[PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86
@ 2024-05-06  5:29 Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 01/54] KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET" Mingwei Zhang
                   ` (54 more replies)
  0 siblings, 55 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

In this version, we added the mediated passthrough vPMU support for AMD.
This is 1st version that comes up with a full x86 support on the vPMU
new design.

Major changes:
 - AMD support integration. Supporting guest PerfMon v1 and v2.
 - Ensure !exclude_guest events only exist prior to mediate passthrough
   vPMU loaded. [sean]
 - Update PMU MSR interception according to exposed counters and pmu
   version. [mingwei reported pmu_counters_test fails]
 - Enforce RDPMC interception unless all counters exposed to guest. This
   removes a hack in RFCv1 where we pass through RDPMC and zero
   unexposed counters. [jim/sean]
 - Combine the PMU context switch for both AMD and Intel.
 - Because of RDPMC interception, update PMU context switch code by
   removing the "zeroing out" logic when restoring the guest context.
   [jim/sean: intercept rdpmc]

Minor changes:
 - Flip enable_passthrough_pmu to false and change to a vendor param.
 - Remove "Intercept full-width GP counter MSRs by checking with perf
   capabilities".
 - Remove the write to pmc patch.
 - Move host_perf_cap as an independent variable, will update after
   https://lore.kernel.org/all/20240423221521.2923759-1-seanjc@google.com/

TODOs:
 - Simplify enabling code for mediated passthrough vPMU.
 - Further optimization on PMU context switch.

On-going discussions:
 - Final name of mediated passthrough vPMU.
 - PMU context switch optimizations.

Testing:
 - Testcases:
   - selftest: pmu_counters_test
   - selftest: pmu_event_filter_test
   - kvm-unit-tests: pmu
   - qemu based ubuntu 20.04 (guest kernel: 5.10 and 6.7.9)
 - Platforms:
   - genoa
   - skylake
   - icelake
   - sapphirerapids
   - emeraldrapids

Ongoing Issues:
 - AMD platform [milan]:
  - ./pmu_event_filter_test error:
    - test_amd_deny_list: Branch instructions retired = 44 (expected 42)
    - test_without_filter: Branch instructions retired = 44 (expected 42)
    - test_member_allow_list: Branch instructions retired = 44 (expected 42)
    - test_not_member_deny_list: Branch instructions retired = 44 (expected 42)
 - Intel platform [skylake]:
  - kvm-unit-tests/pmu fails with two errors:
    - FAIL: Intel: TSX cycles: gp cntr-3 with a value of 0
    - FAIL: Intel: full-width writes: TSX cycles: gp cntr-3 with a value of 0

Installation guidance:
 - echo 0 > /proc/sys/kernel/nmi_watchdog
 - modprobe kvm_{amd,intel} enable_passthrough_pmu=Y 2>/dev/null

v1: https://lore.kernel.org/all/20240126085444.324918-1-xiong.y.zhang@linux.intel.com/


Dapeng Mi (3):
  x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET
  KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU
    MSRs

Kan Liang (3):
  perf: Support get/put passthrough PMU interfaces
  perf: Add generic exclude_guest support
  perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU

Manali Shukla (1):
  KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough
    PMU

Mingwei Zhang (24):
  perf: core/x86: Forbid PMI handler when guest own PMU
  perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
    x86_pmu_cap
  KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
  KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
  KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
  KVM: x86/pmu: Add host_perf_cap and initialize it in
    kvm_x86_vendor_init()
  KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to
    guest
  KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough
    allowed
  KVM: x86/pmu: Create a function prototype to disable MSR interception
  KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in
    passthrough vPMU
  KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  KVM: x86/pmu: Add counter MSR and selector MSR index into struct
    kvm_pmc
  KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
    context
  KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  KVM: x86/pmu: Make check_pmu_event_filter() an exported function
  KVM: x86/pmu: Allow writing to event selector for GP counters if event
    is allowed
  KVM: x86/pmu: Allow writing to fixed counter selector if counter is
    exposed
  KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  KVM: x86/pmu: Introduce PMU operator to increment counter
  KVM: x86/pmu: Introduce PMU operator for setting counter overflow
  KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
  KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
  KVM: nVMX: Add nested virtualization support for passthrough PMU

Sandipan Das (11):
  KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
  x86/msr: Define PerfCntrGlobalStatusSet register
  KVM: x86/pmu: Always set global enable bits in passthrough mode
  perf/x86/amd/core: Set passthrough capability for host
  KVM: x86/pmu/svm: Set passthrough capability for vcpus
  KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter
  KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed
    to guest
  KVM: x86/pmu/svm: Implement callback to disable MSR interception
  KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
    write to event selectors
  KVM: x86/pmu/svm: Add registers to direct access list
  KVM: x86/pmu/svm: Implement handlers to save and restore context

Sean Christopherson (2):
  KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at
    "RESET"
  KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel
    compatible

Xiong Zhang (10):
  perf: core/x86: Register a new vector for KVM GUEST PMI
  KVM: x86: Extract x86_set_kvm_irq_handler() function
  KVM: x86/pmu: Register guest pmi handler for emulated PMU
  perf: x86: Add x86 function to switch PMI handler
  KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
  KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
  KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
  KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter

 arch/x86/events/amd/core.c               |   3 +
 arch/x86/events/core.c                   |  41 ++++-
 arch/x86/events/intel/core.c             |   6 +
 arch/x86/events/perf_event.h             |   1 +
 arch/x86/include/asm/hardirq.h           |   1 +
 arch/x86/include/asm/idtentry.h          |   1 +
 arch/x86/include/asm/irq.h               |   2 +-
 arch/x86/include/asm/irq_vectors.h       |   5 +-
 arch/x86/include/asm/kvm-x86-pmu-ops.h   |   6 +
 arch/x86/include/asm/kvm_host.h          |  10 ++
 arch/x86/include/asm/msr-index.h         |   2 +
 arch/x86/include/asm/perf_event.h        |   4 +
 arch/x86/include/asm/vmx.h               |   1 +
 arch/x86/kernel/idt.c                    |   1 +
 arch/x86/kernel/irq.c                    |  36 ++++-
 arch/x86/kvm/cpuid.c                     |   4 +
 arch/x86/kvm/cpuid.h                     |  10 ++
 arch/x86/kvm/lapic.c                     |   3 +-
 arch/x86/kvm/mmu/mmu.c                   |   2 +-
 arch/x86/kvm/pmu.c                       | 168 ++++++++++++++++++-
 arch/x86/kvm/pmu.h                       |  47 ++++++
 arch/x86/kvm/svm/pmu.c                   | 112 ++++++++++++-
 arch/x86/kvm/svm/svm.c                   |  23 +++
 arch/x86/kvm/svm/svm.h                   |   2 +-
 arch/x86/kvm/vmx/capabilities.h          |   1 +
 arch/x86/kvm/vmx/nested.c                |  52 ++++++
 arch/x86/kvm/vmx/pmu_intel.c             | 192 ++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.c                   | 197 +++++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h                   |   3 +-
 arch/x86/kvm/x86.c                       |  47 +++++-
 arch/x86/kvm/x86.h                       |   1 +
 include/linux/perf_event.h               |  18 +++
 kernel/events/core.c                     | 176 ++++++++++++++++++++
 tools/arch/x86/include/asm/irq_vectors.h |   3 +-
 34 files changed, 1120 insertions(+), 61 deletions(-)


base-commit: fec50db7033ea478773b159e0e2efb135270e3b7
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 01/54] KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET"
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 02/54] KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible Mingwei Zhang
                   ` (53 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sean Christopherson <seanjc@google.com>

Set the enable bits for general purpose counters in IA32_PERF_GLOBAL_CTRL
when refreshing the PMU to emulate the MSR's architecturally defined
post-RESET behavior.  Per Intel's SDM:

  IA32_PERF_GLOBAL_CTRL:  Sets bits n-1:0 and clears the upper bits.

and

  Where "n" is the number of general-purpose counters available in the processor.

AMD also documents this behavior for PerfMonV2 CPUs in one of AMD's many
PPRs.

Do not set any PERF_GLOBAL_CTRL bits if there are no general purpose
counters, although a literal reading of the SDM would require the CPU to
set either bits 63:0 or 31:0.  The intent of the behavior is to globally
enable all GP counters; honor the intent, if not the letter of the law.

Leaving PERF_GLOBAL_CTRL '0' effectively breaks PMU usage in guests that
haven't been updated to work with PMUs that support PERF_GLOBAL_CTRL.
This bug was recently exposed when KVM added supported for AMD's
PerfMonV2, i.e. when KVM started exposing a vPMU with PERF_GLOBAL_CTRL to
guest software that only knew how to program v1 PMUs (that don't support
PERF_GLOBAL_CTRL).

Failure to emulate the post-RESET behavior results in such guests
unknowingly leaving all general purpose counters globally disabled (the
entire reason the post-RESET value sets the GP counter enable bits is to
maintain backwards compatibility).

The bug has likely gone unnoticed because PERF_GLOBAL_CTRL has been
supported on Intel CPUs for as long as KVM has existed, i.e. hardly anyone
is running guest software that isn't aware of PERF_GLOBAL_CTRL on Intel
PMUs.  And because up until v6.0, KVM _did_ emulate the behavior for Intel
CPUs, although the old behavior was likely dumb luck.

Because (a) that old code was also broken in its own way (the history of
this code is a comedy of errors), and (b) PERF_GLOBAL_CTRL was documented
as having a value of '0' post-RESET in all SDMs before March 2023.

Initial vPMU support in commit f5132b01386b ("KVM: Expose a version 2
architectural PMU to a guests") *almost* got it right (again likely by
dumb luck), but for some reason only set the bits if the guest PMU was
advertised as v1:

        if (pmu->version == 1) {
                pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1;
                return;
        }

Commit f19a0c2c2e6a ("KVM: PMU emulation: GLOBAL_CTRL MSR should be
enabled on reset") then tried to remedy that goof, presumably because
guest PMUs were leaving PERF_GLOBAL_CTRL '0', i.e. weren't enabling
counters.

        pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) |
                (((1ull << pmu->nr_arch_fixed_counters) - 1) << X86_PMC_IDX_FIXED);
        pmu->global_ctrl_mask = ~pmu->global_ctrl;

That was KVM's behavior up until commit c49467a45fe0 ("KVM: x86/pmu:
Don't overwrite the pmu->global_ctrl when refreshing") removed
*everything*.  However, it did so based on the behavior defined by the
SDM , which at the time stated that "Global Perf Counter Controls" is
'0' at Power-Up and RESET.

But then the March 2023 SDM (325462-079US), stealthily changed its
"IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT"
table to say:

  IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits.

Note, kvm_pmu_refresh() can be invoked multiple times, i.e. it's not a
"pure" RESET flow.  But it can only be called prior to the first KVM_RUN,
i.e. the guest will only ever observe the final value.

Note #2, KVM has always cleared global_ctrl during refresh (see commit
f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")),
i.e. there is no danger of breaking existing setups by clobbering a value
set by userspace.

Reported-by: Babu Moger <babu.moger@amd.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240309013641.1413400-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/pmu.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index c397b28e3d1b..a593b03c9aed 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -775,8 +775,20 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->pebs_data_cfg_mask = ~0ull;
 	bitmap_zero(pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX);

-	if (vcpu->kvm->arch.enable_pmu)
-		static_call(kvm_x86_pmu_refresh)(vcpu);
+	if (!vcpu->kvm->arch.enable_pmu)
+		return;
+
+	static_call(kvm_x86_pmu_refresh)(vcpu);
+
+	/*
+	 * At RESET, both Intel and AMD CPUs set all enable bits for general
+	 * purpose counters in IA32_PERF_GLOBAL_CTRL (so that software that
+	 * was written for v1 PMUs don't unknowingly leave GP counters disabled
+	 * in the global controls).  Emulate that behavior when refreshing the
+	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
+	 */
+	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
+		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
 }

 void kvm_pmu_init(struct kvm_vcpu *vcpu)
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 02/54] KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 01/54] KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET" Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 03/54] KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms Mingwei Zhang
                   ` (52 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sean Christopherson <seanjc@google.com>

Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is
compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with
helpers to check if a vCPU is compatible AMD vs. Intel.  To handle Intel
vs. AMD behavior related to masking the LVTPC entry, KVM will need to
check for vendor compatibility on every PMI injection, i.e. querying for
AMD will soon be a moderately hot path.

Note!  This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's
default behavior, both if userspace omits (or never sets) CPUID 0x0 and if
userspace sets a completely unknown vendor.  One could argue that KVM
should treat such vCPUs as not being compatible with Intel *or* AMD, but
that would add useless complexity to KVM.

KVM needs to do *something* in the face of vendor specific behavior, and
so unless KVM conjured up a magic third option, choosing to treat unknown
vendors as neither Intel nor AMD means that checks on AMD compatibility
would yield Intel behavior, and checks for Intel compatibility would yield
AMD behavior.  And that's far worse as it would effectively yield random
behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs.
!Intel.  And practically speaking, all x86 CPUs follow either Intel or AMD
architecture, i.e. "supporting" an unknown third architecture adds no
value.

Deliberately don't convert any of the existing guest_cpuid_is_intel()
checks, as the Intel side of things is messier due to some flows explicitly
checking for exactly vendor==Intel, versus some flows assuming anything
that isn't "AMD compatible" gets Intel behavior.  The Intel code will be
cleaned up in the future.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240405235603.1173076-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/cpuid.c            |  1 +
 arch/x86/kvm/cpuid.h            | 10 ++++++++++
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/x86.c              |  2 +-
 5 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 16e07a2eee19..6efd1497b026 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -855,6 +855,7 @@ struct kvm_vcpu_arch {
 	int cpuid_nent;
 	struct kvm_cpuid_entry2 *cpuid_entries;
 	struct kvm_hypervisor_cpuid kvm_cpuid;
+	bool is_amd_compatible;
 
 	/*
 	 * FIXME: Drop this macro and use KVM_NR_GOVERNED_FEATURES directly
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index bfc0bfcb2bc6..77352a4abd87 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -376,6 +376,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	kvm_update_pv_runtime(vcpu);
 
+	vcpu->arch.is_amd_compatible = guest_cpuid_is_amd_or_hygon(vcpu);
 	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
 	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 856e3037e74f..23dbb9eb277c 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -120,6 +120,16 @@ static inline bool guest_cpuid_is_intel(struct kvm_vcpu *vcpu)
 	return best && is_guest_vendor_intel(best->ebx, best->ecx, best->edx);
 }
 
+static inline bool guest_cpuid_is_amd_compatible(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.is_amd_compatible;
+}
+
+static inline bool guest_cpuid_is_intel_compatible(struct kvm_vcpu *vcpu)
+{
+	return !guest_cpuid_is_amd_compatible(vcpu);
+}
+
 static inline int guest_cpuid_family(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 992e651540e8..bf4de6d7e39c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4935,7 +4935,7 @@ static void reset_guest_rsvds_bits_mask(struct kvm_vcpu *vcpu,
 				context->cpu_role.base.level, is_efer_nx(context),
 				guest_can_use(vcpu, X86_FEATURE_GBPAGES),
 				is_cr4_pse(context),
-				guest_cpuid_is_amd_or_hygon(vcpu));
+				guest_cpuid_is_amd_compatible(vcpu));
 }
 
 static void __reset_rsvds_bits_mask_ept(struct rsvd_bits_validate *rsvd_check,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 47d9f03b7778..ebcc12d1e1de 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3470,7 +3470,7 @@ static bool is_mci_status_msr(u32 msr)
 static bool can_set_mci_status(struct kvm_vcpu *vcpu)
 {
 	/* McStatusWrEn enabled? */
-	if (guest_cpuid_is_amd_or_hygon(vcpu))
+	if (guest_cpuid_is_amd_compatible(vcpu))
 		return !!(vcpu->arch.msr_hwcr & BIT_ULL(18));
 
 	return false;
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 03/54] KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 01/54] KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET" Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 02/54] KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 04/54] x86/msr: Define PerfCntrGlobalStatusSet register Mingwei Zhang
                   ` (51 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

On AMD and Hygon platforms, the local APIC does not automatically set
the mask bit of the LVTPC register when handling a PMI and there is
no need to clear it in the kernel's PMI handler.

For guests, the mask bit is currently set by kvm_apic_local_deliver()
and unless it is cleared by the guest kernel's PMI handler, PMIs stop
arriving and break use-cases like sampling with perf record.

This does not affect non-PerfMonV2 guests because PMIs are handled in
the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC
mask bit irrespective of the vendor.

Before:

  $ perf record -e cycles:u true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ]

After:

  $ perf record -e cycles:u true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ]

Fixes: a16eb25b09c0 ("KVM: x86: Mask LVTPC when handling a PMI")
Cc: stable@vger.kernel.org
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
[sean: use is_intel_compatible instead of !is_amd_or_hygon()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240405235603.1173076-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/lapic.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index cf37586f0466..ebf41023be38 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2776,7 +2776,8 @@ int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type)
 		trig_mode = reg & APIC_LVT_LEVEL_TRIGGER;
 
 		r = __apic_accept_irq(apic, mode, vector, 1, trig_mode, NULL);
-		if (r && lvt_type == APIC_LVTPC)
+		if (r && lvt_type == APIC_LVTPC &&
+		    guest_cpuid_is_intel_compatible(apic->vcpu))
 			kvm_lapic_set_reg(apic, APIC_LVTPC, reg | APIC_LVT_MASKED);
 		return r;
 	}
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 04/54] x86/msr: Define PerfCntrGlobalStatusSet register
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (2 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 03/54] KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 05/54] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET Mingwei Zhang
                   ` (50 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Define PerfCntrGlobalStatusSet (MSR 0xc0000303) as it is required by
passthrough PMU to set the overflow bits of PerfCntrGlobalStatus
(MSR 0xc0000300).

When using passthrough PMU, it is necessary to restore the guest state
of the overflow bits. Since PerfCntrGlobalStatus is read-only, this is
done by writing to PerfCntrGlobalStatusSet instead.

The register is available on AMD processors where the PerfMonV2 feature
bit of CPUID leaf 0x80000022 EAX is set.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/include/asm/msr-index.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 05956bd8bacf..ed9e3e8a57d6 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -675,6 +675,7 @@
 #define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS	0xc0000300
 #define MSR_AMD64_PERF_CNTR_GLOBAL_CTL		0xc0000301
 #define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR	0xc0000302
+#define MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET	0xc0000303
 
 /* AMD Last Branch Record MSRs */
 #define MSR_AMD64_LBR_SELECT			0xc000010e
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 05/54] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (3 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 04/54] x86/msr: Define PerfCntrGlobalStatusSet register Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
                   ` (49 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Add additional PMU MSRs MSR_CORE_PERF_GLOBAL_STATUS_SET to allow
passthrough PMU operation on the read-only MSR IA32_PERF_GLOBAL_STATUS.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/msr-index.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index ed9e3e8a57d6..a1281511807c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1107,6 +1107,7 @@
 #define MSR_CORE_PERF_GLOBAL_STATUS	0x0000038e
 #define MSR_CORE_PERF_GLOBAL_CTRL	0x0000038f
 #define MSR_CORE_PERF_GLOBAL_OVF_CTRL	0x00000390
+#define MSR_CORE_PERF_GLOBAL_STATUS_SET 0x00000391
 
 #define MSR_PERF_METRICS		0x00000329
 
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (4 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 05/54] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-07  8:31   ` Peter Zijlstra
  2024-05-07  8:41   ` Peter Zijlstra
  2024-05-06  5:29 ` [PATCH v2 07/54] perf: Add generic exclude_guest support Mingwei Zhang
                   ` (48 subsequent siblings)
  54 siblings, 2 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

Currently, the guest and host share the PMU resources when a guest is
running. KVM has to create an extra virtual event to simulate the
guest's event, which brings several issues, e.g., high overhead, not
accuracy and etc.

A new passthrough PMU method is proposed to address the issue. It requires
that the PMU resources can be fully occupied by the guest while it's
running. Two new interfaces are implemented to fulfill the requirement.
The hypervisor should invoke the interface while creating a guest which
wants the passthrough PMU capability.

The PMU resources should only be temporarily occupied as a whole when a
guest is running. When the guest is out, the PMU resources are still
shared among different users.

The exclude_guest event modifier is used to guarantee the exclusive
occupation of the PMU resources. When creating a guest, the hypervisor
should check whether there are !exclude_guest events in the system.
If yes, the creation should fail. Because some PMU resources have been
occupied by other users.
If no, the PMU resources can be safely accessed by the guest directly.
Perf guarantees that no new !exclude_guest events are created when a
guest is running.

Only the passthrough PMU is affected, but not for other PMU e.g., uncore
and SW PMU. The behavior of those PMUs are not changed. The guest
enter/exit interfaces should only impact the supported PMUs.
Add a new PERF_PMU_CAP_PASSTHROUGH_VPMU flag to indicate the PMUs that
support the feature.

Add nr_include_guest_events to track the !exclude_guest system-wide
event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h |  9 +++++
 kernel/events/core.c       | 67 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d2a15c0c6f8a..dd4920bf3d1b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -291,6 +291,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_NO_EXCLUDE			0x0040
 #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
+#define PERF_PMU_CAP_PASSTHROUGH_VPMU		0x0200
 
 struct perf_output_handle;
 
@@ -1731,6 +1732,8 @@ extern void perf_event_task_tick(void);
 extern int perf_event_account_interrupt(struct perf_event *event);
 extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
+extern int perf_get_mediated_pmu(void);
+extern void perf_put_mediated_pmu(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1817,6 +1820,12 @@ static inline u64 perf_event_pause(struct perf_event *event, bool reset)
 {
 	return 0;
 }
+static inline int perf_get_mediated_pmu(void)
+{
+	return 0;
+}
+
+static inline void perf_put_mediated_pmu(void)			{ }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 724e6d7e128f..701b622c670e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -402,6 +402,21 @@ static atomic_t nr_bpf_events __read_mostly;
 static atomic_t nr_cgroup_events __read_mostly;
 static atomic_t nr_text_poke_events __read_mostly;
 static atomic_t nr_build_id_events __read_mostly;
+static atomic_t nr_include_guest_events __read_mostly;
+
+static refcount_t nr_mediated_pmu_vms = REFCOUNT_INIT(0);
+static DEFINE_MUTEX(perf_mediated_pmu_mutex);
+
+/* !exclude_guest system wide event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
+static inline bool is_include_guest_event(struct perf_event *event)
+{
+	if ((event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) &&
+	    !event->attr.exclude_guest &&
+	    !event->attr.task)
+		return true;
+
+	return false;
+}
 
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
@@ -5193,6 +5208,9 @@ static void _free_event(struct perf_event *event)
 
 	unaccount_event(event);
 
+	if (is_include_guest_event(event))
+		atomic_dec(&nr_include_guest_events);
+
 	security_perf_event_free(event);
 
 	if (event->rb) {
@@ -5737,6 +5755,42 @@ u64 perf_event_pause(struct perf_event *event, bool reset)
 }
 EXPORT_SYMBOL_GPL(perf_event_pause);
 
+/*
+ * Currently invoked at VM creation to
+ * - Check whether there are existing !exclude_guest system wide events
+ *   of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU
+ * - Set nr_mediated_pmu_vms to prevent !exclude_guest event creation on
+ *   PMUs with PERF_PMU_CAP_PASSTHROUGH_VPMU
+ *
+ * No impact for the PMU without PERF_PMU_CAP_PASSTHROUGH_VPMU. The perf
+ * still owns all the PMU resources.
+ */
+int perf_get_mediated_pmu(void)
+{
+	int ret = 0;
+
+	mutex_lock(&perf_mediated_pmu_mutex);
+	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
+		goto end;
+
+	if (atomic_read(&nr_include_guest_events)) {
+		ret = -EBUSY;
+		goto end;
+	}
+	refcount_set(&nr_mediated_pmu_vms, 1);
+end:
+	mutex_unlock(&perf_mediated_pmu_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
+
+void perf_put_mediated_pmu(void)
+{
+	if (!refcount_dec_not_one(&nr_mediated_pmu_vms))
+		refcount_set(&nr_mediated_pmu_vms, 0);
+}
+EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block
@@ -12086,11 +12140,24 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (err)
 		goto err_callchain_buffer;
 
+	if (is_include_guest_event(event)) {
+		mutex_lock(&perf_mediated_pmu_mutex);
+		if (refcount_read(&nr_mediated_pmu_vms)) {
+			mutex_unlock(&perf_mediated_pmu_mutex);
+			err = -EACCES;
+			goto err_security_alloc;
+		}
+		atomic_inc(&nr_include_guest_events);
+		mutex_unlock(&perf_mediated_pmu_mutex);
+	}
+
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
 	return event;
 
+err_security_alloc:
+	security_perf_event_free(event);
 err_callchain_buffer:
 	if (!event->parent) {
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces
  2024-05-06  5:29 ` [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
@ 2024-05-07  8:31   ` Peter Zijlstra
  2024-05-08  4:13     ` Zhang, Xiong Y
  2024-05-07  8:41   ` Peter Zijlstra
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-07  8:31 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users

On Mon, May 06, 2024 at 05:29:31AM +0000, Mingwei Zhang wrote:


> +int perf_get_mediated_pmu(void)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&perf_mediated_pmu_mutex);
> +	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
> +		goto end;
> +
> +	if (atomic_read(&nr_include_guest_events)) {
> +		ret = -EBUSY;
> +		goto end;
> +	}
> +	refcount_set(&nr_mediated_pmu_vms, 1);
> +end:
> +	mutex_unlock(&perf_mediated_pmu_mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);

Blergh... it seems I never got my perf guard patches merged :/, but
could we please write this like:

int perf_get_mediated_pmu(void)
{
	guard(mutex)(&perf_mediated_pmu_mutex);
	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
		return 0;

	if (atomic_read(&nr_include_guest_events))
		return -EBUSY;

	refcount_set(&nr_mediated_pmu_vms, 1);
	return 0;
}

And avoid adding more unlock goto thingies?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces
  2024-05-07  8:31   ` Peter Zijlstra
@ 2024-05-08  4:13     ` Zhang, Xiong Y
  0 siblings, 0 replies; 116+ messages in thread
From: Zhang, Xiong Y @ 2024-05-08  4:13 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users



On 5/7/2024 4:31 PM, Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 05:29:31AM +0000, Mingwei Zhang wrote:
> 
> 
>> +int perf_get_mediated_pmu(void)
>> +{
>> +	int ret = 0;
>> +
>> +	mutex_lock(&perf_mediated_pmu_mutex);
>> +	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
>> +		goto end;
>> +
>> +	if (atomic_read(&nr_include_guest_events)) {
>> +		ret = -EBUSY;
>> +		goto end;
>> +	}
>> +	refcount_set(&nr_mediated_pmu_vms, 1);
>> +end:
>> +	mutex_unlock(&perf_mediated_pmu_mutex);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
> 
> Blergh... it seems I never got my perf guard patches merged :/, but
> could we please write this like:
> 
> int perf_get_mediated_pmu(void)
> {
> 	guard(mutex)(&perf_mediated_pmu_mutex);
> 	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
> 		return 0;
> 
> 	if (atomic_read(&nr_include_guest_events))
> 		return -EBUSY;
> 
> 	refcount_set(&nr_mediated_pmu_vms, 1);
> 	return 0;
> }
> 
> And avoid adding more unlock goto thingies?
>yes, this works. And code is cleaner.

thanks


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces
  2024-05-06  5:29 ` [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
  2024-05-07  8:31   ` Peter Zijlstra
@ 2024-05-07  8:41   ` Peter Zijlstra
  2024-05-08  4:54     ` Zhang, Xiong Y
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-07  8:41 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users

On Mon, May 06, 2024 at 05:29:31AM +0000, Mingwei Zhang wrote:

> +int perf_get_mediated_pmu(void)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&perf_mediated_pmu_mutex);
> +	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
> +		goto end;
> +
> +	if (atomic_read(&nr_include_guest_events)) {
> +		ret = -EBUSY;
> +		goto end;
> +	}
> +	refcount_set(&nr_mediated_pmu_vms, 1);
> +end:
> +	mutex_unlock(&perf_mediated_pmu_mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
> +
> +void perf_put_mediated_pmu(void)
> +{
> +	if (!refcount_dec_not_one(&nr_mediated_pmu_vms))
> +		refcount_set(&nr_mediated_pmu_vms, 0);

I'm sorry, but this made the WTF'o'meter go 'ding'.

Isn't that simply refcount_dec() ?

> +}
> +EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
> +
>  /*
>   * Holding the top-level event's child_mutex means that any
>   * descendant process that has inherited this event will block
> @@ -12086,11 +12140,24 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  	if (err)
>  		goto err_callchain_buffer;
>  
> +	if (is_include_guest_event(event)) {
> +		mutex_lock(&perf_mediated_pmu_mutex);
> +		if (refcount_read(&nr_mediated_pmu_vms)) {
> +			mutex_unlock(&perf_mediated_pmu_mutex);
> +			err = -EACCES;
> +			goto err_security_alloc;
> +		}
> +		atomic_inc(&nr_include_guest_events);
> +		mutex_unlock(&perf_mediated_pmu_mutex);
> +	}

Wouldn't all that be nicer with a helper function?

	if (is_include_guest_event() && !perf_get_guest_event())
		goto err_security_alloc;

> +
>  	/* symmetric to unaccount_event() in _free_event() */
>  	account_event(event);
>  
>  	return event;
>  
> +err_security_alloc:
> +	security_perf_event_free(event);
>  err_callchain_buffer:
>  	if (!event->parent) {
>  		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
> -- 
> 2.45.0.rc1.225.g2a3ae87e7f-goog
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces
  2024-05-07  8:41   ` Peter Zijlstra
@ 2024-05-08  4:54     ` Zhang, Xiong Y
  2024-05-08  8:32       ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Zhang, Xiong Y @ 2024-05-08  4:54 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users



On 5/7/2024 4:41 PM, Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 05:29:31AM +0000, Mingwei Zhang wrote:
> 
>> +int perf_get_mediated_pmu(void)
>> +{
>> +	int ret = 0;
>> +
>> +	mutex_lock(&perf_mediated_pmu_mutex);
>> +	if (refcount_inc_not_zero(&nr_mediated_pmu_vms))
>> +		goto end;
>> +
>> +	if (atomic_read(&nr_include_guest_events)) {
>> +		ret = -EBUSY;
>> +		goto end;
>> +	}
>> +	refcount_set(&nr_mediated_pmu_vms, 1);
>> +end:
>> +	mutex_unlock(&perf_mediated_pmu_mutex);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(perf_get_mediated_pmu);
>> +
>> +void perf_put_mediated_pmu(void)
>> +{
>> +	if (!refcount_dec_not_one(&nr_mediated_pmu_vms))
>> +		refcount_set(&nr_mediated_pmu_vms, 0);
> 
> I'm sorry, but this made the WTF'o'meter go 'ding'.
> 
> Isn't that simply refcount_dec() ?
when nr_mediated_pmu_vms is 1, refcount_dec(&nr_mediated_pmu_vms) has an
error and call trace: refcount_t: decrement hit 0; leaking memory.

Similar when nr_mediated_pmu_vms is 0, refcount_inc(&nr_mediated_pmu_vms)
has an error and call trace also: refcount_t: addition on 0; use_after_free.

it seems refcount_set() should be used to set 1 or 0 to refcount_t.
> 
>> +}
>> +EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>> +
>>  /*
>>   * Holding the top-level event's child_mutex means that any
>>   * descendant process that has inherited this event will block
>> @@ -12086,11 +12140,24 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>>  	if (err)
>>  		goto err_callchain_buffer;
>>  
>> +	if (is_include_guest_event(event)) {
>> +		mutex_lock(&perf_mediated_pmu_mutex);
>> +		if (refcount_read(&nr_mediated_pmu_vms)) {
>> +			mutex_unlock(&perf_mediated_pmu_mutex);
>> +			err = -EACCES;
>> +			goto err_security_alloc;
>> +		}
>> +		atomic_inc(&nr_include_guest_events);
>> +		mutex_unlock(&perf_mediated_pmu_mutex);
>> +	}
> 
> Wouldn't all that be nicer with a helper function?
yes, it is nicer.

thanks
> 
> 	if (is_include_guest_event() && !perf_get_guest_event())
> 		goto err_security_alloc;
> 
>> +
>>  	/* symmetric to unaccount_event() in _free_event() */
>>  	account_event(event);
>>  
>>  	return event;
>>  
>> +err_security_alloc:
>> +	security_perf_event_free(event);
>>  err_callchain_buffer:
>>  	if (!event->parent) {
>>  		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
>> -- 
>> 2.45.0.rc1.225.g2a3ae87e7f-goog
>>
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces
  2024-05-08  4:54     ` Zhang, Xiong Y
@ 2024-05-08  8:32       ` Peter Zijlstra
  0 siblings, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-08  8:32 UTC (permalink / raw)
  To: Zhang, Xiong Y
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users

On Wed, May 08, 2024 at 12:54:31PM +0800, Zhang, Xiong Y wrote:
> On 5/7/2024 4:41 PM, Peter Zijlstra wrote:
> > On Mon, May 06, 2024 at 05:29:31AM +0000, Mingwei Zhang wrote:

> >> +void perf_put_mediated_pmu(void)
> >> +{
> >> +	if (!refcount_dec_not_one(&nr_mediated_pmu_vms))
> >> +		refcount_set(&nr_mediated_pmu_vms, 0);
> > 
> > I'm sorry, but this made the WTF'o'meter go 'ding'.
> > 
> > Isn't that simply refcount_dec() ?
> when nr_mediated_pmu_vms is 1, refcount_dec(&nr_mediated_pmu_vms) has an
> error and call trace: refcount_t: decrement hit 0; leaking memory.
> 
> Similar when nr_mediated_pmu_vms is 0, refcount_inc(&nr_mediated_pmu_vms)
> has an error and call trace also: refcount_t: addition on 0; use_after_free.
> 
> it seems refcount_set() should be used to set 1 or 0 to refcount_t.

Ah, yes, you need refcount_dec_and_test() in order to free. But if this
is the case, then you simply shouldn't be using refcount.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (5 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-07  8:58   ` Peter Zijlstra
  2024-05-06  5:29 ` [PATCH v2 08/54] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU Mingwei Zhang
                   ` (47 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

Current perf doesn't explicitly schedule out all exclude_guest events
while the guest is running. There is no problem for the current emulated
vPMU. Because perf owns all the PMU counters. It can mask the counter
which is assigned to a exclude_guest event when a guest is running
(Intel way), or set the correspoinding HOSTONLY bit in evsentsel (AMD
way). The counter doesn't count when a guest is running.

However, either way doesn't work with the passthrough vPMU introduced.
A guest owns all the PMU counters when it's running. Host should not
mask any counters. The counter may be used by the guest. The evsentsel
may be overwrite.

Perf should explicitly schedule out all exclude_guest events to release
the PMU resources when entering a guest, and resume the counting when
exiting the guest.

Expose two interfaces to KVM. The KVM should notify the perf when
entering/exiting a guest.

It's possible that a exclude_guest event is created when a guest is
running. The new event should not be scheduled in as well.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h |   4 ++
 kernel/events/core.c       | 104 +++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index dd4920bf3d1b..acf16676401a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1734,6 +1734,8 @@ extern int perf_event_period(struct perf_event *event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 extern int perf_get_mediated_pmu(void);
 extern void perf_put_mediated_pmu(void);
+void perf_guest_enter(void);
+void perf_guest_exit(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1826,6 +1828,8 @@ static inline int perf_get_mediated_pmu(void)
 }
 
 static inline void perf_put_mediated_pmu(void)			{ }
+static inline void perf_guest_enter(void)			{ }
+static inline void perf_guest_exit(void)			{ }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 701b622c670e..4c6daf5cc923 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -406,6 +406,7 @@ static atomic_t nr_include_guest_events __read_mostly;
 
 static refcount_t nr_mediated_pmu_vms = REFCOUNT_INIT(0);
 static DEFINE_MUTEX(perf_mediated_pmu_mutex);
+static DEFINE_PER_CPU(bool, perf_in_guest);
 
 /* !exclude_guest system wide event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
 static inline bool is_include_guest_event(struct perf_event *event)
@@ -3854,6 +3855,15 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	if (!event_filter_match(event))
 		return 0;
 
+	/*
+	 * Don't schedule in any exclude_guest events of PMU with
+	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
+	 */
+	if (__this_cpu_read(perf_in_guest) &&
+	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
+	    event->attr.exclude_guest)
+		return 0;
+
 	if (group_can_go_on(event, *can_add_hw)) {
 		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
@@ -5791,6 +5801,100 @@ void perf_put_mediated_pmu(void)
 }
 EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
 
+static void perf_sched_out_exclude_guest(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+
+	update_context_time(ctx);
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		struct perf_event *event, *tmp;
+		struct pmu *pmu = pmu_ctx->pmu;
+
+		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
+			continue;
+
+		perf_pmu_disable(pmu);
+
+		/*
+		 * All active events must be exclude_guest events.
+		 * See perf_get_mediated_pmu().
+		 * Unconditionally remove all active events.
+		 */
+		list_for_each_entry_safe(event, tmp, &pmu_ctx->pinned_active, active_list)
+			group_sched_out(event, pmu_ctx->ctx);
+
+		list_for_each_entry_safe(event, tmp, &pmu_ctx->flexible_active, active_list)
+			group_sched_out(event, pmu_ctx->ctx);
+
+		pmu_ctx->rotate_necessary = 0;
+
+		perf_pmu_enable(pmu);
+	}
+}
+
+/* When entering a guest, schedule out all exclude_guest events. */
+void perf_guest_enter(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		return;
+	}
+
+	perf_sched_out_exclude_guest(&cpuctx->ctx);
+	if (cpuctx->task_ctx)
+		perf_sched_out_exclude_guest(cpuctx->task_ctx);
+
+	__this_cpu_write(perf_in_guest, true);
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+
+static void perf_sched_in_exclude_guest(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+
+	update_context_time(ctx);
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		struct pmu *pmu = pmu_ctx->pmu;
+
+		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
+			continue;
+
+		perf_pmu_disable(pmu);
+		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu);
+		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
+		perf_pmu_enable(pmu);
+	}
+}
+
+void perf_guest_exit(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		return;
+	}
+
+	__this_cpu_write(perf_in_guest, false);
+
+	perf_sched_in_exclude_guest(&cpuctx->ctx);
+	if (cpuctx->task_ctx)
+		perf_sched_in_exclude_guest(cpuctx->task_ctx);
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-05-06  5:29 ` [PATCH v2 07/54] perf: Add generic exclude_guest support Mingwei Zhang
@ 2024-05-07  8:58   ` Peter Zijlstra
  2024-06-10 17:23     ` Liang, Kan
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-07  8:58 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users

On Mon, May 06, 2024 at 05:29:32AM +0000, Mingwei Zhang wrote:

> @@ -5791,6 +5801,100 @@ void perf_put_mediated_pmu(void)
>  }
>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>  
> +static void perf_sched_out_exclude_guest(struct perf_event_context *ctx)
> +{
> +	struct perf_event_pmu_context *pmu_ctx;
> +
> +	update_context_time(ctx);
> +	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> +		struct perf_event *event, *tmp;
> +		struct pmu *pmu = pmu_ctx->pmu;
> +
> +		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
> +			continue;
> +
> +		perf_pmu_disable(pmu);
> +
> +		/*
> +		 * All active events must be exclude_guest events.
> +		 * See perf_get_mediated_pmu().
> +		 * Unconditionally remove all active events.
> +		 */
> +		list_for_each_entry_safe(event, tmp, &pmu_ctx->pinned_active, active_list)
> +			group_sched_out(event, pmu_ctx->ctx);
> +
> +		list_for_each_entry_safe(event, tmp, &pmu_ctx->flexible_active, active_list)
> +			group_sched_out(event, pmu_ctx->ctx);
> +
> +		pmu_ctx->rotate_necessary = 0;
> +
> +		perf_pmu_enable(pmu);
> +	}
> +}
> +
> +/* When entering a guest, schedule out all exclude_guest events. */
> +void perf_guest_enter(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
> +		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +		return;
> +	}
> +
> +	perf_sched_out_exclude_guest(&cpuctx->ctx);
> +	if (cpuctx->task_ctx)
> +		perf_sched_out_exclude_guest(cpuctx->task_ctx);
> +
> +	__this_cpu_write(perf_in_guest, true);
> +
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}
> +
> +static void perf_sched_in_exclude_guest(struct perf_event_context *ctx)
> +{
> +	struct perf_event_pmu_context *pmu_ctx;
> +
> +	update_context_time(ctx);
> +	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> +		struct pmu *pmu = pmu_ctx->pmu;
> +
> +		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
> +			continue;
> +
> +		perf_pmu_disable(pmu);
> +		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu);
> +		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
> +		perf_pmu_enable(pmu);
> +	}
> +}
> +
> +void perf_guest_exit(void)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
> +		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +		return;
> +	}
> +
> +	__this_cpu_write(perf_in_guest, false);
> +
> +	perf_sched_in_exclude_guest(&cpuctx->ctx);
> +	if (cpuctx->task_ctx)
> +		perf_sched_in_exclude_guest(cpuctx->task_ctx);
> +
> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> +}

Bah, this is a ton of copy-paste from the normal scheduling code with
random changes. Why ?

Why can't this use ctx_sched_{in,out}() ? Surely the whole
CAP_PASSTHROUGHT thing is but a flag away.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-05-07  8:58   ` Peter Zijlstra
@ 2024-06-10 17:23     ` Liang, Kan
  2024-06-11 12:06       ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Liang, Kan @ 2024-06-10 17:23 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users



On 2024-05-07 4:58 a.m., Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 05:29:32AM +0000, Mingwei Zhang wrote:
> 
>> @@ -5791,6 +5801,100 @@ void perf_put_mediated_pmu(void)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);
>>  
>> +static void perf_sched_out_exclude_guest(struct perf_event_context *ctx)
>> +{
>> +	struct perf_event_pmu_context *pmu_ctx;
>> +
>> +	update_context_time(ctx);
>> +	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>> +		struct perf_event *event, *tmp;
>> +		struct pmu *pmu = pmu_ctx->pmu;
>> +
>> +		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
>> +			continue;
>> +
>> +		perf_pmu_disable(pmu);
>> +
>> +		/*
>> +		 * All active events must be exclude_guest events.
>> +		 * See perf_get_mediated_pmu().
>> +		 * Unconditionally remove all active events.
>> +		 */
>> +		list_for_each_entry_safe(event, tmp, &pmu_ctx->pinned_active, active_list)
>> +			group_sched_out(event, pmu_ctx->ctx);
>> +
>> +		list_for_each_entry_safe(event, tmp, &pmu_ctx->flexible_active, active_list)
>> +			group_sched_out(event, pmu_ctx->ctx);
>> +
>> +		pmu_ctx->rotate_necessary = 0;
>> +
>> +		perf_pmu_enable(pmu);
>> +	}
>> +}
>> +
>> +/* When entering a guest, schedule out all exclude_guest events. */
>> +void perf_guest_enter(void)
>> +{
>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>> +
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +
>> +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
>> +		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +		return;
>> +	}
>> +
>> +	perf_sched_out_exclude_guest(&cpuctx->ctx);
>> +	if (cpuctx->task_ctx)
>> +		perf_sched_out_exclude_guest(cpuctx->task_ctx);
>> +
>> +	__this_cpu_write(perf_in_guest, true);
>> +
>> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +}
>> +
>> +static void perf_sched_in_exclude_guest(struct perf_event_context *ctx)
>> +{
>> +	struct perf_event_pmu_context *pmu_ctx;
>> +
>> +	update_context_time(ctx);
>> +	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>> +		struct pmu *pmu = pmu_ctx->pmu;
>> +
>> +		if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
>> +			continue;
>> +
>> +		perf_pmu_disable(pmu);
>> +		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu);
>> +		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
>> +		perf_pmu_enable(pmu);
>> +	}
>> +}
>> +
>> +void perf_guest_exit(void)
>> +{
>> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>> +
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +
>> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
>> +		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +		return;
>> +	}
>> +
>> +	__this_cpu_write(perf_in_guest, false);
>> +
>> +	perf_sched_in_exclude_guest(&cpuctx->ctx);
>> +	if (cpuctx->task_ctx)
>> +		perf_sched_in_exclude_guest(cpuctx->task_ctx);
>> +
>> +	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> +}
> 
> Bah, this is a ton of copy-paste from the normal scheduling code with
> random changes. Why ?
> 
> Why can't this use ctx_sched_{in,out}() ? Surely the whole
> CAP_PASSTHROUGHT thing is but a flag away.
>

Not just a flag. The time has to be updated as well, since the ctx->time
is shared among PMUs. Perf shouldn't stop it while other PMUs is still
running.

A timeguest will be introduced to track the start time of a guest.
The event->tstamp of an exclude_guest event should always keep
ctx->timeguest while a guest is running.
When a guest is leaving, update the event->tstamp to now, so the guest
time can be deducted.

The below patch demonstrate how the timeguest works.
(It's an incomplete patch. Just to show the idea.)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 22d3e56682c9..2134e6886e22 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -953,6 +953,7 @@ struct perf_event_context {
 	u64				time;
 	u64				timestamp;
 	u64				timeoffset;
+	u64				timeguest;

 	/*
 	 * These fields let us detect when two contexts have both
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 14fd881e3e1d..2aed56671a24 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -690,12 +690,31 @@ __perf_update_times(struct perf_event *event, u64
now, u64 *enabled, u64 *runnin
 		*running += delta;
 }

+static void perf_event_update_time_guest(struct perf_event *event)
+{
+	/*
+	 * If a guest is running, use the timestamp while entering the guest.
+	 * If the guest is leaving, reset the event timestamp.
+	 */
+	if (!__this_cpu_read(perf_in_guest))
+		event->tstamp = event->ctx->time;
+	else
+		event->tstamp = event->ctx->timeguest;
+}
+
 static void perf_event_update_time(struct perf_event *event)
 {
-	u64 now = perf_event_time(event);
+	u64 now;
+
+	/* Never count the time of an active guest into an exclude_guest event. */
+	if (event->ctx->timeguest &&
+	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU)
+		return perf_event_update_time_guest(event);

+	now = perf_event_time(event);
 	__perf_update_times(event, now, &event->total_time_enabled,
 					&event->total_time_running);
+
 	event->tstamp = now;
 }

@@ -3398,7 +3417,14 @@ ctx_sched_out(struct perf_event_context *ctx,
enum event_type_t event_type)
 			cpuctx->task_ctx = NULL;
 	}

-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule out all !exclude_guest events of PMU
+		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
+		 */
+		is_active = EVENT_ALL;
+	} else
+		is_active ^= ctx->is_active; /* changed bits */

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
 		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
@@ -3998,7 +4024,20 @@ ctx_sched_in(struct perf_event_context *ctx, enum
event_type_t event_type)
 			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
 	}

-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule in all !exclude_guest events of PMU
+		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
+		 */
+		is_active = EVENT_ALL;
+
+		/*
+		 * Update ctx time to set the new start time for
+		 * the exclude_guest events.
+		 */
+		update_context_time(ctx);
+	} else
+		is_active ^= ctx->is_active; /* changed bits */

 	/*
 	 * First go through the list and put on any pinned groups

@@ -5894,14 +5933,18 @@ void perf_guest_enter(u32 guest_lvtpc)
 	}

 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
-	ctx_sched_out(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
+	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
+	/* Set the guest start time */
+	cpuctx->ctx.timeguest = cpuctx->ctx.time;
 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
 	if (cpuctx->task_ctx) {
 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
-		task_ctx_sched_out(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
+		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
+		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
 	}

 	__this_cpu_write(perf_in_guest, true);
@@ -5925,14 +5968,17 @@ void perf_guest_exit(void)

 	__this_cpu_write(perf_in_guest, false);

 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
-	ctx_sched_in(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
+	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
+	cpuctx->ctx.timeguest = 0;
 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
 	if (cpuctx->task_ctx) {
 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
-		ctx_sched_in(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
+		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
+		cpuctx->task_ctx->timeguest = 0;
 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
 	}

Thanks
Kan

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-10 17:23     ` Liang, Kan
@ 2024-06-11 12:06       ` Peter Zijlstra
  2024-06-11 13:27         ` Liang, Kan
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-06-11 12:06 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users

On Mon, Jun 10, 2024 at 01:23:04PM -0400, Liang, Kan wrote:
> On 2024-05-07 4:58 a.m., Peter Zijlstra wrote:

> > Bah, this is a ton of copy-paste from the normal scheduling code with
> > random changes. Why ?
> > 
> > Why can't this use ctx_sched_{in,out}() ? Surely the whole
> > CAP_PASSTHROUGHT thing is but a flag away.
> >
> 
> Not just a flag. The time has to be updated as well, since the ctx->time
> is shared among PMUs. Perf shouldn't stop it while other PMUs is still
> running.

Obviously the original changelog didn't mention any of that.... :/

> A timeguest will be introduced to track the start time of a guest.
> The event->tstamp of an exclude_guest event should always keep
> ctx->timeguest while a guest is running.
> When a guest is leaving, update the event->tstamp to now, so the guest
> time can be deducted.
> 
> The below patch demonstrate how the timeguest works.
> (It's an incomplete patch. Just to show the idea.)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 22d3e56682c9..2134e6886e22 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -953,6 +953,7 @@ struct perf_event_context {
>  	u64				time;
>  	u64				timestamp;
>  	u64				timeoffset;
> +	u64				timeguest;
> 
>  	/*
>  	 * These fields let us detect when two contexts have both
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 14fd881e3e1d..2aed56671a24 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -690,12 +690,31 @@ __perf_update_times(struct perf_event *event, u64
> now, u64 *enabled, u64 *runnin
>  		*running += delta;
>  }
> 
> +static void perf_event_update_time_guest(struct perf_event *event)
> +{
> +	/*
> +	 * If a guest is running, use the timestamp while entering the guest.
> +	 * If the guest is leaving, reset the event timestamp.
> +	 */
> +	if (!__this_cpu_read(perf_in_guest))
> +		event->tstamp = event->ctx->time;
> +	else
> +		event->tstamp = event->ctx->timeguest;
> +}

This conditional seems inverted, without a good reason. Also, in another
thread you talk about some PMUs stopping time in a guest, while other
PMUs would keep ticking. I don't think the above captures that.

>  static void perf_event_update_time(struct perf_event *event)
>  {
> -	u64 now = perf_event_time(event);
> +	u64 now;
> +
> +	/* Never count the time of an active guest into an exclude_guest event. */
> +	if (event->ctx->timeguest &&
> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU)
> +		return perf_event_update_time_guest(event);

Urgh, weird split. The PMU check is here. Please just inline the above
here, this seems to be the sole caller anyway.

> 
> +	now = perf_event_time(event);
>  	__perf_update_times(event, now, &event->total_time_enabled,
>  					&event->total_time_running);
> +
>  	event->tstamp = now;
>  }
> 
> @@ -3398,7 +3417,14 @@ ctx_sched_out(struct perf_event_context *ctx,
> enum event_type_t event_type)
>  			cpuctx->task_ctx = NULL;
>  	}
> 
> -	is_active ^= ctx->is_active; /* changed bits */
> +	if (event_type & EVENT_GUEST) {

Patch doesn't introduce EVENT_GUEST, lost a hunk somewhere?

> +		/*
> +		 * Schedule out all !exclude_guest events of PMU
> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
> +		 */
> +		is_active = EVENT_ALL;
> +	} else
> +		is_active ^= ctx->is_active; /* changed bits */
> 
>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))

> @@ -5894,14 +5933,18 @@ void perf_guest_enter(u32 guest_lvtpc)
>  	}
> 
>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> -	ctx_sched_out(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
> +	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
> +	/* Set the guest start time */
> +	cpuctx->ctx.timeguest = cpuctx->ctx.time;
>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>  	if (cpuctx->task_ctx) {
>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> -		task_ctx_sched_out(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
> +		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
> +		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>  	}
> 
>  	__this_cpu_write(perf_in_guest, true);
> @@ -5925,14 +5968,17 @@ void perf_guest_exit(void)
> 
>  	__this_cpu_write(perf_in_guest, false);
> 
>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
> -	ctx_sched_in(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
> +	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
> +	cpuctx->ctx.timeguest = 0;
>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>  	if (cpuctx->task_ctx) {
>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
> -		ctx_sched_in(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
> +		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
> +		cpuctx->task_ctx->timeguest = 0;
>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>  	}

I'm thinking EVENT_GUEST should cause the ->timeguest updates, no point
in having them explicitly duplicated here, hmm?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-11 12:06       ` Peter Zijlstra
@ 2024-06-11 13:27         ` Liang, Kan
  2024-06-12 11:17           ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Liang, Kan @ 2024-06-11 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users



On 2024-06-11 8:06 a.m., Peter Zijlstra wrote:
> On Mon, Jun 10, 2024 at 01:23:04PM -0400, Liang, Kan wrote:
>> On 2024-05-07 4:58 a.m., Peter Zijlstra wrote:
> 
>>> Bah, this is a ton of copy-paste from the normal scheduling code with
>>> random changes. Why ?
>>>
>>> Why can't this use ctx_sched_{in,out}() ? Surely the whole
>>> CAP_PASSTHROUGHT thing is but a flag away.
>>>
>>
>> Not just a flag. The time has to be updated as well, since the ctx->time
>> is shared among PMUs. Perf shouldn't stop it while other PMUs is still
>> running.
> 
> Obviously the original changelog didn't mention any of that.... :/

Yes, the time issue was a newly found bug when we test the uncore PMUs.

> 
>> A timeguest will be introduced to track the start time of a guest.
>> The event->tstamp of an exclude_guest event should always keep
>> ctx->timeguest while a guest is running.
>> When a guest is leaving, update the event->tstamp to now, so the guest
>> time can be deducted.
>>
>> The below patch demonstrate how the timeguest works.
>> (It's an incomplete patch. Just to show the idea.)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 22d3e56682c9..2134e6886e22 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -953,6 +953,7 @@ struct perf_event_context {
>>  	u64				time;
>>  	u64				timestamp;
>>  	u64				timeoffset;
>> +	u64				timeguest;
>>
>>  	/*
>>  	 * These fields let us detect when two contexts have both
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 14fd881e3e1d..2aed56671a24 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -690,12 +690,31 @@ __perf_update_times(struct perf_event *event, u64
>> now, u64 *enabled, u64 *runnin
>>  		*running += delta;
>>  }
>>
>> +static void perf_event_update_time_guest(struct perf_event *event)
>> +{
>> +	/*
>> +	 * If a guest is running, use the timestamp while entering the guest.
>> +	 * If the guest is leaving, reset the event timestamp.
>> +	 */
>> +	if (!__this_cpu_read(perf_in_guest))
>> +		event->tstamp = event->ctx->time;
>> +	else
>> +		event->tstamp = event->ctx->timeguest;
>> +}
> 
> This conditional seems inverted, without a good reason. Also, in another
> thread you talk about some PMUs stopping time in a guest, while other
> PMUs would keep ticking. I don't think the above captures that.
> 
>>  static void perf_event_update_time(struct perf_event *event)
>>  {
>> -	u64 now = perf_event_time(event);
>> +	u64 now;
>> +
>> +	/* Never count the time of an active guest into an exclude_guest event. */
>> +	if (event->ctx->timeguest &&
>> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU)
>> +		return perf_event_update_time_guest(event);
> 
> Urgh, weird split. The PMU check is here. Please just inline the above
> here, this seems to be the sole caller anyway.
>

Sure

>>
>> +	now = perf_event_time(event);
>>  	__perf_update_times(event, now, &event->total_time_enabled,
>>  					&event->total_time_running);
>> +
>>  	event->tstamp = now;
>>  }
>>
>> @@ -3398,7 +3417,14 @@ ctx_sched_out(struct perf_event_context *ctx,
>> enum event_type_t event_type)
>>  			cpuctx->task_ctx = NULL;
>>  	}
>>
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
> 
> Patch doesn't introduce EVENT_GUEST, lost a hunk somewhere?

Sorry, there is a prerequisite patch to factor out the EVENT_CGROUP.
I thought it will be complex and confusion to paste both. Some details
are lost.
I will post both two patches at the end.

> 
>> +		/*
>> +		 * Schedule out all !exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>> +		 */
>> +		is_active = EVENT_ALL;
>> +	} else
>> +		is_active ^= ctx->is_active; /* changed bits */
>>
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
> 
>> @@ -5894,14 +5933,18 @@ void perf_guest_enter(u32 guest_lvtpc)
>>  	}
>>
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>> -	ctx_sched_out(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
>> +	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
>> +	/* Set the guest start time */
>> +	cpuctx->ctx.timeguest = cpuctx->ctx.time;
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>  	if (cpuctx->task_ctx) {
>>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>> -		task_ctx_sched_out(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
>> +		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
>> +		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
>>
>>  	__this_cpu_write(perf_in_guest, true);
>> @@ -5925,14 +5968,17 @@ void perf_guest_exit(void)
>>
>>  	__this_cpu_write(perf_in_guest, false);
>>
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>> -	ctx_sched_in(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
>> +	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>> +	cpuctx->ctx.timeguest = 0;
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>  	if (cpuctx->task_ctx) {
>>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>> -		ctx_sched_in(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
>> +		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
>> +		cpuctx->task_ctx->timeguest = 0;
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
> 
> I'm thinking EVENT_GUEST should cause the ->timeguest updates, no point
> in having them explicitly duplicated here, hmm?

We have to add a EVENT_GUEST check and update the ->timeguest at the end
of the ctx_sched_out/in functions after the pmu_ctx_sched_out/in().
Because the ->timeguest also be used to check if it's leaving the guest
in the perf_event_update_time().

Since the EVENT_GUEST only be used by perf_guest_enter/exit(), I thought
maybe it's better to move it to where it is used rather than the generic
ctx_sched_out/in(). It will minimize the impact on the
non-virtualization user.

Here are the two complete patches.

The first one is to factor out the EVENT_CGROUP.


From c508f2b0e11a2eea71fe3ff16d1d848359ede535 Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang@linux.intel.com>
Date: Mon, 27 May 2024 06:58:29 -0700
Subject: [PATCH 1/2] perf: Skip pmu_ctx based on event_type

To optimize the cgroup context switch, the perf_event_pmu_context
iteration skips the PMUs without cgroup events. A bool cgroup was
introduced to indicate the case. It can work, but this way is hard to
extend for other cases, e.g. skipping non-passthrough PMUs. It doesn't
make sense to keep adding bool variables.

Pass the event_type instead of the specific bool variable. Check both
the event_type and related pmu_ctx variables to decide whether skipping
a PMU.

Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active.
Add EVENT_FLAGS to indicate such event flags.

No functional change.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 kernel/events/core.c | 70 +++++++++++++++++++++++---------------------
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index abd4027e3859..95d1d5a5addc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -376,6 +376,7 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU = 0x8,
 	EVENT_CGROUP = 0x10,
+	EVENT_FLAGS = EVENT_CGROUP,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
 };

@@ -699,23 +700,32 @@ do {									\
 	___p;								\
 })

-static void perf_ctx_disable(struct perf_event_context *ctx, bool cgroup)
+static bool perf_skip_pmu_ctx(struct perf_event_pmu_context *pmu_ctx,
+			      enum event_type_t event_type)
+{
+	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
+		return true;
+
+	return false;
+}
+
+static void perf_ctx_disable(struct perf_event_context *ctx, enum
event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		perf_pmu_disable(pmu_ctx->pmu);
 	}
 }

-static void perf_ctx_enable(struct perf_event_context *ctx, bool cgroup)
+static void perf_ctx_enable(struct perf_event_context *ctx, enum
event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		perf_pmu_enable(pmu_ctx->pmu);
 	}
@@ -877,7 +887,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 		return;

 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-	perf_ctx_disable(&cpuctx->ctx, true);
+	perf_ctx_disable(&cpuctx->ctx, EVENT_CGROUP);

 	ctx_sched_out(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP);
 	/*
@@ -893,7 +903,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 	 */
 	ctx_sched_in(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP);

-	perf_ctx_enable(&cpuctx->ctx, true);
+	perf_ctx_enable(&cpuctx->ctx, EVENT_CGROUP);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 }

@@ -2729,9 +2739,9 @@ static void ctx_resched(struct perf_cpu_context
*cpuctx,

 	event_type &= EVENT_ALL;

-	perf_ctx_disable(&cpuctx->ctx, false);
+	perf_ctx_disable(&cpuctx->ctx, 0);
 	if (task_ctx) {
-		perf_ctx_disable(task_ctx, false);
+		perf_ctx_disable(task_ctx, 0);
 		task_ctx_sched_out(task_ctx, event_type);
 	}

@@ -2749,9 +2759,9 @@ static void ctx_resched(struct perf_cpu_context
*cpuctx,

 	perf_event_sched_in(cpuctx, task_ctx);

-	perf_ctx_enable(&cpuctx->ctx, false);
+	perf_ctx_enable(&cpuctx->ctx, 0);
 	if (task_ctx)
-		perf_ctx_enable(task_ctx, false);
+		perf_ctx_enable(task_ctx, 0);
 }

 void perf_pmu_resched(struct pmu *pmu)
@@ -3296,9 +3306,6 @@ ctx_sched_out(struct perf_event_context *ctx, enum
event_type_t event_type)
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
 	struct perf_event_pmu_context *pmu_ctx;
 	int is_active = ctx->is_active;
-	bool cgroup = event_type & EVENT_CGROUP;
-
-	event_type &= ~EVENT_CGROUP;

 	lockdep_assert_held(&ctx->lock);

@@ -3333,7 +3340,7 @@ ctx_sched_out(struct perf_event_context *ctx, enum
event_type_t event_type)
 		barrier();
 	}

-	ctx->is_active &= ~event_type;
+	ctx->is_active &= ~(event_type & ~EVENT_FLAGS);
 	if (!(ctx->is_active & EVENT_ALL))
 		ctx->is_active = 0;

@@ -3346,7 +3353,7 @@ ctx_sched_out(struct perf_event_context *ctx, enum
event_type_t event_type)
 	is_active ^= ctx->is_active; /* changed bits */

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		__pmu_ctx_sched_out(pmu_ctx, is_active);
 	}
@@ -3540,7 +3547,7 @@ perf_event_context_sched_out(struct task_struct
*task, struct task_struct *next)
 		raw_spin_lock_nested(&next_ctx->lock, SINGLE_DEPTH_NESTING);
 		if (context_equiv(ctx, next_ctx)) {

-			perf_ctx_disable(ctx, false);
+			perf_ctx_disable(ctx, 0);

 			/* PMIs are disabled; ctx->nr_pending is stable. */
 			if (local_read(&ctx->nr_pending) ||
@@ -3560,7 +3567,7 @@ perf_event_context_sched_out(struct task_struct
*task, struct task_struct *next)
 			perf_ctx_sched_task_cb(ctx, false);
 			perf_event_swap_task_ctx_data(ctx, next_ctx);

-			perf_ctx_enable(ctx, false);
+			perf_ctx_enable(ctx, 0);

 			/*
 			 * RCU_INIT_POINTER here is safe because we've not
@@ -3584,13 +3591,13 @@ perf_event_context_sched_out(struct task_struct
*task, struct task_struct *next)

 	if (do_switch) {
 		raw_spin_lock(&ctx->lock);
-		perf_ctx_disable(ctx, false);
+		perf_ctx_disable(ctx, 0);

 inside_switch:
 		perf_ctx_sched_task_cb(ctx, false);
 		task_ctx_sched_out(ctx, EVENT_ALL);

-		perf_ctx_enable(ctx, false);
+		perf_ctx_enable(ctx, 0);
 		raw_spin_unlock(&ctx->lock);
 	}
 }
@@ -3887,12 +3894,12 @@ static void pmu_groups_sched_in(struct
perf_event_context *ctx,

 static void ctx_groups_sched_in(struct perf_event_context *ctx,
 				struct perf_event_groups *groups,
-				bool cgroup)
+				enum event_type_t event_type)
 {
 	struct perf_event_pmu_context *pmu_ctx;

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
-		if (cgroup && !pmu_ctx->nr_cgroups)
+		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
 		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
 	}
@@ -3909,9 +3916,6 @@ ctx_sched_in(struct perf_event_context *ctx, enum
event_type_t event_type)
 {
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
 	int is_active = ctx->is_active;
-	bool cgroup = event_type & EVENT_CGROUP;
-
-	event_type &= ~EVENT_CGROUP;

 	lockdep_assert_held(&ctx->lock);

@@ -3929,7 +3933,7 @@ ctx_sched_in(struct perf_event_context *ctx, enum
event_type_t event_type)
 		barrier();
 	}

-	ctx->is_active |= (event_type | EVENT_TIME);
+	ctx->is_active |= ((event_type & ~EVENT_FLAGS) | EVENT_TIME);
 	if (ctx->task) {
 		if (!is_active)
 			cpuctx->task_ctx = ctx;
@@ -3944,11 +3948,11 @@ ctx_sched_in(struct perf_event_context *ctx,
enum event_type_t event_type)
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_groups_sched_in(ctx, &ctx->pinned_groups, cgroup);
+		ctx_groups_sched_in(ctx, &ctx->pinned_groups, event_type);

 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE)
-		ctx_groups_sched_in(ctx, &ctx->flexible_groups, cgroup);
+		ctx_groups_sched_in(ctx, &ctx->flexible_groups, event_type);
 }

 static void perf_event_context_sched_in(struct task_struct *task)
@@ -3963,11 +3967,11 @@ static void perf_event_context_sched_in(struct
task_struct *task)

 	if (cpuctx->task_ctx == ctx) {
 		perf_ctx_lock(cpuctx, ctx);
-		perf_ctx_disable(ctx, false);
+		perf_ctx_disable(ctx, 0);

 		perf_ctx_sched_task_cb(ctx, true);

-		perf_ctx_enable(ctx, false);
+		perf_ctx_enable(ctx, 0);
 		perf_ctx_unlock(cpuctx, ctx);
 		goto rcu_unlock;
 	}
@@ -3980,7 +3984,7 @@ static void perf_event_context_sched_in(struct
task_struct *task)
 	if (!ctx->nr_events)
 		goto unlock;

-	perf_ctx_disable(ctx, false);
+	perf_ctx_disable(ctx, 0);
 	/*
 	 * We want to keep the following priority order:
 	 * cpu pinned (that don't need to move), task pinned,
@@ -3990,7 +3994,7 @@ static void perf_event_context_sched_in(struct
task_struct *task)
 	 * events, no need to flip the cpuctx's events around.
 	 */
 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) {
-		perf_ctx_disable(&cpuctx->ctx, false);
+		perf_ctx_disable(&cpuctx->ctx, 0);
 		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
 	}

@@ -3999,9 +4003,9 @@ static void perf_event_context_sched_in(struct
task_struct *task)
 	perf_ctx_sched_task_cb(cpuctx->task_ctx, true);

 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
-		perf_ctx_enable(&cpuctx->ctx, false);
+		perf_ctx_enable(&cpuctx->ctx, 0);

-	perf_ctx_enable(ctx, false);
+	perf_ctx_enable(ctx, 0);

 unlock:
 	perf_ctx_unlock(cpuctx, ctx);


Here are the second patch to introduce EVENT_GUEST and
perf_guest_exit/enter().


From c052e673dc3e0e7ebacdce23f2b0d50ec98401b3 Mon Sep 17 00:00:00 2001
From: Kan Liang <kan.liang@linux.intel.com>
Date: Tue, 11 Jun 2024 06:04:20 -0700
Subject: [PATCH 2/2] perf: Add generic exclude_guest support

Current perf doesn't explicitly schedule out all exclude_guest events
while the guest is running. There is no problem with the current
emulated vPMU. Because perf owns all the PMU counters. It can mask the
counter which is assigned to an exclude_guest event when a guest is
running (Intel way), or set the corresponding HOSTONLY bit in evsentsel
(AMD way). The counter doesn't count when a guest is running.

However, either way doesn't work with the introduced passthrough vPMU.
A guest owns all the PMU counters when it's running. The host should not
mask any counters. The counter may be used by the guest. The evsentsel
may be overwritten.

Perf should explicitly schedule out all exclude_guest events to release
the PMU resources when entering a guest, and resume the counting when
exiting the guest.

Expose two interfaces to KVM. The KVM should notify the perf when
entering/exiting a guest.

Introduce a new event type EVENT_CGUEST to indicate that perf should
check and skip the PMUs which doesn't support the passthrough mode.

It's possible that an exclude_guest event is created when a guest is
running. The new event should not be scheduled in as well.

The ctx->time is used to calculated the running/enabling time of an
event, which is shared among PMUs. The ctx_sched_in/out() with
EVENT_CGUEST doesn't stop the ctx->time. A timeguest is introduced to
track the start time of a guest. For an exclude_guest event, the time in
the guest mode is deducted.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 include/linux/perf_event.h |   5 ++
 kernel/events/core.c       | 119 +++++++++++++++++++++++++++++++++++--
 2 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index dd4920bf3d1b..68c8b93c4e5c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -945,6 +945,7 @@ struct perf_event_context {
 	u64				time;
 	u64				timestamp;
 	u64				timeoffset;
+	u64				timeguest;

 	/*
 	 * These fields let us detect when two contexts have both
@@ -1734,6 +1735,8 @@ extern int perf_event_period(struct perf_event
*event, u64 value);
 extern u64 perf_event_pause(struct perf_event *event, bool reset);
 extern int perf_get_mediated_pmu(void);
 extern void perf_put_mediated_pmu(void);
+void perf_guest_enter(void);
+void perf_guest_exit(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1826,6 +1829,8 @@ static inline int perf_get_mediated_pmu(void)
 }

 static inline void perf_put_mediated_pmu(void)			{ }
+static inline void perf_guest_enter(void)			{ }
+static inline void perf_guest_exit(void)			{ }
 #endif

 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 95d1d5a5addc..cd3a89672b14 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -376,7 +376,8 @@ enum event_type_t {
 	/* see ctx_resched() for details */
 	EVENT_CPU = 0x8,
 	EVENT_CGROUP = 0x10,
-	EVENT_FLAGS = EVENT_CGROUP,
+	EVENT_GUEST = 0x20,
+	EVENT_FLAGS = EVENT_CGROUP | EVENT_GUEST,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
 };

@@ -407,6 +408,7 @@ static atomic_t nr_include_guest_events __read_mostly;

 static atomic_t nr_mediated_pmu_vms;
 static DEFINE_MUTEX(perf_mediated_pmu_mutex);
+static DEFINE_PER_CPU(bool, perf_in_guest);

 /* !exclude_guest event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */
 static inline bool is_include_guest_event(struct perf_event *event)
@@ -651,10 +653,26 @@ __perf_update_times(struct perf_event *event, u64
now, u64 *enabled, u64 *runnin

 static void perf_event_update_time(struct perf_event *event)
 {
-	u64 now = perf_event_time(event);
+	u64 now;
+
+	/* Never count the time of an active guest into an exclude_guest event. */
+	if (event->ctx->timeguest &&
+	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
+		/*
+		 * If a guest is running, use the timestamp while entering the guest.
+		 * If the guest is leaving, reset the event timestamp.
+		 */
+		if (__this_cpu_read(perf_in_guest))
+			event->tstamp = event->ctx->timeguest;
+		else
+			event->tstamp = event->ctx->time;
+		return;
+	}

+	now = perf_event_time(event);
 	__perf_update_times(event, now, &event->total_time_enabled,
 					&event->total_time_running);
+
 	event->tstamp = now;
 }

@@ -706,6 +724,10 @@ static bool perf_skip_pmu_ctx(struct
perf_event_pmu_context *pmu_ctx,
 	if ((event_type & EVENT_CGROUP) && !pmu_ctx->nr_cgroups)
 		return true;

+	if ((event_type & EVENT_GUEST) &&
+	    !(pmu_ctx->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU))
+		return true;
+
 	return false;
 }

@@ -3350,7 +3372,14 @@ ctx_sched_out(struct perf_event_context *ctx,
enum event_type_t event_type)
 			cpuctx->task_ctx = NULL;
 	}

-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule out all !exclude_guest events of PMU
+		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
+		 */
+		is_active = EVENT_ALL;
+	} else
+		is_active ^= ctx->is_active; /* changed bits */

 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
 		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
@@ -3860,6 +3889,15 @@ static int merge_sched_in(struct perf_event
*event, void *data)
 	if (!event_filter_match(event))
 		return 0;

+	/*
+	 * Don't schedule in any exclude_guest events of PMU with
+	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
+	 */
+	if (__this_cpu_read(perf_in_guest) &&
+	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
+	    event->attr.exclude_guest)
+		return 0;
+
 	if (group_can_go_on(event, *can_add_hw)) {
 		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
@@ -3941,7 +3979,20 @@ ctx_sched_in(struct perf_event_context *ctx, enum
event_type_t event_type)
 			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
 	}

-	is_active ^= ctx->is_active; /* changed bits */
+	if (event_type & EVENT_GUEST) {
+		/*
+		 * Schedule in all !exclude_guest events of PMU
+		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
+		 */
+		is_active = EVENT_ALL;
+
+		/*
+		 * Update ctx time to set the new start time for
+		 * the exclude_guest events.
+		 */
+		update_context_time(ctx);
+	} else
+		is_active ^= ctx->is_active; /* changed bits */

 	/*
 	 * First go through the list and put on any pinned groups
@@ -5788,6 +5839,66 @@ void perf_put_mediated_pmu(void)
 }
 EXPORT_SYMBOL_GPL(perf_put_mediated_pmu);

+/* When entering a guest, schedule out all exclude_guest events. */
+void perf_guest_enter(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		return;
+	}
+
+	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
+	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
+	/* Set the guest start time */
+	cpuctx->ctx.timeguest = cpuctx->ctx.time;
+	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
+	if (cpuctx->task_ctx) {
+		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
+		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
+		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
+		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
+	}
+
+	__this_cpu_write(perf_in_guest, true);
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+
+void perf_guest_exit(void)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+
+	lockdep_assert_irqs_disabled();
+
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		return;
+	}
+
+	__this_cpu_write(perf_in_guest, false);
+
+	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
+	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
+	cpuctx->ctx.timeguest = 0;
+	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
+	if (cpuctx->task_ctx) {
+		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
+		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
+		cpuctx->task_ctx->timeguest = 0;
+		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
+	}
+
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+}
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block


Thanks,
Kan


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-11 13:27         ` Liang, Kan
@ 2024-06-12 11:17           ` Peter Zijlstra
  2024-06-12 13:38             ` Liang, Kan
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-06-12 11:17 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users

On Tue, Jun 11, 2024 at 09:27:46AM -0400, Liang, Kan wrote:
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index dd4920bf3d1b..68c8b93c4e5c 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -945,6 +945,7 @@ struct perf_event_context {
>  	u64				time;
>  	u64				timestamp;
>  	u64				timeoffset;
> +	u64				timeguest;
> 
>  	/*
>  	 * These fields let us detect when two contexts have both

> @@ -651,10 +653,26 @@ __perf_update_times(struct perf_event *event, u64
> now, u64 *enabled, u64 *runnin
> 
>  static void perf_event_update_time(struct perf_event *event)
>  {
> -	u64 now = perf_event_time(event);
> +	u64 now;
> +
> +	/* Never count the time of an active guest into an exclude_guest event. */
> +	if (event->ctx->timeguest &&
> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> +		/*
> +		 * If a guest is running, use the timestamp while entering the guest.
> +		 * If the guest is leaving, reset the event timestamp.
> +		 */
> +		if (__this_cpu_read(perf_in_guest))
> +			event->tstamp = event->ctx->timeguest;
> +		else
> +			event->tstamp = event->ctx->time;
> +		return;
> +	}
> 
> +	now = perf_event_time(event);
>  	__perf_update_times(event, now, &event->total_time_enabled,
>  					&event->total_time_running);
> +
>  	event->tstamp = now;
>  }

So I really don't like this much, and AFAICT this is broken. At the very
least this doesn't work right for cgroup events, because they have their
own timeline.

Let me have a poke...

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-12 11:17           ` Peter Zijlstra
@ 2024-06-12 13:38             ` Liang, Kan
  2024-06-13  9:15               ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Liang, Kan @ 2024-06-12 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users



On 2024-06-12 7:17 a.m., Peter Zijlstra wrote:
> On Tue, Jun 11, 2024 at 09:27:46AM -0400, Liang, Kan wrote:
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index dd4920bf3d1b..68c8b93c4e5c 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -945,6 +945,7 @@ struct perf_event_context {
>>  	u64				time;
>>  	u64				timestamp;
>>  	u64				timeoffset;
>> +	u64				timeguest;
>>
>>  	/*
>>  	 * These fields let us detect when two contexts have both
> 
>> @@ -651,10 +653,26 @@ __perf_update_times(struct perf_event *event, u64
>> now, u64 *enabled, u64 *runnin
>>
>>  static void perf_event_update_time(struct perf_event *event)
>>  {
>> -	u64 now = perf_event_time(event);
>> +	u64 now;
>> +
>> +	/* Never count the time of an active guest into an exclude_guest event. */
>> +	if (event->ctx->timeguest &&
>> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>> +		/*
>> +		 * If a guest is running, use the timestamp while entering the guest.
>> +		 * If the guest is leaving, reset the event timestamp.
>> +		 */
>> +		if (__this_cpu_read(perf_in_guest))
>> +			event->tstamp = event->ctx->timeguest;
>> +		else
>> +			event->tstamp = event->ctx->time;
>> +		return;
>> +	}
>>
>> +	now = perf_event_time(event);
>>  	__perf_update_times(event, now, &event->total_time_enabled,
>>  					&event->total_time_running);
>> +
>>  	event->tstamp = now;
>>  }
> 
> So I really don't like this much, 

An alternative way I can imagine may maintain a dedicated timeline for
the PASSTHROUGH PMUs. For that, we probably need two new timelines for
the normal events and the cgroup events. That sounds too complex.

> and AFAICT this is broken. At the very
> least this doesn't work right for cgroup events, because they have their
> own timeline.

I think we just need a new start time for an event. So we may use
perf_event_time() to replace the ctx->time for the cgroup events.

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 019c237dd456..6c46699c6752 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -665,7 +665,7 @@ static void perf_event_update_time(struct perf_event
*event)
 		if (__this_cpu_read(perf_in_guest))
 			event->tstamp = event->ctx->timeguest;
 		else
-			event->tstamp = event->ctx->time;
+			event->tstamp = perf_event_time(event);
 		return;
 	}


> 
> Let me have a poke...

Sure.

Thanks,
Kan

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-12 13:38             ` Liang, Kan
@ 2024-06-13  9:15               ` Peter Zijlstra
  2024-06-13 13:37                 ` Liang, Kan
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-06-13  9:15 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users

On Wed, Jun 12, 2024 at 09:38:06AM -0400, Liang, Kan wrote:
> On 2024-06-12 7:17 a.m., Peter Zijlstra wrote:
> > On Tue, Jun 11, 2024 at 09:27:46AM -0400, Liang, Kan wrote:
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index dd4920bf3d1b..68c8b93c4e5c 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -945,6 +945,7 @@ struct perf_event_context {
> >>  	u64				time;
> >>  	u64				timestamp;
> >>  	u64				timeoffset;
> >> +	u64				timeguest;
> >>
> >>  	/*
> >>  	 * These fields let us detect when two contexts have both
> > 
> >> @@ -651,10 +653,26 @@ __perf_update_times(struct perf_event *event, u64
> >> now, u64 *enabled, u64 *runnin
> >>
> >>  static void perf_event_update_time(struct perf_event *event)
> >>  {
> >> -	u64 now = perf_event_time(event);
> >> +	u64 now;
> >> +
> >> +	/* Never count the time of an active guest into an exclude_guest event. */
> >> +	if (event->ctx->timeguest &&
> >> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> >> +		/*
> >> +		 * If a guest is running, use the timestamp while entering the guest.
> >> +		 * If the guest is leaving, reset the event timestamp.
> >> +		 */
> >> +		if (__this_cpu_read(perf_in_guest))
> >> +			event->tstamp = event->ctx->timeguest;
> >> +		else
> >> +			event->tstamp = event->ctx->time;
> >> +		return;
> >> +	}
> >>
> >> +	now = perf_event_time(event);
> >>  	__perf_update_times(event, now, &event->total_time_enabled,
> >>  					&event->total_time_running);
> >> +
> >>  	event->tstamp = now;
> >>  }
> > 
> > So I really don't like this much, 
> 
> An alternative way I can imagine may maintain a dedicated timeline for
> the PASSTHROUGH PMUs. For that, we probably need two new timelines for
> the normal events and the cgroup events. That sounds too complex.

I'm afraid we might have to. Specifically, the below:

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 019c237dd456..6c46699c6752 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -665,7 +665,7 @@ static void perf_event_update_time(struct perf_event
> *event)
>  		if (__this_cpu_read(perf_in_guest))
>  			event->tstamp = event->ctx->timeguest;
>  		else
> -			event->tstamp = event->ctx->time;
> +			event->tstamp = perf_event_time(event);
>  		return;
>  	}

is still broken in that it (ab)uses event state to track time, and this
goes sideways in case of event overcommit, because then
ctx_sched_{out,in}() will not visit all events.

We've ran into that before. Time-keeping really should be per context or
we'll get a ton of pain.

I've ended up with the (uncompiled) below. Yes, it is unfortunate, but
aside from a few cleanups (we could introduce a struct time_ctx { u64
time, stamp, offset }; and fold a bunch of code, this is more or less
the best we can do I'm afraid.

---

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -947,7 +947,9 @@ struct perf_event_context {
 	u64				time;
 	u64				timestamp;
 	u64				timeoffset;
-	u64				timeguest;
+	u64				guest_time;
+	u64				guest_timestamp;
+	u64				guest_timeoffset;
 
 	/*
 	 * These fields let us detect when two contexts have both
@@ -1043,6 +1045,9 @@ struct perf_cgroup_info {
 	u64				time;
 	u64				timestamp;
 	u64				timeoffset;
+	u64				guest_time;
+	u64				guest_timestamp;
+	u64				guest_timeoffset;
 	int				active;
 };
 
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -638,26 +638,9 @@ __perf_update_times(struct perf_event *e
 
 static void perf_event_update_time(struct perf_event *event)
 {
-	u64 now;
-
-	/* Never count the time of an active guest into an exclude_guest event. */
-	if (event->ctx->timeguest &&
-	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
-		/*
-		 * If a guest is running, use the timestamp while entering the guest.
-		 * If the guest is leaving, reset the event timestamp.
-		 */
-		if (__this_cpu_read(perf_in_guest))
-			event->tstamp = event->ctx->timeguest;
-		else
-			event->tstamp = event->ctx->time;
-		return;
-	}
-
-	now = perf_event_time(event);
+	u64 now = perf_event_time(event);
 	__perf_update_times(event, now, &event->total_time_enabled,
 					&event->total_time_running);
-
 	event->tstamp = now;
 }
 
@@ -780,19 +763,33 @@ static inline int is_cgroup_event(struct
 static inline u64 perf_cgroup_event_time(struct perf_event *event)
 {
 	struct perf_cgroup_info *t;
+	u64 time;
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	return t->time;
+	time = t->time;
+	if (event->attr.exclude_guest)
+		time -= t->guest_time;
+	return time;
 }
 
 static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 {
 	struct perf_cgroup_info *t;
+	u64 time, guest_time;
 
 	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	if (!__load_acquire(&t->active))
-		return t->time;
-	now += READ_ONCE(t->timeoffset);
+	if (!__load_acquire(&t->active)) {
+		time = t->time;
+		if (event->attr.exclude_guest)
+			time -= t->guest_time;
+		return time;
+	}
+
+	time = now + READ_ONCE(t->timeoffset);
+	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
+		guest_time = now + READ_ONCE(t->guest_offset);
+		time -= guest_time;
+	}
 	return now;
 }
 
@@ -807,6 +804,17 @@ static inline void __update_cgrp_time(st
 	WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
 }
 
+static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
+{
+	if (adv)
+		info->guest_time += now - info->guest_timestamp;
+	info->guest_timestamp = now;
+	/*
+	 * see update_context_time()
+	 */
+	WRITE_ONCE(info->guest_timeoffset, info->guest_time - info->guest_timestamp);
+}
+
 static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
 {
 	struct perf_cgroup *cgrp = cpuctx->cgrp;
@@ -821,6 +829,8 @@ static inline void update_cgrp_time_from
 			info = this_cpu_ptr(cgrp->info);
 
 			__update_cgrp_time(info, now, true);
+			if (__this_cpu_read(perf_in_guest))
+				__update_cgrp_guest_time(info, now, true);
 			if (final)
 				__store_release(&info->active, 0);
 		}
@@ -1501,14 +1511,39 @@ static void __update_context_time(struct
 	WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
 }
 
+static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
+{
+	u64 now = ctx->timestamp; /* must be called after __update_context_time(); */
+
+	lockdep_assert_held(&ctx->lock);
+
+	if (adv)
+		ctx->guest_time += now - ctx->guest_timestamp;
+	ctx->guest_timestamp = now;
+
+	/*
+	 * The above: time' = time + (now - timestamp), can be re-arranged
+	 * into: time` = now + (time - timestamp), which gives a single value
+	 * offset to compute future time without locks on.
+	 *
+	 * See perf_event_time_now(), which can be used from NMI context where
+	 * it's (obviously) not possible to acquire ctx->lock in order to read
+	 * both the above values in a consistent manner.
+	 */
+	WRITE_ONCE(ctx->guest_timeoffset, ctx->guest_time - ctx->guest_timestamp);
+}
+
 static void update_context_time(struct perf_event_context *ctx)
 {
 	__update_context_time(ctx, true);
+	if (__this_cpu_read(perf_in_guest))
+		__update_context_guest_time(ctx, true);
 }
 
 static u64 perf_event_time(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
+	u64 time;
 
 	if (unlikely(!ctx))
 		return 0;
@@ -1516,12 +1551,17 @@ static u64 perf_event_time(struct perf_e
 	if (is_cgroup_event(event))
 		return perf_cgroup_event_time(event);
 
-	return ctx->time;
+	time = ctx->time;
+	if (event->attr.exclude_guest)
+		time -= ctx->guest_time;
+
+	return time;
 }
 
 static u64 perf_event_time_now(struct perf_event *event, u64 now)
 {
 	struct perf_event_context *ctx = event->ctx;
+	u64 time, guest_time;
 
 	if (unlikely(!ctx))
 		return 0;
@@ -1529,11 +1569,19 @@ static u64 perf_event_time_now(struct pe
 	if (is_cgroup_event(event))
 		return perf_cgroup_event_time_now(event, now);
 
-	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
-		return ctx->time;
+	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME)) {
+		time = ctx->time;
+		if (event->attr.exclude_guest)
+			time -= ctx->guest_time;
+		return time;
+	}
 
-	now += READ_ONCE(ctx->timeoffset);
-	return now;
+	time = now + READ_ONCE(ctx->timeoffset);
+	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
+		guest_time = now + READ_ONCE(ctx->guest_timeoffset);
+		time -= guest_time;
+	}
+	return time;
 }
 
 static enum event_type_t get_event_type(struct perf_event *event)
@@ -3340,9 +3388,14 @@ ctx_sched_out(struct perf_event_context
 	 * would only update time for the pinned events.
 	 */
 	if (is_active & EVENT_TIME) {
+		bool stop;
+
+		stop = !((ctx->is_active & event_type) & EVENT_ALL) &&
+		       ctx == &cpuctx->ctx;
+			
 		/* update (and stop) ctx time */
 		update_context_time(ctx);
-		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
+		update_cgrp_time_from_cpuctx(cpuctx, stop);
 		/*
 		 * CPU-release for the below ->is_active store,
 		 * see __load_acquire() in perf_event_time_now()
@@ -3366,8 +3419,12 @@ ctx_sched_out(struct perf_event_context
 		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
 		 */
 		is_active = EVENT_ALL;
-	} else
+		__update_context_guest_time(ctx, false);
+		perf_cgroup_set_guest_timestamp(cpuctx);
+		barrier();
+	} else {
 		is_active ^= ctx->is_active; /* changed bits */
+	}
 
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
 		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
@@ -3866,10 +3923,15 @@ static inline void group_update_userpage
 		event_update_userpage(event);
 }
 
+struct merge_sched_data {
+	int can_add_hw;
+	enum event_type_t event_type;
+};
+
 static int merge_sched_in(struct perf_event *event, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
-	int *can_add_hw = data;
+	struct merge_sched_data *msd = data;
 
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
@@ -3881,18 +3943,18 @@ static int merge_sched_in(struct perf_ev
 	 * Don't schedule in any exclude_guest events of PMU with
 	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
 	 */
-	if (__this_cpu_read(perf_in_guest) &&
-	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
-	    event->attr.exclude_guest)
+	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest) &&
+	    (event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) &&
+	    !(msd->event_type & EVENT_GUEST))
 		return 0;
 
-	if (group_can_go_on(event, *can_add_hw)) {
+	if (group_can_go_on(event, msd->can_add_hw)) {
 		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
 	}
 
 	if (event->state == PERF_EVENT_STATE_INACTIVE) {
-		*can_add_hw = 0;
+		msd->can_add_hw = 0;
 		if (event->attr.pinned) {
 			perf_cgroup_event_disable(event, ctx);
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
@@ -3911,11 +3973,15 @@ static int merge_sched_in(struct perf_ev
 
 static void pmu_groups_sched_in(struct perf_event_context *ctx,
 				struct perf_event_groups *groups,
-				struct pmu *pmu)
+				struct pmu *pmu,
+				enum even_type_t event_type)
 {
-	int can_add_hw = 1;
+	struct merge_sched_data msd = {
+		.can_add_hw = 1,
+		.event_type = event_type,
+	};
 	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
-			   merge_sched_in, &can_add_hw);
+			   merge_sched_in, &msd);
 }
 
 static void ctx_groups_sched_in(struct perf_event_context *ctx,
@@ -3927,14 +3993,14 @@ static void ctx_groups_sched_in(struct p
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
 		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
 			continue;
-		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
+		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu, event_type);
 	}
 }
 
 static void __pmu_ctx_sched_in(struct perf_event_context *ctx,
 			       struct pmu *pmu)
 {
-	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
+	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu, 0);
 }
 
 static void
@@ -3949,6 +4015,8 @@ ctx_sched_in(struct perf_event_context *
 		return;
 
 	if (!(is_active & EVENT_TIME)) {
+		/* EVENT_TIME should be active while the guest runs */
+		WARN_ON_ONCE(event_type & EVENT_GUEST);
 		/* start ctx time */
 		__update_context_time(ctx, false);
 		perf_cgroup_set_timestamp(cpuctx);
@@ -3979,8 +4047,11 @@ ctx_sched_in(struct perf_event_context *
 		 * the exclude_guest events.
 		 */
 		update_context_time(ctx);
-	} else
+		update_cgrp_time_from_cpuctx(cpuctx, false);
+		barrier();
+	} else {
 		is_active ^= ctx->is_active; /* changed bits */
+	}
 
 	/*
 	 * First go through the list and put on any pinned groups
@@ -5832,25 +5903,20 @@ void perf_guest_enter(void)
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 
-	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
-		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-		return;
-	}
+	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
+		goto unlock;
 
 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
 	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
-	/* Set the guest start time */
-	cpuctx->ctx.timeguest = cpuctx->ctx.time;
 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
 	if (cpuctx->task_ctx) {
 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
 		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
-		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
 	}
 
 	__this_cpu_write(perf_in_guest, true);
-
+unlock:
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 }
 
@@ -5862,24 +5928,21 @@ void perf_guest_exit(void)
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 
-	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
-		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-		return;
-	}
-
-	__this_cpu_write(perf_in_guest, false);
+	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
+		goto unlock;
 
 	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
 	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
-	cpuctx->ctx.timeguest = 0;
 	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
 	if (cpuctx->task_ctx) {
 		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
 		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
-		cpuctx->task_ctx->timeguest = 0;
 		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
 	}
 
+	__this_cpu_write(perf_in_guest, false);
+
+unlock:
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 }
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-13  9:15               ` Peter Zijlstra
@ 2024-06-13 13:37                 ` Liang, Kan
  2024-06-13 18:04                   ` Liang, Kan
  0 siblings, 1 reply; 116+ messages in thread
From: Liang, Kan @ 2024-06-13 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users



On 2024-06-13 5:15 a.m., Peter Zijlstra wrote:
> On Wed, Jun 12, 2024 at 09:38:06AM -0400, Liang, Kan wrote:
>> On 2024-06-12 7:17 a.m., Peter Zijlstra wrote:
>>> On Tue, Jun 11, 2024 at 09:27:46AM -0400, Liang, Kan wrote:
>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>> index dd4920bf3d1b..68c8b93c4e5c 100644
>>>> --- a/include/linux/perf_event.h
>>>> +++ b/include/linux/perf_event.h
>>>> @@ -945,6 +945,7 @@ struct perf_event_context {
>>>>  	u64				time;
>>>>  	u64				timestamp;
>>>>  	u64				timeoffset;
>>>> +	u64				timeguest;
>>>>
>>>>  	/*
>>>>  	 * These fields let us detect when two contexts have both
>>>
>>>> @@ -651,10 +653,26 @@ __perf_update_times(struct perf_event *event, u64
>>>> now, u64 *enabled, u64 *runnin
>>>>
>>>>  static void perf_event_update_time(struct perf_event *event)
>>>>  {
>>>> -	u64 now = perf_event_time(event);
>>>> +	u64 now;
>>>> +
>>>> +	/* Never count the time of an active guest into an exclude_guest event. */
>>>> +	if (event->ctx->timeguest &&
>>>> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>>>> +		/*
>>>> +		 * If a guest is running, use the timestamp while entering the guest.
>>>> +		 * If the guest is leaving, reset the event timestamp.
>>>> +		 */
>>>> +		if (__this_cpu_read(perf_in_guest))
>>>> +			event->tstamp = event->ctx->timeguest;
>>>> +		else
>>>> +			event->tstamp = event->ctx->time;
>>>> +		return;
>>>> +	}
>>>>
>>>> +	now = perf_event_time(event);
>>>>  	__perf_update_times(event, now, &event->total_time_enabled,
>>>>  					&event->total_time_running);
>>>> +
>>>>  	event->tstamp = now;
>>>>  }
>>>
>>> So I really don't like this much, 
>>
>> An alternative way I can imagine may maintain a dedicated timeline for
>> the PASSTHROUGH PMUs. For that, we probably need two new timelines for
>> the normal events and the cgroup events. That sounds too complex.
> 
> I'm afraid we might have to. Specifically, the below:
> 
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 019c237dd456..6c46699c6752 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -665,7 +665,7 @@ static void perf_event_update_time(struct perf_event
>> *event)
>>  		if (__this_cpu_read(perf_in_guest))
>>  			event->tstamp = event->ctx->timeguest;
>>  		else
>> -			event->tstamp = event->ctx->time;
>> +			event->tstamp = perf_event_time(event);
>>  		return;
>>  	}
> 
> is still broken in that it (ab)uses event state to track time, and this
> goes sideways in case of event overcommit, because then
> ctx_sched_{out,in}() will not visit all events.
> 
> We've ran into that before. Time-keeping really should be per context or
> we'll get a ton of pain.
> 
> I've ended up with the (uncompiled) below. Yes, it is unfortunate, but
> aside from a few cleanups (we could introduce a struct time_ctx { u64
> time, stamp, offset }; and fold a bunch of code, this is more or less
> the best we can do I'm afraid.

Sure. I will try the below codes and implement the cleanup patch as well.

Thanks,
Kan

> 
> ---
> 
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -947,7 +947,9 @@ struct perf_event_context {
>  	u64				time;
>  	u64				timestamp;
>  	u64				timeoffset;
> -	u64				timeguest;
> +	u64				guest_time;
> +	u64				guest_timestamp;
> +	u64				guest_timeoffset;
>  
>  	/*
>  	 * These fields let us detect when two contexts have both
> @@ -1043,6 +1045,9 @@ struct perf_cgroup_info {
>  	u64				time;
>  	u64				timestamp;
>  	u64				timeoffset;
> +	u64				guest_time;
> +	u64				guest_timestamp;
> +	u64				guest_timeoffset;
>  	int				active;
>  };
>  
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -638,26 +638,9 @@ __perf_update_times(struct perf_event *e
>  
>  static void perf_event_update_time(struct perf_event *event)
>  {
> -	u64 now;
> -
> -	/* Never count the time of an active guest into an exclude_guest event. */
> -	if (event->ctx->timeguest &&
> -	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
> -		/*
> -		 * If a guest is running, use the timestamp while entering the guest.
> -		 * If the guest is leaving, reset the event timestamp.
> -		 */
> -		if (__this_cpu_read(perf_in_guest))
> -			event->tstamp = event->ctx->timeguest;
> -		else
> -			event->tstamp = event->ctx->time;
> -		return;
> -	}
> -
> -	now = perf_event_time(event);
> +	u64 now = perf_event_time(event);
>  	__perf_update_times(event, now, &event->total_time_enabled,
>  					&event->total_time_running);
> -
>  	event->tstamp = now;
>  }
>  
> @@ -780,19 +763,33 @@ static inline int is_cgroup_event(struct
>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  {
>  	struct perf_cgroup_info *t;
> +	u64 time;
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> -	return t->time;
> +	time = t->time;
> +	if (event->attr.exclude_guest)
> +		time -= t->guest_time;
> +	return time;
>  }
>  
>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>  {
>  	struct perf_cgroup_info *t;
> +	u64 time, guest_time;
>  
>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> -	if (!__load_acquire(&t->active))
> -		return t->time;
> -	now += READ_ONCE(t->timeoffset);
> +	if (!__load_acquire(&t->active)) {
> +		time = t->time;
> +		if (event->attr.exclude_guest)
> +			time -= t->guest_time;
> +		return time;
> +	}
> +
> +	time = now + READ_ONCE(t->timeoffset);
> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
> +		guest_time = now + READ_ONCE(t->guest_offset);
> +		time -= guest_time;
> +	}
>  	return now;
>  }
>  
> @@ -807,6 +804,17 @@ static inline void __update_cgrp_time(st
>  	WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
>  }
>  
> +static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
> +{
> +	if (adv)
> +		info->guest_time += now - info->guest_timestamp;
> +	info->guest_timestamp = now;
> +	/*
> +	 * see update_context_time()
> +	 */
> +	WRITE_ONCE(info->guest_timeoffset, info->guest_time - info->guest_timestamp);
> +}
> +
>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
>  {
>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
> @@ -821,6 +829,8 @@ static inline void update_cgrp_time_from
>  			info = this_cpu_ptr(cgrp->info);
>  
>  			__update_cgrp_time(info, now, true);
> +			if (__this_cpu_read(perf_in_guest))
> +				__update_cgrp_guest_time(info, now, true);
>  			if (final)
>  				__store_release(&info->active, 0);
>  		}
> @@ -1501,14 +1511,39 @@ static void __update_context_time(struct
>  	WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
>  }
>  
> +static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
> +{
> +	u64 now = ctx->timestamp; /* must be called after __update_context_time(); */
> +
> +	lockdep_assert_held(&ctx->lock);
> +
> +	if (adv)
> +		ctx->guest_time += now - ctx->guest_timestamp;
> +	ctx->guest_timestamp = now;
> +
> +	/*
> +	 * The above: time' = time + (now - timestamp), can be re-arranged
> +	 * into: time` = now + (time - timestamp), which gives a single value
> +	 * offset to compute future time without locks on.
> +	 *
> +	 * See perf_event_time_now(), which can be used from NMI context where
> +	 * it's (obviously) not possible to acquire ctx->lock in order to read
> +	 * both the above values in a consistent manner.
> +	 */
> +	WRITE_ONCE(ctx->guest_timeoffset, ctx->guest_time - ctx->guest_timestamp);
> +}
> +
>  static void update_context_time(struct perf_event_context *ctx)
>  {
>  	__update_context_time(ctx, true);
> +	if (__this_cpu_read(perf_in_guest))
> +		__update_context_guest_time(ctx, true);
>  }
>  
>  static u64 perf_event_time(struct perf_event *event)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> +	u64 time;
>  
>  	if (unlikely(!ctx))
>  		return 0;
> @@ -1516,12 +1551,17 @@ static u64 perf_event_time(struct perf_e
>  	if (is_cgroup_event(event))
>  		return perf_cgroup_event_time(event);
>  
> -	return ctx->time;
> +	time = ctx->time;
> +	if (event->attr.exclude_guest)
> +		time -= ctx->guest_time;
> +
> +	return time;
>  }
>  
>  static u64 perf_event_time_now(struct perf_event *event, u64 now)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> +	u64 time, guest_time;
>  
>  	if (unlikely(!ctx))
>  		return 0;
> @@ -1529,11 +1569,19 @@ static u64 perf_event_time_now(struct pe
>  	if (is_cgroup_event(event))
>  		return perf_cgroup_event_time_now(event, now);
>  
> -	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
> -		return ctx->time;
> +	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME)) {
> +		time = ctx->time;
> +		if (event->attr.exclude_guest)
> +			time -= ctx->guest_time;
> +		return time;
> +	}
>  
> -	now += READ_ONCE(ctx->timeoffset);
> -	return now;
> +	time = now + READ_ONCE(ctx->timeoffset);
> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
> +		guest_time = now + READ_ONCE(ctx->guest_timeoffset);
> +		time -= guest_time;
> +	}
> +	return time;
>  }
>  
>  static enum event_type_t get_event_type(struct perf_event *event)
> @@ -3340,9 +3388,14 @@ ctx_sched_out(struct perf_event_context
>  	 * would only update time for the pinned events.
>  	 */
>  	if (is_active & EVENT_TIME) {
> +		bool stop;
> +
> +		stop = !((ctx->is_active & event_type) & EVENT_ALL) &&
> +		       ctx == &cpuctx->ctx;
> +			
>  		/* update (and stop) ctx time */
>  		update_context_time(ctx);
> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
>  		/*
>  		 * CPU-release for the below ->is_active store,
>  		 * see __load_acquire() in perf_event_time_now()
> @@ -3366,8 +3419,12 @@ ctx_sched_out(struct perf_event_context
>  		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>  		 */
>  		is_active = EVENT_ALL;
> -	} else
> +		__update_context_guest_time(ctx, false);
> +		perf_cgroup_set_guest_timestamp(cpuctx);
> +		barrier();
> +	} else {
>  		is_active ^= ctx->is_active; /* changed bits */
> +	}
>  
>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
> @@ -3866,10 +3923,15 @@ static inline void group_update_userpage
>  		event_update_userpage(event);
>  }
>  
> +struct merge_sched_data {
> +	int can_add_hw;
> +	enum event_type_t event_type;
> +};
> +
>  static int merge_sched_in(struct perf_event *event, void *data)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> -	int *can_add_hw = data;
> +	struct merge_sched_data *msd = data;
>  
>  	if (event->state <= PERF_EVENT_STATE_OFF)
>  		return 0;
> @@ -3881,18 +3943,18 @@ static int merge_sched_in(struct perf_ev
>  	 * Don't schedule in any exclude_guest events of PMU with
>  	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
>  	 */
> -	if (__this_cpu_read(perf_in_guest) &&
> -	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
> -	    event->attr.exclude_guest)
> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest) &&
> +	    (event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) &&
> +	    !(msd->event_type & EVENT_GUEST))
>  		return 0;
>  
> -	if (group_can_go_on(event, *can_add_hw)) {
> +	if (group_can_go_on(event, msd->can_add_hw)) {
>  		if (!group_sched_in(event, ctx))
>  			list_add_tail(&event->active_list, get_event_list(event));
>  	}
>  
>  	if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -		*can_add_hw = 0;
> +		msd->can_add_hw = 0;
>  		if (event->attr.pinned) {
>  			perf_cgroup_event_disable(event, ctx);
>  			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> @@ -3911,11 +3973,15 @@ static int merge_sched_in(struct perf_ev
>  
>  static void pmu_groups_sched_in(struct perf_event_context *ctx,
>  				struct perf_event_groups *groups,
> -				struct pmu *pmu)
> +				struct pmu *pmu,
> +				enum even_type_t event_type)
>  {
> -	int can_add_hw = 1;
> +	struct merge_sched_data msd = {
> +		.can_add_hw = 1,
> +		.event_type = event_type,
> +	};
>  	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
> -			   merge_sched_in, &can_add_hw);
> +			   merge_sched_in, &msd);
>  }
>  
>  static void ctx_groups_sched_in(struct perf_event_context *ctx,
> @@ -3927,14 +3993,14 @@ static void ctx_groups_sched_in(struct p
>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
>  			continue;
> -		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
> +		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu, event_type);
>  	}
>  }
>  
>  static void __pmu_ctx_sched_in(struct perf_event_context *ctx,
>  			       struct pmu *pmu)
>  {
> -	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
> +	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu, 0);
>  }
>  
>  static void
> @@ -3949,6 +4015,8 @@ ctx_sched_in(struct perf_event_context *
>  		return;
>  
>  	if (!(is_active & EVENT_TIME)) {
> +		/* EVENT_TIME should be active while the guest runs */
> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
>  		/* start ctx time */
>  		__update_context_time(ctx, false);
>  		perf_cgroup_set_timestamp(cpuctx);
> @@ -3979,8 +4047,11 @@ ctx_sched_in(struct perf_event_context *
>  		 * the exclude_guest events.
>  		 */
>  		update_context_time(ctx);
> -	} else
> +		update_cgrp_time_from_cpuctx(cpuctx, false);
> +		barrier();
> +	} else {
>  		is_active ^= ctx->is_active; /* changed bits */
> +	}
>  
>  	/*
>  	 * First go through the list and put on any pinned groups
> @@ -5832,25 +5903,20 @@ void perf_guest_enter(void)
>  
>  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>  
> -	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
> -		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> -		return;
> -	}
> +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
> +		goto unlock;
>  
>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>  	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
> -	/* Set the guest start time */
> -	cpuctx->ctx.timeguest = cpuctx->ctx.time;
>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>  	if (cpuctx->task_ctx) {
>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>  		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
> -		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>  	}
>  
>  	__this_cpu_write(perf_in_guest, true);
> -
> +unlock:
>  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>  }
>  
> @@ -5862,24 +5928,21 @@ void perf_guest_exit(void)
>  
>  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>  
> -	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
> -		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> -		return;
> -	}
> -
> -	__this_cpu_write(perf_in_guest, false);
> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
> +		goto unlock;
>  
>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
> -	cpuctx->ctx.timeguest = 0;
>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>  	if (cpuctx->task_ctx) {
>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>  		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
> -		cpuctx->task_ctx->timeguest = 0;
>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>  	}
>  
> +	__this_cpu_write(perf_in_guest, false);
> +
> +unlock:
>  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>  }
>  

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-13 13:37                 ` Liang, Kan
@ 2024-06-13 18:04                   ` Liang, Kan
  2024-06-17  7:51                     ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Liang, Kan @ 2024-06-13 18:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users



On 2024-06-13 9:37 a.m., Liang, Kan wrote:
>> ---
>>
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -947,7 +947,9 @@ struct perf_event_context {
>>  	u64				time;
>>  	u64				timestamp;
>>  	u64				timeoffset;
>> -	u64				timeguest;
>> +	u64				guest_time;
>> +	u64				guest_timestamp;
>> +	u64				guest_timeoffset;
>>  
>>  	/*
>>  	 * These fields let us detect when two contexts have both
>> @@ -1043,6 +1045,9 @@ struct perf_cgroup_info {
>>  	u64				time;
>>  	u64				timestamp;
>>  	u64				timeoffset;
>> +	u64				guest_time;
>> +	u64				guest_timestamp;
>> +	u64				guest_timeoffset;
>>  	int				active;
>>  };
>>  
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -638,26 +638,9 @@ __perf_update_times(struct perf_event *e
>>  
>>  static void perf_event_update_time(struct perf_event *event)
>>  {
>> -	u64 now;
>> -
>> -	/* Never count the time of an active guest into an exclude_guest event. */
>> -	if (event->ctx->timeguest &&
>> -	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) {
>> -		/*
>> -		 * If a guest is running, use the timestamp while entering the guest.
>> -		 * If the guest is leaving, reset the event timestamp.
>> -		 */
>> -		if (__this_cpu_read(perf_in_guest))
>> -			event->tstamp = event->ctx->timeguest;
>> -		else
>> -			event->tstamp = event->ctx->time;
>> -		return;
>> -	}
>> -
>> -	now = perf_event_time(event);
>> +	u64 now = perf_event_time(event);
>>  	__perf_update_times(event, now, &event->total_time_enabled,
>>  					&event->total_time_running);
>> -
>>  	event->tstamp = now;
>>  }
>>  
>> @@ -780,19 +763,33 @@ static inline int is_cgroup_event(struct
>>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>>  {
>>  	struct perf_cgroup_info *t;
>> +	u64 time;
>>  
>>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>> -	return t->time;
>> +	time = t->time;
>> +	if (event->attr.exclude_guest)
>> +		time -= t->guest_time;
>> +	return time;
>>  }
>>  
>>  static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>>  {
>>  	struct perf_cgroup_info *t;
>> +	u64 time, guest_time;
>>  
>>  	t = per_cpu_ptr(event->cgrp->info, event->cpu);
>> -	if (!__load_acquire(&t->active))
>> -		return t->time;
>> -	now += READ_ONCE(t->timeoffset);
>> +	if (!__load_acquire(&t->active)) {
>> +		time = t->time;
>> +		if (event->attr.exclude_guest)
>> +			time -= t->guest_time;
>> +		return time;
>> +	}
>> +
>> +	time = now + READ_ONCE(t->timeoffset);
>> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
>> +		guest_time = now + READ_ONCE(t->guest_offset);
>> +		time -= guest_time;
>> +	}
>>  	return now;
>>  }
>>  
>> @@ -807,6 +804,17 @@ static inline void __update_cgrp_time(st
>>  	WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
>>  }
>>  
>> +static inline void __update_cgrp_guest_time(struct perf_cgroup_info *info, u64 now, bool adv)
>> +{
>> +	if (adv)
>> +		info->guest_time += now - info->guest_timestamp;
>> +	info->guest_timestamp = now;
>> +	/*
>> +	 * see update_context_time()
>> +	 */
>> +	WRITE_ONCE(info->guest_timeoffset, info->guest_time - info->guest_timestamp);
>> +}
>> +
>>  static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
>>  {
>>  	struct perf_cgroup *cgrp = cpuctx->cgrp;
>> @@ -821,6 +829,8 @@ static inline void update_cgrp_time_from
>>  			info = this_cpu_ptr(cgrp->info);
>>  
>>  			__update_cgrp_time(info, now, true);
>> +			if (__this_cpu_read(perf_in_guest))
>> +				__update_cgrp_guest_time(info, now, true);
>>  			if (final)
>>  				__store_release(&info->active, 0);
>>  		}
>> @@ -1501,14 +1511,39 @@ static void __update_context_time(struct
>>  	WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
>>  }
>>  
>> +static void __update_context_guest_time(struct perf_event_context *ctx, bool adv)
>> +{
>> +	u64 now = ctx->timestamp; /* must be called after __update_context_time(); */
>> +
>> +	lockdep_assert_held(&ctx->lock);
>> +
>> +	if (adv)
>> +		ctx->guest_time += now - ctx->guest_timestamp;
>> +	ctx->guest_timestamp = now;
>> +
>> +	/*
>> +	 * The above: time' = time + (now - timestamp), can be re-arranged
>> +	 * into: time` = now + (time - timestamp), which gives a single value
>> +	 * offset to compute future time without locks on.
>> +	 *
>> +	 * See perf_event_time_now(), which can be used from NMI context where
>> +	 * it's (obviously) not possible to acquire ctx->lock in order to read
>> +	 * both the above values in a consistent manner.
>> +	 */
>> +	WRITE_ONCE(ctx->guest_timeoffset, ctx->guest_time - ctx->guest_timestamp);
>> +}
>> +
>>  static void update_context_time(struct perf_event_context *ctx)
>>  {
>>  	__update_context_time(ctx, true);
>> +	if (__this_cpu_read(perf_in_guest))
>> +		__update_context_guest_time(ctx, true);
>>  }
>>  
>>  static u64 perf_event_time(struct perf_event *event)
>>  {
>>  	struct perf_event_context *ctx = event->ctx;
>> +	u64 time;
>>  
>>  	if (unlikely(!ctx))
>>  		return 0;
>> @@ -1516,12 +1551,17 @@ static u64 perf_event_time(struct perf_e
>>  	if (is_cgroup_event(event))
>>  		return perf_cgroup_event_time(event);
>>  
>> -	return ctx->time;
>> +	time = ctx->time;
>> +	if (event->attr.exclude_guest)
>> +		time -= ctx->guest_time;
>> +
>> +	return time;
>>  }
>>  
>>  static u64 perf_event_time_now(struct perf_event *event, u64 now)
>>  {
>>  	struct perf_event_context *ctx = event->ctx;
>> +	u64 time, guest_time;
>>  
>>  	if (unlikely(!ctx))
>>  		return 0;
>> @@ -1529,11 +1569,19 @@ static u64 perf_event_time_now(struct pe
>>  	if (is_cgroup_event(event))
>>  		return perf_cgroup_event_time_now(event, now);
>>  
>> -	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
>> -		return ctx->time;
>> +	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME)) {
>> +		time = ctx->time;
>> +		if (event->attr.exclude_guest)
>> +			time -= ctx->guest_time;
>> +		return time;
>> +	}
>>  
>> -	now += READ_ONCE(ctx->timeoffset);
>> -	return now;
>> +	time = now + READ_ONCE(ctx->timeoffset);
>> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest)) {
>> +		guest_time = now + READ_ONCE(ctx->guest_timeoffset);
>> +		time -= guest_time;
>> +	}
>> +	return time;
>>  }
>>  
>>  static enum event_type_t get_event_type(struct perf_event *event)
>> @@ -3340,9 +3388,14 @@ ctx_sched_out(struct perf_event_context
>>  	 * would only update time for the pinned events.
>>  	 */
>>  	if (is_active & EVENT_TIME) {
>> +		bool stop;
>> +
>> +		stop = !((ctx->is_active & event_type) & EVENT_ALL) &&
>> +		       ctx == &cpuctx->ctx;
>> +			
>>  		/* update (and stop) ctx time */
>>  		update_context_time(ctx);
>> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
>> +		update_cgrp_time_from_cpuctx(cpuctx, stop);

For the event_type == EVENT_GUEST, the "stop" should always be the same
as "ctx == &cpuctx->ctx". Because the ctx->is_active never set the
EVENT_GUEST bit.
Why the stop is introduced?


>>  		/*
>>  		 * CPU-release for the below ->is_active store,
>>  		 * see __load_acquire() in perf_event_time_now()
>> @@ -3366,8 +3419,12 @@ ctx_sched_out(struct perf_event_context
>>  		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>>  		 */
>>  		is_active = EVENT_ALL;
>> -	} else
>> +		__update_context_guest_time(ctx, false);
>> +		perf_cgroup_set_guest_timestamp(cpuctx);
>> +		barrier();
>> +	} else {
>>  		is_active ^= ctx->is_active; /* changed bits */
>> +	}
>>  
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
>> @@ -3866,10 +3923,15 @@ static inline void group_update_userpage
>>  		event_update_userpage(event);
>>  }
>>  
>> +struct merge_sched_data {
>> +	int can_add_hw;
>> +	enum event_type_t event_type;
>> +};
>> +
>>  static int merge_sched_in(struct perf_event *event, void *data)
>>  {
>>  	struct perf_event_context *ctx = event->ctx;
>> -	int *can_add_hw = data;
>> +	struct merge_sched_data *msd = data;
>>  
>>  	if (event->state <= PERF_EVENT_STATE_OFF)
>>  		return 0;
>> @@ -3881,18 +3943,18 @@ static int merge_sched_in(struct perf_ev
>>  	 * Don't schedule in any exclude_guest events of PMU with
>>  	 * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running.
>>  	 */
>> -	if (__this_cpu_read(perf_in_guest) &&
>> -	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU &&
>> -	    event->attr.exclude_guest)
>> +	if (event->attr.exclude_guest && __this_cpu_read(perf_in_guest) &&
>> +	    (event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) &&
>> +	    !(msd->event_type & EVENT_GUEST))
>>  		return 0;
>>  
>> -	if (group_can_go_on(event, *can_add_hw)) {
>> +	if (group_can_go_on(event, msd->can_add_hw)) {
>>  		if (!group_sched_in(event, ctx))
>>  			list_add_tail(&event->active_list, get_event_list(event));
>>  	}
>>  
>>  	if (event->state == PERF_EVENT_STATE_INACTIVE) {
>> -		*can_add_hw = 0;
>> +		msd->can_add_hw = 0;
>>  		if (event->attr.pinned) {
>>  			perf_cgroup_event_disable(event, ctx);
>>  			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>> @@ -3911,11 +3973,15 @@ static int merge_sched_in(struct perf_ev
>>  
>>  static void pmu_groups_sched_in(struct perf_event_context *ctx,
>>  				struct perf_event_groups *groups,
>> -				struct pmu *pmu)
>> +				struct pmu *pmu,
>> +				enum even_type_t event_type)
>>  {
>> -	int can_add_hw = 1;
>> +	struct merge_sched_data msd = {
>> +		.can_add_hw = 1,
>> +		.event_type = event_type,
>> +	};
>>  	visit_groups_merge(ctx, groups, smp_processor_id(), pmu,
>> -			   merge_sched_in, &can_add_hw);
>> +			   merge_sched_in, &msd);
>>  }
>>  
>>  static void ctx_groups_sched_in(struct perf_event_context *ctx,
>> @@ -3927,14 +3993,14 @@ static void ctx_groups_sched_in(struct p
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
>>  			continue;
>> -		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
>> +		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu, event_type);
>>  	}
>>  }
>>  
>>  static void __pmu_ctx_sched_in(struct perf_event_context *ctx,
>>  			       struct pmu *pmu)
>>  {
>> -	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
>> +	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu, 0);
>>  }
>>  
>>  static void
>> @@ -3949,6 +4015,8 @@ ctx_sched_in(struct perf_event_context *
>>  		return;
>>  
>>  	if (!(is_active & EVENT_TIME)) {
>> +		/* EVENT_TIME should be active while the guest runs */
>> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
>>  		/* start ctx time */
>>  		__update_context_time(ctx, false);
>>  		perf_cgroup_set_timestamp(cpuctx);
>> @@ -3979,8 +4047,11 @@ ctx_sched_in(struct perf_event_context *
>>  		 * the exclude_guest events.
>>  		 */
>>  		update_context_time(ctx);
>> -	} else
>> +		update_cgrp_time_from_cpuctx(cpuctx, false);


In the above ctx_sched_out(), the cgrp_time is stopped and the cgrp has
been set to inactive.
I think we need a perf_cgroup_set_timestamp(cpuctx) here to restart the
cgrp_time, Right?

Also, I think the cgrp_time is different from the normal ctx->time. When
a guest is running, there must be no cgroup. It's OK to disable the
cgrp_time. If so, I don't think we need to track the guest_time for the
cgrp.

Thanks,
Kan

>> +		barrier();
>> +	} else {
>>  		is_active ^= ctx->is_active; /* changed bits */
>> +	}
>>  
>>  	/*
>>  	 * First go through the list and put on any pinned groups
>> @@ -5832,25 +5903,20 @@ void perf_guest_enter(void)
>>  
>>  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>  
>> -	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) {
>> -		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> -		return;
>> -	}
>> +	if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest)))
>> +		goto unlock;
>>  
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>  	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
>> -	/* Set the guest start time */
>> -	cpuctx->ctx.timeguest = cpuctx->ctx.time;
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>  	if (cpuctx->task_ctx) {
>>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>  		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
>> -		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
>>  
>>  	__this_cpu_write(perf_in_guest, true);
>> -
>> +unlock:
>>  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>  }
>>  
>> @@ -5862,24 +5928,21 @@ void perf_guest_exit(void)
>>  
>>  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>  
>> -	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) {
>> -		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> -		return;
>> -	}
>> -
>> -	__this_cpu_write(perf_in_guest, false);
>> +	if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest)))
>> +		goto unlock;
>>  
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>>  	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>> -	cpuctx->ctx.timeguest = 0;
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>  	if (cpuctx->task_ctx) {
>>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>>  		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
>> -		cpuctx->task_ctx->timeguest = 0;
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
>>  
>> +	__this_cpu_write(perf_in_guest, false);
>> +
>> +unlock:
>>  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>  }
>>  

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-13 18:04                   ` Liang, Kan
@ 2024-06-17  7:51                     ` Peter Zijlstra
  2024-06-17 13:34                       ` Liang, Kan
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-06-17  7:51 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users

On Thu, Jun 13, 2024 at 02:04:36PM -0400, Liang, Kan wrote:
> >>  static enum event_type_t get_event_type(struct perf_event *event)
> >> @@ -3340,9 +3388,14 @@ ctx_sched_out(struct perf_event_context
> >>  	 * would only update time for the pinned events.
> >>  	 */
> >>  	if (is_active & EVENT_TIME) {
> >> +		bool stop;
> >> +
> >> +		stop = !((ctx->is_active & event_type) & EVENT_ALL) &&
> >> +		       ctx == &cpuctx->ctx;
> >> +			
> >>  		/* update (and stop) ctx time */
> >>  		update_context_time(ctx);
> >> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
> >> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
> 
> For the event_type == EVENT_GUEST, the "stop" should always be the same
> as "ctx == &cpuctx->ctx". Because the ctx->is_active never set the
> EVENT_GUEST bit.
> Why the stop is introduced?

Because the ctx_sched_out() for vPMU should not stop time, only the
'normal' sched-out should stop time.


> >> @@ -3949,6 +4015,8 @@ ctx_sched_in(struct perf_event_context *
> >>  		return;
> >>  
> >>  	if (!(is_active & EVENT_TIME)) {
> >> +		/* EVENT_TIME should be active while the guest runs */
> >> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
> >>  		/* start ctx time */
> >>  		__update_context_time(ctx, false);
> >>  		perf_cgroup_set_timestamp(cpuctx);
> >> @@ -3979,8 +4047,11 @@ ctx_sched_in(struct perf_event_context *
> >>  		 * the exclude_guest events.
> >>  		 */
> >>  		update_context_time(ctx);
> >> -	} else
> >> +		update_cgrp_time_from_cpuctx(cpuctx, false);
> 
> 
> In the above ctx_sched_out(), the cgrp_time is stopped and the cgrp has
> been set to inactive.
> I think we need a perf_cgroup_set_timestamp(cpuctx) here to restart the
> cgrp_time, Right?

So the idea was to not stop time when we schedule out for the vPMU, as
per the above.

> Also, I think the cgrp_time is different from the normal ctx->time. When
> a guest is running, there must be no cgroup. It's OK to disable the
> cgrp_time. If so, I don't think we need to track the guest_time for the
> cgrp.

Uh, the vCPU thread is/can-be part of a cgroup, and different guests
part of different cgroups. The CPU wide 'guest' time is all time spend
in guets, but the cgroup view of things might differ, depending on how
the guets are arranged in cgroups, no?

As such, we need per cgroup guest tracking.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-17  7:51                     ` Peter Zijlstra
@ 2024-06-17 13:34                       ` Liang, Kan
  2024-06-17 15:00                         ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Liang, Kan @ 2024-06-17 13:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users



On 2024-06-17 3:51 a.m., Peter Zijlstra wrote:
> On Thu, Jun 13, 2024 at 02:04:36PM -0400, Liang, Kan wrote:
>>>>  static enum event_type_t get_event_type(struct perf_event *event)
>>>> @@ -3340,9 +3388,14 @@ ctx_sched_out(struct perf_event_context
>>>>  	 * would only update time for the pinned events.
>>>>  	 */
>>>>  	if (is_active & EVENT_TIME) {
>>>> +		bool stop;
>>>> +
>>>> +		stop = !((ctx->is_active & event_type) & EVENT_ALL) &&
>>>> +		       ctx == &cpuctx->ctx;
>>>> +			
>>>>  		/* update (and stop) ctx time */
>>>>  		update_context_time(ctx);
>>>> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
>>>> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
>>
>> For the event_type == EVENT_GUEST, the "stop" should always be the same
>> as "ctx == &cpuctx->ctx". Because the ctx->is_active never set the
>> EVENT_GUEST bit.
>> Why the stop is introduced?
> 
> Because the ctx_sched_out() for vPMU should not stop time, 

But the implementation seems stop the time.

The ctx->is_active should be (EVENT_ALL | EVENT_TIME) for most of cases.

When a vPMU is scheduling in (invoke ctx_sched_out()), the event_type
should only be EVENT_GUEST.

!((ctx->is_active & event_type) & EVENT_ALL) should be TRUE.

For a CPU context, ctx == &cpuctx->ctx is TRUE as well.

The update_cgrp_time_from_cpuctx(cpuctx, TRUE) stops the time by
deactivate the cgroup, __store_release(&info->active, 0).

If an user try to read the cgroup events when a guest is running. The
update_cgrp_time_from_event() doesn't update the cgrp time. So both time
and counter are stopped.

> only the
> 'normal' sched-out should stop time.

If the guest is the only case which we want to keep the time for, I
think we may use a straightforward check as below.

	stop = !(event_type & EVENT_GUEST) && ctx == &cpuctx->ctx;

> 
> 
>>>> @@ -3949,6 +4015,8 @@ ctx_sched_in(struct perf_event_context *
>>>>  		return;
>>>>  
>>>>  	if (!(is_active & EVENT_TIME)) {
>>>> +		/* EVENT_TIME should be active while the guest runs */
>>>> +		WARN_ON_ONCE(event_type & EVENT_GUEST);
>>>>  		/* start ctx time */
>>>>  		__update_context_time(ctx, false);
>>>>  		perf_cgroup_set_timestamp(cpuctx);
>>>> @@ -3979,8 +4047,11 @@ ctx_sched_in(struct perf_event_context *
>>>>  		 * the exclude_guest events.
>>>>  		 */
>>>>  		update_context_time(ctx);
>>>> -	} else
>>>> +		update_cgrp_time_from_cpuctx(cpuctx, false);
>>
>>
>> In the above ctx_sched_out(), the cgrp_time is stopped and the cgrp has
>> been set to inactive.
>> I think we need a perf_cgroup_set_timestamp(cpuctx) here to restart the
>> cgrp_time, Right?
> 
> So the idea was to not stop time when we schedule out for the vPMU, as
> per the above.
> 
>> Also, I think the cgrp_time is different from the normal ctx->time. When
>> a guest is running, there must be no cgroup. It's OK to disable the
>> cgrp_time. If so, I don't think we need to track the guest_time for the
>> cgrp.
> 
> Uh, the vCPU thread is/can-be part of a cgroup, and different guests
> part of different cgroups. The CPU wide 'guest' time is all time spend
> in guets, but the cgroup view of things might differ, depending on how
> the guets are arranged in cgroups, no?
> 
> As such, we need per cgroup guest tracking.

Got it.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-17 13:34                       ` Liang, Kan
@ 2024-06-17 15:00                         ` Peter Zijlstra
  2024-06-17 15:45                           ` Liang, Kan
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-06-17 15:00 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users

On Mon, Jun 17, 2024 at 09:34:15AM -0400, Liang, Kan wrote:
> 
> 
> On 2024-06-17 3:51 a.m., Peter Zijlstra wrote:
> > On Thu, Jun 13, 2024 at 02:04:36PM -0400, Liang, Kan wrote:
> >>>>  static enum event_type_t get_event_type(struct perf_event *event)
> >>>> @@ -3340,9 +3388,14 @@ ctx_sched_out(struct perf_event_context
> >>>>  	 * would only update time for the pinned events.
> >>>>  	 */
> >>>>  	if (is_active & EVENT_TIME) {
> >>>> +		bool stop;
> >>>> +
> >>>> +		stop = !((ctx->is_active & event_type) & EVENT_ALL) &&
> >>>> +		       ctx == &cpuctx->ctx;
> >>>> +			
> >>>>  		/* update (and stop) ctx time */
> >>>>  		update_context_time(ctx);
> >>>> -		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
> >>>> +		update_cgrp_time_from_cpuctx(cpuctx, stop);
> >>
> >> For the event_type == EVENT_GUEST, the "stop" should always be the same
> >> as "ctx == &cpuctx->ctx". Because the ctx->is_active never set the
> >> EVENT_GUEST bit.
> >> Why the stop is introduced?
> > 
> > Because the ctx_sched_out() for vPMU should not stop time, 
> 
> But the implementation seems stop the time.
> 
> The ctx->is_active should be (EVENT_ALL | EVENT_TIME) for most of cases.
> 
> When a vPMU is scheduling in (invoke ctx_sched_out()), the event_type
> should only be EVENT_GUEST.
> 
> !((ctx->is_active & event_type) & EVENT_ALL) should be TRUE.

Hmm.. yeah, I think I might've gotten that wrong.

> > only the
> > 'normal' sched-out should stop time.
> 
> If the guest is the only case which we want to keep the time for, I
> think we may use a straightforward check as below.
> 
> 	stop = !(event_type & EVENT_GUEST) && ctx == &cpuctx->ctx;

So I think I was trying to get stop true when there weren't in fact
events on, that is when we got in without EVENT_ALL bits, but perhaps
that case isn't relevant.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 07/54] perf: Add generic exclude_guest support
  2024-06-17 15:00                         ` Peter Zijlstra
@ 2024-06-17 15:45                           ` Liang, Kan
  0 siblings, 0 replies; 116+ messages in thread
From: Liang, Kan @ 2024-06-17 15:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users



On 2024-06-17 11:00 a.m., Peter Zijlstra wrote:
>>> only the
>>> 'normal' sched-out should stop time.
>> If the guest is the only case which we want to keep the time for, I
>> think we may use a straightforward check as below.
>>
>> 	stop = !(event_type & EVENT_GUEST) && ctx == &cpuctx->ctx;
> So I think I was trying to get stop true when there weren't in fact
> events on, that is when we got in without EVENT_ALL bits, but perhaps
> that case isn't relevant.

It should be irrelevant. Here I think we just need to make sure that the
guest is not the reason for stopping the time if there is an active
time. For the others, there is nothing changed.
The __this_cpu_read(perf_in_guest) check will guarantee that both guest
time and host time sync when any updates occur in the guest mode.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 08/54] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (6 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 07/54] perf: Add generic exclude_guest support Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 09/54] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
                   ` (46 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Kan Liang <kan.liang@linux.intel.com>

Apply the PERF_PMU_CAP_PASSTHROUGH_VPMU for Intel core PMU. It only
indicates that the perf side of core PMU is ready to support the
passthrough vPMU.  Besides the capability, the hypervisor should still need
to check the PMU version and other capabilities to decide whether to enable
the passthrough vPMU.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 768d1414897f..4d8f907a9416 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4743,6 +4743,8 @@ static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
 	else
 		pmu->pmu.capabilities &= ~PERF_PMU_CAP_AUX_OUTPUT;
 
+	pmu->pmu.capabilities |= PERF_PMU_CAP_PASSTHROUGH_VPMU;
+
 	intel_pmu_check_event_constraints(pmu->event_constraints,
 					  pmu->num_counters,
 					  pmu->num_counters_fixed,
@@ -6242,6 +6244,9 @@ __init int intel_pmu_init(void)
 			pr_cont(" AnyThread deprecated, ");
 	}
 
+	/* The perf side of core PMU is ready to support the passthrough vPMU. */
+	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_PASSTHROUGH_VPMU;
+
 	/*
 	 * Install the hw-cache-events table:
 	 */
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 09/54] perf: core/x86: Register a new vector for KVM GUEST PMI
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (7 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 08/54] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-07  9:12   ` Peter Zijlstra
  2024-05-06  5:29 ` [PATCH v2 10/54] KVM: x86: Extract x86_set_kvm_irq_handler() function Mingwei Zhang
                   ` (45 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Create a new vector in the host IDT for kvm guest PMI handling within
passthrough PMU. In addition, add a function to allow kvm register the new
vector's handler.

This is the preparation work to support passthrough PMU to handle kvm guest
PMIs without interference from PMI handler of the host PMU.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/hardirq.h           |  1 +
 arch/x86/include/asm/idtentry.h          |  1 +
 arch/x86/include/asm/irq.h               |  1 +
 arch/x86/include/asm/irq_vectors.h       |  5 +++-
 arch/x86/kernel/idt.c                    |  1 +
 arch/x86/kernel/irq.c                    | 29 ++++++++++++++++++++++++
 tools/arch/x86/include/asm/irq_vectors.h |  3 ++-
 7 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index fbc7722b87d1..250e6db1cb5f 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -19,6 +19,7 @@ typedef struct {
 	unsigned int kvm_posted_intr_ipis;
 	unsigned int kvm_posted_intr_wakeup_ipis;
 	unsigned int kvm_posted_intr_nested_ipis;
+	unsigned int kvm_guest_pmis;
 #endif
 	unsigned int x86_platform_ipis;	/* arch dependent */
 	unsigned int apic_perf_irqs;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 749c7411d2f1..4090aea47b76 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -745,6 +745,7 @@ DECLARE_IDTENTRY_SYSVEC(IRQ_WORK_VECTOR,		sysvec_irq_work);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_VECTOR,		sysvec_kvm_posted_intr_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup_ipi);
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
+DECLARE_IDTENTRY_SYSVEC(KVM_GUEST_PMI_VECTOR,	        sysvec_kvm_guest_pmi_handler);
 #else
 # define fred_sysvec_kvm_posted_intr_ipi		NULL
 # define fred_sysvec_kvm_posted_intr_wakeup_ipi		NULL
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 194dfff84cb1..2483f6ef5d4e 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -31,6 +31,7 @@ extern void fixup_irqs(void);
 
 #if IS_ENABLED(CONFIG_KVM)
 extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
+void kvm_set_guest_pmi_handler(void (*handler)(void));
 #endif
 
 extern void (*x86_platform_ipi_callback)(void);
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index d18bfb238f66..e5f741bb1557 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -77,7 +77,10 @@
  */
 #define IRQ_WORK_VECTOR			0xf6
 
-/* 0xf5 - unused, was UV_BAU_MESSAGE */
+#if IS_ENABLED(CONFIG_KVM)
+#define KVM_GUEST_PMI_VECTOR		0xf5
+#endif
+
 #define DEFERRED_ERROR_VECTOR		0xf4
 
 /* Vector on which hypervisor callbacks will be delivered */
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index fc37c8d83daf..c62368a3ba04 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -157,6 +157,7 @@ static const __initconst struct idt_data apic_idts[] = {
 	INTG(POSTED_INTR_VECTOR,		asm_sysvec_kvm_posted_intr_ipi),
 	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
 	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
+	INTG(KVM_GUEST_PMI_VECTOR,		asm_sysvec_kvm_guest_pmi_handler),
 # endif
 # ifdef CONFIG_IRQ_WORK
 	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 35fde0107901..22c10e5c50af 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -181,6 +181,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+
+	seq_printf(p, "%*s: ", prec, "VPMU");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->kvm_guest_pmis);
+	seq_puts(p, " KVM GUEST PMI\n");
+
 #endif
 	return 0;
 }
@@ -293,6 +300,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
 #if IS_ENABLED(CONFIG_KVM)
 static void dummy_handler(void) {}
 static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
+static void (*kvm_guest_pmi_handler)(void) = dummy_handler;
 
 void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
 {
@@ -305,6 +313,17 @@ void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
 }
 EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
 
+void kvm_set_guest_pmi_handler(void (*handler)(void))
+{
+	if (handler) {
+		kvm_guest_pmi_handler = handler;
+	} else {
+		kvm_guest_pmi_handler = dummy_handler;
+		synchronize_rcu();
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_set_guest_pmi_handler);
+
 /*
  * Handler for POSTED_INTERRUPT_VECTOR.
  */
@@ -332,6 +351,16 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
 	apic_eoi();
 	inc_irq_stat(kvm_posted_intr_nested_ipis);
 }
+
+/*
+ * Handler for KVM_GUEST_PMI_VECTOR.
+ */
+DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_guest_pmi_handler)
+{
+	apic_eoi();
+	inc_irq_stat(kvm_guest_pmis);
+	kvm_guest_pmi_handler();
+}
 #endif
 
 
diff --git a/tools/arch/x86/include/asm/irq_vectors.h b/tools/arch/x86/include/asm/irq_vectors.h
index 3f73ac3ed3a0..6df2a986805d 100644
--- a/tools/arch/x86/include/asm/irq_vectors.h
+++ b/tools/arch/x86/include/asm/irq_vectors.h
@@ -83,8 +83,9 @@
 /* Vector on which hypervisor callbacks will be delivered */
 #define HYPERVISOR_CALLBACK_VECTOR	0xf3
 
-/* Vector for KVM to deliver posted interrupt IPI */
 #if IS_ENABLED(CONFIG_KVM)
+#define KVM_GUEST_PMI_VECTOR		0xf5
+/* Vector for KVM to deliver posted interrupt IPI */
 #define POSTED_INTR_VECTOR		0xf2
 #define POSTED_INTR_WAKEUP_VECTOR	0xf1
 #define POSTED_INTR_NESTED_VECTOR	0xf0
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 09/54] perf: core/x86: Register a new vector for KVM GUEST PMI
  2024-05-06  5:29 ` [PATCH v2 09/54] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
@ 2024-05-07  9:12   ` Peter Zijlstra
  2024-05-08 10:06     ` Yanfei Xu
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-07  9:12 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users

On Mon, May 06, 2024 at 05:29:34AM +0000, Mingwei Zhang wrote:

> +void kvm_set_guest_pmi_handler(void (*handler)(void))
> +{
> +	if (handler) {
> +		kvm_guest_pmi_handler = handler;
> +	} else {
> +		kvm_guest_pmi_handler = dummy_handler;
> +		synchronize_rcu();
> +	}
> +}
> +EXPORT_SYMBOL_GPL(kvm_set_guest_pmi_handler);

Just for my edification, after synchronize_rcu() nobody should observe
the old handler, but what guarantees there's not still one running?

I'm thinking the fact that these handlers run with IRQs disabled, and
synchronize_rcu() also very much ensures all prior non-preempt sections
are complete?



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 09/54] perf: core/x86: Register a new vector for KVM GUEST PMI
  2024-05-07  9:12   ` Peter Zijlstra
@ 2024-05-08 10:06     ` Yanfei Xu
  0 siblings, 0 replies; 116+ messages in thread
From: Yanfei Xu @ 2024-05-08 10:06 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, maobibo, Like Xu, kvm, linux-perf-users



On 5/7/2024 5:12 PM, Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 05:29:34AM +0000, Mingwei Zhang wrote:
> 
>> +void kvm_set_guest_pmi_handler(void (*handler)(void))
>> +{
>> +	if (handler) {
>> +		kvm_guest_pmi_handler = handler;
>> +	} else {
>> +		kvm_guest_pmi_handler = dummy_handler;
>> +		synchronize_rcu();
>> +	}
>> +}
>> +EXPORT_SYMBOL_GPL(kvm_set_guest_pmi_handler);
> 
> Just for my edification, after synchronize_rcu() nobody should observe
> the old handler, but what guarantees there's not still one running?

Interrupts handler can be regarded as RCU read-side critical section,
once synchronize_rcu returns, no one accessing the old handler lefts.

> 
> I'm thinking the fact that these handlers run with IRQs disabled, and
> synchronize_rcu() also very much ensures all prior non-preempt sections
> are complete?

Yes :)

Thanks,
Yanfei

> 
> 
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 10/54] KVM: x86: Extract x86_set_kvm_irq_handler() function
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (8 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 09/54] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-07  9:18   ` Peter Zijlstra
  2024-05-06  5:29 ` [PATCH v2 11/54] KVM: x86/pmu: Register guest pmi handler for emulated PMU Mingwei Zhang
                   ` (44 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

KVM needs to register irq handler for POSTED_INTR_WAKEUP_VECTOR and
KVM_GUEST_PMI_VECTOR, a common function x86_set_kvm_irq_handler() is
extracted to reduce exports function and duplicated code.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
---
 arch/x86/include/asm/irq.h |  3 +--
 arch/x86/kernel/irq.c      | 27 +++++++++++----------------
 arch/x86/kvm/vmx/vmx.c     |  4 ++--
 3 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 2483f6ef5d4e..050a247b69b4 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -30,8 +30,7 @@ struct irq_desc;
 extern void fixup_irqs(void);
 
 #if IS_ENABLED(CONFIG_KVM)
-extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
-void kvm_set_guest_pmi_handler(void (*handler)(void));
+void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void));
 #endif
 
 extern void (*x86_platform_ipi_callback)(void);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 22c10e5c50af..3ada69c50951 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -302,27 +302,22 @@ static void dummy_handler(void) {}
 static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
 static void (*kvm_guest_pmi_handler)(void) = dummy_handler;
 
-void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
+void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
 {
-	if (handler)
+	if (!handler)
+		handler = dummy_handler;
+
+	if (vector == POSTED_INTR_WAKEUP_VECTOR)
 		kvm_posted_intr_wakeup_handler = handler;
-	else {
-		kvm_posted_intr_wakeup_handler = dummy_handler;
-		synchronize_rcu();
-	}
-}
-EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
-
-void kvm_set_guest_pmi_handler(void (*handler)(void))
-{
-	if (handler) {
+	else if (vector == KVM_GUEST_PMI_VECTOR)
 		kvm_guest_pmi_handler = handler;
-	} else {
-		kvm_guest_pmi_handler = dummy_handler;
+	else
+		WARN_ON_ONCE(1);
+
+	if (handler == dummy_handler)
 		synchronize_rcu();
-	}
 }
-EXPORT_SYMBOL_GPL(kvm_set_guest_pmi_handler);
+EXPORT_SYMBOL_GPL(x86_set_kvm_irq_handler);
 
 /*
  * Handler for POSTED_INTERRUPT_VECTOR.
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c37a89eda90f..c2dc68a25a53 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8214,7 +8214,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 
 static void vmx_hardware_unsetup(void)
 {
-	kvm_set_posted_intr_wakeup_handler(NULL);
+	x86_set_kvm_irq_handler(POSTED_INTR_WAKEUP_VECTOR, NULL);
 
 	if (nested)
 		nested_vmx_hardware_unsetup();
@@ -8679,7 +8679,7 @@ static __init int hardware_setup(void)
 	if (r && nested)
 		nested_vmx_hardware_unsetup();
 
-	kvm_set_posted_intr_wakeup_handler(pi_wakeup_handler);
+	x86_set_kvm_irq_handler(POSTED_INTR_WAKEUP_VECTOR, pi_wakeup_handler);
 
 	return r;
 }
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 10/54] KVM: x86: Extract x86_set_kvm_irq_handler() function
  2024-05-06  5:29 ` [PATCH v2 10/54] KVM: x86: Extract x86_set_kvm_irq_handler() function Mingwei Zhang
@ 2024-05-07  9:18   ` Peter Zijlstra
  2024-05-08  8:57     ` Zhang, Xiong Y
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-07  9:18 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users

On Mon, May 06, 2024 at 05:29:35AM +0000, Mingwei Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> 
> KVM needs to register irq handler for POSTED_INTR_WAKEUP_VECTOR and
> KVM_GUEST_PMI_VECTOR, a common function x86_set_kvm_irq_handler() is
> extracted to reduce exports function and duplicated code.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> ---
>  arch/x86/include/asm/irq.h |  3 +--
>  arch/x86/kernel/irq.c      | 27 +++++++++++----------------
>  arch/x86/kvm/vmx/vmx.c     |  4 ++--
>  3 files changed, 14 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
> index 2483f6ef5d4e..050a247b69b4 100644
> --- a/arch/x86/include/asm/irq.h
> +++ b/arch/x86/include/asm/irq.h
> @@ -30,8 +30,7 @@ struct irq_desc;
>  extern void fixup_irqs(void);
>  
>  #if IS_ENABLED(CONFIG_KVM)
> -extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
> -void kvm_set_guest_pmi_handler(void (*handler)(void));
> +void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void));
>  #endif
>  
>  extern void (*x86_platform_ipi_callback)(void);
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 22c10e5c50af..3ada69c50951 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -302,27 +302,22 @@ static void dummy_handler(void) {}
>  static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
>  static void (*kvm_guest_pmi_handler)(void) = dummy_handler;
>  
> -void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
> +void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
>  {
> -	if (handler)
> +	if (!handler)
> +		handler = dummy_handler;
> +
> +	if (vector == POSTED_INTR_WAKEUP_VECTOR)
>  		kvm_posted_intr_wakeup_handler = handler;
> -	else {
> -		kvm_posted_intr_wakeup_handler = dummy_handler;
> -		synchronize_rcu();
> -	}
> -}
> -EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
> -
> -void kvm_set_guest_pmi_handler(void (*handler)(void))
> -{
> -	if (handler) {
> +	else if (vector == KVM_GUEST_PMI_VECTOR)
>  		kvm_guest_pmi_handler = handler;
> -	} else {
> -		kvm_guest_pmi_handler = dummy_handler;
> +	else
> +		WARN_ON_ONCE(1);
> +
> +	if (handler == dummy_handler)
>  		synchronize_rcu();
> -	}
>  }
> -EXPORT_SYMBOL_GPL(kvm_set_guest_pmi_handler);
> +EXPORT_SYMBOL_GPL(x86_set_kvm_irq_handler);

Can't you just squash this into the previous patch? I mean, what's the
point of this back and forth?

> +void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
>  {
> +	if (!handler)
> +		handler = dummy_handler;
> +
> +	if (vector == POSTED_INTR_WAKEUP_VECTOR)
>  		kvm_posted_intr_wakeup_handler = handler;
> +	else if (vector == KVM_GUEST_PMI_VECTOR)
>  		kvm_guest_pmi_handler = handler;
> +	else
> +		WARN_ON_ONCE(1);
> +
> +	if (handler == dummy_handler)
>  		synchronize_rcu();
>  }
> +EXPORT_SYMBOL_GPL(x86_set_kvm_irq_handler);

So what about:

 x86_set_kvm_irq_handler(foo, handler1);
 x86_set_kvm_irq_handler(foo, handler2);

 ?

I'm fairly sure you either want to enforce a NULL<->handler transition,
or add some additional synchronize stuff.

Hmm?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 10/54] KVM: x86: Extract x86_set_kvm_irq_handler() function
  2024-05-07  9:18   ` Peter Zijlstra
@ 2024-05-08  8:57     ` Zhang, Xiong Y
  0 siblings, 0 replies; 116+ messages in thread
From: Zhang, Xiong Y @ 2024-05-08  8:57 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users



On 5/7/2024 5:18 PM, Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 05:29:35AM +0000, Mingwei Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>
>> KVM needs to register irq handler for POSTED_INTR_WAKEUP_VECTOR and
>> KVM_GUEST_PMI_VECTOR, a common function x86_set_kvm_irq_handler() is
>> extracted to reduce exports function and duplicated code.
>>
>> Suggested-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> ---
>>  arch/x86/include/asm/irq.h |  3 +--
>>  arch/x86/kernel/irq.c      | 27 +++++++++++----------------
>>  arch/x86/kvm/vmx/vmx.c     |  4 ++--
>>  3 files changed, 14 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
>> index 2483f6ef5d4e..050a247b69b4 100644
>> --- a/arch/x86/include/asm/irq.h
>> +++ b/arch/x86/include/asm/irq.h
>> @@ -30,8 +30,7 @@ struct irq_desc;
>>  extern void fixup_irqs(void);
>>  
>>  #if IS_ENABLED(CONFIG_KVM)
>> -extern void kvm_set_posted_intr_wakeup_handler(void (*handler)(void));
>> -void kvm_set_guest_pmi_handler(void (*handler)(void));
>> +void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void));
>>  #endif
>>  
>>  extern void (*x86_platform_ipi_callback)(void);
>> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
>> index 22c10e5c50af..3ada69c50951 100644
>> --- a/arch/x86/kernel/irq.c
>> +++ b/arch/x86/kernel/irq.c
>> @@ -302,27 +302,22 @@ static void dummy_handler(void) {}
>>  static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
>>  static void (*kvm_guest_pmi_handler)(void) = dummy_handler;
>>  
>> -void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
>> +void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
>>  {
>> -	if (handler)
>> +	if (!handler)
>> +		handler = dummy_handler;
>> +
>> +	if (vector == POSTED_INTR_WAKEUP_VECTOR)
>>  		kvm_posted_intr_wakeup_handler = handler;
>> -	else {
>> -		kvm_posted_intr_wakeup_handler = dummy_handler;
>> -		synchronize_rcu();
>> -	}
>> -}
>> -EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
>> -
>> -void kvm_set_guest_pmi_handler(void (*handler)(void))
>> -{
>> -	if (handler) {
>> +	else if (vector == KVM_GUEST_PMI_VECTOR)
>>  		kvm_guest_pmi_handler = handler;
>> -	} else {
>> -		kvm_guest_pmi_handler = dummy_handler;
>> +	else
>> +		WARN_ON_ONCE(1);
>> +
>> +	if (handler == dummy_handler)
>>  		synchronize_rcu();
>> -	}
>>  }
>> -EXPORT_SYMBOL_GPL(kvm_set_guest_pmi_handler);
>> +EXPORT_SYMBOL_GPL(x86_set_kvm_irq_handler);
> 
> Can't you just squash this into the previous patch? I mean, what's the
> point of this back and forth?
Ok, I will put this before the previous patch, and let previous patch use
this directly.
> 
>> +void x86_set_kvm_irq_handler(u8 vector, void (*handler)(void))
>>  {
>> +	if (!handler)
>> +		handler = dummy_handler;
>> +
>> +	if (vector == POSTED_INTR_WAKEUP_VECTOR)
>>  		kvm_posted_intr_wakeup_handler = handler;
>> +	else if (vector == KVM_GUEST_PMI_VECTOR)
>>  		kvm_guest_pmi_handler = handler;
>> +	else
>> +		WARN_ON_ONCE(1);
>> +
>> +	if (handler == dummy_handler)
>>  		synchronize_rcu();
>>  }
>> +EXPORT_SYMBOL_GPL(x86_set_kvm_irq_handler);
> 
> So what about:
> 
>  x86_set_kvm_irq_handler(foo, handler1);
>  x86_set_kvm_irq_handler(foo, handler2);
> 
>  ?
> 
> I'm fairly sure you either want to enforce a NULL<->handler transition,
> or add some additional synchronize stuff.
> 
> Hmm?
yes, x86_set_kvm_irq_handler() is called once for each vector at
kvm/kvm_intel module_init() and module_exit(). so we should enforce a
NULL<->handler transition.

thanks
> 
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 11/54] KVM: x86/pmu: Register guest pmi handler for emulated PMU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (9 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 10/54] KVM: x86: Extract x86_set_kvm_irq_handler() function Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler Mingwei Zhang
                   ` (43 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Add function to register/unregister PMI handler at KVM module
initialization and destroy. This allows the host PMU with passthough
capability enabled switch PMI handler at PMU context switch.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
---
 arch/x86/kvm/x86.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ebcc12d1e1de..51b5a88222ef 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13887,6 +13887,16 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
 }
 EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
+static void kvm_handle_guest_pmi(void)
+{
+	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+	if (WARN_ON_ONCE(!vcpu))
+		return;
+
+	kvm_make_request(KVM_REQ_PMI, vcpu);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
@@ -13921,12 +13931,14 @@ static int __init kvm_x86_init(void)
 {
 	kvm_mmu_x86_module_init();
 	mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
+	x86_set_kvm_irq_handler(KVM_GUEST_PMI_VECTOR, kvm_handle_guest_pmi);
 	return 0;
 }
 module_init(kvm_x86_init);
 
 static void __exit kvm_x86_exit(void)
 {
+	x86_set_kvm_irq_handler(KVM_GUEST_PMI_VECTOR, NULL);
 	WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu));
 }
 module_exit(kvm_x86_exit);
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (10 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 11/54] KVM: x86/pmu: Register guest pmi handler for emulated PMU Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-07  9:22   ` Peter Zijlstra
  2024-05-07 21:40   ` Chen, Zide
  2024-05-06  5:29 ` [PATCH v2 13/54] perf: core/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
                   ` (42 subsequent siblings)
  54 siblings, 2 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Add x86 specific function to switch PMI handler since passthrough PMU and host
PMU use different interrupt vectors.

x86_perf_guest_enter() switch PMU vector from NMI to KVM_GUEST_PMI_VECTOR,
and guest LVTPC_MASK value should be reflected onto HW to indicate whether
guest has cleared LVTPC_MASK or not, so guest lvt_pc is passed as parameter.

x86_perf_guest_exit() switch PMU vector from KVM_GUEST_PMI_VECTOR to NMI.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c            | 17 +++++++++++++++++
 arch/x86/include/asm/perf_event.h |  3 +++
 2 files changed, 20 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 09050641ce5d..8167f2230d3a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -701,6 +701,23 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
 }
 EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
 
+void x86_perf_guest_enter(u32 guest_lvtpc)
+{
+	lockdep_assert_irqs_disabled();
+
+	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
+			       (guest_lvtpc & APIC_LVT_MASKED));
+}
+EXPORT_SYMBOL_GPL(x86_perf_guest_enter);
+
+void x86_perf_guest_exit(void)
+{
+	lockdep_assert_irqs_disabled();
+
+	apic_write(APIC_LVTPC, APIC_DM_NMI);
+}
+EXPORT_SYMBOL_GPL(x86_perf_guest_exit);
+
 /*
  * There may be PMI landing after enabled=0. The PMI hitting could be before or
  * after disable_all.
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 3736b8a46c04..807ea9c98567 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -577,6 +577,9 @@ static inline void perf_events_lapic_init(void)	{ }
 static inline void perf_check_microcode(void) { }
 #endif
 
+void x86_perf_guest_enter(u32 guest_lvtpc);
+void x86_perf_guest_exit(void);
+
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data);
 extern void x86_perf_get_lbr(struct x86_pmu_lbr *lbr);
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler
  2024-05-06  5:29 ` [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler Mingwei Zhang
@ 2024-05-07  9:22   ` Peter Zijlstra
  2024-05-08  6:58     ` Zhang, Xiong Y
  2024-05-07 21:40   ` Chen, Zide
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-07  9:22 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users

On Mon, May 06, 2024 at 05:29:37AM +0000, Mingwei Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> 
> Add x86 specific function to switch PMI handler since passthrough PMU and host
> PMU use different interrupt vectors.
> 
> x86_perf_guest_enter() switch PMU vector from NMI to KVM_GUEST_PMI_VECTOR,
> and guest LVTPC_MASK value should be reflected onto HW to indicate whether
> guest has cleared LVTPC_MASK or not, so guest lvt_pc is passed as parameter.
> 
> x86_perf_guest_exit() switch PMU vector from KVM_GUEST_PMI_VECTOR to NMI.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/events/core.c            | 17 +++++++++++++++++
>  arch/x86/include/asm/perf_event.h |  3 +++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 09050641ce5d..8167f2230d3a 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -701,6 +701,23 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
>  }
>  EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
>  
> +void x86_perf_guest_enter(u32 guest_lvtpc)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
> +			       (guest_lvtpc & APIC_LVT_MASKED));
> +}
> +EXPORT_SYMBOL_GPL(x86_perf_guest_enter);
> +
> +void x86_perf_guest_exit(void)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	apic_write(APIC_LVTPC, APIC_DM_NMI);
> +}
> +EXPORT_SYMBOL_GPL(x86_perf_guest_exit);

Urgghh... because it makes sense for this bare APIC write to be exported
?!?

Can't this at the very least be hard tied to perf_guest_{enter,exit}() ?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler
  2024-05-07  9:22   ` Peter Zijlstra
@ 2024-05-08  6:58     ` Zhang, Xiong Y
  2024-05-08  8:37       ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Zhang, Xiong Y @ 2024-05-08  6:58 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users



On 5/7/2024 5:22 PM, Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 05:29:37AM +0000, Mingwei Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>
>> Add x86 specific function to switch PMI handler since passthrough PMU and host
>> PMU use different interrupt vectors.
>>
>> x86_perf_guest_enter() switch PMU vector from NMI to KVM_GUEST_PMI_VECTOR,
>> and guest LVTPC_MASK value should be reflected onto HW to indicate whether
>> guest has cleared LVTPC_MASK or not, so guest lvt_pc is passed as parameter.
>>
>> x86_perf_guest_exit() switch PMU vector from KVM_GUEST_PMI_VECTOR to NMI.
>>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/events/core.c            | 17 +++++++++++++++++
>>  arch/x86/include/asm/perf_event.h |  3 +++
>>  2 files changed, 20 insertions(+)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 09050641ce5d..8167f2230d3a 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -701,6 +701,23 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
>>  
>> +void x86_perf_guest_enter(u32 guest_lvtpc)
>> +{
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
>> +			       (guest_lvtpc & APIC_LVT_MASKED));
>> +}
>> +EXPORT_SYMBOL_GPL(x86_perf_guest_enter);
>> +
>> +void x86_perf_guest_exit(void)
>> +{
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +}
>> +EXPORT_SYMBOL_GPL(x86_perf_guest_exit);
> 
> Urgghh... because it makes sense for this bare APIC write to be exported
> ?!?
Usually KVM doesn't access HW except vmx directly and requests other
components to access HW to avoid confliction, APIC_LVTPC is managed by x86
perf driver, so I added two functions here and exported them.
> 
> Can't this at the very least be hard tied to perf_guest_{enter,exit}() ?
> perf_guest_{enter, exit} is called from this function in another commit, I
should merge that commit into this one according to your suggestion in
other email.

thanks

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler
  2024-05-08  6:58     ` Zhang, Xiong Y
@ 2024-05-08  8:37       ` Peter Zijlstra
  2024-05-09  7:30         ` Zhang, Xiong Y
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-08  8:37 UTC (permalink / raw)
  To: Zhang, Xiong Y
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users

On Wed, May 08, 2024 at 02:58:30PM +0800, Zhang, Xiong Y wrote:
> On 5/7/2024 5:22 PM, Peter Zijlstra wrote:
> > On Mon, May 06, 2024 at 05:29:37AM +0000, Mingwei Zhang wrote:
> >> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

> >> +void x86_perf_guest_enter(u32 guest_lvtpc)
> >> +{
> >> +	lockdep_assert_irqs_disabled();
> >> +
> >> +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
> >> +			       (guest_lvtpc & APIC_LVT_MASKED));
> >> +}
> >> +EXPORT_SYMBOL_GPL(x86_perf_guest_enter);
> >> +
> >> +void x86_perf_guest_exit(void)
> >> +{
> >> +	lockdep_assert_irqs_disabled();
> >> +
> >> +	apic_write(APIC_LVTPC, APIC_DM_NMI);
> >> +}
> >> +EXPORT_SYMBOL_GPL(x86_perf_guest_exit);
> > 
> > Urgghh... because it makes sense for this bare APIC write to be exported
> > ?!?
> Usually KVM doesn't access HW except vmx directly and requests other
> components to access HW to avoid confliction, APIC_LVTPC is managed by x86
> perf driver, so I added two functions here and exported them.

Yes, I understand how you got here. But as with everything you export,
you should ask yourself, should I export this. The above
x86_perf_guest_enter() function allows any module to write random LVTPC
entries. That's not a good thing to export.

I utterly detest how KVM is a module and ends up exporting a ton of
stuff that *really* should not be exported.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler
  2024-05-08  8:37       ` Peter Zijlstra
@ 2024-05-09  7:30         ` Zhang, Xiong Y
  0 siblings, 0 replies; 116+ messages in thread
From: Zhang, Xiong Y @ 2024-05-09  7:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das,
	Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, kvm, linux-perf-users



On 5/8/2024 4:37 PM, Peter Zijlstra wrote:
> On Wed, May 08, 2024 at 02:58:30PM +0800, Zhang, Xiong Y wrote:
>> On 5/7/2024 5:22 PM, Peter Zijlstra wrote:
>>> On Mon, May 06, 2024 at 05:29:37AM +0000, Mingwei Zhang wrote:
>>>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> 
>>>> +void x86_perf_guest_enter(u32 guest_lvtpc)
>>>> +{
>>>> +	lockdep_assert_irqs_disabled();
>>>> +
>>>> +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
>>>> +			       (guest_lvtpc & APIC_LVT_MASKED));
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(x86_perf_guest_enter);
>>>> +
>>>> +void x86_perf_guest_exit(void)
>>>> +{
>>>> +	lockdep_assert_irqs_disabled();
>>>> +
>>>> +	apic_write(APIC_LVTPC, APIC_DM_NMI);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(x86_perf_guest_exit);
>>>
>>> Urgghh... because it makes sense for this bare APIC write to be exported
>>> ?!?
>> Usually KVM doesn't access HW except vmx directly and requests other
>> components to access HW to avoid confliction, APIC_LVTPC is managed by x86
>> perf driver, so I added two functions here and exported them.
> 
> Yes, I understand how you got here. But as with everything you export,
> you should ask yourself, should I export this. The above
> x86_perf_guest_enter() function allows any module to write random LVTPC
> entries. That's not a good thing to export.
Totally agree with your concern. Here KVM need to switch PMI vector at PMU
context switch, export isn't good, could you kindly give guideline to
design or improve such interface?
I thought the following two method, but they are worse than this commit.
1. Perf register notification to KVM, but this makes perf depends on KVM.
2. KVM write APIC_LVTPC directly, but this needs x86 export apic_write().

thanks
> 
> I utterly detest how KVM is a module and ends up exporting a ton of
> stuff that *really* should not be exported.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler
  2024-05-06  5:29 ` [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler Mingwei Zhang
  2024-05-07  9:22   ` Peter Zijlstra
@ 2024-05-07 21:40   ` Chen, Zide
  2024-05-08  3:44     ` Mi, Dapeng
  2024-05-30  5:12     ` Mingwei Zhang
  1 sibling, 2 replies; 116+ messages in thread
From: Chen, Zide @ 2024-05-07 21:40 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users



On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> 
> Add x86 specific function to switch PMI handler since passthrough PMU and host
> PMU use different interrupt vectors.
> 
> x86_perf_guest_enter() switch PMU vector from NMI to KVM_GUEST_PMI_VECTOR,
> and guest LVTPC_MASK value should be reflected onto HW to indicate whether
> guest has cleared LVTPC_MASK or not, so guest lvt_pc is passed as parameter.
> 
> x86_perf_guest_exit() switch PMU vector from KVM_GUEST_PMI_VECTOR to NMI.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/events/core.c            | 17 +++++++++++++++++
>  arch/x86/include/asm/perf_event.h |  3 +++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 09050641ce5d..8167f2230d3a 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -701,6 +701,23 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
>  }
>  EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
>  
> +void x86_perf_guest_enter(u32 guest_lvtpc)
> +{
> +	lockdep_assert_irqs_disabled();
> +
> +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
> +			       (guest_lvtpc & APIC_LVT_MASKED));

If CONFIG_KVM is not defined, KVM_GUEST_PMI_VECTOR is not available and
it causes compiling error.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler
  2024-05-07 21:40   ` Chen, Zide
@ 2024-05-08  3:44     ` Mi, Dapeng
  2024-05-30  5:12     ` Mingwei Zhang
  1 sibling, 0 replies; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-08  3:44 UTC (permalink / raw)
  To: Chen, Zide, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users


On 5/8/2024 5:40 AM, Chen, Zide wrote:
>
> On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>
>> Add x86 specific function to switch PMI handler since passthrough PMU and host
>> PMU use different interrupt vectors.
>>
>> x86_perf_guest_enter() switch PMU vector from NMI to KVM_GUEST_PMI_VECTOR,
>> and guest LVTPC_MASK value should be reflected onto HW to indicate whether
>> guest has cleared LVTPC_MASK or not, so guest lvt_pc is passed as parameter.
>>
>> x86_perf_guest_exit() switch PMU vector from KVM_GUEST_PMI_VECTOR to NMI.
>>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/events/core.c            | 17 +++++++++++++++++
>>  arch/x86/include/asm/perf_event.h |  3 +++
>>  2 files changed, 20 insertions(+)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 09050641ce5d..8167f2230d3a 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -701,6 +701,23 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
>>  }
>>  EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
>>  
>> +void x86_perf_guest_enter(u32 guest_lvtpc)
>> +{
>> +	lockdep_assert_irqs_disabled();
>> +
>> +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
>> +			       (guest_lvtpc & APIC_LVT_MASKED));
> If CONFIG_KVM is not defined, KVM_GUEST_PMI_VECTOR is not available and
> it causes compiling error.
Zide, thanks. If CONFIG_KVM is not defined, these two helpers would not
need to be called, it can be defined as null.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler
  2024-05-07 21:40   ` Chen, Zide
  2024-05-08  3:44     ` Mi, Dapeng
@ 2024-05-30  5:12     ` Mingwei Zhang
  1 sibling, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-30  5:12 UTC (permalink / raw)
  To: Chen, Zide
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu,
	Peter Zijlstra, kvm, linux-perf-users

On Tue, May 07, 2024, Chen, Zide wrote:
> 
> 
> On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
> > From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> > 
> > Add x86 specific function to switch PMI handler since passthrough PMU and host
> > PMU use different interrupt vectors.
> > 
> > x86_perf_guest_enter() switch PMU vector from NMI to KVM_GUEST_PMI_VECTOR,
> > and guest LVTPC_MASK value should be reflected onto HW to indicate whether
> > guest has cleared LVTPC_MASK or not, so guest lvt_pc is passed as parameter.
> > 
> > x86_perf_guest_exit() switch PMU vector from KVM_GUEST_PMI_VECTOR to NMI.
> > 
> > Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > ---
> >  arch/x86/events/core.c            | 17 +++++++++++++++++
> >  arch/x86/include/asm/perf_event.h |  3 +++
> >  2 files changed, 20 insertions(+)
> > 
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > index 09050641ce5d..8167f2230d3a 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -701,6 +701,23 @@ struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
> >  }
> >  EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
> >  
> > +void x86_perf_guest_enter(u32 guest_lvtpc)
> > +{
> > +	lockdep_assert_irqs_disabled();
> > +
> > +	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
> > +			       (guest_lvtpc & APIC_LVT_MASKED));
> 
> If CONFIG_KVM is not defined, KVM_GUEST_PMI_VECTOR is not available and
> it causes compiling error.

That is a good discovery, thanks. hmm, we could put the whole function
under IS_ENABLED(CONFIG_KVM) to avoid that.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 13/54] perf: core/x86: Forbid PMI handler when guest own PMU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (11 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-07  9:33   ` Peter Zijlstra
  2024-05-06  5:29 ` [PATCH v2 14/54] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
                   ` (41 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

If a guest PMI is delivered after VM-exit, the KVM maskable interrupt will
be held pending until EFLAGS.IF is set. In the meantime, if the logical
processor receives an NMI for any reason at all, perf_event_nmi_handler()
will be invoked. If there is any active perf event anywhere on the system,
x86_pmu_handle_irq() will be invoked, and it will clear
IA32_PERF_GLOBAL_STATUS. By the time KVM's PMI handler is invoked, it will
be a mystery which counter(s) overflowed.

When LVTPC is using KVM PMI vecotr, PMU is owned by guest, Host NMI let
x86_pmu_handle_irq() run, x86_pmu_handle_irq() restore PMU vector to NMI
and clear IA32_PERF_GLOBAL_STATUS, this breaks guest vPMU passthrough
environment.

So modify perf_event_nmi_handler() to check perf_guest_context_loaded,
and if so, to simply return without calling x86_pmu_handle_irq().

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c     | 19 ++++++++++++++++++-
 include/linux/perf_event.h |  5 +++++
 kernel/events/core.c       |  5 +++++
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8167f2230d3a..c0f6e294fcad 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -726,7 +726,7 @@ EXPORT_SYMBOL_GPL(x86_perf_guest_exit);
  * It will not be re-enabled in the NMI handler again, because enabled=0. After
  * handling the NMI, disable_all will be called, which will not change the
  * state either. If PMI hits after disable_all, the PMU is already disabled
- * before entering NMI handler. The NMI handler will not change the state
+ * before entering NMI handler. The NMI handler will no	change the state
  * either.
  *
  * So either situation is harmless.
@@ -1749,6 +1749,23 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 	u64 finish_clock;
 	int ret;
 
+	/*
+	 * When guest pmu context is loaded this handler should be forbidden from
+	 * running, the reasons are:
+	 * 1. After x86_perf_guest_enter() is called, and before cpu enter into
+	 *    non-root mode, NMI could happen, but x86_pmu_handle_irq() restore PMU
+	 *    to use NMI vector, which destroy KVM PMI vector setting.
+	 * 2. When VM is running, host NMI other than PMI causes VM exit, KVM will
+	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
+	 *    guest PMU context (kvm_pmu_save_pmu_context()), as x86_pmu_handle_irq()
+	 *    clear global_status MSR which has guest status now, then this destroy
+	 *    guest PMU status.
+	 * 3. After VM exit, but before KVM save guest PMU context, host NMI other
+	 *    than PMI could happen, x86_pmu_handle_irq() clear global_status MSR
+	 *    which has guest status now, then this destroy guest PMU status.
+	 */
+	if (perf_is_guest_context_loaded())
+		return 0;
 	/*
 	 * All PMUs/events that share this PMI handler should make sure to
 	 * increment active_events for their events.
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index acf16676401a..5da7de42954e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1736,6 +1736,7 @@ extern int perf_get_mediated_pmu(void);
 extern void perf_put_mediated_pmu(void);
 void perf_guest_enter(void);
 void perf_guest_exit(void);
+bool perf_is_guest_context_loaded(void);
 #else /* !CONFIG_PERF_EVENTS: */
 static inline void *
 perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1830,6 +1831,10 @@ static inline int perf_get_mediated_pmu(void)
 static inline void perf_put_mediated_pmu(void)			{ }
 static inline void perf_guest_enter(void)			{ }
 static inline void perf_guest_exit(void)			{ }
+static inline bool perf_is_guest_context_loaded(void)
+{
+	return false;
+}
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4c6daf5cc923..184d06c23391 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5895,6 +5895,11 @@ void perf_guest_exit(void)
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 }
 
+bool perf_is_guest_context_loaded(void)
+{
+	return __this_cpu_read(perf_in_guest);
+}
+
 /*
  * Holding the top-level event's child_mutex means that any
  * descendant process that has inherited this event will block
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 13/54] perf: core/x86: Forbid PMI handler when guest own PMU
  2024-05-06  5:29 ` [PATCH v2 13/54] perf: core/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
@ 2024-05-07  9:33   ` Peter Zijlstra
  2024-05-09  7:39     ` Zhang, Xiong Y
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-07  9:33 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users

On Mon, May 06, 2024 at 05:29:38AM +0000, Mingwei Zhang wrote:

> @@ -1749,6 +1749,23 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>  	u64 finish_clock;
>  	int ret;
>  
> +	/*
> +	 * When guest pmu context is loaded this handler should be forbidden from
> +	 * running, the reasons are:
> +	 * 1. After x86_perf_guest_enter() is called, and before cpu enter into
> +	 *    non-root mode, NMI could happen, but x86_pmu_handle_irq() restore PMU
> +	 *    to use NMI vector, which destroy KVM PMI vector setting.
> +	 * 2. When VM is running, host NMI other than PMI causes VM exit, KVM will
> +	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
> +	 *    guest PMU context (kvm_pmu_save_pmu_context()), as x86_pmu_handle_irq()
> +	 *    clear global_status MSR which has guest status now, then this destroy
> +	 *    guest PMU status.
> +	 * 3. After VM exit, but before KVM save guest PMU context, host NMI other
> +	 *    than PMI could happen, x86_pmu_handle_irq() clear global_status MSR
> +	 *    which has guest status now, then this destroy guest PMU status.
> +	 */
> +	if (perf_is_guest_context_loaded())
> +		return 0;

A function call makes sense because? Also, isn't this naming at least a
very little misleading? Specifically this is about passthrough, not
guest context per se.

>  	/*
>  	 * All PMUs/events that share this PMI handler should make sure to
>  	 * increment active_events for their events.
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index acf16676401a..5da7de42954e 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1736,6 +1736,7 @@ extern int perf_get_mediated_pmu(void);
>  extern void perf_put_mediated_pmu(void);
>  void perf_guest_enter(void);
>  void perf_guest_exit(void);
> +bool perf_is_guest_context_loaded(void);
>  #else /* !CONFIG_PERF_EVENTS: */
>  static inline void *
>  perf_aux_output_begin(struct perf_output_handle *handle,
> @@ -1830,6 +1831,10 @@ static inline int perf_get_mediated_pmu(void)
>  static inline void perf_put_mediated_pmu(void)			{ }
>  static inline void perf_guest_enter(void)			{ }
>  static inline void perf_guest_exit(void)			{ }
> +static inline bool perf_is_guest_context_loaded(void)
> +{
> +	return false;
> +}
>  #endif
>  
>  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 4c6daf5cc923..184d06c23391 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5895,6 +5895,11 @@ void perf_guest_exit(void)
>  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>  }
>  
> +bool perf_is_guest_context_loaded(void)
> +{
> +	return __this_cpu_read(perf_in_guest);
> +}
> +
>  /*
>   * Holding the top-level event's child_mutex means that any
>   * descendant process that has inherited this event will block
> -- 
> 2.45.0.rc1.225.g2a3ae87e7f-goog
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 13/54] perf: core/x86: Forbid PMI handler when guest own PMU
  2024-05-07  9:33   ` Peter Zijlstra
@ 2024-05-09  7:39     ` Zhang, Xiong Y
  0 siblings, 0 replies; 116+ messages in thread
From: Zhang, Xiong Y @ 2024-05-09  7:39 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users



On 5/7/2024 5:33 PM, Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 05:29:38AM +0000, Mingwei Zhang wrote:
> 
>> @@ -1749,6 +1749,23 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>>  	u64 finish_clock;
>>  	int ret;
>>  
>> +	/*
>> +	 * When guest pmu context is loaded this handler should be forbidden from
>> +	 * running, the reasons are:
>> +	 * 1. After x86_perf_guest_enter() is called, and before cpu enter into
>> +	 *    non-root mode, NMI could happen, but x86_pmu_handle_irq() restore PMU
>> +	 *    to use NMI vector, which destroy KVM PMI vector setting.
>> +	 * 2. When VM is running, host NMI other than PMI causes VM exit, KVM will
>> +	 *    call host NMI handler (vmx_vcpu_enter_exit()) first before KVM save
>> +	 *    guest PMU context (kvm_pmu_save_pmu_context()), as x86_pmu_handle_irq()
>> +	 *    clear global_status MSR which has guest status now, then this destroy
>> +	 *    guest PMU status.
>> +	 * 3. After VM exit, but before KVM save guest PMU context, host NMI other
>> +	 *    than PMI could happen, x86_pmu_handle_irq() clear global_status MSR
>> +	 *    which has guest status now, then this destroy guest PMU status.
>> +	 */
>> +	if (perf_is_guest_context_loaded())
>> +		return 0;
> 
> A function call makes sense because? Also, isn't this naming at least the purpose of function call is to re-use the per-cpu variable defined in
perf core, otherwise another per-cpu variable will be defined in
arch/x86/event/core.c, whether function call or per-cpu variable depends on
the interface between perf and KVM.
> very little misleading? Specifically this is about passthrough, not
> guest context per se.
> 
>>  	/*
>>  	 * All PMUs/events that share this PMI handler should make sure to
>>  	 * increment active_events for their events.
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index acf16676401a..5da7de42954e 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -1736,6 +1736,7 @@ extern int perf_get_mediated_pmu(void);
>>  extern void perf_put_mediated_pmu(void);
>>  void perf_guest_enter(void);
>>  void perf_guest_exit(void);
>> +bool perf_is_guest_context_loaded(void);
>>  #else /* !CONFIG_PERF_EVENTS: */
>>  static inline void *
>>  perf_aux_output_begin(struct perf_output_handle *handle,
>> @@ -1830,6 +1831,10 @@ static inline int perf_get_mediated_pmu(void)
>>  static inline void perf_put_mediated_pmu(void)			{ }
>>  static inline void perf_guest_enter(void)			{ }
>>  static inline void perf_guest_exit(void)			{ }
>> +static inline bool perf_is_guest_context_loaded(void)
>> +{
>> +	return false;
>> +}
>>  #endif
>>  
>>  #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 4c6daf5cc923..184d06c23391 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -5895,6 +5895,11 @@ void perf_guest_exit(void)
>>  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>  }
>>  
>> +bool perf_is_guest_context_loaded(void)
>> +{
>> +	return __this_cpu_read(perf_in_guest);
>> +}
>> +
>>  /*
>>   * Holding the top-level event's child_mutex means that any
>>   * descendant process that has inherited this event will block
>> -- 
>> 2.45.0.rc1.225.g2a3ae87e7f-goog
>>
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 14/54] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (12 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 13/54] perf: core/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 15/54] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter Mingwei Zhang
                   ` (40 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Plumb passthrough PMU capability to x86_pmu_cap in order to let any kernel
entity such as KVM know that host PMU support passthrough PMU mode and has
the implementation.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c            | 1 +
 arch/x86/events/intel/core.c      | 1 +
 arch/x86/events/perf_event.h      | 1 +
 arch/x86/include/asm/perf_event.h | 1 +
 4 files changed, 4 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index c0f6e294fcad..f5a043410614 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -3023,6 +3023,7 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
 	cap->events_mask	= (unsigned int)x86_pmu.events_maskl;
 	cap->events_mask_len	= x86_pmu.events_mask_len;
 	cap->pebs_ept		= x86_pmu.pebs_ept;
+	cap->passthrough	= !!(x86_pmu.flags & PMU_FL_PASSTHROUGH);
 }
 EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability);
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 4d8f907a9416..62d327cb5424 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -6246,6 +6246,7 @@ __init int intel_pmu_init(void)
 
 	/* The perf side of core PMU is ready to support the passthrough vPMU. */
 	x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_PASSTHROUGH_VPMU;
+	x86_pmu.flags |= PMU_FL_PASSTHROUGH;
 
 	/*
 	 * Install the hw-cache-events table:
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index fb56518356ec..bdf6d114d05a 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1020,6 +1020,7 @@ do {									\
 #define PMU_FL_MEM_LOADS_AUX	0x100 /* Require an auxiliary event for the complete memory info */
 #define PMU_FL_RETIRE_LATENCY	0x200 /* Support Retire Latency in PEBS */
 #define PMU_FL_BR_CNTR		0x400 /* Support branch counter logging */
+#define PMU_FL_PASSTHROUGH      0x800 /* Support passthrough mode */
 
 #define EVENT_VAR(_id)  event_attr_##_id
 #define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 807ea9c98567..39a6379162bc 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -258,6 +258,7 @@ struct x86_pmu_capability {
 	unsigned int	events_mask;
 	int		events_mask_len;
 	unsigned int	pebs_ept	:1;
+	unsigned int	passthrough	:1;
 };
 
 /*
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 15/54] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (13 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 14/54] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 16/54] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs Mingwei Zhang
                   ` (39 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Introduce enable_passthrough_pmu as a RO KVM kernel module parameter. This
variable is true only when the following conditions satisfies:
 - set to true when module loaded.
 - enable_pmu is true.
 - is running on Intel CPU.
 - supports PerfMon v4.
 - host PMU supports passthrough mode.

The value is always read-only because passthrough PMU currently does not
support features like LBR and PEBS, while emualted PMU does. This will end
up with two different values for kvm_cap.supported_perf_cap, which is
initialized at module load time. Maintaining two different perf
capabilities will add complexity. Further, there is not enough motivation
to support running two types of PMU implementations at the same time,
although it is possible/feasible in reality.

Finally, always propagate enable_passthrough_pmu and perf_capabilities into
kvm->arch for each KVM instance.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.h              | 14 ++++++++++++++
 arch/x86/kvm/vmx/vmx.c          |  7 +++++--
 arch/x86/kvm/x86.c              |  8 ++++++++
 arch/x86/kvm/x86.h              |  1 +
 5 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6efd1497b026..9851f0c8e91b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1402,6 +1402,7 @@ struct kvm_arch {
 
 	bool bus_lock_detection_enabled;
 	bool enable_pmu;
+	bool enable_passthrough_pmu;
 
 	u32 notify_window;
 	u32 notify_vmexit_flags;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 4d52b0b539ba..cf93be5e7359 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -208,6 +208,20 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 			enable_pmu = false;
 	}
 
+	/* Pass-through vPMU is only supported in Intel CPUs. */
+	if (!is_intel)
+		enable_passthrough_pmu = false;
+
+	/*
+	 * Pass-through vPMU requires at least PerfMon version 4 because the
+	 * implementation requires the usage of MSR_CORE_PERF_GLOBAL_STATUS_SET
+	 * for counter emulation as well as PMU context switch.  In addition, it
+	 * requires host PMU support on passthrough mode. Disable pass-through
+	 * vPMU if any condition fails.
+	 */
+	if (!enable_pmu || kvm_pmu_cap.version < 4 || !kvm_pmu_cap.passthrough)
+		enable_passthrough_pmu = false;
+
 	if (!enable_pmu) {
 		memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
 		return;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c2dc68a25a53..af253cfa5c37 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -144,6 +144,8 @@ module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
 extern bool __read_mostly allow_smaller_maxphyaddr;
 module_param(allow_smaller_maxphyaddr, bool, S_IRUGO);
 
+module_param(enable_passthrough_pmu, bool, 0444);
+
 #define KVM_VM_CR0_ALWAYS_OFF (X86_CR0_NW | X86_CR0_CD)
 #define KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST X86_CR0_NE
 #define KVM_VM_CR0_ALWAYS_ON				\
@@ -7874,13 +7876,14 @@ static u64 vmx_get_perf_capabilities(void)
 	if (boot_cpu_has(X86_FEATURE_PDCM))
 		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
 
-	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR)) {
+	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
+	    !enable_passthrough_pmu) {
 		x86_perf_get_lbr(&lbr);
 		if (lbr.nr)
 			perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
 	}
 
-	if (vmx_pebs_supported()) {
+	if (vmx_pebs_supported() && !enable_passthrough_pmu) {
 		perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
 		if ((perf_cap & PERF_CAP_PEBS_FORMAT) < 4)
 			perf_cap &= ~PERF_CAP_PEBS_BASELINE;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 51b5a88222ef..4c289fcb34fe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -193,6 +193,10 @@ bool __read_mostly enable_pmu = true;
 EXPORT_SYMBOL_GPL(enable_pmu);
 module_param(enable_pmu, bool, 0444);
 
+/* Enable/disable mediated passthrough PMU virtualization */
+bool __read_mostly enable_passthrough_pmu;
+EXPORT_SYMBOL_GPL(enable_passthrough_pmu);
+
 bool __read_mostly eager_page_split = true;
 module_param(eager_page_split, bool, 0644);
 
@@ -6666,6 +6670,9 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		mutex_lock(&kvm->lock);
 		if (!kvm->created_vcpus) {
 			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
+			/* Disable passthrough PMU if enable_pmu is false. */
+			if (!kvm->arch.enable_pmu)
+				kvm->arch.enable_passthrough_pmu = false;
 			r = 0;
 		}
 		mutex_unlock(&kvm->lock);
@@ -12564,6 +12571,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.default_tsc_khz = max_tsc_khz ? : tsc_khz;
 	kvm->arch.guest_can_read_msr_platform_info = true;
 	kvm->arch.enable_pmu = enable_pmu;
+	kvm->arch.enable_passthrough_pmu = enable_passthrough_pmu;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a8b71803777b..d5cc008e18f5 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -330,6 +330,7 @@ extern u64 host_arch_capabilities;
 extern struct kvm_caps kvm_caps;
 
 extern bool enable_pmu;
+extern bool enable_passthrough_pmu;
 
 /*
  * Get a filtered version of KVM's supported XCR0 that strips out dynamic
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 16/54] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (14 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 15/54] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode Mingwei Zhang
                   ` (38 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Plumb through pass-through PMU setting from kvm->arch into kvm_pmu on each
vcpu created. Note that enabling PMU is decided by VMM when it sets the
CPUID bits exposed to guest VM. So plumb through the enabling for each pmu
in intel_pmu_refresh().

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/pmu.c              |  1 +
 arch/x86/kvm/vmx/pmu_intel.c    | 12 +++++++++---
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9851f0c8e91b..19b924c3bd85 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -592,6 +592,8 @@ struct kvm_pmu {
 	 * redundant check before cleanup if guest don't use vPMU at all.
 	 */
 	u8 event_count;
+
+	bool passthrough;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index a593b03c9aed..5768ea2935e9 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -797,6 +797,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
 
 	memset(pmu, 0, sizeof(*pmu));
 	static_call(kvm_x86_pmu_init)(vcpu);
+	pmu->passthrough = false;
 	kvm_pmu_refresh(vcpu);
 }
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 12ade343a17e..0ed71f825e6b 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -470,15 +470,21 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 		return;
 
 	entry = kvm_find_cpuid_entry(vcpu, 0xa);
-	if (!entry)
+	if (!entry || !vcpu->kvm->arch.enable_pmu) {
+		pmu->passthrough = false;
 		return;
-
+	}
 	eax.full = entry->eax;
 	edx.full = entry->edx;
 
 	pmu->version = eax.split.version_id;
-	if (!pmu->version)
+	if (!pmu->version) {
+		pmu->passthrough = false;
 		return;
+	}
+
+	pmu->passthrough = vcpu->kvm->arch.enable_passthrough_pmu &&
+			   lapic_in_kernel(vcpu);
 
 	pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
 					 kvm_pmu_cap.num_counters_gp);
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (15 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 16/54] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-08  4:18   ` Mi, Dapeng
  2024-05-06  5:29 ` [PATCH v2 18/54] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Mingwei Zhang
                   ` (37 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Currently, the global control bits for a vcpu are restored to the reset
state only if the guest PMU version is less than 2. This works for
emulated PMU as the MSRs are intercepted and backing events are created
for and managed by the host PMU [1].

If such a guest in run with passthrough PMU, the counters no longer work
because the global enable bits are cleared. Hence, set the global enable
bits to their reset state if passthrough PMU is used.

A passthrough-capable host may not necessarily support PMU version 2 and
it can choose to restore or save the global control state from struct
kvm_pmu in the PMU context save and restore helpers depending on the
availability of the global control register.

[1] 7b46b733bdb4 ("KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET"");
Reported-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
[removed the fixes tag]
---
 arch/x86/kvm/pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5768ea2935e9..e656f72fdace 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -787,7 +787,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
 	 * in the global controls).  Emulate that behavior when refreshing the
 	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
 	 */
-	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
+	if ((pmu->passthrough || kvm_pmu_has_perf_global_ctrl(pmu)) && pmu->nr_arch_gp_counters)
 		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
 }
 
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-05-06  5:29 ` [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode Mingwei Zhang
@ 2024-05-08  4:18   ` Mi, Dapeng
  2024-05-08  4:36     ` Mingwei Zhang
  0 siblings, 1 reply; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-08  4:18 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users


On 5/6/2024 1:29 PM, Mingwei Zhang wrote:
> From: Sandipan Das <sandipan.das@amd.com>
>
> Currently, the global control bits for a vcpu are restored to the reset
> state only if the guest PMU version is less than 2. This works for
> emulated PMU as the MSRs are intercepted and backing events are created
> for and managed by the host PMU [1].
>
> If such a guest in run with passthrough PMU, the counters no longer work
> because the global enable bits are cleared. Hence, set the global enable
> bits to their reset state if passthrough PMU is used.
>
> A passthrough-capable host may not necessarily support PMU version 2 and
> it can choose to restore or save the global control state from struct
> kvm_pmu in the PMU context save and restore helpers depending on the
> availability of the global control register.
>
> [1] 7b46b733bdb4 ("KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET"");
> Reported-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> [removed the fixes tag]
> ---
>  arch/x86/kvm/pmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 5768ea2935e9..e656f72fdace 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -787,7 +787,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>  	 * in the global controls).  Emulate that behavior when refreshing the
>  	 * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
>  	 */
> -	if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
> +	if ((pmu->passthrough || kvm_pmu_has_perf_global_ctrl(pmu)) && pmu->nr_arch_gp_counters)

The logic seems not correct. we could support perfmon version 1 for
meidated vPMU (passthrough vPMU) as well in the future.  pmu->passthrough
is ture doesn't guarantee GLOBAL_CTRL MSR always exists.


>  		pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
>  }
>  

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-05-08  4:18   ` Mi, Dapeng
@ 2024-05-08  4:36     ` Mingwei Zhang
  2024-05-08  6:27       ` Mi, Dapeng
  0 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-08  4:36 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu,
	Peter Zijlstra, kvm, linux-perf-users

On Tue, May 7, 2024 at 9:19 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>
>
> On 5/6/2024 1:29 PM, Mingwei Zhang wrote:
> > From: Sandipan Das <sandipan.das@amd.com>
> >
> > Currently, the global control bits for a vcpu are restored to the reset
> > state only if the guest PMU version is less than 2. This works for
> > emulated PMU as the MSRs are intercepted and backing events are created
> > for and managed by the host PMU [1].
> >
> > If such a guest in run with passthrough PMU, the counters no longer work
> > because the global enable bits are cleared. Hence, set the global enable
> > bits to their reset state if passthrough PMU is used.
> >
> > A passthrough-capable host may not necessarily support PMU version 2 and
> > it can choose to restore or save the global control state from struct
> > kvm_pmu in the PMU context save and restore helpers depending on the
> > availability of the global control register.
> >
> > [1] 7b46b733bdb4 ("KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET"");
> > Reported-by: Mingwei Zhang <mizhang@google.com>
> > Signed-off-by: Sandipan Das <sandipan.das@amd.com>
> > [removed the fixes tag]
> > ---
> >  arch/x86/kvm/pmu.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 5768ea2935e9..e656f72fdace 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -787,7 +787,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
> >        * in the global controls).  Emulate that behavior when refreshing the
> >        * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
> >        */
> > -     if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
> > +     if ((pmu->passthrough || kvm_pmu_has_perf_global_ctrl(pmu)) && pmu->nr_arch_gp_counters)
>
> The logic seems not correct. we could support perfmon version 1 for
> meidated vPMU (passthrough vPMU) as well in the future.  pmu->passthrough
> is ture doesn't guarantee GLOBAL_CTRL MSR always exists.

heh, the logic is correct here. However, I would say the code change
may not reflect that clearly.

The if condition combines the handling of global ctrl registers for
both the legacy vPMU and the mediated passthrough vPMU.

In legacy pmu, the logic should be this:

if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)

Because, since KVM emulates the MSR, if the global ctrl register does
not exist, then there is no point resetting it to any value. However,
if it does exist, there are non-zero number of GP counters, we should
reset it to some value (all enabling bits are set for GP counters)
according to SDM.

The logic for mediated passthrough PMU is different as follows:

if (pmu->passthrough && pmu->nr_arch_gp_counters)

Since mediated passthrough PMU requires PerfMon v4 in Intel (PerfMon
v2 in AMD), once it is enabled (pmu->passthrough = true), then global
ctrl _must_ exist phyiscally. Regardless of whether we expose it to
the guest VM, at reset time, we need to ensure enabling bits for GP
counters are set (behind the screen). This is critical for AMD, since
most of the guests are usually in (AMD) PerfMon v1 in which global
ctrl MSR is inaccessible, but does exist and is operating in HW.

Yes, if we eliminate that requirement (pmu->passthrough -> Perfmon v4
Intel / Perfmon v2 AMD), then this code will have to change. However,
that is currently not in our RFCv2.

Thanks.
-Mingwei







>
>
> >               pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
> >  }
> >

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-05-08  4:36     ` Mingwei Zhang
@ 2024-05-08  6:27       ` Mi, Dapeng
  2024-05-08 14:13         ` Sean Christopherson
  0 siblings, 1 reply; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-08  6:27 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu,
	Peter Zijlstra, kvm, linux-perf-users


On 5/8/2024 12:36 PM, Mingwei Zhang wrote:
> On Tue, May 7, 2024 at 9:19 PM Mi, Dapeng <dapeng1.mi@linux.intel.com> wrote:
>>
>> On 5/6/2024 1:29 PM, Mingwei Zhang wrote:
>>> From: Sandipan Das <sandipan.das@amd.com>
>>>
>>> Currently, the global control bits for a vcpu are restored to the reset
>>> state only if the guest PMU version is less than 2. This works for
>>> emulated PMU as the MSRs are intercepted and backing events are created
>>> for and managed by the host PMU [1].
>>>
>>> If such a guest in run with passthrough PMU, the counters no longer work
>>> because the global enable bits are cleared. Hence, set the global enable
>>> bits to their reset state if passthrough PMU is used.
>>>
>>> A passthrough-capable host may not necessarily support PMU version 2 and
>>> it can choose to restore or save the global control state from struct
>>> kvm_pmu in the PMU context save and restore helpers depending on the
>>> availability of the global control register.
>>>
>>> [1] 7b46b733bdb4 ("KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET"");
>>> Reported-by: Mingwei Zhang <mizhang@google.com>
>>> Signed-off-by: Sandipan Das <sandipan.das@amd.com>
>>> [removed the fixes tag]
>>> ---
>>>  arch/x86/kvm/pmu.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>> index 5768ea2935e9..e656f72fdace 100644
>>> --- a/arch/x86/kvm/pmu.c
>>> +++ b/arch/x86/kvm/pmu.c
>>> @@ -787,7 +787,7 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
>>>        * in the global controls).  Emulate that behavior when refreshing the
>>>        * PMU so that userspace doesn't need to manually set PERF_GLOBAL_CTRL.
>>>        */
>>> -     if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
>>> +     if ((pmu->passthrough || kvm_pmu_has_perf_global_ctrl(pmu)) && pmu->nr_arch_gp_counters)
>> The logic seems not correct. we could support perfmon version 1 for
>> meidated vPMU (passthrough vPMU) as well in the future.  pmu->passthrough
>> is ture doesn't guarantee GLOBAL_CTRL MSR always exists.
> heh, the logic is correct here. However, I would say the code change
> may not reflect that clearly.
>
> The if condition combines the handling of global ctrl registers for
> both the legacy vPMU and the mediated passthrough vPMU.
>
> In legacy pmu, the logic should be this:
>
> if (kvm_pmu_has_perf_global_ctrl(pmu) && pmu->nr_arch_gp_counters)
>
> Because, since KVM emulates the MSR, if the global ctrl register does
> not exist, then there is no point resetting it to any value. However,
> if it does exist, there are non-zero number of GP counters, we should
> reset it to some value (all enabling bits are set for GP counters)
> according to SDM.
>
> The logic for mediated passthrough PMU is different as follows:
>
> if (pmu->passthrough && pmu->nr_arch_gp_counters)
>
> Since mediated passthrough PMU requires PerfMon v4 in Intel (PerfMon
> v2 in AMD), once it is enabled (pmu->passthrough = true), then global
> ctrl _must_ exist phyiscally. Regardless of whether we expose it to
> the guest VM, at reset time, we need to ensure enabling bits for GP
> counters are set (behind the screen). This is critical for AMD, since
> most of the guests are usually in (AMD) PerfMon v1 in which global
> ctrl MSR is inaccessible, but does exist and is operating in HW.
>
> Yes, if we eliminate that requirement (pmu->passthrough -> Perfmon v4
> Intel / Perfmon v2 AMD), then this code will have to change. However,
Yeah, that's what I'm worrying about. We ever discussed to support mediated
vPMU on HW below perfmon v4. When someone implements this, he may not
notice this place needs to be changed as well, this introduces a potential
bug and we should avoid this.
> that is currently not in our RFCv2.
>
> Thanks.
> -Mingwei
>
>
>
>
>
>
>
>>
>>>               pmu->global_ctrl = GENMASK_ULL(pmu->nr_arch_gp_counters - 1, 0);
>>>  }
>>>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-05-08  6:27       ` Mi, Dapeng
@ 2024-05-08 14:13         ` Sean Christopherson
  2024-05-09  0:13           ` Mingwei Zhang
  2024-05-09  0:38           ` Mi, Dapeng
  0 siblings, 2 replies; 116+ messages in thread
From: Sean Christopherson @ 2024-05-08 14:13 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

On Wed, May 08, 2024, Dapeng Mi wrote:
> 
> On 5/8/2024 12:36 PM, Mingwei Zhang wrote:
> > if (pmu->passthrough && pmu->nr_arch_gp_counters)
> >
> > Since mediated passthrough PMU requires PerfMon v4 in Intel (PerfMon
> > v2 in AMD), once it is enabled (pmu->passthrough = true), then global
> > ctrl _must_ exist phyiscally. Regardless of whether we expose it to
> > the guest VM, at reset time, we need to ensure enabling bits for GP
> > counters are set (behind the screen). This is critical for AMD, since
> > most of the guests are usually in (AMD) PerfMon v1 in which global
> > ctrl MSR is inaccessible, but does exist and is operating in HW.
> >
> > Yes, if we eliminate that requirement (pmu->passthrough -> Perfmon v4
> > Intel / Perfmon v2 AMD), then this code will have to change. However,
> Yeah, that's what I'm worrying about. We ever discussed to support mediated
> vPMU on HW below perfmon v4. When someone implements this, he may not
> notice this place needs to be changed as well, this introduces a potential
> bug and we should avoid this.

Just add a WARN on the PMU version.  I haven't thought much about whether or not
KVM should support mediated PMU for earlier hardware, but having a sanity check
on the assumptions of this code is reasonable even if we don't _plan_ on supporting
earlier hardware.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-05-08 14:13         ` Sean Christopherson
@ 2024-05-09  0:13           ` Mingwei Zhang
  2024-05-09  0:30             ` Mi, Dapeng
  2024-05-09  0:38           ` Mi, Dapeng
  1 sibling, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-09  0:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dapeng Mi, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

On Wed, May 8, 2024 at 7:13 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, May 08, 2024, Dapeng Mi wrote:
> >
> > On 5/8/2024 12:36 PM, Mingwei Zhang wrote:
> > > if (pmu->passthrough && pmu->nr_arch_gp_counters)
> > >
> > > Since mediated passthrough PMU requires PerfMon v4 in Intel (PerfMon
> > > v2 in AMD), once it is enabled (pmu->passthrough = true), then global
> > > ctrl _must_ exist phyiscally. Regardless of whether we expose it to
> > > the guest VM, at reset time, we need to ensure enabling bits for GP
> > > counters are set (behind the screen). This is critical for AMD, since
> > > most of the guests are usually in (AMD) PerfMon v1 in which global
> > > ctrl MSR is inaccessible, but does exist and is operating in HW.
> > >
> > > Yes, if we eliminate that requirement (pmu->passthrough -> Perfmon v4
> > > Intel / Perfmon v2 AMD), then this code will have to change. However,
> > Yeah, that's what I'm worrying about. We ever discussed to support mediated
> > vPMU on HW below perfmon v4. When someone implements this, he may not
> > notice this place needs to be changed as well, this introduces a potential
> > bug and we should avoid this.

I think you might have worried too much about future problems, but
yes, things are under the radar. For Intel, this version constraint
might be ok as Perfmon v4 is skylake, which is already pretty early.

For AMD, things are slightly different, PerfMon v2 in AMD requires
Genoa, which is pretty new. So, this problem probably could be
something for AMD if they want to extend the new vPMU design to Milan,
but we will see how people think. So one potential (easy) extension
for AMD is host PerfMon v1 + guest PerfMon v1 support for mediated
passthrough vPMU.

>
> Just add a WARN on the PMU version.  I haven't thought much about whether or not
> KVM should support mediated PMU for earlier hardware, but having a sanity check
> on the assumptions of this code is reasonable even if we don't _plan_ on supporting
> earlier hardware.

Sure. That sounds pretty reasonable.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-05-09  0:13           ` Mingwei Zhang
@ 2024-05-09  0:30             ` Mi, Dapeng
  0 siblings, 0 replies; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-09  0:30 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson
  Cc: Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla,
	Sandipan Das, Jim Mattson, Stephane Eranian, Ian Rogers,
	Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users


On 5/9/2024 8:13 AM, Mingwei Zhang wrote:
> On Wed, May 8, 2024 at 7:13 AM Sean Christopherson <seanjc@google.com> wrote:
>> On Wed, May 08, 2024, Dapeng Mi wrote:
>>> On 5/8/2024 12:36 PM, Mingwei Zhang wrote:
>>>> if (pmu->passthrough && pmu->nr_arch_gp_counters)
>>>>
>>>> Since mediated passthrough PMU requires PerfMon v4 in Intel (PerfMon
>>>> v2 in AMD), once it is enabled (pmu->passthrough = true), then global
>>>> ctrl _must_ exist phyiscally. Regardless of whether we expose it to
>>>> the guest VM, at reset time, we need to ensure enabling bits for GP
>>>> counters are set (behind the screen). This is critical for AMD, since
>>>> most of the guests are usually in (AMD) PerfMon v1 in which global
>>>> ctrl MSR is inaccessible, but does exist and is operating in HW.
>>>>
>>>> Yes, if we eliminate that requirement (pmu->passthrough -> Perfmon v4
>>>> Intel / Perfmon v2 AMD), then this code will have to change. However,
>>> Yeah, that's what I'm worrying about. We ever discussed to support mediated
>>> vPMU on HW below perfmon v4. When someone implements this, he may not
>>> notice this place needs to be changed as well, this introduces a potential
>>> bug and we should avoid this.
> I think you might have worried too much about future problems, but
> yes, things are under the radar. For Intel, this version constraint
> might be ok as Perfmon v4 is skylake, which is already pretty early.
No, I don't think this is redundant worry since we did discuss this
requirement before and it could need to be supported in the future. We need
to consider the code's extensibility and not introduce potential issue.
>
> For AMD, things are slightly different, PerfMon v2 in AMD requires
> Genoa, which is pretty new. So, this problem probably could be
> something for AMD if they want to extend the new vPMU design to Milan,
> but we will see how people think. So one potential (easy) extension
> for AMD is host PerfMon v1 + guest PerfMon v1 support for mediated
> passthrough vPMU.
>
>> Just add a WARN on the PMU version.  I haven't thought much about whether or not
>> KVM should support mediated PMU for earlier hardware, but having a sanity check
>> on the assumptions of this code is reasonable even if we don't _plan_ on supporting
>> earlier hardware.
> Sure. That sounds pretty reasonable.
Good for me.
>
> Thanks.
> -Mingwei
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode
  2024-05-08 14:13         ` Sean Christopherson
  2024-05-09  0:13           ` Mingwei Zhang
@ 2024-05-09  0:38           ` Mi, Dapeng
  1 sibling, 0 replies; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-09  0:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Xiong Zhang, Kan Liang, Zhenyu Wang,
	Manali Shukla, Sandipan Das, Jim Mattson, Stephane Eranian,
	Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev, Samantha Alt,
	Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users


On 5/8/2024 10:13 PM, Sean Christopherson wrote:
> On Wed, May 08, 2024, Dapeng Mi wrote:
>> On 5/8/2024 12:36 PM, Mingwei Zhang wrote:
>>> if (pmu->passthrough && pmu->nr_arch_gp_counters)
>>>
>>> Since mediated passthrough PMU requires PerfMon v4 in Intel (PerfMon
>>> v2 in AMD), once it is enabled (pmu->passthrough = true), then global
>>> ctrl _must_ exist phyiscally. Regardless of whether we expose it to
>>> the guest VM, at reset time, we need to ensure enabling bits for GP
>>> counters are set (behind the screen). This is critical for AMD, since
>>> most of the guests are usually in (AMD) PerfMon v1 in which global
>>> ctrl MSR is inaccessible, but does exist and is operating in HW.
>>>
>>> Yes, if we eliminate that requirement (pmu->passthrough -> Perfmon v4
>>> Intel / Perfmon v2 AMD), then this code will have to change. However,
>> Yeah, that's what I'm worrying about. We ever discussed to support mediated
>> vPMU on HW below perfmon v4. When someone implements this, he may not
>> notice this place needs to be changed as well, this introduces a potential
>> bug and we should avoid this.
> Just add a WARN on the PMU version.  I haven't thought much about whether or not
> KVM should support mediated PMU for earlier hardware, but having a sanity check
> on the assumptions of this code is reasonable even if we don't _plan_ on supporting
> earlier hardware.
I have no preference on whether supporting the old hardware (especially for
Intel CPUs below v4 which has no GLOBAL_STATUS_SET MSR), but I think the
key question is whether we want to totally drop the emulated vPMU after
mediated passthrough vPMU becomes mature. If so, we may have to support the
old platforms.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 18/54] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (16 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 19/54] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() Mingwei Zhang
                   ` (36 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Add a helper to check if passthrough PMU is enabled for convenience as it
is vendor neutral.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index cf93be5e7359..56ba0772568c 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -48,6 +48,11 @@ struct kvm_pmu_ops {
 
 void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops);
 
+static inline bool is_passthrough_pmu_enabled(struct kvm_vcpu *vcpu)
+{
+	return vcpu_to_pmu(vcpu)->passthrough;
+}
+
 static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
 {
 	/*
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 19/54] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init()
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (17 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 18/54] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 20/54] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
                   ` (35 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Initialize host_perf_cap early in kvm_x86_vendor_init(). This helps KVM
recognize the HW PMU capability against its guest VMs.  This awareness
directly decides the feasibility of passing through RDPMC and indirectly
affects the performance in PMU context switch. Having the host PMU feature
set cached in host_perf_cap saves a rdmsrl() to IA32_PERF_CAPABILITY MSR on
each PMU context switch.

In addition, just opportunistically remove the host_perf_cap initialization
in vmx_get_perf_capabilities() so the value is not dependent on module
parameter "enable_pmu".

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.h     | 1 +
 arch/x86/kvm/vmx/vmx.c | 4 ----
 arch/x86/kvm/x86.c     | 6 ++++++
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 56ba0772568c..e041c8a23e2f 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -295,4 +295,5 @@ bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
 extern struct kvm_pmu_ops intel_pmu_ops;
 extern struct kvm_pmu_ops amd_pmu_ops;
+extern u64 __read_mostly host_perf_cap;
 #endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index af253cfa5c37..a5024b7b0439 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7868,14 +7868,10 @@ static u64 vmx_get_perf_capabilities(void)
 {
 	u64 perf_cap = PMU_CAP_FW_WRITES;
 	struct x86_pmu_lbr lbr;
-	u64 host_perf_cap = 0;
 
 	if (!enable_pmu)
 		return 0;
 
-	if (boot_cpu_has(X86_FEATURE_PDCM))
-		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
-
 	if (!cpu_feature_enabled(X86_FEATURE_ARCH_LBR) &&
 	    !enable_passthrough_pmu) {
 		x86_perf_get_lbr(&lbr);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4c289fcb34fe..db395c00955f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -245,6 +245,9 @@ EXPORT_SYMBOL_GPL(host_xss);
 u64 __read_mostly host_arch_capabilities;
 EXPORT_SYMBOL_GPL(host_arch_capabilities);
 
+u64 __read_mostly host_perf_cap;
+EXPORT_SYMBOL_GPL(host_perf_cap);
+
 const struct _kvm_stats_desc kvm_vm_stats_desc[] = {
 	KVM_GENERIC_VM_STATS(),
 	STATS_DESC_COUNTER(VM, mmu_shadow_zapped),
@@ -9772,6 +9775,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
 		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, host_arch_capabilities);
 
+	if (boot_cpu_has(X86_FEATURE_PDCM))
+		rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
+
 	r = ops->hardware_setup();
 	if (r != 0)
 		goto out_mmu_exit;
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 20/54] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (18 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 19/54] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-08 21:55   ` Chen, Zide
  2024-05-06  5:29 ` [PATCH v2 21/54] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Mingwei Zhang
                   ` (34 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Clear RDPMC_EXITING in vmcs when all counters on the host side are exposed
to guest VM. This gives performance to passthrough PMU. However, when guest
does not get all counters, intercept RDPMC to prevent access to unexposed
counters. Make decision in vmx_vcpu_after_set_cpuid() when guest enables
PMU and passthrough PMU is enabled.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c     | 16 ++++++++++++++++
 arch/x86/kvm/pmu.h     |  1 +
 arch/x86/kvm/vmx/vmx.c |  5 +++++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e656f72fdace..19104e16a986 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -96,6 +96,22 @@ void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops)
 #undef __KVM_X86_PMU_OP
 }
 
+bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	if (is_passthrough_pmu_enabled(vcpu) &&
+	    !enable_vmware_backdoor &&
+	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
+	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
+	    pmu->counter_bitmask[KVM_PMC_GP] == (((u64)1 << kvm_pmu_cap.bit_width_gp) - 1) &&
+	    pmu->counter_bitmask[KVM_PMC_FIXED] == (((u64)1 << kvm_pmu_cap.bit_width_fixed)  - 1))
+		return true;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(kvm_pmu_check_rdpmc_passthrough);
+
 static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
 {
 	struct kvm_pmu *pmu = pmc_to_pmu(pmc);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index e041c8a23e2f..91941a0f6e47 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -290,6 +290,7 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
 void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
+bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a5024b7b0439..a18ba5ae5376 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7860,6 +7860,11 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		vmx->msr_ia32_feature_control_valid_bits &=
 			~FEAT_CTL_SGX_LC_ENABLED;
 
+	if (kvm_pmu_check_rdpmc_passthrough(&vmx->vcpu))
+		exec_controls_clearbit(vmx, CPU_BASED_RDPMC_EXITING);
+	else
+		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);
+
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
 }
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 20/54] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest
  2024-05-06  5:29 ` [PATCH v2 20/54] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
@ 2024-05-08 21:55   ` Chen, Zide
  2024-05-30  5:20     ` Mingwei Zhang
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Zide @ 2024-05-08 21:55 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users



On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index a5024b7b0439..a18ba5ae5376 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7860,6 +7860,11 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  		vmx->msr_ia32_feature_control_valid_bits &=
>  			~FEAT_CTL_SGX_LC_ENABLED;
>  
> +	if (kvm_pmu_check_rdpmc_passthrough(&vmx->vcpu))
> +		exec_controls_clearbit(vmx, CPU_BASED_RDPMC_EXITING);
> +	else
> +		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);
> +

Seems it's cleaner to put the code inside vmx_set_perf_global_ctrl(),
and change the name to reflect that it handles all the PMU related VMCS
controls.

>  	/* Refresh #PF interception to account for MAXPHYADDR changes. */
>  	vmx_update_exception_bitmap(vcpu);
>  }

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 20/54] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest
  2024-05-08 21:55   ` Chen, Zide
@ 2024-05-30  5:20     ` Mingwei Zhang
  0 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-30  5:20 UTC (permalink / raw)
  To: Chen, Zide
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu,
	Peter Zijlstra, kvm, linux-perf-users

On Wed, May 08, 2024, Chen, Zide wrote:
> 
> 
> On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index a5024b7b0439..a18ba5ae5376 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7860,6 +7860,11 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >  		vmx->msr_ia32_feature_control_valid_bits &=
> >  			~FEAT_CTL_SGX_LC_ENABLED;
> >  
> > +	if (kvm_pmu_check_rdpmc_passthrough(&vmx->vcpu))
> > +		exec_controls_clearbit(vmx, CPU_BASED_RDPMC_EXITING);
> > +	else
> > +		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);
> > +
>
> Seems it's cleaner to put the code inside vmx_set_perf_global_ctrl(),
> and change the name to reflect that it handles all the PMU related VMCS
> controls.

I prefer putting them separately, but I think I could put the above code
into a different helper.

Thanks.
-Mingwei
> 
> >  	/* Refresh #PF interception to account for MAXPHYADDR changes. */
> >  	vmx_update_exception_bitmap(vcpu);
> >  }

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 21/54] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (19 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 20/54] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 22/54] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed Mingwei Zhang
                   ` (33 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Define macro PMU_CAP_PERF_METRICS to represent bit[15] of
MSR_IA32_PERF_CAPABILITIES MSR. This bit is used to represent whether
perf metrics feature is enabled.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/vmx/capabilities.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 41a4533f9989..d8317552b634 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -22,6 +22,7 @@ extern int __read_mostly pt_mode;
 #define PT_MODE_HOST_GUEST	1
 
 #define PMU_CAP_FW_WRITES	(1ULL << 13)
+#define PMU_CAP_PERF_METRICS	BIT_ULL(15)
 #define PMU_CAP_LBR_FMT		0x3f
 
 struct nested_vmx_msrs {
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 22/54] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (20 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 21/54] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 23/54] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
                   ` (32 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Introduce a vendor specific API to check if rdpmc passthrough allowed.
RDPMC passthrough requires guest VM have the full ownership of all
counters. These include general purpose counters and fixed counters and
some vendor specific MSRs such as PERF_METRICS. Since PERF_METRICS MSR is
Intel specific, putting the check into vendor specific code.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h |  1 +
 arch/x86/kvm/pmu.c                     |  1 +
 arch/x86/kvm/pmu.h                     |  1 +
 arch/x86/kvm/svm/pmu.c                 |  6 ++++++
 arch/x86/kvm/vmx/pmu_intel.c           | 16 ++++++++++++++++
 5 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index f852b13aeefe..fd986d5146e4 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -20,6 +20,7 @@ KVM_X86_PMU_OP(get_msr)
 KVM_X86_PMU_OP(set_msr)
 KVM_X86_PMU_OP(refresh)
 KVM_X86_PMU_OP(init)
+KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
 KVM_X86_PMU_OP_OPTIONAL(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 19104e16a986..3afefe4cf6e2 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -102,6 +102,7 @@ bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu)
 
 	if (is_passthrough_pmu_enabled(vcpu) &&
 	    !enable_vmware_backdoor &&
+	    static_call(kvm_x86_pmu_is_rdpmc_passthru_allowed)(vcpu) &&
 	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
 	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed &&
 	    pmu->counter_bitmask[KVM_PMC_GP] == (((u64)1 << kvm_pmu_cap.bit_width_gp) - 1) &&
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 91941a0f6e47..e1af6d07b191 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -40,6 +40,7 @@ struct kvm_pmu_ops {
 	void (*reset)(struct kvm_vcpu *vcpu);
 	void (*deliver_pmi)(struct kvm_vcpu *vcpu);
 	void (*cleanup)(struct kvm_vcpu *vcpu);
+	bool (*is_rdpmc_passthru_allowed)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index dfcc38bd97d3..6b471b1ec9b8 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -228,6 +228,11 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
 	}
 }
 
+static bool amd_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
+{
+	return true;
+}
+
 struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
@@ -237,6 +242,7 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.set_msr = amd_pmu_set_msr,
 	.refresh = amd_pmu_refresh,
 	.init = amd_pmu_init,
+	.is_rdpmc_passthru_allowed = amd_is_rdpmc_passthru_allowed,
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 0ed71f825e6b..ed79cbba1edc 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -725,6 +725,21 @@ void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
 	}
 }
 
+static bool intel_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Per Intel SDM vol. 2 for RDPMC, MSR_PERF_METRICS is accessible by
+	 * with type 0x2000 in ECX[31:16], while the index value in ECX[15:0] is
+	 * implementation specific. Therefore, if the host has this MSR, but
+	 * does not expose it to the guest, RDPMC has to be intercepted.
+	 */
+	if ((host_perf_cap & PMU_CAP_PERF_METRICS) &&
+	    !(vcpu_get_perf_capabilities(vcpu) & PMU_CAP_PERF_METRICS))
+		return false;
+
+	return true;
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
@@ -736,6 +751,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.reset = intel_pmu_reset,
 	.deliver_pmi = intel_pmu_deliver_pmi,
 	.cleanup = intel_pmu_cleanup,
+	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 23/54] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (21 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 22/54] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 24/54] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
                   ` (31 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

In PMU passthrough mode, there are three requirements to manage
IA32_PERF_GLOBAL_CTRL:
 - guest IA32_PERF_GLOBAL_CTRL MSR must be saved at vm exit.
 - IA32_PERF_GLOBAL_CTRL MSR must be cleared at vm exit to avoid any
   counter of running within KVM runloop.
 - guest IA32_PERF_GLOBAL_CTRL MSR must be restored at vm entry.

Introduce vmx_set_perf_global_ctrl() function to auto switching
IA32_PERF_GLOBAL_CTR and invoke it after the VMM finishes setting up the
CPUID bits.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/vmx.h |   1 +
 arch/x86/kvm/vmx/vmx.c     | 117 +++++++++++++++++++++++++++++++------
 arch/x86/kvm/vmx/vmx.h     |   3 +-
 3 files changed, 103 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 4dba17363008..9b363e30e40c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -105,6 +105,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL      0x40000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a18ba5ae5376..c9de7d2623b8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4385,6 +4385,97 @@ static u32 vmx_pin_based_exec_ctrl(struct vcpu_vmx *vmx)
 	return pin_based_exec_ctrl;
 }
 
+static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
+{
+	u32 vmentry_ctrl = vm_entry_controls_get(vmx);
+	u32 vmexit_ctrl = vm_exit_controls_get(vmx);
+	struct vmx_msrs *m;
+	int i;
+
+	if (cpu_has_perf_global_ctrl_bug() ||
+	    !is_passthrough_pmu_enabled(&vmx->vcpu)) {
+		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
+		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+		vmexit_ctrl &= ~VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL;
+	}
+
+	if (is_passthrough_pmu_enabled(&vmx->vcpu)) {
+		/*
+		 * Setup auto restore guest PERF_GLOBAL_CTRL MSR at vm entry.
+		 */
+		if (vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
+			vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, 0);
+		} else {
+			m = &vmx->msr_autoload.guest;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i < 0) {
+				i = m->nr++;
+				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
+			}
+			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+			m->val[i].value = 0;
+		}
+		/*
+		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
+		 */
+		if (vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) {
+			vmcs_write64(HOST_IA32_PERF_GLOBAL_CTRL, 0);
+		} else {
+			m = &vmx->msr_autoload.host;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i < 0) {
+				i = m->nr++;
+				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+			}
+			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+			m->val[i].value = 0;
+		}
+		/*
+		 * Setup auto save guest PERF_GLOBAL_CTRL msr at vm exit
+		 */
+		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
+			m = &vmx->msr_autostore.guest;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i < 0) {
+				i = m->nr++;
+				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
+			}
+			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+		}
+	} else {
+		if (!(vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)) {
+			m = &vmx->msr_autoload.guest;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i >= 0) {
+				m->nr--;
+				m->val[i] = m->val[m->nr];
+				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
+			}
+		}
+		if (!(vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)) {
+			m = &vmx->msr_autoload.host;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i >= 0) {
+				m->nr--;
+				m->val[i] = m->val[m->nr];
+				vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->nr);
+			}
+		}
+		if (!(vmexit_ctrl & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)) {
+			m = &vmx->msr_autostore.guest;
+			i = vmx_find_loadstore_msr_slot(m, MSR_CORE_PERF_GLOBAL_CTRL);
+			if (i >= 0) {
+				m->nr--;
+				m->val[i] = m->val[m->nr];
+				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
+			}
+		}
+	}
+
+	vm_entry_controls_set(vmx, vmentry_ctrl);
+	vm_exit_controls_set(vmx, vmexit_ctrl);
+}
+
 static u32 vmx_vmentry_ctrl(void)
 {
 	u32 vmentry_ctrl = vmcs_config.vmentry_ctrl;
@@ -4392,17 +4483,10 @@ static u32 vmx_vmentry_ctrl(void)
 	if (vmx_pt_mode_is_system())
 		vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP |
 				  VM_ENTRY_LOAD_IA32_RTIT_CTL);
-	/*
-	 * IA32e mode, and loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically.
-	 */
-	vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |
-			  VM_ENTRY_LOAD_IA32_EFER |
-			  VM_ENTRY_IA32E_MODE);
-
-	if (cpu_has_perf_global_ctrl_bug())
-		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
-
-	return vmentry_ctrl;
+	 /*
+	  * IA32e mode, and loading of EFER is toggled dynamically.
+	  */
+	return vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_EFER | VM_ENTRY_IA32E_MODE);
 }
 
 static u32 vmx_vmexit_ctrl(void)
@@ -4420,12 +4504,8 @@ static u32 vmx_vmexit_ctrl(void)
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
 				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
 
-	if (cpu_has_perf_global_ctrl_bug())
-		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
-
-	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
-	return vmexit_ctrl &
-		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
+	/* Loading of EFER is toggled dynamically */
+	return vmexit_ctrl & ~VM_EXIT_LOAD_IA32_EFER;
 }
 
 static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
@@ -4763,6 +4843,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		vmcs_write64(VM_FUNCTION_CONTROL, 0);
 
 	vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
+	vmcs_write64(VM_EXIT_MSR_STORE_ADDR, __pa(vmx->msr_autostore.guest.val));
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
 	vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val));
 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
@@ -7865,6 +7946,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	else
 		exec_controls_setbit(vmx, CPU_BASED_RDPMC_EXITING);
 
+	vmx_set_perf_global_ctrl(vmx);
+
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
 }
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 65786dbe7d60..1b56fac35656 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -504,7 +504,8 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_LOAD_IA32_EFER |					\
 	       VM_EXIT_CLEAR_BNDCFGS |					\
 	       VM_EXIT_PT_CONCEAL_PIP |					\
-	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
+	       VM_EXIT_CLEAR_IA32_RTIT_CTL |                            \
+	       VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 24/54] KVM: x86/pmu: Create a function prototype to disable MSR interception
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (22 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 23/54] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-08 22:03   ` Chen, Zide
  2024-05-06  5:29 ` [PATCH v2 25/54] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs Mingwei Zhang
                   ` (30 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Add one extra pmu function prototype in kvm_pmu_ops to disable PMU MSR
interception.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
 arch/x86/kvm/cpuid.c                   | 4 ++++
 arch/x86/kvm/pmu.c                     | 5 +++++
 arch/x86/kvm/pmu.h                     | 2 ++
 4 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index fd986d5146e4..1b7876dcb3c3 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -24,6 +24,7 @@ KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
 KVM_X86_PMU_OP_OPTIONAL(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
+KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 77352a4abd87..b577ba649feb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -381,6 +381,10 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 
 	kvm_pmu_refresh(vcpu);
+
+	if (is_passthrough_pmu_enabled(vcpu))
+		kvm_pmu_passthrough_pmu_msrs(vcpu);
+
 	vcpu->arch.cr4_guest_rsvd_bits =
 	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 3afefe4cf6e2..bd94f2d67f5c 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1059,3 +1059,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
 	kfree(filter);
 	return r;
 }
+
+void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
+{
+	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index e1af6d07b191..63f876557716 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -41,6 +41,7 @@ struct kvm_pmu_ops {
 	void (*deliver_pmi)(struct kvm_vcpu *vcpu);
 	void (*cleanup)(struct kvm_vcpu *vcpu);
 	bool (*is_rdpmc_passthru_allowed)(struct kvm_vcpu *vcpu);
+	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
@@ -292,6 +293,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
 bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
+void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 24/54] KVM: x86/pmu: Create a function prototype to disable MSR interception
  2024-05-06  5:29 ` [PATCH v2 24/54] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
@ 2024-05-08 22:03   ` Chen, Zide
  2024-05-30  5:24     ` Mingwei Zhang
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Zide @ 2024-05-08 22:03 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users



On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
> Add one extra pmu function prototype in kvm_pmu_ops to disable PMU MSR
> interception.
> 
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
>  arch/x86/kvm/cpuid.c                   | 4 ++++
>  arch/x86/kvm/pmu.c                     | 5 +++++
>  arch/x86/kvm/pmu.h                     | 2 ++
>  4 files changed, 12 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> index fd986d5146e4..1b7876dcb3c3 100644
> --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> @@ -24,6 +24,7 @@ KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
>  KVM_X86_PMU_OP_OPTIONAL(reset)
>  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
>  KVM_X86_PMU_OP_OPTIONAL(cleanup)
> +KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
>  
>  #undef KVM_X86_PMU_OP
>  #undef KVM_X86_PMU_OP_OPTIONAL
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 77352a4abd87..b577ba649feb 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -381,6 +381,10 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
>  
>  	kvm_pmu_refresh(vcpu);
> +
> +	if (is_passthrough_pmu_enabled(vcpu))
> +		kvm_pmu_passthrough_pmu_msrs(vcpu);
> +
>  	vcpu->arch.cr4_guest_rsvd_bits =
>  	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
>  
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 3afefe4cf6e2..bd94f2d67f5c 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -1059,3 +1059,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
>  	kfree(filter);
>  	return r;
>  }
> +
> +void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> +{
> +	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
> +}

Don't quite understand why a separate callback is needed. It seems it's
not messier if put this logic in the kvm_x86_vcpu_after_set_cpuid()
callback.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 24/54] KVM: x86/pmu: Create a function prototype to disable MSR interception
  2024-05-08 22:03   ` Chen, Zide
@ 2024-05-30  5:24     ` Mingwei Zhang
  0 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-30  5:24 UTC (permalink / raw)
  To: Chen, Zide
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu,
	Peter Zijlstra, kvm, linux-perf-users

On Wed, May 08, 2024, Chen, Zide wrote:
> 
> 
> On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
> > Add one extra pmu function prototype in kvm_pmu_ops to disable PMU MSR
> > interception.
> > 
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > ---
> >  arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
> >  arch/x86/kvm/cpuid.c                   | 4 ++++
> >  arch/x86/kvm/pmu.c                     | 5 +++++
> >  arch/x86/kvm/pmu.h                     | 2 ++
> >  4 files changed, 12 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> > index fd986d5146e4..1b7876dcb3c3 100644
> > --- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
> > @@ -24,6 +24,7 @@ KVM_X86_PMU_OP(is_rdpmc_passthru_allowed)
> >  KVM_X86_PMU_OP_OPTIONAL(reset)
> >  KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
> >  KVM_X86_PMU_OP_OPTIONAL(cleanup)
> > +KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
> >  
> >  #undef KVM_X86_PMU_OP
> >  #undef KVM_X86_PMU_OP_OPTIONAL
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 77352a4abd87..b577ba649feb 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -381,6 +381,10 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >  	vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
> >  
> >  	kvm_pmu_refresh(vcpu);
> > +
> > +	if (is_passthrough_pmu_enabled(vcpu))
> > +		kvm_pmu_passthrough_pmu_msrs(vcpu);
> > +
> >  	vcpu->arch.cr4_guest_rsvd_bits =
> >  	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
> >  
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 3afefe4cf6e2..bd94f2d67f5c 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -1059,3 +1059,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
> >  	kfree(filter);
> >  	return r;
> >  }
> > +
> > +void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> > +{
> > +	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
> > +}
> 
> Don't quite understand why a separate callback is needed. It seems it's
> not messier if put this logic in the kvm_x86_vcpu_after_set_cpuid()
> callback.

One of the key point here is whether we _can_ intercept RDPMC. We have
to intercept it if there is _any_ counters / MSRs that is accessible by
rdmpc. In Intel CPU, the PERF_METRICS MSR is accessible from RDPMC. This
MSR is a vendor specific one. So that's why we added another vendor API.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 25/54] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (23 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 24/54] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU Mingwei Zhang
                   ` (29 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Dapeng Mi <dapeng1.mi@linux.intel.com>

Event selectors for GP counters and fixed counters control MSR are
intercepted for the purpose of security, i.e., preventing guest from using
unallowed events to steal information or take advantages of any CPU errata.

Other than event selectors, disable PMU counter MSR interception specified
in guest CPUID, counter MSR index outside of exported range will still be
intercepted.

Global registers like global_ctrl will passthrough only if pmu version is
greater than 1.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
---
 arch/x86/kvm/cpuid.c         |  3 +--
 arch/x86/kvm/vmx/pmu_intel.c | 47 ++++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b577ba649feb..99e6cca67beb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -382,8 +382,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	kvm_pmu_refresh(vcpu);
 
-	if (is_passthrough_pmu_enabled(vcpu))
-		kvm_pmu_passthrough_pmu_msrs(vcpu);
+	kvm_pmu_passthrough_pmu_msrs(vcpu);
 
 	vcpu->arch.cr4_guest_rsvd_bits =
 	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index ed79cbba1edc..8e8d1f2aa5e5 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -740,6 +740,52 @@ static bool intel_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+/*
+ * Setup PMU MSR interception for both mediated passthrough vPMU and legacy
+ * emulated vPMU. Note that this function is called after each time userspace
+ * set CPUID.
+ */
+static void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
+{
+	bool msr_intercept = !is_passthrough_pmu_enabled(vcpu);
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	int i;
+
+	/*
+	 * Unexposed PMU MSRs are intercepted by default. However,
+	 * KVM_SET_CPUID{,2} may be invoked multiple times. To ensure MSR
+	 * interception is correct after each call of setting CPUID, explicitly
+	 * touch msr bitmap for each PMU MSR.
+	 */
+	for (i = 0; i < kvm_pmu_cap.num_counters_gp; i++) {
+		if (i >= pmu->nr_arch_gp_counters)
+			msr_intercept = true;
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PERFCTR0 + i, MSR_TYPE_RW, msr_intercept);
+		if (fw_writes_is_enabled(vcpu))
+			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, msr_intercept);
+		else
+			vmx_set_intercept_for_msr(vcpu, MSR_IA32_PMC0 + i, MSR_TYPE_RW, true);
+	}
+
+	msr_intercept = !is_passthrough_pmu_enabled(vcpu);
+	for (i = 0; i < kvm_pmu_cap.num_counters_fixed; i++) {
+		if (i >= pmu->nr_arch_fixed_counters)
+			msr_intercept = true;
+		vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_FIXED_CTR0 + i, MSR_TYPE_RW, msr_intercept);
+	}
+
+	if (pmu->version > 1 && is_passthrough_pmu_enabled(vcpu) &&
+	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp &&
+	    pmu->nr_arch_fixed_counters == kvm_pmu_cap.num_counters_fixed)
+		msr_intercept = false;
+	else
+		msr_intercept = true;
+
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_STATUS, MSR_TYPE_RW, msr_intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, MSR_TYPE_RW, msr_intercept);
+	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, msr_intercept);
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
@@ -752,6 +798,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.deliver_pmi = intel_pmu_deliver_pmi,
 	.cleanup = intel_pmu_cleanup,
 	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
+	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (24 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 25/54] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-08 21:48   ` Chen, Zide
  2024-05-06  5:29 ` [PATCH v2 27/54] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
                   ` (28 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Avoid calling into legacy/emulated vPMU logic such as reprogram_counters()
when passthrough vPMU is enabled. Note that even when passthrough vPMU is
enabled, global_ctrl may still be intercepted if guest VM only sees a
subset of the counters.

Suggested-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index bd94f2d67f5c..e9047051489e 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -713,7 +713,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (pmu->global_ctrl != data) {
 			diff = pmu->global_ctrl ^ data;
 			pmu->global_ctrl = data;
-			reprogram_counters(pmu, diff);
+			if (!is_passthrough_pmu_enabled(vcpu))
+				reprogram_counters(pmu, diff);
 		}
 		break;
 	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU
  2024-05-06  5:29 ` [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU Mingwei Zhang
@ 2024-05-08 21:48   ` Chen, Zide
  2024-05-09  0:43     ` Mi, Dapeng
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Zide @ 2024-05-08 21:48 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users



On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
> Avoid calling into legacy/emulated vPMU logic such as reprogram_counters()
> when passthrough vPMU is enabled. Note that even when passthrough vPMU is
> enabled, global_ctrl may still be intercepted if guest VM only sees a
> subset of the counters.
> 
> Suggested-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/pmu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index bd94f2d67f5c..e9047051489e 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -713,7 +713,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (pmu->global_ctrl != data) {
>  			diff = pmu->global_ctrl ^ data;
>  			pmu->global_ctrl = data;
> -			reprogram_counters(pmu, diff);
> +			if (!is_passthrough_pmu_enabled(vcpu))
> +				reprogram_counters(pmu, diff);

Since in [PATCH 44/54], reprogram_counters() is effectively skipped in
the passthrough case, is this patch still needed?

>  		}
>  		break;
>  	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU
  2024-05-08 21:48   ` Chen, Zide
@ 2024-05-09  0:43     ` Mi, Dapeng
  2024-05-09  1:29       ` Chen, Zide
  0 siblings, 1 reply; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-09  0:43 UTC (permalink / raw)
  To: Chen, Zide, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users


On 5/9/2024 5:48 AM, Chen, Zide wrote:
>
> On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
>> Avoid calling into legacy/emulated vPMU logic such as reprogram_counters()
>> when passthrough vPMU is enabled. Note that even when passthrough vPMU is
>> enabled, global_ctrl may still be intercepted if guest VM only sees a
>> subset of the counters.
>>
>> Suggested-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>  arch/x86/kvm/pmu.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index bd94f2d67f5c..e9047051489e 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -713,7 +713,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>  		if (pmu->global_ctrl != data) {
>>  			diff = pmu->global_ctrl ^ data;
>>  			pmu->global_ctrl = data;
>> -			reprogram_counters(pmu, diff);
>> +			if (!is_passthrough_pmu_enabled(vcpu))
>> +				reprogram_counters(pmu, diff);
> Since in [PATCH 44/54], reprogram_counters() is effectively skipped in
> the passthrough case, is this patch still needed?
Zide, reprogram_counters() and reprogram_counter() are two different
helpers. Both they need to be skipped in passthrough mode.
>
>>  		}
>>  		break;
>>  	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU
  2024-05-09  0:43     ` Mi, Dapeng
@ 2024-05-09  1:29       ` Chen, Zide
  2024-05-09  2:58         ` Mi, Dapeng
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Zide @ 2024-05-09  1:29 UTC (permalink / raw)
  To: Mi, Dapeng, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users



On 5/8/2024 5:43 PM, Mi, Dapeng wrote:
> 
> On 5/9/2024 5:48 AM, Chen, Zide wrote:
>>
>> On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
>>> Avoid calling into legacy/emulated vPMU logic such as reprogram_counters()
>>> when passthrough vPMU is enabled. Note that even when passthrough vPMU is
>>> enabled, global_ctrl may still be intercepted if guest VM only sees a
>>> subset of the counters.
>>>
>>> Suggested-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>> ---
>>>  arch/x86/kvm/pmu.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>> index bd94f2d67f5c..e9047051489e 100644
>>> --- a/arch/x86/kvm/pmu.c
>>> +++ b/arch/x86/kvm/pmu.c
>>> @@ -713,7 +713,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>  		if (pmu->global_ctrl != data) {
>>>  			diff = pmu->global_ctrl ^ data;
>>>  			pmu->global_ctrl = data;
>>> -			reprogram_counters(pmu, diff);
>>> +			if (!is_passthrough_pmu_enabled(vcpu))
>>> +				reprogram_counters(pmu, diff);
>> Since in [PATCH 44/54], reprogram_counters() is effectively skipped in
>> the passthrough case, is this patch still needed?
> Zide, reprogram_counters() and reprogram_counter() are two different
> helpers. Both they need to be skipped in passthrough mode.

Yes, but this is talking about reprogram_counters() only.  passthrough
mode is being checked inside and outside the function call, which is
redundant.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU
  2024-05-09  1:29       ` Chen, Zide
@ 2024-05-09  2:58         ` Mi, Dapeng
  2024-05-30  5:28           ` Mingwei Zhang
  0 siblings, 1 reply; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-09  2:58 UTC (permalink / raw)
  To: Chen, Zide, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users


On 5/9/2024 9:29 AM, Chen, Zide wrote:
>
> On 5/8/2024 5:43 PM, Mi, Dapeng wrote:
>> On 5/9/2024 5:48 AM, Chen, Zide wrote:
>>> On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
>>>> Avoid calling into legacy/emulated vPMU logic such as reprogram_counters()
>>>> when passthrough vPMU is enabled. Note that even when passthrough vPMU is
>>>> enabled, global_ctrl may still be intercepted if guest VM only sees a
>>>> subset of the counters.
>>>>
>>>> Suggested-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>>>> ---
>>>>  arch/x86/kvm/pmu.c | 3 ++-
>>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>>>> index bd94f2d67f5c..e9047051489e 100644
>>>> --- a/arch/x86/kvm/pmu.c
>>>> +++ b/arch/x86/kvm/pmu.c
>>>> @@ -713,7 +713,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>>  		if (pmu->global_ctrl != data) {
>>>>  			diff = pmu->global_ctrl ^ data;
>>>>  			pmu->global_ctrl = data;
>>>> -			reprogram_counters(pmu, diff);
>>>> +			if (!is_passthrough_pmu_enabled(vcpu))
>>>> +				reprogram_counters(pmu, diff);
>>> Since in [PATCH 44/54], reprogram_counters() is effectively skipped in
>>> the passthrough case, is this patch still needed?
>> Zide, reprogram_counters() and reprogram_counter() are two different
>> helpers. Both they need to be skipped in passthrough mode.
> Yes, but this is talking about reprogram_counters() only.  passthrough
> mode is being checked inside and outside the function call, which is
> redundant.
Oh, yes. I don't need this patch then.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU
  2024-05-09  2:58         ` Mi, Dapeng
@ 2024-05-30  5:28           ` Mingwei Zhang
  0 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-30  5:28 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Chen, Zide, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu,
	Peter Zijlstra, kvm, linux-perf-users

On Thu, May 09, 2024, Mi, Dapeng wrote:
> 
> On 5/9/2024 9:29 AM, Chen, Zide wrote:
> >
> > On 5/8/2024 5:43 PM, Mi, Dapeng wrote:
> >> On 5/9/2024 5:48 AM, Chen, Zide wrote:
> >>> On 5/5/2024 10:29 PM, Mingwei Zhang wrote:
> >>>> Avoid calling into legacy/emulated vPMU logic such as reprogram_counters()
> >>>> when passthrough vPMU is enabled. Note that even when passthrough vPMU is
> >>>> enabled, global_ctrl may still be intercepted if guest VM only sees a
> >>>> subset of the counters.
> >>>>
> >>>> Suggested-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> >>>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> >>>> ---
> >>>>  arch/x86/kvm/pmu.c | 3 ++-
> >>>>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> >>>> index bd94f2d67f5c..e9047051489e 100644
> >>>> --- a/arch/x86/kvm/pmu.c
> >>>> +++ b/arch/x86/kvm/pmu.c
> >>>> @@ -713,7 +713,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >>>>  		if (pmu->global_ctrl != data) {
> >>>>  			diff = pmu->global_ctrl ^ data;
> >>>>  			pmu->global_ctrl = data;
> >>>> -			reprogram_counters(pmu, diff);
> >>>> +			if (!is_passthrough_pmu_enabled(vcpu))
> >>>> +				reprogram_counters(pmu, diff);
> >>> Since in [PATCH 44/54], reprogram_counters() is effectively skipped in
> >>> the passthrough case, is this patch still needed?
> >> Zide, reprogram_counters() and reprogram_counter() are two different
> >> helpers. Both they need to be skipped in passthrough mode.
> > Yes, but this is talking about reprogram_counters() only.  passthrough
> > mode is being checked inside and outside the function call, which is
> > redundant.
> Oh, yes. I don't need this patch then.

Right. I am thinking about dropping [PATCH 44/54], since that one
contains some redundant checking. I will see how should this be fixed in
next version. Thanks for pointing it out though.

-Mingwei

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 27/54] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (25 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-14  7:33   ` Mi, Dapeng
  2024-05-06  5:29 ` [PATCH v2 28/54] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc Mingwei Zhang
                   ` (27 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Reject PMU MSRs interception explicitly in
vmx_get_passthrough_msr_slot() since interception of PMU MSRs are
specially handled in intel_passthrough_pmu_msrs().

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c9de7d2623b8..62b5913abdd6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -164,7 +164,7 @@ module_param(enable_passthrough_pmu, bool, 0444);
 
 /*
  * List of MSRs that can be directly passed to the guest.
- * In addition to these x2apic, PT and LBR MSRs are handled specially.
+ * In addition to these x2apic, PMU, PT and LBR MSRs are handled specially.
  */
 static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_IA32_SPEC_CTRL,
@@ -694,6 +694,13 @@ static int vmx_get_passthrough_msr_slot(u32 msr)
 	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
 	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
 		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
+	case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + 7:
+	case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 + 7:
+	case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 + 2:
+	case MSR_CORE_PERF_GLOBAL_STATUS:
+	case MSR_CORE_PERF_GLOBAL_CTRL:
+	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+		/* PMU MSRs. These are handled in intel_passthrough_pmu_msrs() */
 		return -ENOENT;
 	}
 
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 27/54] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
  2024-05-06  5:29 ` [PATCH v2 27/54] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
@ 2024-05-14  7:33   ` Mi, Dapeng
  0 siblings, 0 replies; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-14  7:33 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users


On 5/6/2024 1:29 PM, Mingwei Zhang wrote:
> Reject PMU MSRs interception explicitly in
> vmx_get_passthrough_msr_slot() since interception of PMU MSRs are
> specially handled in intel_passthrough_pmu_msrs().
>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c9de7d2623b8..62b5913abdd6 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -164,7 +164,7 @@ module_param(enable_passthrough_pmu, bool, 0444);
>  
>  /*
>   * List of MSRs that can be directly passed to the guest.
> - * In addition to these x2apic, PT and LBR MSRs are handled specially.
> + * In addition to these x2apic, PMU, PT and LBR MSRs are handled specially.
>   */
>  static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
>  	MSR_IA32_SPEC_CTRL,
> @@ -694,6 +694,13 @@ static int vmx_get_passthrough_msr_slot(u32 msr)
>  	case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 8:
>  	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
>  		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> +	case MSR_IA32_PMC0 ... MSR_IA32_PMC0 + 7:
> +	case MSR_IA32_PERFCTR0 ... MSR_IA32_PERFCTR0 + 7:
> +	case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR0 + 2:

We'd better use helpers to get the maximum supported GP and fixed counter
instead of magic numbers here. There are more GP and fixed counters in the
future's Intel CPUs.


> +	case MSR_CORE_PERF_GLOBAL_STATUS:
> +	case MSR_CORE_PERF_GLOBAL_CTRL:
> +	case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> +		/* PMU MSRs. These are handled in intel_passthrough_pmu_msrs() */
>  		return -ENOENT;
>  	}
>  

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 28/54] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (26 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 27/54] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 29/54] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Mingwei Zhang
                   ` (26 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Add the MSR indices for both selector and counter in each kvm_pmc. Giving
convenience to mediated passthrough vPMU in scenarios of querying MSR from
a given pmc. Note that legacy vPMU does not need this because it never
directly accesses PMU MSRs, instead each kvm_pmc is bound to a perf_event.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/svm/pmu.c          | 13 +++++++++++++
 arch/x86/kvm/vmx/pmu_intel.c    | 13 +++++++++++++
 3 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 19b924c3bd85..8b4ea9bdcc74 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -519,6 +519,8 @@ struct kvm_pmc {
 	 */
 	u64 emulated_counter;
 	u64 eventsel;
+	u64 msr_counter;
+	u64 msr_eventsel;
 	struct perf_event *perf_event;
 	struct kvm_vcpu *vcpu;
 	/*
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 6b471b1ec9b8..447657513729 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -177,6 +177,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 {
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	union cpuid_0x80000022_ebx ebx;
+	int i;
 
 	pmu->version = 1;
 	if (guest_cpuid_has(vcpu, X86_FEATURE_PERFMON_V2)) {
@@ -210,6 +211,18 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
 	pmu->nr_arch_fixed_counters = 0;
 	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
+
+	if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
+		for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+			pmu->gp_counters[i].msr_eventsel = MSR_F15H_PERF_CTL0 + 2*i;
+			pmu->gp_counters[i].msr_counter = MSR_F15H_PERF_CTR0 + 2*i;
+		}
+	} else {
+		for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+			pmu->gp_counters[i].msr_eventsel = MSR_K7_EVNTSEL0 + i;
+			pmu->gp_counters[i].msr_counter = MSR_K7_PERFCTR0 + i;
+		}
+	}
 }
 
 static void amd_pmu_init(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 8e8d1f2aa5e5..7852ba25a240 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -562,6 +562,19 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 				~((1ull << pmu->nr_arch_gp_counters) - 1);
 		}
 	}
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmu->gp_counters[i].msr_eventsel = MSR_P6_EVNTSEL0 + i;
+		if (fw_writes_is_enabled(vcpu))
+			pmu->gp_counters[i].msr_counter = MSR_IA32_PMC0 + i;
+		else
+			pmu->gp_counters[i].msr_counter = MSR_IA32_PERFCTR0 + i;
+	}
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmu->fixed_counters[i].msr_eventsel = MSR_CORE_PERF_FIXED_CTR_CTRL;
+		pmu->fixed_counters[i].msr_counter = MSR_CORE_PERF_FIXED_CTR0 + i;
+	}
 }
 
 static void intel_pmu_init(struct kvm_vcpu *vcpu)
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 29/54] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (27 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 28/54] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Mingwei Zhang
                   ` (25 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Plumb through kvm_pmu_ops with these two extra functions to allow pmu
context switch.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h |  2 ++
 arch/x86/kvm/pmu.c                     | 14 ++++++++++++++
 arch/x86/kvm/pmu.h                     |  4 ++++
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 1b7876dcb3c3..1a848ba6a7a7 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -25,6 +25,8 @@ KVM_X86_PMU_OP_OPTIONAL(reset)
 KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
 KVM_X86_PMU_OP_OPTIONAL(cleanup)
 KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
+KVM_X86_PMU_OP_OPTIONAL(save_pmu_context)
+KVM_X86_PMU_OP_OPTIONAL(restore_pmu_context)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e9047051489e..782b564bdf96 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1065,3 +1065,17 @@ void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 {
 	static_call_cond(kvm_x86_pmu_passthrough_pmu_msrs)(vcpu);
 }
+
+void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
+{
+	lockdep_assert_irqs_disabled();
+
+	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
+}
+
+void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
+{
+	lockdep_assert_irqs_disabled();
+
+	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 63f876557716..8bd4b79e363f 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -42,6 +42,8 @@ struct kvm_pmu_ops {
 	void (*cleanup)(struct kvm_vcpu *vcpu);
 	bool (*is_rdpmc_passthru_allowed)(struct kvm_vcpu *vcpu);
 	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);
+	void (*save_pmu_context)(struct kvm_vcpu *vcpu);
+	void (*restore_pmu_context)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
@@ -294,6 +296,8 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
 void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel);
 bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
 void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
+void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu);
+void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (28 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 29/54] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-14  8:08   ` Mi, Dapeng
  2024-05-06  5:29 ` [PATCH v2 31/54] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Mingwei Zhang
                   ` (24 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Implement the save/restore of PMU state for pasthrough PMU in Intel. In
passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
and gains the full HW PMU ownership. On the contrary, host regains the
ownership of PMU HW from KVM when control flow leaves the scope of
passthrough PMU.

Implement PMU context switches for Intel CPUs and opptunistically use
rdpmcl() instead of rdmsrl() when reading counters since the former has
lower latency in Intel CPUs.

Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/pmu.c           | 46 ++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/pmu_intel.c | 41 +++++++++++++++++++++++++++++++-
 2 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 782b564bdf96..13197472e31d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1068,14 +1068,60 @@ void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 
 void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	u32 i;
+
 	lockdep_assert_irqs_disabled();
 
 	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
+
+	/*
+	 * Clear hardware selector MSR content and its counter to avoid
+	 * leakage and also avoid this guest GP counter get accidentally
+	 * enabled during host running when host enable global ctrl.
+	 */
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmc = &pmu->gp_counters[i];
+		rdmsrl(pmc->msr_counter, pmc->counter);
+		rdmsrl(pmc->msr_eventsel, pmc->eventsel);
+		if (pmc->counter)
+			wrmsrl(pmc->msr_counter, 0);
+		if (pmc->eventsel)
+			wrmsrl(pmc->msr_eventsel, 0);
+	}
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = &pmu->fixed_counters[i];
+		rdmsrl(pmc->msr_counter, pmc->counter);
+		if (pmc->counter)
+			wrmsrl(pmc->msr_counter, 0);
+	}
 }
 
 void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
 {
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct kvm_pmc *pmc;
+	int i;
+
 	lockdep_assert_irqs_disabled();
 
 	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
+
+	/*
+	 * No need to zero out unexposed GP/fixed counters/selectors since RDPMC
+	 * in this case will be intercepted. Accessing to these counters and
+	 * selectors will cause #GP in the guest.
+	 */
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		pmc = &pmu->gp_counters[i];
+		wrmsrl(pmc->msr_counter, pmc->counter);
+		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel);
+	}
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = &pmu->fixed_counters[i];
+		wrmsrl(pmc->msr_counter, pmc->counter);
+	}
 }
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 7852ba25a240..a23cf9ca224e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -572,7 +572,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 	}
 
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
-		pmu->fixed_counters[i].msr_eventsel = MSR_CORE_PERF_FIXED_CTR_CTRL;
+		pmu->fixed_counters[i].msr_eventsel = 0;
 		pmu->fixed_counters[i].msr_counter = MSR_CORE_PERF_FIXED_CTR0 + i;
 	}
 }
@@ -799,6 +799,43 @@ static void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, msr_intercept);
 }
 
+static void intel_save_guest_pmu_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	/* Global ctrl register is already saved at VM-exit. */
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
+	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
+	if (pmu->global_status)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
+
+	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+	/*
+	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
+	 * also avoid these guest fixed counters get accidentially enabled
+	 * during host running when host enable global ctrl.
+	 */
+	if (pmu->fixed_ctr_ctrl)
+		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
+}
+
+static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u64 global_status, toggle;
+
+	/* Clear host global_ctrl MSR if non-zero. */
+	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
+	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
+	toggle = pmu->global_status ^ global_status;
+	if (global_status & toggle)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
+	if (pmu->global_status & toggle)
+		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
+
+	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
@@ -812,6 +849,8 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.cleanup = intel_pmu_cleanup,
 	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
 	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
+	.save_pmu_context = intel_save_guest_pmu_context,
+	.restore_pmu_context = intel_restore_guest_pmu_context,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-05-06  5:29 ` [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Mingwei Zhang
@ 2024-05-14  8:08   ` Mi, Dapeng
  2024-05-30  5:34     ` Mingwei Zhang
  0 siblings, 1 reply; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-14  8:08 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users


On 5/6/2024 1:29 PM, Mingwei Zhang wrote:
> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> and gains the full HW PMU ownership. On the contrary, host regains the
> ownership of PMU HW from KVM when control flow leaves the scope of
> passthrough PMU.
>
> Implement PMU context switches for Intel CPUs and opptunistically use
> rdpmcl() instead of rdmsrl() when reading counters since the former has
> lower latency in Intel CPUs.

It looks rdpmcl() optimization is removed from this patch, right? The
description is not identical with code.


>
> Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>

The commit tags doesn't follow the rules in Linux document, we need to
change it in next version.

https://docs.kernel.org/process/submitting-patches.html#:~:text=Co%2Ddeveloped%2Dby%3A%20states,work%20on%20a%20single%20patch.

> ---
>  arch/x86/kvm/pmu.c           | 46 ++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/pmu_intel.c | 41 +++++++++++++++++++++++++++++++-
>  2 files changed, 86 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 782b564bdf96..13197472e31d 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -1068,14 +1068,60 @@ void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>  
>  void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct kvm_pmc *pmc;
> +	u32 i;
> +
>  	lockdep_assert_irqs_disabled();
>  
>  	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
> +
> +	/*
> +	 * Clear hardware selector MSR content and its counter to avoid
> +	 * leakage and also avoid this guest GP counter get accidentally
> +	 * enabled during host running when host enable global ctrl.
> +	 */
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		pmc = &pmu->gp_counters[i];
> +		rdmsrl(pmc->msr_counter, pmc->counter);

I understood we want to use common code to manipulate PMU MSRs as much as
possible, but I don't think we should sacrifice performance. rdpmcl() has
better performance than rdmsrl(). If AMD CPUs doesn't support rdpmc
instruction, I think we should move this into vendor specific
xxx_save/restore_pmu_context helpers().


> +		rdmsrl(pmc->msr_eventsel, pmc->eventsel);
> +		if (pmc->counter)
> +			wrmsrl(pmc->msr_counter, 0);
> +		if (pmc->eventsel)
> +			wrmsrl(pmc->msr_eventsel, 0);
> +	}
> +
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> +		pmc = &pmu->fixed_counters[i];
> +		rdmsrl(pmc->msr_counter, pmc->counter);
> +		if (pmc->counter)
> +			wrmsrl(pmc->msr_counter, 0);
> +	}
>  }
>  
>  void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	struct kvm_pmc *pmc;
> +	int i;
> +
>  	lockdep_assert_irqs_disabled();
>  
>  	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
> +
> +	/*
> +	 * No need to zero out unexposed GP/fixed counters/selectors since RDPMC
> +	 * in this case will be intercepted. Accessing to these counters and
> +	 * selectors will cause #GP in the guest.
> +	 */
> +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> +		pmc = &pmu->gp_counters[i];
> +		wrmsrl(pmc->msr_counter, pmc->counter);
> +		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel);
> +	}
> +
> +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> +		pmc = &pmu->fixed_counters[i];
> +		wrmsrl(pmc->msr_counter, pmc->counter);
> +	}
>  }
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 7852ba25a240..a23cf9ca224e 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -572,7 +572,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>  	}
>  
>  	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> -		pmu->fixed_counters[i].msr_eventsel = MSR_CORE_PERF_FIXED_CTR_CTRL;
> +		pmu->fixed_counters[i].msr_eventsel = 0;
Why to initialize msr_eventsel to 0 instead of the real MSR address here?
>  		pmu->fixed_counters[i].msr_counter = MSR_CORE_PERF_FIXED_CTR0 + i;
>  	}
>  }
> @@ -799,6 +799,43 @@ static void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>  	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, msr_intercept);
>  }
>  
> +static void intel_save_guest_pmu_context(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> +	/* Global ctrl register is already saved at VM-exit. */
> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> +	if (pmu->global_status)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> +
> +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> +	/*
> +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> +	 * also avoid these guest fixed counters get accidentially enabled
> +	 * during host running when host enable global ctrl.
> +	 */
> +	if (pmu->fixed_ctr_ctrl)
> +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> +}
> +
> +static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +	u64 global_status, toggle;
> +
> +	/* Clear host global_ctrl MSR if non-zero. */
> +	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> +	toggle = pmu->global_status ^ global_status;
> +	if (global_status & toggle)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
> +	if (pmu->global_status & toggle)
> +		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
> +
> +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> +}
> +
>  struct kvm_pmu_ops intel_pmu_ops __initdata = {
>  	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
>  	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
> @@ -812,6 +849,8 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
>  	.cleanup = intel_pmu_cleanup,
>  	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
>  	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
> +	.save_pmu_context = intel_save_guest_pmu_context,
> +	.restore_pmu_context = intel_restore_guest_pmu_context,
>  	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
>  	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
>  	.MIN_NR_GP_COUNTERS = 1,

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  2024-05-14  8:08   ` Mi, Dapeng
@ 2024-05-30  5:34     ` Mingwei Zhang
  0 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-30  5:34 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu,
	Peter Zijlstra, kvm, linux-perf-users

On Tue, May 14, 2024, Mi, Dapeng wrote:
> 
> On 5/6/2024 1:29 PM, Mingwei Zhang wrote:
> > Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> > passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> > the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> > and gains the full HW PMU ownership. On the contrary, host regains the
> > ownership of PMU HW from KVM when control flow leaves the scope of
> > passthrough PMU.
> >
> > Implement PMU context switches for Intel CPUs and opptunistically use
> > rdpmcl() instead of rdmsrl() when reading counters since the former has
> > lower latency in Intel CPUs.
> 
> It looks rdpmcl() optimization is removed from this patch, right? The
> description is not identical with code.

That is correct. I was debugging this for a while and since we don't
have rdpmcl_safe(), one of the bug cause rdpmc() to crash the kernel.
Really don't like rdpmc(). But will add it back in next version.

> 
> 
> >
> > Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> 
> The commit tags doesn't follow the rules in Linux document, we need to
> change it in next version.

Will do.
> 
> https://docs.kernel.org/process/submitting-patches.html#:~:text=Co%2Ddeveloped%2Dby%3A%20states,work%20on%20a%20single%20patch.
> 
> > ---
> >  arch/x86/kvm/pmu.c           | 46 ++++++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/vmx/pmu_intel.c | 41 +++++++++++++++++++++++++++++++-
> >  2 files changed, 86 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 782b564bdf96..13197472e31d 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -1068,14 +1068,60 @@ void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> >  
> >  void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
> >  {
> > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +	struct kvm_pmc *pmc;
> > +	u32 i;
> > +
> >  	lockdep_assert_irqs_disabled();
> >  
> >  	static_call_cond(kvm_x86_pmu_save_pmu_context)(vcpu);
> > +
> > +	/*
> > +	 * Clear hardware selector MSR content and its counter to avoid
> > +	 * leakage and also avoid this guest GP counter get accidentally
> > +	 * enabled during host running when host enable global ctrl.
> > +	 */
> > +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > +		pmc = &pmu->gp_counters[i];
> > +		rdmsrl(pmc->msr_counter, pmc->counter);
> 
> I understood we want to use common code to manipulate PMU MSRs as much as
> possible, but I don't think we should sacrifice performance. rdpmcl() has
> better performance than rdmsrl(). If AMD CPUs doesn't support rdpmc
> instruction, I think we should move this into vendor specific
> xxx_save/restore_pmu_context helpers().
> 
> 
> > +		rdmsrl(pmc->msr_eventsel, pmc->eventsel);
> > +		if (pmc->counter)
> > +			wrmsrl(pmc->msr_counter, 0);
> > +		if (pmc->eventsel)
> > +			wrmsrl(pmc->msr_eventsel, 0);
> > +	}
> > +
> > +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > +		pmc = &pmu->fixed_counters[i];
> > +		rdmsrl(pmc->msr_counter, pmc->counter);
> > +		if (pmc->counter)
> > +			wrmsrl(pmc->msr_counter, 0);
> > +	}
> >  }
> >  
> >  void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
> >  {
> > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +	struct kvm_pmc *pmc;
> > +	int i;
> > +
> >  	lockdep_assert_irqs_disabled();
> >  
> >  	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
> > +
> > +	/*
> > +	 * No need to zero out unexposed GP/fixed counters/selectors since RDPMC
> > +	 * in this case will be intercepted. Accessing to these counters and
> > +	 * selectors will cause #GP in the guest.
> > +	 */
> > +	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > +		pmc = &pmu->gp_counters[i];
> > +		wrmsrl(pmc->msr_counter, pmc->counter);
> > +		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel);
> > +	}
> > +
> > +	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > +		pmc = &pmu->fixed_counters[i];
> > +		wrmsrl(pmc->msr_counter, pmc->counter);
> > +	}
> >  }
> > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > index 7852ba25a240..a23cf9ca224e 100644
> > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > @@ -572,7 +572,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> >  	}
> >  
> >  	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > -		pmu->fixed_counters[i].msr_eventsel = MSR_CORE_PERF_FIXED_CTR_CTRL;
> > +		pmu->fixed_counters[i].msr_eventsel = 0;
> Why to initialize msr_eventsel to 0 instead of the real MSR address here?
> >  		pmu->fixed_counters[i].msr_counter = MSR_CORE_PERF_FIXED_CTR0 + i;
> >  	}
> >  }
> > @@ -799,6 +799,43 @@ static void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> >  	vmx_set_intercept_for_msr(vcpu, MSR_CORE_PERF_GLOBAL_OVF_CTRL, MSR_TYPE_RW, msr_intercept);
> >  }
> >  
> > +static void intel_save_guest_pmu_context(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +
> > +	/* Global ctrl register is already saved at VM-exit. */
> > +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> > +	/* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> > +	if (pmu->global_status)
> > +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> > +
> > +	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > +	/*
> > +	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> > +	 * also avoid these guest fixed counters get accidentially enabled
> > +	 * during host running when host enable global ctrl.
> > +	 */
> > +	if (pmu->fixed_ctr_ctrl)
> > +		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> > +}
> > +
> > +static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > +	u64 global_status, toggle;
> > +
> > +	/* Clear host global_ctrl MSR if non-zero. */
> > +	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> > +	rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> > +	toggle = pmu->global_status ^ global_status;
> > +	if (global_status & toggle)
> > +		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
> > +	if (pmu->global_status & toggle)
> > +		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
> > +
> > +	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > +}
> > +
> >  struct kvm_pmu_ops intel_pmu_ops __initdata = {
> >  	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
> >  	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
> > @@ -812,6 +849,8 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
> >  	.cleanup = intel_pmu_cleanup,
> >  	.is_rdpmc_passthru_allowed = intel_is_rdpmc_passthru_allowed,
> >  	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
> > +	.save_pmu_context = intel_save_guest_pmu_context,
> > +	.restore_pmu_context = intel_restore_guest_pmu_context,
> >  	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
> >  	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
> >  	.MIN_NR_GP_COUNTERS = 1,
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 31/54] KVM: x86/pmu: Make check_pmu_event_filter() an exported function
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (29 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 32/54] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Mingwei Zhang
                   ` (23 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Make check_pmu_event_filter() exported to usable by vendor modules like
kvm_intel. This is because passthrough PMU intercept the guest writes to
event selectors and directly do the event filter checking inside the vendor
specific set_msr() instead of deferring to the KVM_REQ_PMU handler.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c | 3 ++-
 arch/x86/kvm/pmu.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 13197472e31d..0f587651a49e 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -443,7 +443,7 @@ static bool is_fixed_event_allowed(struct kvm_x86_pmu_event_filter *filter,
 	return true;
 }
 
-static bool check_pmu_event_filter(struct kvm_pmc *pmc)
+bool check_pmu_event_filter(struct kvm_pmc *pmc)
 {
 	struct kvm_x86_pmu_event_filter *filter;
 	struct kvm *kvm = pmc->vcpu->kvm;
@@ -457,6 +457,7 @@ static bool check_pmu_event_filter(struct kvm_pmc *pmc)
 
 	return is_fixed_event_allowed(filter, pmc->idx);
 }
+EXPORT_SYMBOL_GPL(check_pmu_event_filter);
 
 static bool pmc_event_is_allowed(struct kvm_pmc *pmc)
 {
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 8bd4b79e363f..9cde62f3988e 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -298,6 +298,7 @@ bool kvm_pmu_check_rdpmc_passthrough(struct kvm_vcpu *vcpu);
 void kvm_pmu_passthrough_pmu_msrs(struct kvm_vcpu *vcpu);
 void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu);
 void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu);
+bool check_pmu_event_filter(struct kvm_pmc *pmc);
 
 bool is_vmware_backdoor_pmc(u32 pmc_idx);
 
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 32/54] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (30 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 31/54] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 33/54] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Mingwei Zhang
                   ` (22 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Only allow writing to event selector if event is allowed in filter. Since
passthrough PMU implementation does the PMU context switch at VM Enter/Exit
boudary, even if the value of event selector passes the checking, it cannot
be written directly to HW since PMU HW is owned by the host PMU at the
moment.  Because of that, introduce eventsel_hw to cache that value which
will be assigned into HW just before VM entry.

Note that regardless of whether an event value is allowed, the value will
be cached in pmc->eventsel and guest VM can always read the cached value
back. This implementation is consistent with the HW CPU design.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/pmu.c              |  5 ++---
 arch/x86/kvm/vmx/pmu_intel.c    | 13 ++++++++++++-
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8b4ea9bdcc74..b396000b9440 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -519,6 +519,7 @@ struct kvm_pmc {
 	 */
 	u64 emulated_counter;
 	u64 eventsel;
+	u64 eventsel_hw;
 	u64 msr_counter;
 	u64 msr_eventsel;
 	struct perf_event *perf_event;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 0f587651a49e..2ad71020a2c0 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1085,10 +1085,9 @@ void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
 	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
 		pmc = &pmu->gp_counters[i];
 		rdmsrl(pmc->msr_counter, pmc->counter);
-		rdmsrl(pmc->msr_eventsel, pmc->eventsel);
 		if (pmc->counter)
 			wrmsrl(pmc->msr_counter, 0);
-		if (pmc->eventsel)
+		if (pmc->eventsel_hw)
 			wrmsrl(pmc->msr_eventsel, 0);
 	}
 
@@ -1118,7 +1117,7 @@ void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
 	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
 		pmc = &pmu->gp_counters[i];
 		wrmsrl(pmc->msr_counter, pmc->counter);
-		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel);
+		wrmsrl(pmc->msr_eventsel, pmu->gp_counters[i].eventsel_hw);
 	}
 
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index a23cf9ca224e..e706d107ff28 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -399,7 +399,18 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if (data & reserved_bits)
 				return 1;
 
-			if (data != pmc->eventsel) {
+			if (is_passthrough_pmu_enabled(vcpu)) {
+				pmc->eventsel = data;
+				if (!check_pmu_event_filter(pmc)) {
+					if (pmc->eventsel_hw &
+					    ARCH_PERFMON_EVENTSEL_ENABLE) {
+						pmc->eventsel_hw &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
+						pmc->counter = 0;
+					}
+					return 0;
+				}
+				pmc->eventsel_hw = data;
+			} else if (data != pmc->eventsel) {
 				pmc->eventsel = data;
 				kvm_pmu_request_counter_reprogram(pmc);
 			}
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 33/54] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (31 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 32/54] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:29 ` [PATCH v2 34/54] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Mingwei Zhang
                   ` (21 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Allow writing to fixed counter selector if counter is exposed. If this
fixed counter is filtered out, this counter won't be enabled on HW.

Passthrough PMU implements the context switch at VM Enter/Exit boundary the
guest value cannot be directly written to HW since the HW PMU is owned by
the host. Introduce a new field fixed_ctr_ctrl_hw in kvm_pmu to cache the
guest value.  which will be assigne to HW at PMU context restore.

Since passthrough PMU intercept writes to fixed counter selector, there is
no need to read the value at pmu context save, but still clear the fix
counter ctrl MSR and counters when switching out to host PMU.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/pmu_intel.c    | 28 ++++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b396000b9440..9857dda8b851 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -546,6 +546,7 @@ struct kvm_pmu {
 	unsigned nr_arch_fixed_counters;
 	unsigned available_event_types;
 	u64 fixed_ctr_ctrl;
+	u64 fixed_ctr_ctrl_hw;
 	u64 fixed_ctr_ctrl_mask;
 	u64 global_ctrl;
 	u64 global_status;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index e706d107ff28..f0f99f5c21c5 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -34,6 +34,25 @@
 
 #define MSR_PMC_FULL_WIDTH_BIT      (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0)
 
+static void reprogram_fixed_counters_in_passthrough_pmu(struct kvm_pmu *pmu, u64 data)
+{
+	struct kvm_pmc *pmc;
+	u64 new_data = 0;
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+		pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR0 + i);
+		if (check_pmu_event_filter(pmc)) {
+			pmc->current_config = fixed_ctrl_field(data, i);
+			new_data |= (pmc->current_config << (i * 4));
+		} else {
+			pmc->counter = 0;
+		}
+	}
+	pmu->fixed_ctr_ctrl_hw = new_data;
+	pmu->fixed_ctr_ctrl = data;
+}
+
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
 	struct kvm_pmc *pmc;
@@ -351,7 +370,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (data & pmu->fixed_ctr_ctrl_mask)
 			return 1;
 
-		if (pmu->fixed_ctr_ctrl != data)
+		if (is_passthrough_pmu_enabled(vcpu))
+			reprogram_fixed_counters_in_passthrough_pmu(pmu, data);
+		else if (pmu->fixed_ctr_ctrl != data)
 			reprogram_fixed_counters(pmu, data);
 		break;
 	case MSR_IA32_PEBS_ENABLE:
@@ -820,13 +841,12 @@ static void intel_save_guest_pmu_context(struct kvm_vcpu *vcpu)
 	if (pmu->global_status)
 		wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
 
-	rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
 	/*
 	 * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
 	 * also avoid these guest fixed counters get accidentially enabled
 	 * during host running when host enable global ctrl.
 	 */
-	if (pmu->fixed_ctr_ctrl)
+	if (pmu->fixed_ctr_ctrl_hw)
 		wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
 }
 
@@ -844,7 +864,7 @@ static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
 	if (pmu->global_status & toggle)
 		wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
 
-	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
 }
 
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 34/54] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (32 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 33/54] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Mingwei Zhang
@ 2024-05-06  5:29 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 35/54] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Mingwei Zhang
                   ` (20 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:29 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

In PMU passthrough mode, use global_ctrl field in struct kvm_pmu as the
cached value. This is convenient for KVM to set and get the value from the
host side. In addition, load and save the value across VM enter/exit
boundary in the following way:

 - At VM exit, if processor supports
   GUEST_VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL, read guest
   IA32_PERF_GLOBAL_CTRL GUEST_IA32_PERF_GLOBAL_CTRL VMCS field, else read
   it from VM-exit MSR-stroe array in VMCS. The value is then assigned to
   global_ctrl.

 - At VM Entry, if processor supports
   GUEST_VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL, read guest
   IA32_PERF_GLOBAL_CTRL from GUEST_IA32_PERF_GLOBAL_CTRL VMCS field, else
   read it from VM-entry MSR-load array in VMCS. The value is then
   assigned to global ctrl.

Implement the above logic into two helper functions and invoke them around
VM Enter/exit boundary.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/vmx/vmx.c          | 49 ++++++++++++++++++++++++++++++++-
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9857dda8b851..54a56eda77f4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -598,6 +598,8 @@ struct kvm_pmu {
 	u8 event_count;
 
 	bool passthrough;
+	int global_ctrl_slot_in_autoload;
+	int global_ctrl_slot_in_autostore;
 };
 
 struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 62b5913abdd6..c86b768743a9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4421,6 +4421,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
 			}
 			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
 			m->val[i].value = 0;
+			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autoload = i;
 		}
 		/*
 		 * Setup auto clear host PERF_GLOBAL_CTRL msr at vm exit.
@@ -4448,6 +4449,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
 				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
 			}
 			m->val[i].index = MSR_CORE_PERF_GLOBAL_CTRL;
+			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autostore = i;
 		}
 	} else {
 		if (!(vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)) {
@@ -4458,6 +4460,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
 				m->val[i] = m->val[m->nr];
 				vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m->nr);
 			}
+			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autoload = -ENOENT;
 		}
 		if (!(vmexit_ctrl & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)) {
 			m = &vmx->msr_autoload.host;
@@ -4476,6 +4479,7 @@ static void vmx_set_perf_global_ctrl(struct vcpu_vmx *vmx)
 				m->val[i] = m->val[m->nr];
 				vmcs_write32(VM_EXIT_MSR_STORE_COUNT, m->nr);
 			}
+			vcpu_to_pmu(&vmx->vcpu)->global_ctrl_slot_in_autostore = -ENOENT;
 		}
 	}
 
@@ -7236,7 +7240,7 @@ static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
-static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
+static void __atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 {
 	int i, nr_msrs;
 	struct perf_guest_switch_msr *msrs;
@@ -7259,6 +7263,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 					msrs[i].host, false);
 }
 
+static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
+	int i;
+
+	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
+		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
+	} else {
+		i = pmu->global_ctrl_slot_in_autostore;
+		pmu->global_ctrl = vmx->msr_autostore.guest.val[i].value;
+	}
+}
+
+static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
+	u64 global_ctrl = pmu->global_ctrl;
+	int i;
+
+	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
+		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
+	} else {
+		i = pmu->global_ctrl_slot_in_autoload;
+		vmx->msr_autoload.guest.val[i].value = global_ctrl;
+	}
+}
+
+static void __atomic_switch_perf_msrs_in_passthrough_pmu(struct vcpu_vmx *vmx)
+{
+	load_perf_global_ctrl_in_passthrough_pmu(vmx);
+}
+
+static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
+{
+	if (is_passthrough_pmu_enabled(&vmx->vcpu))
+		__atomic_switch_perf_msrs_in_passthrough_pmu(vmx);
+	else
+		__atomic_switch_perf_msrs(vmx);
+}
+
 static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7369,6 +7413,9 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	vcpu->arch.cr2 = native_read_cr2();
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		save_perf_global_ctrl_in_passthrough_pmu(vmx);
+
 	vmx->idt_vectoring_info = 0;
 
 	vmx_enable_fb_clear(vmx);
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 35/54] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (33 preceding siblings ...)
  2024-05-06  5:29 ` [PATCH v2 34/54] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary Mingwei Zhang
                   ` (19 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Excluding existing vLBR logic from the passthrough PMU because the it does
not support LBR related MSRs. So to avoid any side effect, do not call
vLBR related code in both vcpu_enter_guest() and pmi injection function.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 13 ++++++++-----
 arch/x86/kvm/vmx/vmx.c       |  2 +-
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index f0f99f5c21c5..6db759147896 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -660,13 +660,16 @@ static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu)
 
 static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
 {
-	u8 version = vcpu_to_pmu(vcpu)->version;
+	u8 version;
 
-	if (!intel_pmu_lbr_is_enabled(vcpu))
-		return;
+	if (!is_passthrough_pmu_enabled(vcpu)) {
+		if (!intel_pmu_lbr_is_enabled(vcpu))
+			return;
 
-	if (version > 1 && version < 4)
-		intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu);
+		version = vcpu_to_pmu(vcpu)->version;
+		if (version > 1 && version < 4)
+			intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu);
+	}
 }
 
 static void vmx_update_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c86b768743a9..b6ed3ccdf1af 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7525,7 +7525,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit)
 	pt_guest_enter(vmx);
 
 	atomic_switch_perf_msrs(vmx);
-	if (intel_pmu_lbr_is_enabled(vcpu))
+	if (!is_passthrough_pmu_enabled(&vmx->vcpu) && intel_pmu_lbr_is_enabled(vcpu))
 		vmx_passthrough_lbr_msrs(vcpu);
 
 	if (enable_preemption_timer)
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (34 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 35/54] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-07-10  8:37   ` Sandipan Das
  2024-05-06  5:30 ` [PATCH v2 37/54] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM Mingwei Zhang
                   ` (18 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Switch PMI handler at KVM context switch boundary because KVM uses a
separate maskable interrupt vector other than the NMI handler for the host
PMU to process its own PMIs.  So invoke the perf API that allows
registration of the PMI handler.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 2ad71020a2c0..a12012a00c11 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1097,6 +1097,8 @@ void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
 		if (pmc->counter)
 			wrmsrl(pmc->msr_counter, 0);
 	}
+
+	x86_perf_guest_exit();
 }
 
 void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
@@ -1107,6 +1109,8 @@ void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
 
 	lockdep_assert_irqs_disabled();
 
+	x86_perf_guest_enter(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
+
 	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
 
 	/*
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
  2024-05-06  5:30 ` [PATCH v2 36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary Mingwei Zhang
@ 2024-07-10  8:37   ` Sandipan Das
  2024-07-10 10:01     ` Zhang, Xiong Y
  0 siblings, 1 reply; 116+ messages in thread
From: Sandipan Das @ 2024-07-10  8:37 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang
  Cc: Jim Mattson, Ian Rogers, Stephane Eranian, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users,
	Manali Shukla

On 5/6/2024 11:00 AM, Mingwei Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> 
> Switch PMI handler at KVM context switch boundary because KVM uses a
> separate maskable interrupt vector other than the NMI handler for the host
> PMU to process its own PMIs.  So invoke the perf API that allows
> registration of the PMI handler.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/kvm/pmu.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 2ad71020a2c0..a12012a00c11 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -1097,6 +1097,8 @@ void kvm_pmu_save_pmu_context(struct kvm_vcpu *vcpu)
>  		if (pmc->counter)
>  			wrmsrl(pmc->msr_counter, 0);
>  	}
> +
> +	x86_perf_guest_exit();
>  }
>  
>  void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
> @@ -1107,6 +1109,8 @@ void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
>  
>  	lockdep_assert_irqs_disabled();
>  
> +	x86_perf_guest_enter(kvm_lapic_get_reg(vcpu->arch.apic, APIC_LVTPC));
> +

Reading the LVTPC for a vCPU that does not have a struct kvm_lapic allocated
leads to a NULL pointer dereference. I noticed this while trying to run a
minimalistic guest like https://github.com/dpw/kvm-hello-world

Does this require a kvm_lapic_enabled() or similar check?

>  	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
>  
>  	/*


^ permalink raw reply	[flat|nested] 116+ messages in thread

* RE: [PATCH v2 36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
  2024-07-10  8:37   ` Sandipan Das
@ 2024-07-10 10:01     ` Zhang, Xiong Y
  2024-07-10 12:30       ` Sandipan Das
  0 siblings, 1 reply; 116+ messages in thread
From: Zhang, Xiong Y @ 2024-07-10 10:01 UTC (permalink / raw)
  To: Sandipan Das, Zhang, Mingwei, Sean Christopherson, Paolo Bonzini,
	Dapeng Mi, Liang, Kan, Zhenyu Wang
  Cc: Jim Mattson, Ian Rogers, Eranian, Stephane, Namhyung Kim,
	gce-passthrou-pmu-dev@google.com, Alt, Samantha, Lv, Zhiyuan,
	Xu, Yanfei, maobibo, Like Xu, Peter Zijlstra, kvm@vger.kernel.org,
	linux-perf-users@vger.kernel.org, Manali Shukla


> On 5/6/2024 11:00 AM, Mingwei Zhang wrote:
> > From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> >
> > Switch PMI handler at KVM context switch boundary because KVM uses a
> > separate maskable interrupt vector other than the NMI handler for the
> > host PMU to process its own PMIs.  So invoke the perf API that allows
> > registration of the PMI handler.
> >
> > Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > ---
> >  arch/x86/kvm/pmu.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index
> > 2ad71020a2c0..a12012a00c11 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -1097,6 +1097,8 @@ void kvm_pmu_save_pmu_context(struct kvm_vcpu
> *vcpu)
> >  		if (pmc->counter)
> >  			wrmsrl(pmc->msr_counter, 0);
> >  	}
> > +
> > +	x86_perf_guest_exit();
> >  }
> >
> >  void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu) @@ -1107,6
> > +1109,8 @@ void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
> >
> >  	lockdep_assert_irqs_disabled();
> >
> > +	x86_perf_guest_enter(kvm_lapic_get_reg(vcpu->arch.apic,
> > +APIC_LVTPC));
> > +
> 
> Reading the LVTPC for a vCPU that does not have a struct kvm_lapic allocated
> leads to a NULL pointer dereference. I noticed this while trying to run a
> minimalistic guest like https://github.com/dpw/kvm-hello-world
> 
> Does this require a kvm_lapic_enabled() or similar check?
> 

Intel processor has lapci_in_kernel() checking in "[RFC PATCH v3 16/54] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs".
+	pmu->passthrough = vcpu->kvm->arch.enable_passthrough_pmu &&
+			   lapic_in_kernel(vcpu);

AMD processor seems need this checking also. we could move this checking into common place.

Thanks

> >  	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
> >
> >  	/*


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
  2024-07-10 10:01     ` Zhang, Xiong Y
@ 2024-07-10 12:30       ` Sandipan Das
  0 siblings, 0 replies; 116+ messages in thread
From: Sandipan Das @ 2024-07-10 12:30 UTC (permalink / raw)
  To: Zhang, Xiong Y, Zhang, Mingwei, Sean Christopherson,
	Paolo Bonzini, Dapeng Mi, Liang, Kan, Zhenyu Wang
  Cc: Jim Mattson, Ian Rogers, Eranian, Stephane, Namhyung Kim,
	gce-passthrou-pmu-dev@google.com, Alt, Samantha, Lv, Zhiyuan,
	Xu, Yanfei, maobibo, Like Xu, Peter Zijlstra, kvm@vger.kernel.org,
	linux-perf-users@vger.kernel.org, Manali Shukla

On 7/10/2024 3:31 PM, Zhang, Xiong Y wrote:
> 
>> On 5/6/2024 11:00 AM, Mingwei Zhang wrote:
>>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>>
>>> Switch PMI handler at KVM context switch boundary because KVM uses a
>>> separate maskable interrupt vector other than the NMI handler for the
>>> host PMU to process its own PMIs.  So invoke the perf API that allows
>>> registration of the PMI handler.
>>>
>>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>> ---
>>>  arch/x86/kvm/pmu.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index
>>> 2ad71020a2c0..a12012a00c11 100644
>>> --- a/arch/x86/kvm/pmu.c
>>> +++ b/arch/x86/kvm/pmu.c
>>> @@ -1097,6 +1097,8 @@ void kvm_pmu_save_pmu_context(struct kvm_vcpu
>> *vcpu)
>>>  		if (pmc->counter)
>>>  			wrmsrl(pmc->msr_counter, 0);
>>>  	}
>>> +
>>> +	x86_perf_guest_exit();
>>>  }
>>>
>>>  void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu) @@ -1107,6
>>> +1109,8 @@ void kvm_pmu_restore_pmu_context(struct kvm_vcpu *vcpu)
>>>
>>>  	lockdep_assert_irqs_disabled();
>>>
>>> +	x86_perf_guest_enter(kvm_lapic_get_reg(vcpu->arch.apic,
>>> +APIC_LVTPC));
>>> +
>>
>> Reading the LVTPC for a vCPU that does not have a struct kvm_lapic allocated
>> leads to a NULL pointer dereference. I noticed this while trying to run a
>> minimalistic guest like https://github.com/dpw/kvm-hello-world
>>
>> Does this require a kvm_lapic_enabled() or similar check?
>>
> 
> Intel processor has lapci_in_kernel() checking in "[RFC PATCH v3 16/54] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs".
> +	pmu->passthrough = vcpu->kvm->arch.enable_passthrough_pmu &&
> +			   lapic_in_kernel(vcpu);
> 
> AMD processor seems need this checking also. we could move this checking into common place.
> 

Thanks. Adding that fixes the issue.

> 
>>>  	static_call_cond(kvm_x86_pmu_restore_pmu_context)(vcpu);
>>>
>>>  	/*
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 37/54] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (35 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch Mingwei Zhang
                   ` (17 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

When passthrough PMU is enabled by kvm and perf, KVM call
perf_get_mediated_pmu() to exclusive own x86 core PMU at VM creation, KVM
call perf_put_mediated_pmu() to return x86 core PMU to host perf at VM
destroy.

When perf_get_mediated_pmu() fail, the host has system wide perf events
without exclude_guest = 1 which must be disabled to enable VM with
passthrough PMU.

Once VM with passthrough PMU starts, perf will refuse to create system wide
perf event without exclude_guest = 1 until the vm is closed.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
---
 arch/x86/kvm/x86.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db395c00955f..3152587eca5b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6674,8 +6674,11 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		if (!kvm->created_vcpus) {
 			kvm->arch.enable_pmu = !(cap->args[0] & KVM_PMU_CAP_DISABLE);
 			/* Disable passthrough PMU if enable_pmu is false. */
-			if (!kvm->arch.enable_pmu)
+			if (!kvm->arch.enable_pmu) {
+				if (kvm->arch.enable_passthrough_pmu)
+					perf_put_mediated_pmu();
 				kvm->arch.enable_passthrough_pmu = false;
+			}
 			r = 0;
 		}
 		mutex_unlock(&kvm->lock);
@@ -12578,6 +12581,14 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.guest_can_read_msr_platform_info = true;
 	kvm->arch.enable_pmu = enable_pmu;
 	kvm->arch.enable_passthrough_pmu = enable_passthrough_pmu;
+	if (kvm->arch.enable_passthrough_pmu) {
+		ret = perf_get_mediated_pmu();
+		if (ret < 0) {
+			kvm_err("failed to enable mediated passthrough pmu, please disable system wide perf events\n");
+			goto out_uninit_mmu;
+		}
+	}
+
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	spin_lock_init(&kvm->arch.hv_root_tdp_lock);
@@ -12726,6 +12737,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 		__x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
 		mutex_unlock(&kvm->slots_lock);
 	}
+	if (kvm->arch.enable_passthrough_pmu)
+		perf_put_mediated_pmu();
 	kvm_unload_vcpu_mmus(kvm);
 	static_call_cond(kvm_x86_vm_destroy)(kvm);
 	kvm_free_msr_filter(srcu_dereference_check(kvm->arch.msr_filter, &kvm->srcu, 1));
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (36 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 37/54] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-07  9:39   ` Peter Zijlstra
  2024-05-06  5:30 ` [PATCH v2 39/54] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Mingwei Zhang
                   ` (16 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

perf subsystem should stop and restart all the perf events at the host
level when entering and leaving passthrough PMU respectively. So invoke
the perf API at PMU context switch functions.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f5a043410614..6fe467bca809 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -705,6 +705,8 @@ void x86_perf_guest_enter(u32 guest_lvtpc)
 {
 	lockdep_assert_irqs_disabled();
 
+	perf_guest_enter();
+
 	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
 			       (guest_lvtpc & APIC_LVT_MASKED));
 }
@@ -715,6 +717,8 @@ void x86_perf_guest_exit(void)
 	lockdep_assert_irqs_disabled();
 
 	apic_write(APIC_LVTPC, APIC_DM_NMI);
+
+	perf_guest_exit();
 }
 EXPORT_SYMBOL_GPL(x86_perf_guest_exit);
 
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
  2024-05-06  5:30 ` [PATCH v2 38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch Mingwei Zhang
@ 2024-05-07  9:39   ` Peter Zijlstra
  2024-05-08  4:22     ` Mi, Dapeng
  2024-05-30  4:34     ` Mingwei Zhang
  0 siblings, 2 replies; 116+ messages in thread
From: Peter Zijlstra @ 2024-05-07  9:39 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users

On Mon, May 06, 2024 at 05:30:03AM +0000, Mingwei Zhang wrote:
> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> 
> perf subsystem should stop and restart all the perf events at the host
> level when entering and leaving passthrough PMU respectively. So invoke
> the perf API at PMU context switch functions.
> 
> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/events/core.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index f5a043410614..6fe467bca809 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -705,6 +705,8 @@ void x86_perf_guest_enter(u32 guest_lvtpc)
>  {
>  	lockdep_assert_irqs_disabled();
>  
> +	perf_guest_enter();
> +
>  	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
>  			       (guest_lvtpc & APIC_LVT_MASKED));
>  }
> @@ -715,6 +717,8 @@ void x86_perf_guest_exit(void)
>  	lockdep_assert_irqs_disabled();
>  
>  	apic_write(APIC_LVTPC, APIC_DM_NMI);
> +
> +	perf_guest_exit();
>  }
>  EXPORT_SYMBOL_GPL(x86_perf_guest_exit);

*sigh*.. why does this patch exist? Please merge with the one that
introduces these functions.

This is making review really hard.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
  2024-05-07  9:39   ` Peter Zijlstra
@ 2024-05-08  4:22     ` Mi, Dapeng
  2024-05-30  4:34     ` Mingwei Zhang
  1 sibling, 0 replies; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-08  4:22 UTC (permalink / raw)
  To: Peter Zijlstra, Mingwei Zhang
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Kan Liang,
	Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users


On 5/7/2024 5:39 PM, Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 05:30:03AM +0000, Mingwei Zhang wrote:
>> From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>>
>> perf subsystem should stop and restart all the perf events at the host
>> level when entering and leaving passthrough PMU respectively. So invoke
>> the perf API at PMU context switch functions.
>>
>> Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/events/core.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index f5a043410614..6fe467bca809 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -705,6 +705,8 @@ void x86_perf_guest_enter(u32 guest_lvtpc)
>>  {
>>  	lockdep_assert_irqs_disabled();
>>  
>> +	perf_guest_enter();
>> +
>>  	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
>>  			       (guest_lvtpc & APIC_LVT_MASKED));
>>  }
>> @@ -715,6 +717,8 @@ void x86_perf_guest_exit(void)
>>  	lockdep_assert_irqs_disabled();
>>  
>>  	apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +
>> +	perf_guest_exit();
>>  }
>>  EXPORT_SYMBOL_GPL(x86_perf_guest_exit);
> *sigh*.. why does this patch exist? Please merge with the one that
> introduces these functions.
>
> This is making review really hard.
Sure. we would adjust the patches sequence. Thanks.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
  2024-05-07  9:39   ` Peter Zijlstra
  2024-05-08  4:22     ` Mi, Dapeng
@ 2024-05-30  4:34     ` Mingwei Zhang
  1 sibling, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-30  4:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu, kvm,
	linux-perf-users

On Tue, May 07, 2024, Peter Zijlstra wrote:
> On Mon, May 06, 2024 at 05:30:03AM +0000, Mingwei Zhang wrote:
> > From: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> > 
> > perf subsystem should stop and restart all the perf events at the host
> > level when entering and leaving passthrough PMU respectively. So invoke
> > the perf API at PMU context switch functions.
> > 
> > Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
> > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > ---
> >  arch/x86/events/core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > index f5a043410614..6fe467bca809 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -705,6 +705,8 @@ void x86_perf_guest_enter(u32 guest_lvtpc)
> >  {
> >  	lockdep_assert_irqs_disabled();
> >  
> > +	perf_guest_enter();
> > +
> >  	apic_write(APIC_LVTPC, APIC_DM_FIXED | KVM_GUEST_PMI_VECTOR |
> >  			       (guest_lvtpc & APIC_LVT_MASKED));
> >  }
> > @@ -715,6 +717,8 @@ void x86_perf_guest_exit(void)
> >  	lockdep_assert_irqs_disabled();
> >  
> >  	apic_write(APIC_LVTPC, APIC_DM_NMI);
> > +
> > +	perf_guest_exit();
> >  }
> >  EXPORT_SYMBOL_GPL(x86_perf_guest_exit);
> 
> *sigh*.. why does this patch exist? Please merge with the one that
> introduces these functions.
> 
> This is making review really hard.

Ah, right. This function should be added immediately after commit "perf:
x86: Add x86 function to switch PMI handler". It was just mind set of
development: "how can we call perf_guest_{enter,exit}() if KVM has not
implemented anything?" So we defer the invocation until this moment :)

Will fix that in next version.

Thanks.
-Mingwei



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 39/54] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (37 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 40/54] KVM: x86/pmu: Introduce PMU operator to increment counter Mingwei Zhang
                   ` (15 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Xiong Zhang <xiong.y.zhang@linux.intel.com>

Add correct PMU context switch at VM_entry/exit boundary.

Signed-off-by: Xiong Zhang <xiong.y.zhang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/x86.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3152587eca5b..733c1acdc8c1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11009,6 +11009,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		set_debugreg(0, 7);
 	}
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		kvm_pmu_restore_pmu_context(vcpu);
+
 	guest_timing_enter_irqoff();
 
 	for (;;) {
@@ -11037,6 +11040,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		++vcpu->stat.exits;
 	}
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		kvm_pmu_save_pmu_context(vcpu);
+
 	/*
 	 * Do this here before restoring debug registers on the host.  And
 	 * since we do this before handling the vmexit, a DR access vmexit
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 40/54] KVM: x86/pmu: Introduce PMU operator to increment counter
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (38 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 39/54] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 41/54] KVM: x86/pmu: Introduce PMU operator for setting counter overflow Mingwei Zhang
                   ` (14 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Introduce PMU operator to increment counter because in passthrough PMU
there is no common backend implementation like host perf API. Having a PMU
operator for counter increment and overflow checking will help hiding
architectural differences.

So Introduce the operator function to make it convenient for passthrough
PMU to synthesize a PMI.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h |  1 +
 arch/x86/kvm/pmu.h                     |  1 +
 arch/x86/kvm/vmx/pmu_intel.c           | 12 ++++++++++++
 3 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 1a848ba6a7a7..72ca78df8d2b 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -27,6 +27,7 @@ KVM_X86_PMU_OP_OPTIONAL(cleanup)
 KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
 KVM_X86_PMU_OP_OPTIONAL(save_pmu_context)
 KVM_X86_PMU_OP_OPTIONAL(restore_pmu_context)
+KVM_X86_PMU_OP_OPTIONAL(incr_counter)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 9cde62f3988e..325f17673a00 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -44,6 +44,7 @@ struct kvm_pmu_ops {
 	void (*passthrough_pmu_msrs)(struct kvm_vcpu *vcpu);
 	void (*save_pmu_context)(struct kvm_vcpu *vcpu);
 	void (*restore_pmu_context)(struct kvm_vcpu *vcpu);
+	bool (*incr_counter)(struct kvm_pmc *pmc);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 6db759147896..485bbccf503a 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -74,6 +74,17 @@ static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 	}
 }
 
+static bool intel_incr_counter(struct kvm_pmc *pmc)
+{
+	pmc->counter += 1;
+	pmc->counter &= pmc_bitmask(pmc);
+
+	if (!pmc->counter)
+		return true;
+
+	return false;
+}
+
 static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
 					    unsigned int idx, u64 *mask)
 {
@@ -885,6 +896,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.passthrough_pmu_msrs = intel_passthrough_pmu_msrs,
 	.save_pmu_context = intel_save_guest_pmu_context,
 	.restore_pmu_context = intel_restore_guest_pmu_context,
+	.incr_counter = intel_incr_counter,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 41/54] KVM: x86/pmu: Introduce PMU operator for setting counter overflow
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (39 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 40/54] KVM: x86/pmu: Introduce PMU operator to increment counter Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Mingwei Zhang
                   ` (13 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Introduce PMU operator for setting counter overflow. When emulating counter
increment, multiple counters could overflow at the same time, i.e., during
the execution of the same instruction. In passthrough PMU, having an PMU
operator provides convenience to update the PMU global status in one shot
with details hidden behind the vendor specific implementation.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/include/asm/kvm-x86-pmu-ops.h | 1 +
 arch/x86/kvm/pmu.h                     | 1 +
 arch/x86/kvm/vmx/pmu_intel.c           | 5 +++++
 3 files changed, 7 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 72ca78df8d2b..bd5b118a5ce5 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -28,6 +28,7 @@ KVM_X86_PMU_OP_OPTIONAL(passthrough_pmu_msrs)
 KVM_X86_PMU_OP_OPTIONAL(save_pmu_context)
 KVM_X86_PMU_OP_OPTIONAL(restore_pmu_context)
 KVM_X86_PMU_OP_OPTIONAL(incr_counter)
+KVM_X86_PMU_OP_OPTIONAL(set_overflow)
 
 #undef KVM_X86_PMU_OP
 #undef KVM_X86_PMU_OP_OPTIONAL
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 325f17673a00..78a7f0c5f3ba 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -45,6 +45,7 @@ struct kvm_pmu_ops {
 	void (*save_pmu_context)(struct kvm_vcpu *vcpu);
 	void (*restore_pmu_context)(struct kvm_vcpu *vcpu);
 	bool (*incr_counter)(struct kvm_pmc *pmc);
+	void (*set_overflow)(struct kvm_vcpu *vcpu);
 
 	const u64 EVENTSEL_EVENT;
 	const int MAX_NR_GP_COUNTERS;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 485bbccf503a..aac09eb9af0e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -881,6 +881,10 @@ static void intel_restore_guest_pmu_context(struct kvm_vcpu *vcpu)
 	wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl_hw);
 }
 
+static void intel_set_overflow(struct kvm_vcpu *vcpu)
+{
+}
+
 struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = intel_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = intel_msr_idx_to_pmc,
@@ -897,6 +901,7 @@ struct kvm_pmu_ops intel_pmu_ops __initdata = {
 	.save_pmu_context = intel_save_guest_pmu_context,
 	.restore_pmu_context = intel_restore_guest_pmu_context,
 	.incr_counter = intel_incr_counter,
+	.set_overflow = intel_set_overflow,
 	.EVENTSEL_EVENT = ARCH_PERFMON_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_INTEL_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = 1,
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (40 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 41/54] KVM: x86/pmu: Introduce PMU operator for setting counter overflow Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-08 18:28   ` Chen, Zide
  2024-05-06  5:30 ` [PATCH v2 43/54] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API Mingwei Zhang
                   ` (12 subsequent siblings)
  54 siblings, 1 reply; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Implement emulated counter increment for passthrough PMU under KVM_REQ_PMU.
Defer the counter increment to KVM_REQ_PMU handler because counter
increment requests come from kvm_pmu_trigger_event() which can be triggered
within the KVM_RUN inner loop or outside of the inner loop. This means the
counter increment could happen before or after PMU context switch.

So process counter increment in one place makes the implementation simple.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c | 50 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index a12012a00c11..06e70f74559d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -510,6 +510,18 @@ static int reprogram_counter(struct kvm_pmc *pmc)
 				     eventsel & ARCH_PERFMON_EVENTSEL_INT);
 }
 
+static void kvm_pmu_handle_event_in_passthrough_pmu(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	static_call_cond(kvm_x86_pmu_set_overflow)(vcpu);
+
+	if (atomic64_read(&pmu->__reprogram_pmi)) {
+		kvm_make_request(KVM_REQ_PMI, vcpu);
+		atomic64_set(&pmu->__reprogram_pmi, 0ull);
+	}
+}
+
 void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
 {
 	DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
@@ -517,6 +529,9 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
 	struct kvm_pmc *pmc;
 	int bit;
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		return kvm_pmu_handle_event_in_passthrough_pmu(vcpu);
+
 	bitmap_copy(bitmap, pmu->reprogram_pmi, X86_PMC_IDX_MAX);
 
 	/*
@@ -848,6 +863,17 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
 	kvm_pmu_reset(vcpu);
 }
 
+static void kvm_passthrough_pmu_incr_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc)
+{
+	if (static_call(kvm_x86_pmu_incr_counter)(pmc)) {
+		__set_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->global_status);
+		kvm_make_request(KVM_REQ_PMU, vcpu);
+
+		if (pmc->eventsel & ARCH_PERFMON_EVENTSEL_INT)
+			set_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
+	}
+}
+
 static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
 {
 	pmc->emulated_counter++;
@@ -880,11 +906,13 @@ static inline bool cpl_is_matched(struct kvm_pmc *pmc)
 	return (static_call(kvm_x86_get_cpl)(pmc->vcpu) == 0) ? select_os : select_user;
 }
 
-void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
+static void __kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel,
+				    bool is_passthrough)
 {
 	DECLARE_BITMAP(bitmap, X86_PMC_IDX_MAX);
 	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 	struct kvm_pmc *pmc;
+	bool is_pmc_allowed;
 	int i;
 
 	BUILD_BUG_ON(sizeof(pmu->global_ctrl) * BITS_PER_BYTE != X86_PMC_IDX_MAX);
@@ -896,6 +924,12 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
 		return;
 
 	kvm_for_each_pmc(pmu, pmc, i, bitmap) {
+		if (is_passthrough)
+			is_pmc_allowed = pmc_speculative_in_use(pmc) &&
+					 check_pmu_event_filter(pmc);
+		else
+			is_pmc_allowed = pmc_event_is_allowed(pmc);
+
 		/*
 		 * Ignore checks for edge detect (all events currently emulated
 		 * but KVM are always rising edges), pin control (unsupported
@@ -911,12 +945,22 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
 		 * could ignoring them, so do the simple thing for now.
 		 */
 		if (((pmc->eventsel ^ eventsel) & AMD64_RAW_EVENT_MASK_NB) ||
-		    !pmc_event_is_allowed(pmc) || !cpl_is_matched(pmc))
+		    !is_pmc_allowed || !cpl_is_matched(pmc))
 			continue;
 
-		kvm_pmu_incr_counter(pmc);
+		if (is_passthrough)
+			kvm_passthrough_pmu_incr_counter(vcpu, pmc);
+		else
+			kvm_pmu_incr_counter(pmc);
 	}
 }
+
+void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
+{
+	bool is_passthrough = is_passthrough_pmu_enabled(vcpu);
+
+	__kvm_pmu_trigger_event(vcpu, eventsel, is_passthrough);
+}
 EXPORT_SYMBOL_GPL(kvm_pmu_trigger_event);
 
 static bool is_masked_filter_valid(const struct kvm_x86_pmu_event_filter *filter)
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-05-06  5:30 ` [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Mingwei Zhang
@ 2024-05-08 18:28   ` Chen, Zide
  2024-05-09  1:11     ` Mi, Dapeng
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Zide @ 2024-05-08 18:28 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Dapeng Mi, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users



On 5/5/2024 10:30 PM, Mingwei Zhang wrote:
> @@ -896,6 +924,12 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
>  		return;
>  
>  	kvm_for_each_pmc(pmu, pmc, i, bitmap) {
> +		if (is_passthrough)
> +			is_pmc_allowed = pmc_speculative_in_use(pmc) &&
> +					 check_pmu_event_filter(pmc);
> +		else
> +			is_pmc_allowed = pmc_event_is_allowed(pmc);
> +

Why don't need to check pmc_is_globally_enabled() in PMU passthrough
case? Sorry if I missed something.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-05-08 18:28   ` Chen, Zide
@ 2024-05-09  1:11     ` Mi, Dapeng
  2024-05-30  4:20       ` Mingwei Zhang
  0 siblings, 1 reply; 116+ messages in thread
From: Mi, Dapeng @ 2024-05-09  1:11 UTC (permalink / raw)
  To: Chen, Zide, Mingwei Zhang, Sean Christopherson, Paolo Bonzini,
	Xiong Zhang, Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv, Yanfei Xu,
	maobibo, Like Xu, Peter Zijlstra, kvm, linux-perf-users


On 5/9/2024 2:28 AM, Chen, Zide wrote:
>
> On 5/5/2024 10:30 PM, Mingwei Zhang wrote:
>> @@ -896,6 +924,12 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
>>  		return;
>>  
>>  	kvm_for_each_pmc(pmu, pmc, i, bitmap) {
>> +		if (is_passthrough)
>> +			is_pmc_allowed = pmc_speculative_in_use(pmc) &&
>> +					 check_pmu_event_filter(pmc);
>> +		else
>> +			is_pmc_allowed = pmc_event_is_allowed(pmc);
>> +
> Why don't need to check pmc_is_globally_enabled() in PMU passthrough
> case? Sorry if I missed something.

Not sure if it's because the historical reason. Since pmu->global_ctrl
would be updated in each vm-exit right now, we may not need to skip
pmc_is_globally_enabled() anymore. Need Mingwei to confirm.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  2024-05-09  1:11     ` Mi, Dapeng
@ 2024-05-30  4:20       ` Mingwei Zhang
  0 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-30  4:20 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Chen, Zide, Sean Christopherson, Paolo Bonzini, Xiong Zhang,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Stephane Eranian, Ian Rogers, Namhyung Kim, gce-passthrou-pmu-dev,
	Samantha Alt, Zhiyuan Lv, Yanfei Xu, maobibo, Like Xu,
	Peter Zijlstra, kvm, linux-perf-users

On Thu, May 09, 2024, Mi, Dapeng wrote:
> 
> On 5/9/2024 2:28 AM, Chen, Zide wrote:
> >
> > On 5/5/2024 10:30 PM, Mingwei Zhang wrote:
> >> @@ -896,6 +924,12 @@ void kvm_pmu_trigger_event(struct kvm_vcpu *vcpu, u64 eventsel)
> >>  		return;
> >>  
> >>  	kvm_for_each_pmc(pmu, pmc, i, bitmap) {
> >> +		if (is_passthrough)
> >> +			is_pmc_allowed = pmc_speculative_in_use(pmc) &&
> >> +					 check_pmu_event_filter(pmc);
> >> +		else
> >> +			is_pmc_allowed = pmc_event_is_allowed(pmc);
> >> +
> > Why don't need to check pmc_is_globally_enabled() in PMU passthrough
> > case? Sorry if I missed something.
> 
> Not sure if it's because the historical reason. Since pmu->global_ctrl
> would be updated in each vm-exit right now, we may not need to skip
> pmc_is_globally_enabled() anymore. Need Mingwei to confirm.
> 
yeah, this is a historical reason and how it becomes a bug. I will fix
that in next version.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 43/54] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (41 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 44/54] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU Mingwei Zhang
                   ` (11 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Update pmc_{read,write}_counter() to disconnect perf API because
passthrough PMU does not use host PMU on backend. Because of that
pmc->counter contains directly the actual value of the guest VM when set by
the host (VMM) side.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c | 5 +++++
 arch/x86/kvm/pmu.h | 4 ++++
 2 files changed, 9 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 06e70f74559d..12330d3f92f4 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -322,6 +322,11 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
 
 void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
 {
+	if (pmc_to_pmu(pmc)->passthrough) {
+		pmc->counter = val;
+		return;
+	}
+
 	/*
 	 * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
 	 * read-modify-write.  Adjust the counter value so that its value is
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 78a7f0c5f3ba..7e006cb61296 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -116,6 +116,10 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
 {
 	u64 counter, enabled, running;
 
+	counter = pmc->counter;
+	if (pmc_to_pmu(pmc)->passthrough)
+		return counter & pmc_bitmask(pmc);
+
 	counter = pmc->counter + pmc->emulated_counter;
 
 	if (pmc->perf_event && !pmc->is_paused)
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 44/54] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (42 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 43/54] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 45/54] KVM: nVMX: Add nested virtualization support for " Mingwei Zhang
                   ` (10 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Disconnect counter reprogram logic because passthrough PMU never use host
PMU nor does it use perf API to do anything. Instead, when passthrough PMU
is enabled, touching anywhere around counter reprogram part should be an
error.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/kvm/pmu.c | 5 +++++
 arch/x86/kvm/pmu.h | 8 ++++++++
 2 files changed, 13 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 12330d3f92f4..da8b27f2b71d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -478,6 +478,11 @@ static int reprogram_counter(struct kvm_pmc *pmc)
 	bool emulate_overflow;
 	u8 fixed_ctr_ctrl;
 
+	if (pmu->passthrough) {
+		pr_warn_once("Passthrough PMU never reprogram counter\n");
+		return 0;
+	}
+
 	emulate_overflow = pmc_pause_counter(pmc);
 
 	if (!pmc_event_is_allowed(pmc))
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7e006cb61296..10553bc1ae1d 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -256,6 +256,10 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 
 static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
 {
+	/* Passthrough PMU never reprogram counters via KVM_REQ_PMU. */
+	if (pmc_to_pmu(pmc)->passthrough)
+		return;
+
 	set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
 	kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
 }
@@ -264,6 +268,10 @@ static inline void reprogram_counters(struct kvm_pmu *pmu, u64 diff)
 {
 	int bit;
 
+	/* Passthrough PMU never reprogram counters via KVM_REQ_PMU. */
+	if (pmu->passthrough)
+		return;
+
 	if (!diff)
 		return;
 
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 45/54] KVM: nVMX: Add nested virtualization support for passthrough PMU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (43 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 44/54] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 46/54] perf/x86/amd/core: Set passthrough capability for host Mingwei Zhang
                   ` (9 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

Add nested virtualization support for passthrough PMU by combining the MSR
interception bitmaps of vmcs01 and vmcs12. Readers may argue even without
this patch, nested virtualization works for passthrough PMU because L1 will
see Perfmon v2 and will have to use legacy vPMU implementation if it is
Linux. However, any assumption made on L1 may be invalid, e.g., L1 may not
even be Linux.

If both L0 and L1 pass through PMU MSRs, the correct behavior is to allow
MSR access from L2 directly touch HW MSRs, since both L0 and L1 passthrough
the access.

However, in current implementation, if without adding anything for nested,
KVM always set MSR interception bits in vmcs02. This leads to the fact that
L0 will emulate all MSR read/writes for L2, leading to errors, since the
current passthrough vPMU never implements set_msr() and get_msr() for any
counter access except counter accesses from the VMM side.

So fix the issue by setting up the correct MSR interception for PMU MSRs.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/vmx/nested.c | 52 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d05ddf751491..558032663228 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -590,6 +590,55 @@ static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx,
 						   msr_bitmap_l0, msr);
 }
 
+/* Pass PMU MSRs to nested VM if L0 and L1 are set to passthrough. */
+static void nested_vmx_set_passthru_pmu_intercept_for_msr(struct kvm_vcpu *vcpu,
+							  unsigned long *msr_bitmap_l1,
+							  unsigned long *msr_bitmap_l0)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int i;
+
+	for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_ARCH_PERFMON_EVENTSEL0 + i,
+						 MSR_TYPE_RW);
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_IA32_PERFCTR0 + i,
+						 MSR_TYPE_RW);
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_IA32_PMC0 + i,
+						 MSR_TYPE_RW);
+	}
+
+	for (i = 0; i < vcpu_to_pmu(vcpu)->nr_arch_fixed_counters; i++) {
+		nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+						 msr_bitmap_l0,
+						 MSR_CORE_PERF_FIXED_CTR0 + i,
+						 MSR_TYPE_RW);
+	}
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_FIXED_CTR_CTRL,
+					 MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_GLOBAL_STATUS,
+					 MSR_TYPE_RW);
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_GLOBAL_CTRL,
+					 MSR_TYPE_RW);
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1,
+					 msr_bitmap_l0,
+					 MSR_CORE_PERF_GLOBAL_OVF_CTRL,
+					 MSR_TYPE_RW);
+}
+
 /*
  * Merge L0's and L1's MSR bitmap, return false to indicate that
  * we do not use the hardware.
@@ -691,6 +740,9 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
 
+	if (is_passthrough_pmu_enabled(vcpu))
+		nested_vmx_set_passthru_pmu_intercept_for_msr(vcpu, msr_bitmap_l1, msr_bitmap_l0);
+
 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
 
 	vmx->nested.force_msr_bitmap_recalc = false;
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 46/54] perf/x86/amd/core: Set passthrough capability for host
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (44 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 45/54] KVM: nVMX: Add nested virtualization support for " Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 47/54] KVM: x86/pmu/svm: Set passthrough capability for vcpus Mingwei Zhang
                   ` (8 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Apply the PERF_PMU_CAP_PASSTHROUGH_VPMU flag for version 2 and later
implementations of the core PMU. Aside from having Global Control and
Status registers, virtualizing the PMU using the passthrough model
requires an interface to set or clear the overflow bits in the Global
Status MSRs while restoring or saving the PMU context of a vCPU.

PerfMonV2-capable hardware has additional MSRs for this purpose namely,
PerfCntrGlobalStatusSet and PerfCntrGlobalStatusClr, thereby making it
suitable for use with passthrough PMU.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/events/amd/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 985ef3b47919..7a91c991779c 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1395,6 +1395,9 @@ static int __init amd_core_pmu_init(void)
 
 		amd_pmu_global_cntr_mask = (1ULL << x86_pmu.num_counters) - 1;
 
+		x86_pmu.flags |= PMU_FL_PASSTHROUGH;
+		x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_PASSTHROUGH_VPMU;
+
 		/* Update PMC handling functions */
 		x86_pmu.enable_all = amd_pmu_v2_enable_all;
 		x86_pmu.disable_all = amd_pmu_v2_disable_all;
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 47/54] KVM: x86/pmu/svm: Set passthrough capability for vcpus
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (45 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 46/54] perf/x86/amd/core: Set passthrough capability for host Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 48/54] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter Mingwei Zhang
                   ` (7 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Pass on the passthrough PMU setting from kvm->arch into kvm_pmu for each
vcpu. As long as the host supports PerfMonV2, the guest PMU version does
not matter.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/kvm/svm/pmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 447657513729..385478103f65 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -211,6 +211,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
 	pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
 	pmu->nr_arch_fixed_counters = 0;
 	bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
+	pmu->passthrough = vcpu->kvm->arch.enable_passthrough_pmu;
 
 	if (guest_cpuid_has(vcpu, X86_FEATURE_PERFCTR_CORE)) {
 		for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 48/54] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (46 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 47/54] KVM: x86/pmu/svm: Set passthrough capability for vcpus Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 49/54] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
                   ` (6 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Since passthrough PMU can be also used on some AMD platforms, set the
"enable_passthrough_pmu" KVM kernel module parameter to true when the
following conditions are met.
 - parameter is set to true when module loaded
 - enable_pmu is true
 - is running on and AMD CPU
 - CPU supports PerfMonV2
 - host PMU supports passthrough mode

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/kvm/pmu.h     | 20 ++++++++++++--------
 arch/x86/kvm/svm/svm.c |  2 ++
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 10553bc1ae1d..02ea90431033 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -196,6 +196,7 @@ extern struct kvm_pmu_emulated_event_selectors kvm_pmu_eventsel;
 static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 {
 	bool is_intel = boot_cpu_data.x86_vendor == X86_VENDOR_INTEL;
+	bool is_amd = boot_cpu_data.x86_vendor == X86_VENDOR_AMD;
 	int min_nr_gp_ctrs = pmu_ops->MIN_NR_GP_COUNTERS;
 
 	/*
@@ -223,18 +224,21 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
 			enable_pmu = false;
 	}
 
-	/* Pass-through vPMU is only supported in Intel CPUs. */
-	if (!is_intel)
+	/* Pass-through vPMU is only supported in Intel and AMD CPUs. */
+	if (!is_intel && !is_amd)
 		enable_passthrough_pmu = false;
 
 	/*
-	 * Pass-through vPMU requires at least PerfMon version 4 because the
-	 * implementation requires the usage of MSR_CORE_PERF_GLOBAL_STATUS_SET
-	 * for counter emulation as well as PMU context switch.  In addition, it
-	 * requires host PMU support on passthrough mode. Disable pass-through
-	 * vPMU if any condition fails.
+	 * On Intel platforms, pass-through vPMU requires at least PerfMon
+	 * version 4 because the implementation requires the usage of
+	 * MSR_CORE_PERF_GLOBAL_STATUS_SET for counter emulation as well as
+	 * PMU context switch.  In addition, it requires host PMU support on
+	 * passthrough mode. Disable pass-through vPMU if any condition fails.
+	 *
+	 * On AMD platforms, pass-through vPMU requires at least PerfMonV2
+	 * because MSR_PERF_CNTR_GLOBAL_STATUS_SET is required.
 	 */
-	if (!enable_pmu || kvm_pmu_cap.version < 4 || !kvm_pmu_cap.passthrough)
+	if (!enable_pmu || !kvm_pmu_cap.passthrough || (is_intel && kvm_pmu_cap.version < 4) || (is_amd && kvm_pmu_cap.version < 2))
 		enable_passthrough_pmu = false;
 
 	if (!enable_pmu) {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d1a9f9951635..88648b3a9cdd 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -238,6 +238,8 @@ module_param(intercept_smi, bool, 0444);
 bool vnmi = true;
 module_param(vnmi, bool, 0444);
 
+module_param(enable_passthrough_pmu, bool, 0444);
+
 static bool svm_gp_erratum_intercept = true;
 
 static u8 rsm_ins_bytes[] = "\x0f\xaa";
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 49/54] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (47 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 48/54] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 50/54] KVM: x86/pmu/svm: Implement callback to disable MSR interception Mingwei Zhang
                   ` (5 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

If passthrough PMU is enabled and all counters exposed to guest, clear the
RDPMC interception bit in the VMCB Control Area (byte offset 0xc bit 15) to
let RDPMC instructions proceed without VM-Exits. This improves the guest
PMU performance in passthrough mode. If either condition is not satisfied,
then intercept RDPMC and prevent guest accessing unexposed counters.

Note that On AMD platforms, passing through RDPMC will only allow guests to
read the general-purpose counters. Details about the RDPMC interception bit
can be found in Appendix B the "Layout of VMCB" from the AMD64 Architecture
Programmer's Manual Volume 2.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/kvm/svm/svm.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 88648b3a9cdd..84dd1f560d0a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1223,6 +1223,11 @@ static inline void init_vmcb_after_set_cpuid(struct kvm_vcpu *vcpu)
 		/* No need to intercept these MSRs */
 		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_EIP, 1, 1);
 		set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_ESP, 1, 1);
+
+		if (kvm_pmu_check_rdpmc_passthrough(vcpu))
+			svm_clr_intercept(svm, INTERCEPT_RDPMC);
+		else
+			svm_set_intercept(svm, INTERCEPT_RDPMC);
 	}
 }
 
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 50/54] KVM: x86/pmu/svm: Implement callback to disable MSR interception
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (48 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 49/54] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 51/54] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors Mingwei Zhang
                   ` (4 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Implement the AMD-specific callback for passthrough PMU that disables
interception of PMU-related MSRs if the guest PMU counters qualify the
requirement of passthrough. The PMU registers include the following.
 - PerfCntrGlobalStatus (MSR 0xc0000300)
 - PerfCntrGlobalCtl (MSR 0xc0000301)
 - PerfCntrGlobalStatusClr (MSR 0xc0000302)
 - PerfCntrGlobalStatusSet (MSR 0xc0000303)
 - PERF_CTLx and PERF_CTRx pairs (MSRs 0xc0010200..0xc001020b)

Note that the passthrough/interception is invoked after each CPUID set. Since
CPUID set can be done multiple times, do the intercept/clear of the bitmap
explicitly for each counters as well as global registers.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/kvm/svm/pmu.c | 44 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 385478103f65..2ad62b8ac2c2 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -247,6 +247,49 @@ static bool amd_is_rdpmc_passthru_allowed(struct kvm_vcpu *vcpu)
 	return true;
 }
 
+static void amd_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	struct vcpu_svm *svm = to_svm(vcpu);
+	int msr_clear = !!(is_passthrough_pmu_enabled(vcpu));
+	int i;
+
+	for (i = 0; i < kvm_pmu_cap.num_counters_gp; i++) {
+		/*
+		 * PERF_CTLx registers require interception in order to clear
+		 * HostOnly bit and set GuestOnly bit. This is to prevent the
+		 * PERF_CTRx registers from counting before VM entry and after
+		 * VM exit.
+		 */
+		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTL + 2 * i, 0, 0);
+
+		/*
+		 * Pass through counters exposed to the guest and intercept
+		 * counters that are unexposed. Do this explicitly since this
+		 * function may be set multiple times before vcpu runs.
+		 */
+		if (i >= pmu->nr_arch_gp_counters)
+			msr_clear = 0;
+		set_msr_interception(vcpu, svm->msrpm, MSR_F15H_PERF_CTR + 2 * i, msr_clear, msr_clear);
+	}
+
+	/*
+	 * In mediated passthrough vPMU, intercept global PMU MSRs when guest
+	 * PMU only owns a subset of counters provided in HW or its version is
+	 * less than 2.
+	 */
+	if (is_passthrough_pmu_enabled(vcpu) && pmu->version > 1 &&
+	    pmu->nr_arch_gp_counters == kvm_pmu_cap.num_counters_gp)
+		msr_clear = 1;
+	else
+		msr_clear = 0;
+
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_CTL, msr_clear, msr_clear);
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, msr_clear, msr_clear);
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, msr_clear, msr_clear);
+	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, msr_clear, msr_clear);
+}
+
 struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
@@ -257,6 +300,7 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.refresh = amd_pmu_refresh,
 	.init = amd_pmu_init,
 	.is_rdpmc_passthru_allowed = amd_is_rdpmc_passthru_allowed,
+	.passthrough_pmu_msrs = amd_passthrough_pmu_msrs,
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 51/54] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (49 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 50/54] KVM: x86/pmu/svm: Implement callback to disable MSR interception Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 52/54] KVM: x86/pmu/svm: Add registers to direct access list Mingwei Zhang
                   ` (3 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

On AMD platforms, there is no way to restore PerfCntrGlobalCtl at
VM-Entry or clear it at VM-Exit. Since the register states will be
restored before entering and saved after exiting guest context, the
counters can keep ticking and even overflow leading to chaos while
still in host context.

To avoid this, the PERF_CTLx MSRs (event selectors) are always
intercepted. KVM will always set the GuestOnly bit and clear the
HostOnly bit so that the counters run only in guest context even if
their enable bits are set. Intercepting these MSRs is also necessary
for guest event filtering.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/kvm/svm/pmu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 2ad62b8ac2c2..bed0acfaf34d 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -165,7 +165,12 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		data &= ~pmu->reserved_bits;
 		if (data != pmc->eventsel) {
 			pmc->eventsel = data;
-			kvm_pmu_request_counter_reprogram(pmc);
+			if (is_passthrough_pmu_enabled(vcpu)) {
+				data &= ~AMD64_EVENTSEL_HOSTONLY;
+				pmc->eventsel_hw = data | AMD64_EVENTSEL_GUESTONLY;
+			} else {
+				kvm_pmu_request_counter_reprogram(pmc);
+			}
 		}
 		return 0;
 	}
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 52/54] KVM: x86/pmu/svm: Add registers to direct access list
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (50 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 51/54] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 53/54] KVM: x86/pmu/svm: Implement handlers to save and restore context Mingwei Zhang
                   ` (2 subsequent siblings)
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Add all PMU-related MSRs to the list of possible direct access MSRs.
Most of them will not be intercepted when using passthrough PMU.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/kvm/svm/svm.c | 16 ++++++++++++++++
 arch/x86/kvm/svm/svm.h |  2 +-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 84dd1f560d0a..ccc08c43f7fb 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -140,6 +140,22 @@ static const struct svm_direct_access_msrs {
 	{ .index = X2APIC_MSR(APIC_TMICT),		.always = false },
 	{ .index = X2APIC_MSR(APIC_TMCCT),		.always = false },
 	{ .index = X2APIC_MSR(APIC_TDCR),		.always = false },
+	{ .index = MSR_F15H_PERF_CTL0,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR0,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL1,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR1,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL2,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR2,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL3,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR3,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL4,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR4,			.always = false },
+	{ .index = MSR_F15H_PERF_CTL5,			.always = false },
+	{ .index = MSR_F15H_PERF_CTR5,			.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_CTL,	.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS,	.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR,	.always = false },
+	{ .index = MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET,	.always = false },
 	{ .index = MSR_INVALID,				.always = false },
 };
 
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7f1fbd874c45..beb552a9ab05 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -30,7 +30,7 @@
 #define	IOPM_SIZE PAGE_SIZE * 3
 #define	MSRPM_SIZE PAGE_SIZE * 2
 
-#define MAX_DIRECT_ACCESS_MSRS	47
+#define MAX_DIRECT_ACCESS_MSRS	63
 #define MSRPM_OFFSETS	32
 extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
 extern bool npt_enabled;
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 53/54] KVM: x86/pmu/svm: Implement handlers to save and restore context
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (51 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 52/54] KVM: x86/pmu/svm: Add registers to direct access list Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-06  5:30 ` [PATCH v2 54/54] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU Mingwei Zhang
  2024-05-28  2:35 ` [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Ma, Yongwei
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Sandipan Das <sandipan.das@amd.com>

Implement the AMD-specific handlers to save and restore the state of
PMU-related MSRs when using passthrough PMU.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
---
 arch/x86/kvm/svm/pmu.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index bed0acfaf34d..9629a172aa1b 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -295,6 +295,36 @@ static void amd_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
 	set_msr_interception(vcpu, svm->msrpm, MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, msr_clear, msr_clear);
 }
 
+static void amd_save_pmu_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
+	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, pmu->global_status);
+
+	/* Clear global status bits if non-zero */
+	if (pmu->global_status)
+		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, pmu->global_status);
+}
+
+static void amd_restore_pmu_context(struct kvm_vcpu *vcpu)
+{
+	struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+	u64 global_status;
+
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, 0);
+	rdmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS, global_status);
+
+	/* Clear host global_status MSR if non-zero. */
+	if (global_status)
+		wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_CLR, global_status);
+
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_STATUS_SET, pmu->global_status);
+
+	wrmsrl(MSR_AMD64_PERF_CNTR_GLOBAL_CTL, pmu->global_ctrl);
+}
+
 struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.rdpmc_ecx_to_pmc = amd_rdpmc_ecx_to_pmc,
 	.msr_idx_to_pmc = amd_msr_idx_to_pmc,
@@ -306,6 +336,8 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
 	.init = amd_pmu_init,
 	.is_rdpmc_passthru_allowed = amd_is_rdpmc_passthru_allowed,
 	.passthrough_pmu_msrs = amd_passthrough_pmu_msrs,
+	.save_pmu_context = amd_save_pmu_context,
+	.restore_pmu_context = amd_restore_pmu_context,
 	.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
 	.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
 	.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH v2 54/54] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (52 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 53/54] KVM: x86/pmu/svm: Implement handlers to save and restore context Mingwei Zhang
@ 2024-05-06  5:30 ` Mingwei Zhang
  2024-05-28  2:35 ` [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Ma, Yongwei
  54 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-06  5:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Xiong Zhang, Dapeng Mi,
	Kan Liang, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Stephane Eranian, Ian Rogers, Namhyung Kim,
	Mingwei Zhang, gce-passthrou-pmu-dev, Samantha Alt, Zhiyuan Lv,
	Yanfei Xu, maobibo, Like Xu, Peter Zijlstra, kvm,
	linux-perf-users

From: Manali Shukla <manali.shukla@amd.com>

With the Passthrough PMU enabled, the PERF_CTLx MSRs (event selectors) are
always intercepted and the event filter checking can be directly done
inside amd_pmu_set_msr().

Add a check to allow writing to event selector for GP counters if and only
if the event is allowed in filter.

Signed-off-by: Manali Shukla <manali.shukla@amd.com>
---
 arch/x86/kvm/svm/pmu.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 9629a172aa1b..cb6d3bfdd588 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -166,6 +166,15 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (data != pmc->eventsel) {
 			pmc->eventsel = data;
 			if (is_passthrough_pmu_enabled(vcpu)) {
+				if (!check_pmu_event_filter(pmc)) {
+					/*
+					 * When guest request an invalid event,
+					 * stop the counter by clearing the
+					 * event selector MSR.
+					 */
+					pmc->eventsel_hw = 0;
+					return 0;
+				}
 				data &= ~AMD64_EVENTSEL_HOSTONLY;
 				pmc->eventsel_hw = data | AMD64_EVENTSEL_GUESTONLY;
 			} else {
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* RE: [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86
  2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
                   ` (53 preceding siblings ...)
  2024-05-06  5:30 ` [PATCH v2 54/54] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU Mingwei Zhang
@ 2024-05-28  2:35 ` Ma, Yongwei
  2024-05-30  4:28   ` Mingwei Zhang
  54 siblings, 1 reply; 116+ messages in thread
From: Ma, Yongwei @ 2024-05-28  2:35 UTC (permalink / raw)
  To: Mingwei Zhang, Sean Christopherson, Paolo Bonzini, Zhang, Xiong Y,
	Dapeng Mi, Liang, Kan, Zhenyu Wang, Manali Shukla, Sandipan Das
  Cc: Jim Mattson, Eranian, Stephane, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev@google.com, Alt, Samantha, Lv, Zhiyuan,
	Xu, Yanfei, maobibo, Like Xu, Peter Zijlstra, kvm@vger.kernel.org,
	linux-perf-users@vger.kernel.org

> In this version, we added the mediated passthrough vPMU support for AMD.
> This is 1st version that comes up with a full x86 support on the vPMU new
> design.
> 
> Major changes:
>  - AMD support integration. Supporting guest PerfMon v1 and v2.
>  - Ensure !exclude_guest events only exist prior to mediate passthrough
>    vPMU loaded. [sean]
>  - Update PMU MSR interception according to exposed counters and pmu
>    version. [mingwei reported pmu_counters_test fails]
>  - Enforce RDPMC interception unless all counters exposed to guest. This
>    removes a hack in RFCv1 where we pass through RDPMC and zero
>    unexposed counters. [jim/sean]
>  - Combine the PMU context switch for both AMD and Intel.
>  - Because of RDPMC interception, update PMU context switch code by
>    removing the "zeroing out" logic when restoring the guest context.
>    [jim/sean: intercept rdpmc]
> 
> Minor changes:
>  - Flip enable_passthrough_pmu to false and change to a vendor param.
>  - Remove "Intercept full-width GP counter MSRs by checking with perf
>    capabilities".
>  - Remove the write to pmc patch.
>  - Move host_perf_cap as an independent variable, will update after
>    https://lore.kernel.org/all/20240423221521.2923759-1-
> seanjc@google.com/
> 
> TODOs:
>  - Simplify enabling code for mediated passthrough vPMU.
>  - Further optimization on PMU context switch.
> 
> On-going discussions:
>  - Final name of mediated passthrough vPMU.
>  - PMU context switch optimizations.
> 
> Testing:
>  - Testcases:
>    - selftest: pmu_counters_test
>    - selftest: pmu_event_filter_test
>    - kvm-unit-tests: pmu
>    - qemu based ubuntu 20.04 (guest kernel: 5.10 and 6.7.9)
>  - Platforms:
>    - genoa
>    - skylake
>    - icelake
>    - sapphirerapids
>    - emeraldrapids
> 
> Ongoing Issues:
>  - AMD platform [milan]:
>   - ./pmu_event_filter_test error:
>     - test_amd_deny_list: Branch instructions retired = 44 (expected 42)
>     - test_without_filter: Branch instructions retired = 44 (expected 42)
>     - test_member_allow_list: Branch instructions retired = 44 (expected 42)
>     - test_not_member_deny_list: Branch instructions retired = 44 (expected
> 42)
>  - Intel platform [skylake]:
>   - kvm-unit-tests/pmu fails with two errors:
>     - FAIL: Intel: TSX cycles: gp cntr-3 with a value of 0
>     - FAIL: Intel: full-width writes: TSX cycles: gp cntr-3 with a value of 0
> 
> Installation guidance:
>  - echo 0 > /proc/sys/kernel/nmi_watchdog
>  - modprobe kvm_{amd,intel} enable_passthrough_pmu=Y 2>/dev/null
> 
> v1: https://lore.kernel.org/all/20240126085444.324918-1-
> xiong.y.zhang@linux.intel.com/
> 
> 
> Dapeng Mi (3):
>   x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET
>   KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
>   KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU
>     MSRs
> 
> Kan Liang (3):
>   perf: Support get/put passthrough PMU interfaces
>   perf: Add generic exclude_guest support
>   perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU
> 
> Manali Shukla (1):
>   KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough
>     PMU
> 
> Mingwei Zhang (24):
>   perf: core/x86: Forbid PMI handler when guest own PMU
>   perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
>     x86_pmu_cap
>   KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
>   KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
>   KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
>   KVM: x86/pmu: Add host_perf_cap and initialize it in
>     kvm_x86_vendor_init()
>   KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to
>     guest
>   KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough
>     allowed
>   KVM: x86/pmu: Create a function prototype to disable MSR interception
>   KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in
>     passthrough vPMU
>   KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
>   KVM: x86/pmu: Add counter MSR and selector MSR index into struct
>     kvm_pmc
>   KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
>     context
>   KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
>   KVM: x86/pmu: Make check_pmu_event_filter() an exported function
>   KVM: x86/pmu: Allow writing to event selector for GP counters if event
>     is allowed
>   KVM: x86/pmu: Allow writing to fixed counter selector if counter is
>     exposed
>   KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
>   KVM: x86/pmu: Introduce PMU operator to increment counter
>   KVM: x86/pmu: Introduce PMU operator for setting counter overflow
>   KVM: x86/pmu: Implement emulated counter increment for passthrough
> PMU
>   KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
>   KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
>   KVM: nVMX: Add nested virtualization support for passthrough PMU
> 
> Sandipan Das (11):
>   KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
>   x86/msr: Define PerfCntrGlobalStatusSet register
>   KVM: x86/pmu: Always set global enable bits in passthrough mode
>   perf/x86/amd/core: Set passthrough capability for host
>   KVM: x86/pmu/svm: Set passthrough capability for vcpus
>   KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter
>   KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed
>     to guest
>   KVM: x86/pmu/svm: Implement callback to disable MSR interception
>   KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
>     write to event selectors
>   KVM: x86/pmu/svm: Add registers to direct access list
>   KVM: x86/pmu/svm: Implement handlers to save and restore context
> 
> Sean Christopherson (2):
>   KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at
>     "RESET"
>   KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel
>     compatible
> 
> Xiong Zhang (10):
>   perf: core/x86: Register a new vector for KVM GUEST PMI
>   KVM: x86: Extract x86_set_kvm_irq_handler() function
>   KVM: x86/pmu: Register guest pmi handler for emulated PMU
>   perf: x86: Add x86 function to switch PMI handler
>   KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
>   KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
>   KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
>   KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
>   KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
>   KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
> 
>  arch/x86/events/amd/core.c               |   3 +
>  arch/x86/events/core.c                   |  41 ++++-
>  arch/x86/events/intel/core.c             |   6 +
>  arch/x86/events/perf_event.h             |   1 +
>  arch/x86/include/asm/hardirq.h           |   1 +
>  arch/x86/include/asm/idtentry.h          |   1 +
>  arch/x86/include/asm/irq.h               |   2 +-
>  arch/x86/include/asm/irq_vectors.h       |   5 +-
>  arch/x86/include/asm/kvm-x86-pmu-ops.h   |   6 +
>  arch/x86/include/asm/kvm_host.h          |  10 ++
>  arch/x86/include/asm/msr-index.h         |   2 +
>  arch/x86/include/asm/perf_event.h        |   4 +
>  arch/x86/include/asm/vmx.h               |   1 +
>  arch/x86/kernel/idt.c                    |   1 +
>  arch/x86/kernel/irq.c                    |  36 ++++-
>  arch/x86/kvm/cpuid.c                     |   4 +
>  arch/x86/kvm/cpuid.h                     |  10 ++
>  arch/x86/kvm/lapic.c                     |   3 +-
>  arch/x86/kvm/mmu/mmu.c                   |   2 +-
>  arch/x86/kvm/pmu.c                       | 168 ++++++++++++++++++-
>  arch/x86/kvm/pmu.h                       |  47 ++++++
>  arch/x86/kvm/svm/pmu.c                   | 112 ++++++++++++-
>  arch/x86/kvm/svm/svm.c                   |  23 +++
>  arch/x86/kvm/svm/svm.h                   |   2 +-
>  arch/x86/kvm/vmx/capabilities.h          |   1 +
>  arch/x86/kvm/vmx/nested.c                |  52 ++++++
>  arch/x86/kvm/vmx/pmu_intel.c             | 192 ++++++++++++++++++++--
>  arch/x86/kvm/vmx/vmx.c                   | 197 +++++++++++++++++++----
>  arch/x86/kvm/vmx/vmx.h                   |   3 +-
>  arch/x86/kvm/x86.c                       |  47 +++++-
>  arch/x86/kvm/x86.h                       |   1 +
>  include/linux/perf_event.h               |  18 +++
>  kernel/events/core.c                     | 176 ++++++++++++++++++++
>  tools/arch/x86/include/asm/irq_vectors.h |   3 +-
>  34 files changed, 1120 insertions(+), 61 deletions(-)
> 
> 
> base-commit: fec50db7033ea478773b159e0e2efb135270e3b7
> --
> 2.45.0.rc1.225.g2a3ae87e7f-goog
> 
Hi Mingwei,
Regarding the ongoing issue you mentioned on Intel Skylake platform, I tried to reproduce it .However, these two cases could PASS on my Skylake machine. Could you double check it with the latest kvm-unit-tests or share me your SKL CPU model?

CPU model on my SKL :
	 'Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz'.
Passthrough PMU status:
	$cat /sys/module/kvm_intel/parameters/enable_passthrough_pmu
	$Y
Kvm-unit-tests:
	https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git
Result:
	PASS: Intel: TSX cycles: gp cntr-3 with a value of 37
	PASS: Intel: full-width writes: TSX cycles: gp cntr-3 with a value of 36

Tested-by: Yongwei Ma <yongwei.ma@intel.com>

Thanks and Best Regards,
Yongwei Ma

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86
  2024-05-28  2:35 ` [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Ma, Yongwei
@ 2024-05-30  4:28   ` Mingwei Zhang
  0 siblings, 0 replies; 116+ messages in thread
From: Mingwei Zhang @ 2024-05-30  4:28 UTC (permalink / raw)
  To: Ma, Yongwei
  Cc: Sean Christopherson, Paolo Bonzini, Zhang, Xiong Y, Dapeng Mi,
	Liang, Kan, Zhenyu Wang, Manali Shukla, Sandipan Das, Jim Mattson,
	Eranian, Stephane, Ian Rogers, Namhyung Kim,
	gce-passthrou-pmu-dev@google.com, Alt, Samantha, Lv, Zhiyuan,
	Xu, Yanfei, maobibo, Like Xu, Peter Zijlstra, kvm@vger.kernel.org,
	linux-perf-users@vger.kernel.org

On Tue, May 28, 2024, Ma, Yongwei wrote:
> > In this version, we added the mediated passthrough vPMU support for AMD.
> > This is 1st version that comes up with a full x86 support on the vPMU new
> > design.
> > 
> > Major changes:
> >  - AMD support integration. Supporting guest PerfMon v1 and v2.
> >  - Ensure !exclude_guest events only exist prior to mediate passthrough
> >    vPMU loaded. [sean]
> >  - Update PMU MSR interception according to exposed counters and pmu
> >    version. [mingwei reported pmu_counters_test fails]
> >  - Enforce RDPMC interception unless all counters exposed to guest. This
> >    removes a hack in RFCv1 where we pass through RDPMC and zero
> >    unexposed counters. [jim/sean]
> >  - Combine the PMU context switch for both AMD and Intel.
> >  - Because of RDPMC interception, update PMU context switch code by
> >    removing the "zeroing out" logic when restoring the guest context.
> >    [jim/sean: intercept rdpmc]
> > 
> > Minor changes:
> >  - Flip enable_passthrough_pmu to false and change to a vendor param.
> >  - Remove "Intercept full-width GP counter MSRs by checking with perf
> >    capabilities".
> >  - Remove the write to pmc patch.
> >  - Move host_perf_cap as an independent variable, will update after
> >    https://lore.kernel.org/all/20240423221521.2923759-1-
> > seanjc@google.com/
> > 
> > TODOs:
> >  - Simplify enabling code for mediated passthrough vPMU.
> >  - Further optimization on PMU context switch.
> > 
> > On-going discussions:
> >  - Final name of mediated passthrough vPMU.
> >  - PMU context switch optimizations.
> > 
> > Testing:
> >  - Testcases:
> >    - selftest: pmu_counters_test
> >    - selftest: pmu_event_filter_test
> >    - kvm-unit-tests: pmu
> >    - qemu based ubuntu 20.04 (guest kernel: 5.10 and 6.7.9)
> >  - Platforms:
> >    - genoa
> >    - skylake
> >    - icelake
> >    - sapphirerapids
> >    - emeraldrapids
> > 
> > Ongoing Issues:
> >  - AMD platform [milan]:
> >   - ./pmu_event_filter_test error:
> >     - test_amd_deny_list: Branch instructions retired = 44 (expected 42)
> >     - test_without_filter: Branch instructions retired = 44 (expected 42)
> >     - test_member_allow_list: Branch instructions retired = 44 (expected 42)
> >     - test_not_member_deny_list: Branch instructions retired = 44 (expected
> > 42)
> >  - Intel platform [skylake]:
> >   - kvm-unit-tests/pmu fails with two errors:
> >     - FAIL: Intel: TSX cycles: gp cntr-3 with a value of 0
> >     - FAIL: Intel: full-width writes: TSX cycles: gp cntr-3 with a value of 0
> > 
> > Installation guidance:
> >  - echo 0 > /proc/sys/kernel/nmi_watchdog
> >  - modprobe kvm_{amd,intel} enable_passthrough_pmu=Y 2>/dev/null
> > 
> > v1: https://lore.kernel.org/all/20240126085444.324918-1-
> > xiong.y.zhang@linux.intel.com/
> > 
> > 
> > Dapeng Mi (3):
> >   x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET
> >   KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
> >   KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU
> >     MSRs
> > 
> > Kan Liang (3):
> >   perf: Support get/put passthrough PMU interfaces
> >   perf: Add generic exclude_guest support
> >   perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU
> > 
> > Manali Shukla (1):
> >   KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough
> >     PMU
> > 
> > Mingwei Zhang (24):
> >   perf: core/x86: Forbid PMI handler when guest own PMU
> >   perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
> >     x86_pmu_cap
> >   KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter
> >   KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs
> >   KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
> >   KVM: x86/pmu: Add host_perf_cap and initialize it in
> >     kvm_x86_vendor_init()
> >   KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to
> >     guest
> >   KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough
> >     allowed
> >   KVM: x86/pmu: Create a function prototype to disable MSR interception
> >   KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in
> >     passthrough vPMU
> >   KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot()
> >   KVM: x86/pmu: Add counter MSR and selector MSR index into struct
> >     kvm_pmc
> >   KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
> >     context
> >   KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
> >   KVM: x86/pmu: Make check_pmu_event_filter() an exported function
> >   KVM: x86/pmu: Allow writing to event selector for GP counters if event
> >     is allowed
> >   KVM: x86/pmu: Allow writing to fixed counter selector if counter is
> >     exposed
> >   KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
> >   KVM: x86/pmu: Introduce PMU operator to increment counter
> >   KVM: x86/pmu: Introduce PMU operator for setting counter overflow
> >   KVM: x86/pmu: Implement emulated counter increment for passthrough
> > PMU
> >   KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API
> >   KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU
> >   KVM: nVMX: Add nested virtualization support for passthrough PMU
> > 
> > Sandipan Das (11):
> >   KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
> >   x86/msr: Define PerfCntrGlobalStatusSet register
> >   KVM: x86/pmu: Always set global enable bits in passthrough mode
> >   perf/x86/amd/core: Set passthrough capability for host
> >   KVM: x86/pmu/svm: Set passthrough capability for vcpus
> >   KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter
> >   KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed
> >     to guest
> >   KVM: x86/pmu/svm: Implement callback to disable MSR interception
> >   KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest
> >     write to event selectors
> >   KVM: x86/pmu/svm: Add registers to direct access list
> >   KVM: x86/pmu/svm: Implement handlers to save and restore context
> > 
> > Sean Christopherson (2):
> >   KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at
> >     "RESET"
> >   KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel
> >     compatible
> > 
> > Xiong Zhang (10):
> >   perf: core/x86: Register a new vector for KVM GUEST PMI
> >   KVM: x86: Extract x86_set_kvm_irq_handler() function
> >   KVM: x86/pmu: Register guest pmi handler for emulated PMU
> >   perf: x86: Add x86 function to switch PMI handler
> >   KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
> >   KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
> >   KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
> >   KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM
> >   KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
> >   KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
> > 
> >  arch/x86/events/amd/core.c               |   3 +
> >  arch/x86/events/core.c                   |  41 ++++-
> >  arch/x86/events/intel/core.c             |   6 +
> >  arch/x86/events/perf_event.h             |   1 +
> >  arch/x86/include/asm/hardirq.h           |   1 +
> >  arch/x86/include/asm/idtentry.h          |   1 +
> >  arch/x86/include/asm/irq.h               |   2 +-
> >  arch/x86/include/asm/irq_vectors.h       |   5 +-
> >  arch/x86/include/asm/kvm-x86-pmu-ops.h   |   6 +
> >  arch/x86/include/asm/kvm_host.h          |  10 ++
> >  arch/x86/include/asm/msr-index.h         |   2 +
> >  arch/x86/include/asm/perf_event.h        |   4 +
> >  arch/x86/include/asm/vmx.h               |   1 +
> >  arch/x86/kernel/idt.c                    |   1 +
> >  arch/x86/kernel/irq.c                    |  36 ++++-
> >  arch/x86/kvm/cpuid.c                     |   4 +
> >  arch/x86/kvm/cpuid.h                     |  10 ++
> >  arch/x86/kvm/lapic.c                     |   3 +-
> >  arch/x86/kvm/mmu/mmu.c                   |   2 +-
> >  arch/x86/kvm/pmu.c                       | 168 ++++++++++++++++++-
> >  arch/x86/kvm/pmu.h                       |  47 ++++++
> >  arch/x86/kvm/svm/pmu.c                   | 112 ++++++++++++-
> >  arch/x86/kvm/svm/svm.c                   |  23 +++
> >  arch/x86/kvm/svm/svm.h                   |   2 +-
> >  arch/x86/kvm/vmx/capabilities.h          |   1 +
> >  arch/x86/kvm/vmx/nested.c                |  52 ++++++
> >  arch/x86/kvm/vmx/pmu_intel.c             | 192 ++++++++++++++++++++--
> >  arch/x86/kvm/vmx/vmx.c                   | 197 +++++++++++++++++++----
> >  arch/x86/kvm/vmx/vmx.h                   |   3 +-
> >  arch/x86/kvm/x86.c                       |  47 +++++-
> >  arch/x86/kvm/x86.h                       |   1 +
> >  include/linux/perf_event.h               |  18 +++
> >  kernel/events/core.c                     | 176 ++++++++++++++++++++
> >  tools/arch/x86/include/asm/irq_vectors.h |   3 +-
> >  34 files changed, 1120 insertions(+), 61 deletions(-)
> > 
> > 
> > base-commit: fec50db7033ea478773b159e0e2efb135270e3b7
> > --
> > 2.45.0.rc1.225.g2a3ae87e7f-goog
> > 
> Hi Mingwei,
> Regarding the ongoing issue you mentioned on Intel Skylake platform, I tried to reproduce it .However, these two cases could PASS on my Skylake machine. Could you double check it with the latest kvm-unit-tests or share me your SKL CPU model?
> 
> CPU model on my SKL :
> 	 'Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz'.
> Passthrough PMU status:
> 	$cat /sys/module/kvm_intel/parameters/enable_passthrough_pmu
> 	$Y
> Kvm-unit-tests:
> 	https://gitlab.com/kvm-unit-tests/kvm-unit-tests.git
> Result:
> 	PASS: Intel: TSX cycles: gp cntr-3 with a value of 37
> 	PASS: Intel: full-width writes: TSX cycles: gp cntr-3 with a value of 36
> 
That is good to see. We (me and my colleague) can reproduce the issue
from my side. This might be related with the host setup instead of our
code. Will figure it out.

Thanks.
-Mingwei


> Tested-by: Yongwei Ma <yongwei.ma@intel.com>
> 
> Thanks and Best Regards,
> Yongwei Ma

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2024-07-10 12:30 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-06  5:29 [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 01/54] KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET" Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 02/54] KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 03/54] KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 04/54] x86/msr: Define PerfCntrGlobalStatusSet register Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 05/54] x86/msr: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 06/54] perf: Support get/put passthrough PMU interfaces Mingwei Zhang
2024-05-07  8:31   ` Peter Zijlstra
2024-05-08  4:13     ` Zhang, Xiong Y
2024-05-07  8:41   ` Peter Zijlstra
2024-05-08  4:54     ` Zhang, Xiong Y
2024-05-08  8:32       ` Peter Zijlstra
2024-05-06  5:29 ` [PATCH v2 07/54] perf: Add generic exclude_guest support Mingwei Zhang
2024-05-07  8:58   ` Peter Zijlstra
2024-06-10 17:23     ` Liang, Kan
2024-06-11 12:06       ` Peter Zijlstra
2024-06-11 13:27         ` Liang, Kan
2024-06-12 11:17           ` Peter Zijlstra
2024-06-12 13:38             ` Liang, Kan
2024-06-13  9:15               ` Peter Zijlstra
2024-06-13 13:37                 ` Liang, Kan
2024-06-13 18:04                   ` Liang, Kan
2024-06-17  7:51                     ` Peter Zijlstra
2024-06-17 13:34                       ` Liang, Kan
2024-06-17 15:00                         ` Peter Zijlstra
2024-06-17 15:45                           ` Liang, Kan
2024-05-06  5:29 ` [PATCH v2 08/54] perf/x86/intel: Support PERF_PMU_CAP_PASSTHROUGH_VPMU Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 09/54] perf: core/x86: Register a new vector for KVM GUEST PMI Mingwei Zhang
2024-05-07  9:12   ` Peter Zijlstra
2024-05-08 10:06     ` Yanfei Xu
2024-05-06  5:29 ` [PATCH v2 10/54] KVM: x86: Extract x86_set_kvm_irq_handler() function Mingwei Zhang
2024-05-07  9:18   ` Peter Zijlstra
2024-05-08  8:57     ` Zhang, Xiong Y
2024-05-06  5:29 ` [PATCH v2 11/54] KVM: x86/pmu: Register guest pmi handler for emulated PMU Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 12/54] perf: x86: Add x86 function to switch PMI handler Mingwei Zhang
2024-05-07  9:22   ` Peter Zijlstra
2024-05-08  6:58     ` Zhang, Xiong Y
2024-05-08  8:37       ` Peter Zijlstra
2024-05-09  7:30         ` Zhang, Xiong Y
2024-05-07 21:40   ` Chen, Zide
2024-05-08  3:44     ` Mi, Dapeng
2024-05-30  5:12     ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 13/54] perf: core/x86: Forbid PMI handler when guest own PMU Mingwei Zhang
2024-05-07  9:33   ` Peter Zijlstra
2024-05-09  7:39     ` Zhang, Xiong Y
2024-05-06  5:29 ` [PATCH v2 14/54] perf: core/x86: Plumb passthrough PMU capability from x86_pmu to x86_pmu_cap Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 15/54] KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 16/54] KVM: x86/pmu: Plumb through pass-through PMU to vcpu for Intel CPUs Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 17/54] KVM: x86/pmu: Always set global enable bits in passthrough mode Mingwei Zhang
2024-05-08  4:18   ` Mi, Dapeng
2024-05-08  4:36     ` Mingwei Zhang
2024-05-08  6:27       ` Mi, Dapeng
2024-05-08 14:13         ` Sean Christopherson
2024-05-09  0:13           ` Mingwei Zhang
2024-05-09  0:30             ` Mi, Dapeng
2024-05-09  0:38           ` Mi, Dapeng
2024-05-06  5:29 ` [PATCH v2 18/54] KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 19/54] KVM: x86/pmu: Add host_perf_cap and initialize it in kvm_x86_vendor_init() Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 20/54] KVM: x86/pmu: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
2024-05-08 21:55   ` Chen, Zide
2024-05-30  5:20     ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 21/54] KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 22/54] KVM: x86/pmu: Introduce PMU operator to check if rdpmc passthrough allowed Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 23/54] KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 24/54] KVM: x86/pmu: Create a function prototype to disable MSR interception Mingwei Zhang
2024-05-08 22:03   ` Chen, Zide
2024-05-30  5:24     ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 25/54] KVM: x86/pmu: Add intel_passthrough_pmu_msrs() to pass-through PMU MSRs Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 26/54] KVM: x86/pmu: Avoid legacy vPMU code when accessing global_ctrl in passthrough vPMU Mingwei Zhang
2024-05-08 21:48   ` Chen, Zide
2024-05-09  0:43     ` Mi, Dapeng
2024-05-09  1:29       ` Chen, Zide
2024-05-09  2:58         ` Mi, Dapeng
2024-05-30  5:28           ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 27/54] KVM: x86/pmu: Exclude PMU MSRs in vmx_get_passthrough_msr_slot() Mingwei Zhang
2024-05-14  7:33   ` Mi, Dapeng
2024-05-06  5:29 ` [PATCH v2 28/54] KVM: x86/pmu: Add counter MSR and selector MSR index into struct kvm_pmc Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 29/54] KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU context Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 30/54] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU Mingwei Zhang
2024-05-14  8:08   ` Mi, Dapeng
2024-05-30  5:34     ` Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 31/54] KVM: x86/pmu: Make check_pmu_event_filter() an exported function Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 32/54] KVM: x86/pmu: Allow writing to event selector for GP counters if event is allowed Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 33/54] KVM: x86/pmu: Allow writing to fixed counter selector if counter is exposed Mingwei Zhang
2024-05-06  5:29 ` [PATCH v2 34/54] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 35/54] KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 36/54] KVM: x86/pmu: Switch PMI handler at KVM context switch boundary Mingwei Zhang
2024-07-10  8:37   ` Sandipan Das
2024-07-10 10:01     ` Zhang, Xiong Y
2024-07-10 12:30       ` Sandipan Das
2024-05-06  5:30 ` [PATCH v2 37/54] KVM: x86/pmu: Grab x86 core PMU for passthrough PMU VM Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 38/54] KVM: x86/pmu: Call perf_guest_enter() at PMU context switch Mingwei Zhang
2024-05-07  9:39   ` Peter Zijlstra
2024-05-08  4:22     ` Mi, Dapeng
2024-05-30  4:34     ` Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 39/54] KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 40/54] KVM: x86/pmu: Introduce PMU operator to increment counter Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 41/54] KVM: x86/pmu: Introduce PMU operator for setting counter overflow Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 42/54] KVM: x86/pmu: Implement emulated counter increment for passthrough PMU Mingwei Zhang
2024-05-08 18:28   ` Chen, Zide
2024-05-09  1:11     ` Mi, Dapeng
2024-05-30  4:20       ` Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 43/54] KVM: x86/pmu: Update pmc_{read,write}_counter() to disconnect perf API Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 44/54] KVM: x86/pmu: Disconnect counter reprogram logic from passthrough PMU Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 45/54] KVM: nVMX: Add nested virtualization support for " Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 46/54] perf/x86/amd/core: Set passthrough capability for host Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 47/54] KVM: x86/pmu/svm: Set passthrough capability for vcpus Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 48/54] KVM: x86/pmu/svm: Set enable_passthrough_pmu module parameter Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 49/54] KVM: x86/pmu/svm: Allow RDPMC pass through when all counters exposed to guest Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 50/54] KVM: x86/pmu/svm: Implement callback to disable MSR interception Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 51/54] KVM: x86/pmu/svm: Set GuestOnly bit and clear HostOnly bit when guest write to event selectors Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 52/54] KVM: x86/pmu/svm: Add registers to direct access list Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 53/54] KVM: x86/pmu/svm: Implement handlers to save and restore context Mingwei Zhang
2024-05-06  5:30 ` [PATCH v2 54/54] KVM: x86/pmu/svm: Wire up PMU filtering functionality for passthrough PMU Mingwei Zhang
2024-05-28  2:35 ` [PATCH v2 00/54] Mediated Passthrough vPMU 2.0 for x86 Ma, Yongwei
2024-05-30  4:28   ` Mingwei Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).