linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/17] ARM64 PMU Partitioning
@ 2025-06-02 19:26 Colton Lewis
  2025-06-02 19:26 ` [PATCH 01/17] arm64: cpufeature: Add cpucap for HPMN0 Colton Lewis
                   ` (17 more replies)
  0 siblings, 18 replies; 34+ messages in thread
From: Colton Lewis @ 2025-06-02 19:26 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Russell King, Catalin Marinas,
	Will Deacon, Marc Zyngier, Oliver Upton, Joey Gouly,
	Suzuki K Poulose, Zenghui Yu, Mark Rutland, Shuah Khan, linux-doc,
	linux-kernel, linux-arm-kernel, kvmarm, linux-perf-users,
	linux-kselftest, Colton Lewis

Overview:

This series implements a new PMU scheme on ARM, a partitioned PMU
that exists alongside the existing emulated PMU and may be enabled by
the kernel command line kvm.reserved_host_counters or by the vcpu
ioctl KVM_ARM_PARTITION_PMU. This is a continuation of the RFC posted
earlier this year. [1]

The high level overview and reason for the name is that this
implementation takes advantage of recent CPU features to partition the
PMU counters into a host-reserved set and a guest-reserved set. Guests
are allowed untrapped hardware access to the most frequently used PMU
registers and features for the guest-reserved counters only.

This untrapped hardware access significantly reduces the overhead of
using performance monitoring capabilities such as the `perf` tool
inside a guest VM. Register accesses that aren't trapping to KVM mean
less time spent in the host kernel and more time on the workloads
guests care about. This optimization especially shines during high
`perf` sample rates or large numbers of events that require
multiplexing hardware counters.

Performance:

For example, the following tests were carried out on identical ARM
machines with 10 general purpose counters with identical guest images
run on QEMU, the only difference being my PMU implementation or the
existing one. Some arguments have been simplified here to clarify the
purpose of the test:

1) time perf record -e ${FIFTEEN_HW_EVENTS} -F 1000 -- \
   gzip -c tmpfs/random.64M.img >/dev/null

On emulated PMU this command took 4.143s real time with 0.159s system
time. On partitioned PMU this command took 3.139s real time with
0.110s system time, runtime reductions of 24.23% and 30.82%.

2) time perf stat -dd -- \
   automated_specint2017.sh

On emulated PMU this benchmark completed in 3789.16s real time with
224.45s system time and a final benchmark score of 4.28. On
partitioned PMU this benchmark completed in 3525.67s real time with
15.98s system time and a final benchmark score of 4.56. That is a
6.95% reduction in runtime, 92.88% reduction in system time, and
6.54% improvement in overall benchmark score.

Seeing these improvements on something as lightweight as perf stat is
remarkable and implies there would have been a much greater
improvement with perf record. I did not test that because I was not
confident it would even finish in a reasonable time on the emulated
PMU

Test 3 was slightly different, I ran the workload in a VM with a
single VCPU pinned to a physical CPU and analyzed from the host where
the physical CPU spent its time using mpstat.

3) perf record -e ${FIFTEEN_HW_EVENTS} -F 4000 -- \
   stress-ng --cpu 0 --timeout 30

Over a period of 30s the cpu running with the emulated PMU spent
34.96% of the time in the host kernel and 55.85% of the time in the
guest. The cpu running the partitioned PMU spent 0.97% of its time in
the host kernel and 91.06% of its time in the guest.

Taken together, these tests represent a remarkable performance
improvement for anything perf related using this new PMU
implementation.

Caveats:

Because the most consistent and performant thing to do was untrap
PMCR_EL0, the number of counters visible to the guest via PMCR_EL0.N
is always equal to the value KVM sets for MDCR_EL2.HPMN. Previously
allowed writes to PMCR_EL0.N via {GET,SET}_ONE_REG no longer affect
the guest.

These improvements come at a cost to 7-35 new registers that must be
swapped at every vcpu_load and vcpu_put if the feature is enabled. I
have been informed KVM would like to avoid paying this cost when
possible.

One solution is to make the trapping changes and context swapping lazy
such that the trapping changes and context swapping only take place
after the guest has actually accessed the PMU so guests that never
access the PMU never pay the cost.

This is not done here because it is not crucial to the primary
functionality and I thought review would be more productive as soon as
I had something complete enough for reviewers to easily play with.

However, this or any better ideas are on the table for inclusion in
future re-rolls.

[1] https://lore.kernel.org/kvmarm/20250213180317.3205285-1-coltonlewis@google.com/

Colton Lewis (16):
  arm64: cpufeature: Add cpucap for HPMN0
  arm64: Generate sign macro for sysreg Enums
  arm64: cpufeature: Add cpucap for PMICNTR
  KVM: arm64: Reorganize PMU functions
  KVM: arm64: Introduce method to partition the PMU
  perf: arm_pmuv3: Generalize counter bitmasks
  perf: arm_pmuv3: Keep out of guest counter partition
  KVM: arm64: Set up FGT for Partitioned PMU
  KVM: arm64: Writethrough trapped PMEVTYPER register
  KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned
  KVM: arm64: Writethrough trapped PMOVS register
  KVM: arm64: Context switch Partitioned PMU guest registers
  perf: pmuv3: Handle IRQs for Partitioned PMU guest counters
  KVM: arm64: Inject recorded guest interrupts
  KVM: arm64: Add ioctl to partition the PMU when supported
  KVM: arm64: selftests: Add test case for partitioned PMU

Marc Zyngier (1):
  KVM: arm64: Cleanup PMU includes

 Documentation/virt/kvm/api.rst                |  16 +
 arch/arm/include/asm/arm_pmuv3.h              |  24 +
 arch/arm64/include/asm/arm_pmuv3.h            |  36 +-
 arch/arm64/include/asm/kvm_host.h             | 208 +++++-
 arch/arm64/include/asm/kvm_pmu.h              |  82 +++
 arch/arm64/kernel/cpufeature.c                |  15 +
 arch/arm64/kvm/Makefile                       |   2 +-
 arch/arm64/kvm/arm.c                          |  24 +-
 arch/arm64/kvm/debug.c                        |  13 +-
 arch/arm64/kvm/hyp/include/hyp/switch.h       |  65 +-
 arch/arm64/kvm/pmu-emul.c                     | 629 +----------------
 arch/arm64/kvm/pmu-part.c                     | 358 ++++++++++
 arch/arm64/kvm/pmu.c                          | 630 ++++++++++++++++++
 arch/arm64/kvm/sys_regs.c                     |  54 +-
 arch/arm64/tools/cpucaps                      |   2 +
 arch/arm64/tools/gen-sysreg.awk               |   1 +
 arch/arm64/tools/sysreg                       |   6 +-
 drivers/perf/arm_pmuv3.c                      |  55 +-
 include/kvm/arm_pmu.h                         | 199 ------
 include/linux/perf/arm_pmu.h                  |  15 +-
 include/linux/perf/arm_pmuv3.h                |  14 +-
 include/uapi/linux/kvm.h                      |   4 +
 tools/include/uapi/linux/kvm.h                |   2 +
 .../selftests/kvm/arm64/vpmu_counter_access.c |  40 +-
 virt/kvm/kvm_main.c                           |   1 +
 25 files changed, 1616 insertions(+), 879 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_pmu.h
 create mode 100644 arch/arm64/kvm/pmu-part.c
 delete mode 100644 include/kvm/arm_pmu.h


base-commit: 1b85d923ba8c9e6afaf19e26708411adde94fba8
--
2.49.0.1204.g71687c7c1d-goog

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-06-04 20:58 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-02 19:26 [PATCH 00/17] ARM64 PMU Partitioning Colton Lewis
2025-06-02 19:26 ` [PATCH 01/17] arm64: cpufeature: Add cpucap for HPMN0 Colton Lewis
2025-06-02 22:15   ` Oliver Upton
2025-06-03 20:50     ` Colton Lewis
2025-06-02 19:26 ` [PATCH 02/17] arm64: Generate sign macro for sysreg Enums Colton Lewis
2025-06-02 19:26 ` [PATCH 03/17] arm64: cpufeature: Add cpucap for PMICNTR Colton Lewis
2025-06-02 19:26 ` [PATCH 04/17] KVM: arm64: Cleanup PMU includes Colton Lewis
2025-06-02 21:42   ` Sean Christopherson
2025-06-03 20:48     ` Colton Lewis
2025-06-02 19:26 ` [PATCH 05/17] KVM: arm64: Reorganize PMU functions Colton Lewis
2025-06-02 19:26 ` [PATCH 06/17] KVM: arm64: Introduce method to partition the PMU Colton Lewis
2025-06-02 22:28   ` Oliver Upton
2025-06-03 21:32     ` Colton Lewis
2025-06-03 22:02       ` Oliver Upton
2025-06-04 20:10         ` Colton Lewis
2025-06-04 20:57           ` Oliver Upton
2025-06-02 19:26 ` [PATCH 07/17] perf: arm_pmuv3: Generalize counter bitmasks Colton Lewis
2025-06-02 19:26 ` [PATCH 08/17] perf: arm_pmuv3: Keep out of guest counter partition Colton Lewis
2025-06-02 19:26 ` [PATCH 09/17] KVM: arm64: Set up FGT for Partitioned PMU Colton Lewis
2025-06-02 19:26 ` [PATCH 10/17] KVM: arm64: Writethrough trapped PMEVTYPER register Colton Lewis
2025-06-03 22:22   ` Oliver Upton
2025-06-04 20:10     ` Colton Lewis
2025-06-02 19:26 ` [PATCH 11/17] KVM: arm64: Use physical PMSELR for PMXEVTYPER if partitioned Colton Lewis
2025-06-02 19:26 ` [PATCH 12/17] KVM: arm64: Writethrough trapped PMOVS register Colton Lewis
2025-06-02 19:26 ` [PATCH 13/17] KVM: arm64: Context switch Partitioned PMU guest registers Colton Lewis
2025-06-02 19:26 ` [PATCH 14/17] perf: pmuv3: Handle IRQs for Partitioned PMU guest counters Colton Lewis
2025-06-02 19:27 ` [PATCH 15/17] KVM: arm64: Inject recorded guest interrupts Colton Lewis
2025-06-02 19:27 ` [PATCH 16/17] KVM: arm64: Add ioctl to partition the PMU when supported Colton Lewis
2025-06-02 22:40   ` Oliver Upton
2025-06-03 21:46     ` Colton Lewis
2025-06-04 20:12       ` Colton Lewis
2025-06-02 19:27 ` [PATCH 17/17] KVM: arm64: selftests: Add test case for partitioned PMU Colton Lewis
2025-06-03 22:43 ` [PATCH 00/17] ARM64 PMU Partitioning Oliver Upton
2025-06-04 20:10   ` Colton Lewis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).