[RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support
@ 2025-11-14 16:06 Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields Alexandru Elisei
                   ` (35 more replies)
  0 siblings, 36 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

The series is based on v6.18-rc2 + the fix to FGT traps being computed too
late [1], which hasn't yet been merged. A branch containing everything be
found at [2].  kvmtool support is needed to create a VM with SPE enabled; a
branch with the necessary changes can be found at [3]. For testing, I used
kvm-unit-tests, which can be found at [4].

The series is an RFC and is lightly untested, likely broken and incomplete -
support for various features, like pKVM, nested virt, nVHE, etc, etc is missing.
I wanted the focus to be on pinning memory at stage 2 (that's patches #29, 'KVM:
arm64: Pin the SPE buffer in the host and map it at stage 2', to #3, 'KVM:
arm64: Add hugetlb support for SPE') and I would very much like to start a
discussion around that.

This series is based the register definitions from DDI0601 [5] and on DEN0154
[6], which is a beta specification. One notable difference is that in this
implementation I've chosen not to ignore buffer register writes when
PMBLIMITR_EL1.E = 1, to maintain compatibility with the current SPE driver.
We're working internally on merging the changes proposed in DEN0154 with the
Arm ARM.

RFC v5 can be found at [7], although that version is four years old now and
bears little resemblance to this series. The only thing that I kept is the
userspace API, everything else was written from scratch.

Introduction
============

Statistical Profiling Extension (SPE) is an optional feature added in
ARMv8.2. It allows sampling at regular intervals of the operations executed
by the PE and storing a record of each operation in a memory buffer. A high
level overview of the extension is presented in an article on arm.com [8].

The problem
===========

When the Statistical Profiling Unit (SPU from now on) encounter a fault when
it attempts to write a record to memory, two things happen: profiling is
stopped, and the fault is reported to the CPU via an interrupt, not an
exception. This creates a blackout window during which the CPU executes
instructions which aren't profiled. The SPE driver avoid this by keeping the
buffer mapped while ProfilingBufferEnabled() = true. But when running as a
guest under KVM, the SPU will trigger stage 2 faults, with the associated
blackout windows.

Solution
========

I chose the same approach as the SPE driver, which is to avoid the blackout
windows altogether by keeping the buffer mapped at stage 2 while
ProfilingBufferEnabled() = true.

Please note when reading the patches that due to a naming quirk in the
architecture, ProfilingBufferEnabled() = true is not the same as the buffer
enable bit being set (PMBLIMITR_EL1.E = 1). ProfilingBufferEnabled() = true
require the buffer enable bit to be set and that PMBSR_EL1.S = 0.

Implementation
==============

The obvious solution would be to pin the pages corresponding to the buffer in
the host, where by 'pin' I mean to have an elevated reference count.

When ProfilingBufferEnabled() becomes true following a write made by the guest
to one of the buffer registers, KVM does the following:

1. Faults in the buffer pages* in the host's stage 1 with a
pin_user_pages(FOLL_LONGTERM) call.

2. Maps the pages at stage 2.

* The buffer is programmed by the guest with virtual addresess in the guest
stage 1; KVM must also pin the pages that map the stage 1 tables for the
buffer guest virtual addresses to avoid a SPU stage 2 fault on a stage 1
translation table walk.

Somewhat counterintuitive, this doesn't guarantee that the pages remain mapped
in the host's stage 1. split_huge_pmd() will remap a THP block mapping as PTEs,
completely ignoring an elevated reference count, and that means breaking the
existing mapping. KVM uses FOLL_SPLIT_PMD when pinned a page to break existing
block mappings, and make sure this doesn't happen.

But this is still not enough. Even more counterintuitive, a pinned page that
always has a valid mapping in the host's stage 1 can still be unmapped from
stage 2. A few examples that I've found, and I don't think this is an
exhaustive list:

1. Automatic NUMA balancing: skips individual pinned folios, but calls the
invalidate MMU notifier (which unmaps the memory from stage 2) at the **PUD**
level (introduced by commit 7f06e3aa2e83 ("mm/mprotect: push mmu notifier to
PUDs", see mm/mprotect::change_pud_range()).

2. Migration - rmap invokes the mmu invalidate notifier before checking that it
has an elevated reference count. If it finds that the page has an elevated
reference count, it doesn't remove remove it from the stage 1 translation tables
(see mm/rmap.c::try_to_unmmap_one()).

3. khugepaged invokes the mmu invalidate notifier, takes the page table
spinlock, does a check for pins, and backs out of collapsing the PTEs (see
mm/khugepaged,c::hpage_collapse_scan_pmd()).

4. KSM does something very similar to khugepaged, where it invokes the MMU
notifier, takes the page table lock, and backs out of the change if the page
is pinned (see mm/ksm.c::try_to_merge_one_page() -> write_protect_page()).

Why is the MMU notifier taken outside of the spinlock? Because
the MMU notifier must be able to sleep, and the check for an elevated
reference count must be done with the page table spinlock held. Why the need
to do the with the spinlock held? Because GUP pins the page with the spinlock held.

The issue in all the examples is structural: the MMU notifier must be able
to sleep, and so it must be called from a preemptible section; but, since GUP
pins a page with the page table spinlock held, the only reliable way to check
for an elevated reference count is by holding the spinlock, which makes the
check non-preemptible.

To get around this, I implemented a mechanism by which the arch-independent MMU
notifiers are propagated by KVM down to the arch code, and, based on the reason,
KVM will not change stage 2 for the memory region that is pinned for the region.

Alternatives
============

I would be very happy to rethink my approach if we can agree on a better
solution. Some obvious alternatives, not exhaustive by any means:

1. Have KVM prefault memory and map the buffer at stage 2 when
ProfilingBufferEnabled() becomes true. If the SPU reports a stage 2 fault, map
it at stage 2, similar to a CPU fault. Are the SPU faults rare enough for this
approach not to introduce a statistically significant difference in the
profiling data?

2. Have KVM prefault memory, map in the kernel's linear address space and
program SPE in the host to profile the guest.

3. Support only physical addressing mode. Finding large areas of contiguous
physical memory might be difficult for a guest (but not impossible, CMA could
be used for it), and this approach is incompatible with existing guests.

[1] https://lore.kernel.org/kvmarm/20251112102853.47759-1-alexandru.elisei@arm.com/
[2] https://gitlab.arm.com/linux-arm/linux-ae/-/tree/kvm-spe-v6
[3] https://gitlab.arm.com/linux-arm/kvmtool-ae/-/tree/kvm-spe-v6
[4] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/kvm-spe-v6
[5] https://developer.arm.com/documentation/ddi0601/latest/
[6] https://developer.arm.com/documentation/den0154/v1_bet0
[7] https://www.spinics.net/lists/arm-kernel/msg934192.html
[8] https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/statistical-profiling-extension-for-armv8-a

Alexandru Elisei (33):
  arm64/sysreg: Add new SPE fields
  arm64/sysreg: Define MDCR_EL2.E2PB values
  KVM: arm64: Add CONFIG_KVM_ARM_SPE Kconfig option
  perf: arm_spe_pmu: Move struct arm_spe_pmu to a separate header file
  KVM: arm64: Add KVM_CAP_ARM_SPE capability
  KVM: arm64: Add KVM_ARM_VCPU_SPE VCPU feature
  HACK! KVM: arm64: Disable SPE virtualization if protected KVM is
    enabled
  HACK! KVM: arm64: Enable SPE virtualization only in VHE mode
  HACK! KVM: arm64: Disable SPE virtualization if nested virt is enabled
  KVM: arm64: Add SPE VCPU device attribute to set the SPU device
  perf: arm_spe_pmu: Add PMBIDR_EL1 to struct arm_spe_pmu
  KVM: arm64: Add SPE VCPU device attribute to set the max buffer size
  KVM: arm64: Add SPE VCPU device attribute to initialize SPE
  KVM: arm64: Advertise SPE version in ID_AA64DFR0_EL1.PMSver
  KVM: arm64: Add writable SPE system registers to VCPU context
  perf: arm_spe_pmu: Add PMSIDR_EL1 to struct arm_spe_pmu
  KVM: arm64: Trap PMBIDR_EL1 and PMSIDR_EL1
  KVM: arm64: config: Use functions from spe.c to test
    FEAT_SPE_{FnE,FDS}
  KVM: arm64: Check for unsupported CPU early in kvm_arch_vcpu_load()
  KVM: arm64: VHE: Context switch SPE state
  KVM: arm64: Allow guest SPE physical timestamps only if
    perfmon_capable()
  KVM: arm64: Handle SPE hardware maintenance interrupts
  KVM: arm64: Add basic handling of SPE buffer control registers writes
  KVM: arm64: Add comment to explain how trapped SPE registers are
    handled
  KVM: arm64: Make MTE functions public
  KVM: arm64: at: Use callback for reading descriptor
  KVM: arm64: Pin the SPE buffer in the host and map it at stage 2
  KVM: Propagate MMU event to the MMU notifier handlers
  KVM: arm64: Handle MMU notifiers for the SPE buffer
  KVM: Add KVM_EXIT_RLIMIT exit_reason
  KVM: arm64: Implement locked memory accounting for the SPE buffer
  KVM: arm64: Add hugetlb support for SPE
  KVM: arm64: Allow the creation of a SPE enabled VM

Sudeep Holla (2):
  KVM: arm64: Add a new VCPU device control group for SPE
  KVM: arm64: Add SPE VCPU device attribute to set the interrupt number

 Documentation/virt/kvm/api.rst          |   23 +
 Documentation/virt/kvm/devices/vcpu.rst |  139 ++
 arch/arm64/include/asm/kvm_emulate.h    |    9 +-
 arch/arm64/include/asm/kvm_host.h       |   21 +-
 arch/arm64/include/asm/kvm_hyp.h        |   16 +-
 arch/arm64/include/asm/kvm_mmu.h        |   13 +-
 arch/arm64/include/asm/kvm_nested.h     |    6 +
 arch/arm64/include/asm/kvm_spe.h        |  165 ++
 arch/arm64/include/asm/sysreg.h         |    3 +
 arch/arm64/include/uapi/asm/kvm.h       |    6 +
 arch/arm64/kvm/Kconfig                  |    8 +
 arch/arm64/kvm/Makefile                 |    1 +
 arch/arm64/kvm/arm.c                    |   59 +-
 arch/arm64/kvm/at.c                     |   17 +-
 arch/arm64/kvm/config.c                 |   29 +-
 arch/arm64/kvm/debug.c                  |   29 +-
 arch/arm64/kvm/guest.c                  |   12 +
 arch/arm64/kvm/hyp/vhe/Makefile         |    1 +
 arch/arm64/kvm/hyp/vhe/spe-sr.c         |   80 +
 arch/arm64/kvm/hyp/vhe/switch.c         |    2 +
 arch/arm64/kvm/mmu.c                    |  157 +-
 arch/arm64/kvm/nested.c                 |   16 +-
 arch/arm64/kvm/pmu-emul.c               |    4 +-
 arch/arm64/kvm/spe.c                    | 1872 +++++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c               |   76 +-
 arch/arm64/kvm/vgic/vgic-its.c          |    4 +-
 arch/arm64/tools/sysreg                 |   25 +-
 drivers/perf/arm_spe_pmu.c              |   37 +-
 include/kvm/arm_vgic.h                  |    2 +
 include/linux/kvm_host.h                |   19 +
 include/linux/perf/arm_spe_pmu.h        |   59 +
 include/uapi/linux/kvm.h                |    7 +
 virt/kvm/kvm_main.c                     |    8 +
 33 files changed, 2797 insertions(+), 128 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_spe.h
 create mode 100644 arch/arm64/kvm/hyp/vhe/spe-sr.c
 create mode 100644 arch/arm64/kvm/spe.c
 create mode 100644 include/linux/perf/arm_spe_pmu.h

-- 
2.51.2

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-12-10 18:38   ` Leo Yan
  2025-12-15 21:42   ` Suzuki K Poulose
  2025-11-14 16:06 ` [RFC PATCH v6 02/35] arm64/sysreg: Define MDCR_EL2.E2PB values Alexandru Elisei
                   ` (34 subsequent siblings)
  35 siblings, 2 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Add the PMBSR_EL1.MSS2, PMBISR_EL1.MaxBuffSize, PMBLIMITR_EL1.nVM and
PMBIDR_EL1.AddrMode fields, which will be used by KVM.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/tools/sysreg | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 1c6cdf9d54bb..1e7b69594f04 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -3070,7 +3070,9 @@ EndSysreg
 
 Sysreg	PMBLIMITR_EL1	3	0	9	10	0
 Field	63:12	LIMIT
-Res0	11:6
+Res0	11:8
+Field	7	nVM
+Res0	6
 Field	5	PMFZ
 Res0	4:3
 Enum	2:1	FM
@@ -3085,7 +3087,8 @@ Field	63:0	PTR
 EndSysreg
 
 Sysreg	PMBSR_EL1	3	0	9	10	3
-Res0	63:32
+Res0	63:56
+Field	55:32	MSS2
 Enum	31:26	EC
 	0b000000	BUF
 	0b100100	FAULT_S1
@@ -3112,13 +3115,20 @@ Field	7:0	Attr
 EndSysreg
 
 Sysreg	PMBIDR_EL1	3	0	9	10	7
-Res0	63:12
+Res0	63:48
+Field	47:32	MaxBuffSize
+Res0	31:12
 Enum	11:8	EA
 	0b0000	NotDescribed
 	0b0001	Ignored
 	0b0010	SError
 EndEnum
-Res0	7:6
+Enum	7:6	AddrMode
+	0b00	VM_ONLY
+	0b01	BOTH
+	0b10	RESERVED
+	0b11	nVM_ONLY
+EndEnum
 Field	5	F
 Field	4	P
 Field	3:0	ALIGN
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields
  2025-11-14 16:06 ` [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields Alexandru Elisei
@ 2025-12-10 18:38   ` Leo Yan
  2025-12-12  9:39     ` Alexandru Elisei
  2025-12-15 21:42   ` Suzuki K Poulose
  1 sibling, 1 reply; 49+ messages in thread
From: Leo Yan @ 2025-12-10 18:38 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm, james.clark,
	mark.rutland, james.morse

On Fri, Nov 14, 2025 at 04:06:42PM +0000, Alexandru Elisei wrote:

[...]

>  Sysreg	PMBIDR_EL1	3	0	9	10	7
> -Res0	63:12
> +Res0	63:48
> +Field	47:32	MaxBuffSize
> +Res0	31:12
>  Enum	11:8	EA
>  	0b0000	NotDescribed
>  	0b0001	Ignored
>  	0b0010	SError
>  EndEnum
> -Res0	7:6
> +Enum	7:6	AddrMode
> +	0b00	VM_ONLY
> +	0b01	BOTH
> +	0b10	RESERVED
> +	0b11	nVM_ONLY
> +EndEnum

Not sure how to select the enum names, but seems names above don't match
well to the definitions.  How about:

  Enum  7:6     AddrMode
        0b00    VA_ONLY         // Virtual address only
        0b01    VA_PA           // Virtual and physical address supported
        0b10    RESERVED
        0b11    VM_PA_ONLY      // Physical address only in VM

Thanks,
Leo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields
  2025-12-10 18:38   ` Leo Yan
@ 2025-12-12  9:39     ` Alexandru Elisei
  0 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-12-12  9:39 UTC (permalink / raw)
  To: Leo Yan
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm, james.clark,
	mark.rutland, james.morse

Hi Leo,

On Wed, Dec 10, 2025 at 06:38:27PM +0000, Leo Yan wrote:
> On Fri, Nov 14, 2025 at 04:06:42PM +0000, Alexandru Elisei wrote:
> 
> [...]
> 
> >  Sysreg	PMBIDR_EL1	3	0	9	10	7
> > -Res0	63:12
> > +Res0	63:48
> > +Field	47:32	MaxBuffSize
> > +Res0	31:12
> >  Enum	11:8	EA
> >  	0b0000	NotDescribed
> >  	0b0001	Ignored
> >  	0b0010	SError
> >  EndEnum
> > -Res0	7:6
> > +Enum	7:6	AddrMode
> > +	0b00	VM_ONLY
> > +	0b01	BOTH
> > +	0b10	RESERVED
> > +	0b11	nVM_ONLY
> > +EndEnum
> 
> Not sure how to select the enum names, but seems names above don't match
> well to the definitions.  How about:
> 
>   Enum  7:6     AddrMode
>         0b00    VA_ONLY         // Virtual address only
>         0b01    VA_PA           // Virtual and physical address supported
>         0b10    RESERVED
>         0b11    VM_PA_ONLY      // Physical address only in VM

Why not, but I would name the last value PA_ONLY.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields
  2025-11-14 16:06 ` [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields Alexandru Elisei
  2025-12-10 18:38   ` Leo Yan
@ 2025-12-15 21:42   ` Suzuki K Poulose
  1 sibling, 0 replies; 49+ messages in thread
From: Suzuki K Poulose @ 2025-12-15 21:42 UTC (permalink / raw)
  To: Alexandru Elisei, maz, oliver.upton, joey.gouly, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

On 14/11/2025 16:06, Alexandru Elisei wrote:
> Add the PMBSR_EL1.MSS2, PMBISR_EL1.MaxBuffSize, PMBLIMITR_EL1.nVM and

minor nit:                ^PMBIDR_EL1

> PMBIDR_EL1.AddrMode fields, which will be used by KVM.
> 

The defintions look correct to me.

Suzuki


> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>   arch/arm64/tools/sysreg | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
> index 1c6cdf9d54bb..1e7b69594f04 100644
> --- a/arch/arm64/tools/sysreg
> +++ b/arch/arm64/tools/sysreg
> @@ -3070,7 +3070,9 @@ EndSysreg
>   
>   Sysreg	PMBLIMITR_EL1	3	0	9	10	0
>   Field	63:12	LIMIT
> -Res0	11:6
> +Res0	11:8
> +Field	7	nVM
> +Res0	6
>   Field	5	PMFZ
>   Res0	4:3
>   Enum	2:1	FM
> @@ -3085,7 +3087,8 @@ Field	63:0	PTR
>   EndSysreg
>   
>   Sysreg	PMBSR_EL1	3	0	9	10	3
> -Res0	63:32
> +Res0	63:56
> +Field	55:32	MSS2
>   Enum	31:26	EC
>   	0b000000	BUF
>   	0b100100	FAULT_S1
> @@ -3112,13 +3115,20 @@ Field	7:0	Attr
>   EndSysreg
>   
>   Sysreg	PMBIDR_EL1	3	0	9	10	7
> -Res0	63:12
> +Res0	63:48
> +Field	47:32	MaxBuffSize
> +Res0	31:12
>   Enum	11:8	EA
>   	0b0000	NotDescribed
>   	0b0001	Ignored
>   	0b0010	SError
>   EndEnum
> -Res0	7:6
> +Enum	7:6	AddrMode
> +	0b00	VM_ONLY
> +	0b01	BOTH
> +	0b10	RESERVED
> +	0b11	nVM_ONLY
> +EndEnum
>   Field	5	F
>   Field	4	P
>   Field	3:0	ALIGN



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 02/35] arm64/sysreg: Define MDCR_EL2.E2PB values
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-12-15 21:33   ` Suzuki K Poulose
  2025-11-14 16:06 ` [RFC PATCH v6 03/35] KVM: arm64: Add CONFIG_KVM_ARM_SPE Kconfig option Alexandru Elisei
                   ` (33 subsequent siblings)
  35 siblings, 1 reply; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

KVM will make use of the different values for MDCR_EL2.E2PB, document them.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/tools/sysreg | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 1e7b69594f04..0de8af0e0778 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -3851,7 +3851,12 @@ Field	17	HPMD
 Res0	16
 Field	15	EnSPM
 Field	14	TPMS
-Field	13:12	E2PB
+UnsignedEnum	13:12 E2PB
+	0b00	EL2
+	0b01	RESERVED
+	0b10	EL1_TRAP
+	0b11	EL1
+EndEnum
 Field	11	TDRA
 Field	10	TDOSA
 Field	9	TDA
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 02/35] arm64/sysreg: Define MDCR_EL2.E2PB values
  2025-11-14 16:06 ` [RFC PATCH v6 02/35] arm64/sysreg: Define MDCR_EL2.E2PB values Alexandru Elisei
@ 2025-12-15 21:33   ` Suzuki K Poulose
  0 siblings, 0 replies; 49+ messages in thread
From: Suzuki K Poulose @ 2025-12-15 21:33 UTC (permalink / raw)
  To: Alexandru Elisei, maz, oliver.upton, joey.gouly, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

On 14/11/2025 16:06, Alexandru Elisei wrote:
> KVM will make use of the different values for MDCR_EL2.E2PB, document them.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>   arch/arm64/tools/sysreg | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
> index 1e7b69594f04..0de8af0e0778 100644
> --- a/arch/arm64/tools/sysreg
> +++ b/arch/arm64/tools/sysreg
> @@ -3851,7 +3851,12 @@ Field	17	HPMD
>   Res0	16
>   Field	15	EnSPM
>   Field	14	TPMS
> -Field	13:12	E2PB
> +UnsignedEnum	13:12 E2PB
> +	0b00	EL2
> +	0b01	RESERVED
> +	0b10	EL1_TRAP
> +	0b11	EL1
> +EndEnum
>   Field	11	TDRA
>   Field	10	TDOSA
>   Field	9	TDA

Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 03/35] KVM: arm64: Add CONFIG_KVM_ARM_SPE Kconfig option
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 02/35] arm64/sysreg: Define MDCR_EL2.E2PB values Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 04/35] perf: arm_spe_pmu: Move struct arm_spe_pmu to a separate header file Alexandru Elisei
                   ` (32 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Add a new configuration option that will be used for KVM SPE emulation.
CONFIG_KVM_ARM_SPE depends on the SPE driver being builtin because:

1. The SPE driver maintains a cpumask of physical CPUs that support SPE,
   and that will be used by KVM to emulate SPE on heterogeneous systems.

2. KVM will rely on the SPE driver enabling the SPE interrupt at the GIC
   level.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/Kconfig | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 4f803fd1c99a..31388b5b2655 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -83,4 +83,12 @@ config PTDUMP_STAGE2_DEBUGFS
 
 	  If in doubt, say N.
 
+config KVM_ARM_SPE
+	bool
+	depends on KVM && ARM_SPE_PMU=y
+	default n
+	help
+	  Adds support for Statistical Profiling Extension (SPE) in virtual
+	  machines.
+
 endif # VIRTUALIZATION
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 04/35] perf: arm_spe_pmu: Move struct arm_spe_pmu to a separate header file
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (2 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 03/35] KVM: arm64: Add CONFIG_KVM_ARM_SPE Kconfig option Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 05/35] KVM: arm64: Add KVM_CAP_ARM_SPE capability Alexandru Elisei
                   ` (31 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

KVM will soon want to make use of struct arm_spe_pmu, move it to a separate
header where it will be easily accessible.

Cc: Will Deacon <will@kernel.org>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 drivers/perf/arm_spe_pmu.c       | 32 +--------------------
 include/linux/perf/arm_spe_pmu.h | 49 ++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 31 deletions(-)
 create mode 100644 include/linux/perf/arm_spe_pmu.h

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index fa50645fedda..188ea783a569 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -28,6 +28,7 @@
 #include <linux/of.h>
 #include <linux/perf_event.h>
 #include <linux/perf/arm_pmu.h>
+#include <linux/perf/arm_spe_pmu.h>
 #include <linux/platform_device.h>
 #include <linux/printk.h>
 #include <linux/slab.h>
@@ -67,37 +68,6 @@ struct arm_spe_pmu_buf {
 	void					*base;
 };
 
-struct arm_spe_pmu {
-	struct pmu				pmu;
-	struct platform_device			*pdev;
-	cpumask_t				supported_cpus;
-	struct hlist_node			hotplug_node;
-
-	int					irq; /* PPI */
-	u16					pmsver;
-	u16					min_period;
-	u16					counter_sz;
-
-#define SPE_PMU_FEAT_FILT_EVT			(1UL << 0)
-#define SPE_PMU_FEAT_FILT_TYP			(1UL << 1)
-#define SPE_PMU_FEAT_FILT_LAT			(1UL << 2)
-#define SPE_PMU_FEAT_ARCH_INST			(1UL << 3)
-#define SPE_PMU_FEAT_LDS			(1UL << 4)
-#define SPE_PMU_FEAT_ERND			(1UL << 5)
-#define SPE_PMU_FEAT_INV_FILT_EVT		(1UL << 6)
-#define SPE_PMU_FEAT_DISCARD			(1UL << 7)
-#define SPE_PMU_FEAT_EFT			(1UL << 8)
-#define SPE_PMU_FEAT_DEV_PROBED			(1UL << 63)
-	u64					features;
-
-	u64					pmsevfr_res0;
-	u16					max_record_sz;
-	u16					align;
-	struct perf_output_handle __percpu	*handle;
-};
-
-#define to_spe_pmu(p) (container_of(p, struct arm_spe_pmu, pmu))
-
 /* Convert a free-running index from perf into an SPE buffer offset */
 #define PERF_IDX2OFF(idx, buf) \
 	((idx) % ((unsigned long)(buf)->nr_pages << PAGE_SHIFT))
diff --git a/include/linux/perf/arm_spe_pmu.h b/include/linux/perf/arm_spe_pmu.h
new file mode 100644
index 000000000000..86085a7559d9
--- /dev/null
+++ b/include/linux/perf/arm_spe_pmu.h
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Split from driver/perf/arm_spe_pmu.c
+ *
+ * Copyright (C) 2016 ARM Limited
+ */
+
+#ifndef __PERF_ARM_SPE_PMU_H__
+#define __PERF_ARM_SPE_PMU_H__
+
+#include <linux/cpumask.h>
+#include <linux/container_of.h>
+#include <linux/list.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+#include <linux/types.h>
+
+struct arm_spe_pmu {
+	struct pmu				pmu;
+	struct platform_device			*pdev;
+	cpumask_t				supported_cpus;
+	struct hlist_node			hotplug_node;
+
+	int					irq; /* PPI */
+	u16					pmsver;
+	u16					min_period;
+	u16					counter_sz;
+
+#define SPE_PMU_FEAT_FILT_EVT			(1UL << 0)
+#define SPE_PMU_FEAT_FILT_TYP			(1UL << 1)
+#define SPE_PMU_FEAT_FILT_LAT			(1UL << 2)
+#define SPE_PMU_FEAT_ARCH_INST			(1UL << 3)
+#define SPE_PMU_FEAT_LDS			(1UL << 4)
+#define SPE_PMU_FEAT_ERND			(1UL << 5)
+#define SPE_PMU_FEAT_INV_FILT_EVT		(1UL << 6)
+#define SPE_PMU_FEAT_DISCARD			(1UL << 7)
+#define SPE_PMU_FEAT_EFT			(1UL << 8)
+#define SPE_PMU_FEAT_DEV_PROBED			(1UL << 63)
+	u64					features;
+
+	u64					pmsevfr_res0;
+	u16					max_record_sz;
+	u16					align;
+	struct perf_output_handle __percpu	*handle;
+};
+
+#define to_spe_pmu(p) (container_of(p, struct arm_spe_pmu, pmu))
+
+#endif /* __PERF_ARM_SPE_PMU_H__ */
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 05/35] KVM: arm64: Add KVM_CAP_ARM_SPE capability
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (3 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 04/35] perf: arm_spe_pmu: Move struct arm_spe_pmu to a separate header file Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-12-14 12:18   ` Leo Yan
  2025-11-14 16:06 ` [RFC PATCH v6 06/35] KVM: arm64: Add KVM_ARM_VCPU_SPE VCPU feature Alexandru Elisei
                   ` (30 subsequent siblings)
  35 siblings, 1 reply; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Add the SPE capability that will be used by userspace to test if KVM
supports SPE virtualization.

The SPE driver supports heterogenous systems. Keep track of all the
available Statistical Profiling Units (SPUs) in the system, because KVM
will require the user to associate a VM with exactly one SPU.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/api.rst   | 10 +++++++++
 arch/arm64/include/asm/kvm_spe.h | 20 +++++++++++++++++
 arch/arm64/kvm/Makefile          |  1 +
 arch/arm64/kvm/arm.c             |  5 ++++-
 arch/arm64/kvm/spe.c             | 38 ++++++++++++++++++++++++++++++++
 drivers/perf/arm_spe_pmu.c       |  2 ++
 include/linux/perf/arm_spe_pmu.h |  8 +++++++
 include/uapi/linux/kvm.h         |  1 +
 8 files changed, 84 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/kvm_spe.h
 create mode 100644 arch/arm64/kvm/spe.c

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 57061fa29e6a..10e0733297ac 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -9229,6 +9229,16 @@ KVM exits with the register state of either the L1 or L2 guest
 depending on which executed at the time of an exit. Userspace must
 take care to differentiate between these cases.
 
+8.46 KVM_CAP_ARM_SPE
+--------------------
+
+:Capability: KVM_CAP_ARM_SPE
+:Architectures: arm64
+:Type: vm
+
+This capability indicates that Statistical Profiling Extension (SPE)
+virtualization is available in KVM.
+
 9. Known KVM API problems
 =========================
 
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
new file mode 100644
index 000000000000..6572384531e2
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 - ARM Ltd
+ */
+
+#ifndef __ARM64_KVM_SPE_H__
+#define __ARM64_KVM_SPE_H__
+
+#ifdef CONFIG_KVM_ARM_SPE
+DECLARE_STATIC_KEY_FALSE(kvm_spe_available);
+
+static __always_inline bool kvm_supports_spe(void)
+{
+	return static_branch_likely(&kvm_spe_available);
+}
+#else
+#define kvm_supports_spe()	false
+#endif /* CONFIG_KVM_ARM_SPE */
+
+#endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 3ebc0570345c..6ea071653c5e 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -29,6 +29,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
 kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
 kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o
+kvm-$(CONFIG_KVM_ARM_SPE)     += spe.o
 
 always-y := hyp_constants.h hyp-constants.s
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 052bf0d4d0b0..ee230cb34215 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -39,6 +39,7 @@
 #include <asm/kvm_nested.h>
 #include <asm/kvm_pkvm.h>
 #include <asm/kvm_ptrauth.h>
+#include <asm/kvm_spe.h>
 #include <asm/sections.h>
 
 #include <kvm/arm_hypercalls.h>
@@ -419,7 +420,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		else
 			r = kvm_supports_cacheable_pfnmap();
 		break;
-
+	case KVM_CAP_ARM_SPE:
+		r = kvm_supports_spe();
+		break;
 	default:
 		r = 0;
 	}
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
new file mode 100644
index 000000000000..cf902853750f
--- /dev/null
+++ b/arch/arm64/kvm/spe.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 - ARM Ltd
+ */
+
+#include <linux/cpumask.h>
+#include <linux/kvm_host.h>
+#include <linux/perf/arm_spe_pmu.h>
+
+#include <asm/kvm_spe.h>
+#include <asm/sysreg.h>
+
+DEFINE_STATIC_KEY_FALSE(kvm_spe_available);
+
+static LIST_HEAD(arm_spus);
+static DEFINE_MUTEX(arm_spus_lock);
+
+struct arm_spu_entry {
+	struct list_head link;
+	struct arm_spe_pmu *arm_spu;
+};
+
+void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
+{
+	struct arm_spu_entry *entry;
+
+	guard(mutex)(&arm_spus_lock);
+
+	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return;
+
+	entry->arm_spu = arm_spu;
+	list_add_tail(&entry->link, &arm_spus);
+
+	if (list_is_singular(&arm_spus))
+		static_branch_enable(&kvm_spe_available);
+}
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 188ea783a569..66ae36d4d32e 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -1326,6 +1326,8 @@ static int arm_spe_pmu_device_probe(struct platform_device *pdev)
 	if (ret)
 		goto out_teardown_dev;
 
+	kvm_host_spe_init(spe_pmu);
+
 	return 0;
 
 out_teardown_dev:
diff --git a/include/linux/perf/arm_spe_pmu.h b/include/linux/perf/arm_spe_pmu.h
index 86085a7559d9..8a2db0e03e45 100644
--- a/include/linux/perf/arm_spe_pmu.h
+++ b/include/linux/perf/arm_spe_pmu.h
@@ -46,4 +46,12 @@ struct arm_spe_pmu {
 
 #define to_spe_pmu(p) (container_of(p, struct arm_spe_pmu, pmu))
 
+#ifdef CONFIG_KVM_ARM_SPE
+void kvm_host_spe_init(struct arm_spe_pmu *spe_pmu);
+#else
+static inline void kvm_host_spe_init(struct arm_spe_pmu *spe_pmu)
+{
+}
+#endif /* CONFIG_KVM_ARM_SPE */
+
 #endif /* __PERF_ARM_SPE_PMU_H__ */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 52f6000ab020..11e5dbde331b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -963,6 +963,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_RISCV_MP_STATE_RESET 242
 #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
 #define KVM_CAP_GUEST_MEMFD_FLAGS 244
+#define KVM_CAP_ARM_SPE 245
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 05/35] KVM: arm64: Add KVM_CAP_ARM_SPE capability
  2025-11-14 16:06 ` [RFC PATCH v6 05/35] KVM: arm64: Add KVM_CAP_ARM_SPE capability Alexandru Elisei
@ 2025-12-14 12:18   ` Leo Yan
  2025-12-15 11:46     ` Alexandru Elisei
  0 siblings, 1 reply; 49+ messages in thread
From: Leo Yan @ 2025-12-14 12:18 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm, james.clark,
	mark.rutland, james.morse

On Fri, Nov 14, 2025 at 04:06:46PM +0000, Alexandru Elisei wrote:

[...]

> +void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
> +{
> +	struct arm_spu_entry *entry;
> +
> +	guard(mutex)(&arm_spus_lock);
> +
> +	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> +	if (!entry)
> +		return;
> +
> +	entry->arm_spu = arm_spu;
> +	list_add_tail(&entry->link, &arm_spus);
> +
> +	if (list_is_singular(&arm_spus))
> +		static_branch_enable(&kvm_spe_available);

We can simply check if list_empty(&arm_spus) in kvm_supports_spe(), thus
the static key kvm_spe_available is not needed.  Another benefit is this
is consistent with CPU PMU's virt implementation.

Thanks,
Leo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 05/35] KVM: arm64: Add KVM_CAP_ARM_SPE capability
  2025-12-14 12:18   ` Leo Yan
@ 2025-12-15 11:46     ` Alexandru Elisei
  0 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-12-15 11:46 UTC (permalink / raw)
  To: Leo Yan
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm, james.clark,
	mark.rutland, james.morse

Hi Leo,

On Sun, Dec 14, 2025 at 08:18:42PM +0800, Leo Yan wrote:
> On Fri, Nov 14, 2025 at 04:06:46PM +0000, Alexandru Elisei wrote:
> 
> [...]
> 
> > +void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
> > +{
> > +	struct arm_spu_entry *entry;
> > +
> > +	guard(mutex)(&arm_spus_lock);
> > +
> > +	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
> > +	if (!entry)
> > +		return;
> > +
> > +	entry->arm_spu = arm_spu;
> > +	list_add_tail(&entry->link, &arm_spus);
> > +
> > +	if (list_is_singular(&arm_spus))
> > +		static_branch_enable(&kvm_spe_available);
> 
> We can simply check if list_empty(&arm_spus) in kvm_supports_spe(), thus
> the static key kvm_spe_available is not needed.  Another benefit is this
> is consistent with CPU PMU's virt implementation.

Sure, that makes sense. I think I added a static key because I was thinking
about performance on hot paths, but looking at the series I forgot to actually
make use of it.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 06/35] KVM: arm64: Add KVM_ARM_VCPU_SPE VCPU feature
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (4 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 05/35] KVM: arm64: Add KVM_CAP_ARM_SPE capability Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 07/35] HACK! KVM: arm64: Disable SPE virtualization if protected KVM is enabled Alexandru Elisei
                   ` (29 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Add a new VCPU feature that enables SPE virtualization when set by
userspace.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_spe.h  | 4 ++++
 arch/arm64/include/uapi/asm/kvm.h | 1 +
 arch/arm64/kvm/arm.c              | 7 +++++++
 3 files changed, 12 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 6572384531e2..8e8a5c6f7971 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -13,8 +13,12 @@ static __always_inline bool kvm_supports_spe(void)
 {
 	return static_branch_likely(&kvm_spe_available);
 }
+
+#define vcpu_has_spe(vcpu)					\
+	(vcpu_has_feature(vcpu, KVM_ARM_VCPU_SPE))
 #else
 #define kvm_supports_spe()	false
+#define vcpu_has_spe(vcpu)	false
 #endif /* CONFIG_KVM_ARM_SPE */
 
 #endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index ed5f3892674c..5bdfe1f6d565 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -106,6 +106,7 @@ struct kvm_regs {
 #define KVM_ARM_VCPU_PTRAUTH_GENERIC	6 /* VCPU uses generic authentication */
 #define KVM_ARM_VCPU_HAS_EL2		7 /* Support nested virtualization */
 #define KVM_ARM_VCPU_HAS_EL2_E2H0	8 /* Limit NV support to E2H RES0 */
+#define KVM_ARM_VCPU_SPE		9 /* Support SPE in guest */
 
 struct kvm_vcpu_init {
 	__u32 target;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ee230cb34215..1e4449d96d62 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1459,6 +1459,9 @@ static unsigned long system_supported_vcpu_features(void)
 	if (!cpus_have_final_cap(ARM64_HAS_NESTED_VIRT))
 		clear_bit(KVM_ARM_VCPU_HAS_EL2, &features);
 
+	if (!kvm_supports_spe())
+		clear_bit(KVM_ARM_VCPU_SPE, &features);
+
 	return features;
 }
 
@@ -1498,6 +1501,10 @@ static int kvm_vcpu_init_check_features(struct kvm_vcpu *vcpu,
 	if (test_bit(KVM_ARM_VCPU_HAS_EL2, &features))
 		return -EINVAL;
 
+	/* SPE is incompatible with AArch32 */
+	if (test_bit(KVM_ARM_VCPU_SPE, &features))
+		return -EINVAL;
+
 	return 0;
 }
 
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 07/35] HACK! KVM: arm64: Disable SPE virtualization if protected KVM is enabled
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (5 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 06/35] KVM: arm64: Add KVM_ARM_VCPU_SPE VCPU feature Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 08/35] HACK! KVM: arm64: Enable SPE virtualization only in VHE mode Alexandru Elisei
                   ` (28 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

For RFC only.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/spe.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index cf902853750f..7e38f7d9075b 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -24,6 +24,10 @@ void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
 {
 	struct arm_spu_entry *entry;
 
+	/* TODO: pKVM support */
+	if (is_protected_kvm_enabled())
+		return;
+
 	guard(mutex)(&arm_spus_lock);
 
 	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 08/35] HACK! KVM: arm64: Enable SPE virtualization only in VHE mode
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (6 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 07/35] HACK! KVM: arm64: Disable SPE virtualization if protected KVM is enabled Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-12-15 17:49   ` Leo Yan
  2025-11-14 16:06 ` [RFC PATCH v6 09/35] HACK! KVM: arm64: Disable SPE virtualization if nested virt is enabled Alexandru Elisei
                   ` (27 subsequent siblings)
  35 siblings, 1 reply; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

For RFC only.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/spe.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 7e38f7d9075b..101258b55053 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -24,8 +24,8 @@ void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
 {
 	struct arm_spu_entry *entry;
 
-	/* TODO: pKVM support */
-	if (is_protected_kvm_enabled())
+	/* TODO: pKVM and nVHE support */
+	if (is_protected_kvm_enabled() || !has_vhe())
 		return;
 
 	guard(mutex)(&arm_spus_lock);
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 08/35] HACK! KVM: arm64: Enable SPE virtualization only in VHE mode
  2025-11-14 16:06 ` [RFC PATCH v6 08/35] HACK! KVM: arm64: Enable SPE virtualization only in VHE mode Alexandru Elisei
@ 2025-12-15 17:49   ` Leo Yan
  0 siblings, 0 replies; 49+ messages in thread
From: Leo Yan @ 2025-12-15 17:49 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm, james.clark,
	mark.rutland, james.morse

On Fri, Nov 14, 2025 at 04:06:49PM +0000, Alexandru Elisei wrote:
> For RFC only.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/kvm/spe.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
> index 7e38f7d9075b..101258b55053 100644
> --- a/arch/arm64/kvm/spe.c
> +++ b/arch/arm64/kvm/spe.c
> @@ -24,8 +24,8 @@ void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
>  {
>  	struct arm_spu_entry *entry;
>  
> -	/* TODO: pKVM support */
> -	if (is_protected_kvm_enabled())
> +	/* TODO: pKVM and nVHE support */
> +	if (is_protected_kvm_enabled() || !has_vhe())

I totally agree we should focus on VHE mode first.  But it is worth
considering if we can unify solution across different virtualization
modes.

Aside from register access and IRQ handling, buffer management is the
most complex part of this series, it would be useful to know whether
the change can support different modes.

Thanks,
Leo

P.S. an interesting question is whether we can reuse virtio-iommu or
DMA buffer allocation for SPE.  My understanding is the IOMMU would be
simpler case, as the page table allocation and mapping occur entirely
on the host side.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 09/35] HACK! KVM: arm64: Disable SPE virtualization if nested virt is enabled
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (7 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 08/35] HACK! KVM: arm64: Enable SPE virtualization only in VHE mode Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 10/35] KVM: arm64: Add a new VCPU device control group for SPE Alexandru Elisei
                   ` (26 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

For RFC only.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/arm.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1e4449d96d62..d7f802035970 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1490,6 +1490,11 @@ static int kvm_vcpu_init_check_features(struct kvm_vcpu *vcpu,
 	    test_bit(KVM_ARM_VCPU_PTRAUTH_GENERIC, &features))
 		return -EINVAL;
 
+	/* TODO: NV support */
+	if (test_bit(KVM_ARM_VCPU_SPE, &features) &&
+	    test_bit(KVM_ARM_VCPU_HAS_EL2, &features))
+		return -EINVAL;
+
 	if (!test_bit(KVM_ARM_VCPU_EL1_32BIT, &features))
 		return 0;
 
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 10/35] KVM: arm64: Add a new VCPU device control group for SPE
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (8 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 09/35] HACK! KVM: arm64: Disable SPE virtualization if nested virt is enabled Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 11/35] KVM: arm64: Add SPE VCPU device attribute to set the interrupt number Alexandru Elisei
                   ` (25 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse, Sudeep Holla

From: Sudeep Holla <sudeep.holla@arm.com>

Add a new VCPU device control group to control various aspects of KVM's SPE
emulation. Functionality will be added in later patches.

[ Alexandru E: Major rework ]

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/devices/vcpu.rst |  5 +++++
 arch/arm64/include/asm/kvm_spe.h        | 17 +++++++++++++++++
 arch/arm64/include/uapi/asm/kvm.h       |  1 +
 arch/arm64/kvm/guest.c                  | 10 ++++++++++
 arch/arm64/kvm/spe.c                    | 15 +++++++++++++++
 5 files changed, 48 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 60bf205cb373..8c5208ccd107 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -291,3 +291,8 @@ From the destination VMM process:
 
 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
+
+5. GROUP: KVM_ARM_VCPU_SPE_CTRL
+===============================
+
+:Architectures: ARM64
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 8e8a5c6f7971..7c8268f6f507 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -16,9 +16,26 @@ static __always_inline bool kvm_supports_spe(void)
 
 #define vcpu_has_spe(vcpu)					\
 	(vcpu_has_feature(vcpu, KVM_ARM_VCPU_SPE))
+
+int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 #else
 #define kvm_supports_spe()	false
 #define vcpu_has_spe(vcpu)	false
+
+static inline int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+static inline int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+static inline int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
 #endif /* CONFIG_KVM_ARM_SPE */
 
 #endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 5bdfe1f6d565..5e2d47572136 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -444,6 +444,7 @@ enum {
 #define   KVM_ARM_VCPU_TIMER_IRQ_HPTIMER	3
 #define KVM_ARM_VCPU_PVTIME_CTRL	2
 #define   KVM_ARM_VCPU_PVTIME_IPA	0
+#define KVM_ARM_VCPU_SPE_CTRL		3
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_VCPU2_SHIFT		28
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 1c87699fd886..d1bf8b154a31 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -25,6 +25,7 @@
 #include <asm/kvm.h>
 #include <asm/kvm_emulate.h>
 #include <asm/kvm_nested.h>
+#include <asm/kvm_spe.h>
 #include <asm/sigcontext.h>
 
 #include "trace.h"
@@ -918,6 +919,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
 	case KVM_ARM_VCPU_PVTIME_CTRL:
 		ret = kvm_arm_pvtime_set_attr(vcpu, attr);
 		break;
+	case KVM_ARM_VCPU_SPE_CTRL:
+		ret = kvm_spe_set_attr(vcpu, attr);
+		break;
 	default:
 		ret = -ENXIO;
 		break;
@@ -941,6 +945,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
 	case KVM_ARM_VCPU_PVTIME_CTRL:
 		ret = kvm_arm_pvtime_get_attr(vcpu, attr);
 		break;
+	case KVM_ARM_VCPU_SPE_CTRL:
+		ret = kvm_spe_get_attr(vcpu, attr);
+		break;
 	default:
 		ret = -ENXIO;
 		break;
@@ -964,6 +971,9 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 	case KVM_ARM_VCPU_PVTIME_CTRL:
 		ret = kvm_arm_pvtime_has_attr(vcpu, attr);
 		break;
+	case KVM_ARM_VCPU_SPE_CTRL:
+		ret = kvm_spe_has_attr(vcpu, attr);
+		break;
 	default:
 		ret = -ENXIO;
 		break;
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 101258b55053..4d635e881620 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -40,3 +40,18 @@ void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
 	if (list_is_singular(&arm_spus))
 		static_branch_enable(&kvm_spe_available);
 }
+
+int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 11/35] KVM: arm64: Add SPE VCPU device attribute to set the interrupt number
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (9 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 10/35] KVM: arm64: Add a new VCPU device control group for SPE Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 12/35] KVM: arm64: Add SPE VCPU device attribute to set the SPU device Alexandru Elisei
                   ` (24 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse, Sudeep Holla

From: Sudeep Holla <sudeep.holla@arm.com>

Add KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_IRQ) to allow the user to set
the interrupt number for the buffer management interrupt.

[ Alexandru E: Split from "KVM: arm64: Add a new VCPU device control group
               for SPE" ]

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/devices/vcpu.rst | 22 +++++++
 arch/arm64/include/asm/kvm_host.h       |  2 +
 arch/arm64/include/asm/kvm_spe.h        | 10 +++
 arch/arm64/include/uapi/asm/kvm.h       |  1 +
 arch/arm64/kvm/guest.c                  |  2 +
 arch/arm64/kvm/spe.c                    | 82 +++++++++++++++++++++++++
 6 files changed, 119 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 8c5208ccd107..9a26252d0a34 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -296,3 +296,25 @@ From the destination VMM process:
 ===============================
 
 :Architectures: ARM64
+
+5.1 ATTRIBUTE: KVM_ARM_VCPU_SPE_IRQ
+-----------------------------------
+
+:Parameters: in kvm_device_attr.addr the address for the Profiling Buffer
+             management interrupt number as a pointer to an int
+
+Returns:
+
+	 =======  ==========================================================
+	 -EFAULT  Error accessing the buffer management interrupt number
+	 -EINVAL  Invalid interrupt number or not using an in-kernel irqchip
+	 -ENODEV  KVM_ARM_VCPU_HAS_SPE VCPU feature not set
+	 -ENXIO   SPE not supported or not properly configured
+	 =======  ==========================================================
+
+Required.
+
+Specifies the Profiling Buffer management interrupt number. The interrupt number
+must be a PPI and the interrupt number must be the same for each VCPU. Arm
+recommends 21 as the interrupt number. SPE virtualization requires an in-kernel
+vGIC implementation.
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 64302c438355..bc7aeae39fb9 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -27,6 +27,7 @@
 #include <asm/fpsimd.h>
 #include <asm/kvm.h>
 #include <asm/kvm_asm.h>
+#include <asm/kvm_spe.h>
 #include <asm/vncr_mapping.h>
 
 #define __KVM_HAVE_ARCH_INTC_INITIALIZED
@@ -865,6 +866,7 @@ struct kvm_vcpu_arch {
 	struct vgic_cpu vgic_cpu;
 	struct arch_timer_cpu timer_cpu;
 	struct kvm_pmu pmu;
+	struct kvm_vcpu_spe vcpu_spe;
 
 	/* vcpu power state */
 	struct kvm_mp_state mp_state;
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 7c8268f6f507..6855976b4c72 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -6,7 +6,14 @@
 #ifndef __ARM64_KVM_SPE_H__
 #define __ARM64_KVM_SPE_H__
 
+#include <linux/kvm.h>
+
 #ifdef CONFIG_KVM_ARM_SPE
+
+struct kvm_vcpu_spe {
+	int irq_num;		/* Buffer management interrupt number */
+};
+
 DECLARE_STATIC_KEY_FALSE(kvm_spe_available);
 
 static __always_inline bool kvm_supports_spe(void)
@@ -21,6 +28,9 @@ int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 #else
+struct kvm_vcpu_spe {
+};
+
 #define kvm_supports_spe()	false
 #define vcpu_has_spe(vcpu)	false
 
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 5e2d47572136..578a0f6c3f8f 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -445,6 +445,7 @@ enum {
 #define KVM_ARM_VCPU_PVTIME_CTRL	2
 #define   KVM_ARM_VCPU_PVTIME_IPA	0
 #define KVM_ARM_VCPU_SPE_CTRL		3
+#define   KVM_ARM_VCPU_SPE_IRQ		0
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_VCPU2_SHIFT		28
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index d1bf8b154a31..fbc17a71edc3 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -920,7 +920,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
 		ret = kvm_arm_pvtime_set_attr(vcpu, attr);
 		break;
 	case KVM_ARM_VCPU_SPE_CTRL:
+		mutex_lock(&vcpu->kvm->arch.config_lock);
 		ret = kvm_spe_set_attr(vcpu, attr);
+		mutex_unlock(&vcpu->kvm->arch.config_lock);
 		break;
 	default:
 		ret = -ENXIO;
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 4d635e881620..c6b81a2ef71f 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -41,17 +41,99 @@ void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
 		static_branch_enable(&kvm_spe_available);
 }
 
+static bool kvm_spe_irq_is_valid(struct kvm *kvm, int irq)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vcpu_spe *vcpu_spe;
+	unsigned long i;
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		vcpu_spe = &vcpu->arch.vcpu_spe;
+
+		if (!vcpu_spe->irq_num)
+			continue;
+
+		if (vcpu_spe->irq_num != irq)
+			return false;
+	}
+
+	return true;
+}
+
 int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
+	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
+	struct kvm *kvm = vcpu->kvm;
+
+	lockdep_assert_held(&kvm->arch.config_lock);
+
+	if (!vcpu_has_spe(vcpu))
+		return -ENODEV;
+
+	switch (attr->attr) {
+	case KVM_ARM_VCPU_SPE_IRQ: {
+		int __user *uaddr = (int __user *)(long)attr->addr;
+		int irq;
+
+		if (!irqchip_in_kernel(kvm))
+			return -EINVAL;
+
+		if (get_user(irq, uaddr))
+			return -EFAULT;
+
+		if (!irq_is_ppi(irq))
+			return -EINVAL;
+
+		if (!kvm_spe_irq_is_valid(kvm, irq))
+			return -EINVAL;
+
+		vcpu_spe->irq_num = irq;
+		return 0;
+	}
+	}
+
 	return -ENXIO;
 }
 
 int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
+	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
+	struct kvm *kvm = vcpu->kvm;
+
+	if (!vcpu_has_spe(vcpu))
+		return -ENODEV;
+
+	switch (attr->attr) {
+	case KVM_ARM_VCPU_SPE_IRQ: {
+		int __user *uaddr = (int __user *)(long)attr->addr;
+		int irq;
+
+		if (!irqchip_in_kernel(kvm))
+			return -EINVAL;
+
+		if (!vcpu_spe->irq_num)
+			return -ENXIO;
+
+		irq = vcpu_spe->irq_num;
+		if (put_user(irq, uaddr))
+			return -EFAULT;
+
+		return 0;
+	}
+	}
+
 	return -ENXIO;
 }
 
 int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
+	if (!vcpu_has_spe(vcpu))
+		return -ENODEV;
+
+	switch(attr->attr) {
+	case KVM_ARM_VCPU_SPE_IRQ:
+		return 0;
+	}
+
 	return -ENXIO;
 }
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 12/35] KVM: arm64: Add SPE VCPU device attribute to set the SPU device
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (10 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 11/35] KVM: arm64: Add SPE VCPU device attribute to set the interrupt number Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 13/35] perf: arm_spe_pmu: Add PMBIDR_EL1 to struct arm_spe_pmu Alexandru Elisei
                   ` (23 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

To support systems where there is more than one SPU instance, or where not
all the PEs have SPE, add KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_SET_SPU)
for userspace to set the SPU instance it wants the virtual machine to use.
Similar to the PMU, it is entirely up to userspace to make sure the VCPUs
are run only on the physical CPUs which share this SPU instance.

If the ioctl is called for multiple VCPUs, userspace must use the same SPU
for each of the VCPUs.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/devices/vcpu.rst | 29 ++++++++++++
 arch/arm64/include/asm/kvm_host.h       |  1 +
 arch/arm64/include/asm/kvm_spe.h        |  7 +++
 arch/arm64/include/uapi/asm/kvm.h       |  1 +
 arch/arm64/kvm/pmu-emul.c               |  4 +-
 arch/arm64/kvm/spe.c                    | 62 +++++++++++++++++++++++++
 6 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 9a26252d0a34..e305377fadad 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -318,3 +318,32 @@ Specifies the Profiling Buffer management interrupt number. The interrupt number
 must be a PPI and the interrupt number must be the same for each VCPU. Arm
 recommends 21 as the interrupt number. SPE virtualization requires an in-kernel
 vGIC implementation.
+
+5.2 ATTRIBUTE: KVM_ARM_VCPU_SPE_SPU
+------------------------------------------
+
+:Parameters: in kvm_device_attr.addr the address to an int representing the SPU
+             identifier.
+
+:Returns:
+
+	 =======  ================================================
+	 -EBUSY   Virtual machine has already run
+	 -EFAULT  Error accessing the SPU identifier
+	 -EINVAL  A different SPU already set
+	 -ENXIO   SPE not supported or not properly configured, or
+                  SPU not present on the system
+	 -ENODEV  KVM_ARM_VCPU_HAS_SPE VCPU feature not set
+	 =======  ================================================
+
+Required.
+
+Request that the VCPU uses the specified hardware SPU. The SPE identifier can be
+read from the 'type' file for the desired SPE instance under /sys/devices (or,
+equivalent, /sys/bus/event_source). Must be set for at least one VCPU, in which
+case all the other VCPUs will use the same SPU. Once a SPU has been set,
+attempting to set a different one will result in an error.
+
+Similar to KVM_ARM_VCPU_PMU_V3_CTRL(KVM_ARM_VCPU_PMU_SET_PMU), userspace is
+responsible for making sure that the VCPU is run only on physical CPUs which
+have the specified SPU.
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index bc7aeae39fb9..373d22ec4783 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -365,6 +365,7 @@ struct kvm_arch {
 	 */
 	unsigned long *pmu_filter;
 	struct arm_pmu *arm_pmu;
+	struct kvm_spe kvm_spe;
 
 	cpumask_var_t supported_cpus;
 
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 6855976b4c72..a4e9f03e3751 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -10,6 +10,10 @@
 
 #ifdef CONFIG_KVM_ARM_SPE
 
+struct kvm_spe {
+	struct arm_spe_pmu *arm_spu;
+};
+
 struct kvm_vcpu_spe {
 	int irq_num;		/* Buffer management interrupt number */
 };
@@ -28,6 +32,9 @@ int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 #else
+struct kvm_spe {
+};
+
 struct kvm_vcpu_spe {
 };
 
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 578a0f6c3f8f..760c3e074d3d 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -446,6 +446,7 @@ enum {
 #define   KVM_ARM_VCPU_PVTIME_IPA	0
 #define KVM_ARM_VCPU_SPE_CTRL		3
 #define   KVM_ARM_VCPU_SPE_IRQ		0
+#define   KVM_ARM_VCPU_SPE_SPU		1
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_VCPU2_SHIFT		28
diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index b03dbda7f1ab..f71240f5f914 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -1094,7 +1094,9 @@ static int kvm_arm_pmu_v3_set_pmu(struct kvm_vcpu *vcpu, int pmu_id)
 			}
 
 			kvm_arm_set_pmu(kvm, arm_pmu);
-			cpumask_copy(kvm->arch.supported_cpus, &arm_pmu->supported_cpus);
+			/* SPE also sets the supported_cpus cpumask. */
+			cpumask_and(kvm->arch.supported_cpus, &arm_pmu->supported_cpus,
+				    kvm->arch.supported_cpus);
 			ret = 0;
 			break;
 		}
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index c6b81a2ef71f..c581838029ae 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -41,6 +41,43 @@ void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
 		static_branch_enable(&kvm_spe_available);
 }
 
+static int kvm_spe_set_spu(struct kvm_vcpu *vcpu, int spu_id)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
+	struct arm_spe_pmu *existing_spu, *new_spu = NULL;
+	struct arm_spu_entry *entry;
+
+	if (kvm_vm_has_ran_once(kvm))
+		return -EBUSY;
+
+	guard(mutex)(&arm_spus_lock);
+
+	existing_spu = kvm_spe->arm_spu;
+	list_for_each_entry(entry, &arm_spus, link) {
+		if (entry->arm_spu->pmu.type == spu_id) {
+			new_spu = entry->arm_spu;
+			break;
+		}
+	}
+
+	if (!new_spu)
+		return -ENXIO;
+
+	if (existing_spu) {
+		if (new_spu != existing_spu)
+			return -EINVAL;
+		return 0;
+	}
+
+	kvm_spe->arm_spu = new_spu;
+	/* PMU also sets the supported_cpus cpumask. */
+	cpumask_and(kvm->arch.supported_cpus, &new_spu->supported_cpus,
+		    kvm->arch.supported_cpus);
+
+	return 0;
+}
+
 static bool kvm_spe_irq_is_valid(struct kvm *kvm, int irq)
 {
 	struct kvm_vcpu *vcpu;
@@ -90,6 +127,15 @@ int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 		vcpu_spe->irq_num = irq;
 		return 0;
 	}
+	case KVM_ARM_VCPU_SPE_SPU: {
+		int __user *uaddr = (int __user *)(long)attr->addr;
+		int spu_id;
+
+		if (get_user(spu_id, uaddr))
+			return -EFAULT;
+
+		return kvm_spe_set_spu(vcpu, spu_id);
+	}
 	}
 
 	return -ENXIO;
@@ -99,6 +145,7 @@ int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
 	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
 	struct kvm *kvm = vcpu->kvm;
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
 
 	if (!vcpu_has_spe(vcpu))
 		return -ENODEV;
@@ -120,6 +167,20 @@ int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 
 		return 0;
 	}
+	case KVM_ARM_VCPU_SPE_SPU: {
+		struct arm_spe_pmu *spu = kvm_spe->arm_spu;
+		int __user *uaddr = (int __user *)(long)attr->addr;
+		int spu_id;
+
+		if (!spu)
+			return -ENXIO;
+
+		spu_id = spu->pmu.type;
+		if (put_user(spu_id, uaddr))
+			return -EFAULT;
+
+		return 0;
+	}
 	}
 
 	return -ENXIO;
@@ -132,6 +193,7 @@ int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 
 	switch(attr->attr) {
 	case KVM_ARM_VCPU_SPE_IRQ:
+	case KVM_ARM_VCPU_SPE_SPU:
 		return 0;
 	}
 
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 13/35] perf: arm_spe_pmu: Add PMBIDR_EL1 to struct arm_spe_pmu
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (11 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 12/35] KVM: arm64: Add SPE VCPU device attribute to set the SPU device Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 14/35] KVM: arm64: Add SPE VCPU device attribute to set the max buffer size Alexandru Elisei
                   ` (22 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Add the read-only register PMBIDR_EL1 to struct arm_spe_pmu, as KVM will
need it to virtualize SPE and it saves KVM having to read the hardware
register each time a guest accesses it.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 drivers/perf/arm_spe_pmu.c       | 1 +
 include/linux/perf/arm_spe_pmu.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 66ae36d4d32e..2ca3377538aa 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -1056,6 +1056,7 @@ static void __arm_spe_pmu_dev_probe(void *info)
 			"profiling buffer owned by higher exception level\n");
 		return;
 	}
+	spe_pmu->pmbidr_el1 = reg;
 
 	/* Minimum alignment. If it's out-of-range, then fail the probe */
 	fld = FIELD_GET(PMBIDR_EL1_ALIGN, reg);
diff --git a/include/linux/perf/arm_spe_pmu.h b/include/linux/perf/arm_spe_pmu.h
index 8a2db0e03e45..25425249c193 100644
--- a/include/linux/perf/arm_spe_pmu.h
+++ b/include/linux/perf/arm_spe_pmu.h
@@ -21,6 +21,7 @@ struct arm_spe_pmu {
 	cpumask_t				supported_cpus;
 	struct hlist_node			hotplug_node;
 
+	u64					pmbidr_el1;
 	int					irq; /* PPI */
 	u16					pmsver;
 	u16					min_period;
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 14/35] KVM: arm64: Add SPE VCPU device attribute to set the max buffer size
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (12 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 13/35] perf: arm_spe_pmu: Add PMBIDR_EL1 to struct arm_spe_pmu Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 15/35] KVM: arm64: Add SPE VCPU device attribute to initialize SPE Alexandru Elisei
                   ` (21 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

During profiling, the buffer programmed by the guest must be kept mapped at
stage 2 by KVM, making this memory pinned from the host's perspective.

To make sure that a guest doesn't consume too much memory, add a new SPE
VCPU device attribute, KVM_ARM_VCPU_MAX_BUFFER_SIZE, which is used by
userspace to limit the amount of memory a VCPU can pin when programming
the profiling buffer. This value will be advertised to the guest in the
PMBIDR_EL1.MaxBuffSize field.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/devices/vcpu.rst |  49 ++++++++++
 arch/arm64/include/asm/kvm_spe.h        |   6 ++
 arch/arm64/include/uapi/asm/kvm.h       |   5 +-
 arch/arm64/kvm/arm.c                    |   2 +
 arch/arm64/kvm/spe.c                    | 116 ++++++++++++++++++++++++
 5 files changed, 176 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index e305377fadad..bb1bbd2ff6e2 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -347,3 +347,52 @@ attempting to set a different one will result in an error.
 Similar to KVM_ARM_VCPU_PMU_V3_CTRL(KVM_ARM_VCPU_PMU_SET_PMU), userspace is
 responsible for making sure that the VCPU is run only on physical CPUs which
 have the specified SPU.
+
+5.3 ATTRIBUTE: KVM_ARM_VCPU_MAX_BUFFER_SIZE
+------------------------------------------
+
+:Parameters: in kvm_device_attr.addr the address to an u64 representing the
+             maximum buffer size, in bytes.
+
+:Returns:
+
+	 =======  =========================================================
+	 -EBUSY   Virtual machine has already run
+	 -EDOM    Buffer size cannot be represented by hardware
+	 -EFAULT  Error accessing the max buffer size identifier
+	 -EINVAL  A different maximum buffer size already set or the size is
+                  not aligned to the host's page size
+	 -ENXIO   SPE not supported or not properly configured
+	 -ENODEV  KVM_ARM_VCPU_HAS_SPE VCPU feature or SPU instance not set
+	 -ERANGE  Buffer size larger than maximum supported by the SPU
+                  instance.
+	 =======  ==========================================================
+
+Required.
+
+Limit the size of the profiling buffer for the VCPU to the specified value. The
+value will be used by all VCPUs. Can be set for more than one VCPUs, as long as
+the value stays the same.
+
+Requires that a SPU has been already assigned to the VM. The maximum buffer size
+must be less than or equal to the maximum buffer size of the assigned SPU instance,
+unless there is no limit on the maximum buffer size for the SPU. In this case
+the VCPU maximum buffer size can have any value, including 0, as long as it can
+be encoded by hardware. For details on how the hardware encodes this value,
+please consult Arm DDI0601 for the field PMBIDR_EL1.MaxBuffSize.
+
+The value 0 is special and it means that there is no upper limit on the size of
+the buffer that the guest can use. Can only be set if the SPU instance used by
+the VM has a similarly unlimited buffer size.
+
+When a guest enables SPE on the VCPU, KVM will pin the host memory backing the
+buffer to avoid the statistical profiling unit experiencing stage 2 faults when
+it writes to memory. This includes the host pages backing the guest's stage 1
+translation tables that are used to translate the buffer. As a result, it is
+expected that the size of the memory that will be pinned for each VCPU will be
+slightly larger that the maximum buffer set with this ioctl.
+
+This memory that is pinned will count towards the process RLIMIT_MEMLOCK. To
+avoid the limit being exceeded, userspace must increase the RLIMIT_MEMLOCK limit
+prior to running the VCPU, otherwise KVM_RUN will return to userspace with an
+error.
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index a4e9f03e3751..e48f7a7f67bb 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -12,6 +12,7 @@
 
 struct kvm_spe {
 	struct arm_spe_pmu *arm_spu;
+	u64 max_buffer_size;	/* Maximum per VCPU buffer size */
 };
 
 struct kvm_vcpu_spe {
@@ -28,6 +29,8 @@ static __always_inline bool kvm_supports_spe(void)
 #define vcpu_has_spe(vcpu)					\
 	(vcpu_has_feature(vcpu, KVM_ARM_VCPU_SPE))
 
+void kvm_spe_init_vm(struct kvm *kvm);
+
 int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
@@ -41,6 +44,9 @@ struct kvm_vcpu_spe {
 #define kvm_supports_spe()	false
 #define vcpu_has_spe(vcpu)	false
 
+static inline void kvm_spe_init_vm(struct kvm *kvm)
+{
+}
 static inline int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
 	return -ENXIO;
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 760c3e074d3d..9db652392781 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -445,8 +445,9 @@ enum {
 #define KVM_ARM_VCPU_PVTIME_CTRL	2
 #define   KVM_ARM_VCPU_PVTIME_IPA	0
 #define KVM_ARM_VCPU_SPE_CTRL		3
-#define   KVM_ARM_VCPU_SPE_IRQ		0
-#define   KVM_ARM_VCPU_SPE_SPU		1
+#define   KVM_ARM_VCPU_SPE_IRQ			0
+#define   KVM_ARM_VCPU_SPE_SPU			1
+#define   KVM_ARM_VCPU_SPE_MAX_BUFFER_SIZE	2
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_VCPU2_SHIFT		28
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d7f802035970..9afdf66be8b2 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -194,6 +194,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	kvm_timer_init_vm(kvm);
 
+	kvm_spe_init_vm(kvm);
+
 	/* The maximum number of VCPUs is limited by the host's GIC model */
 	kvm->max_vcpus = kvm_arm_default_max_vcpus();
 
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index c581838029ae..3478da2a1f7c 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2021 - ARM Ltd
  */
 
+#include <linux/bitops.h>
 #include <linux/cpumask.h>
 #include <linux/kvm_host.h>
 #include <linux/perf/arm_spe_pmu.h>
@@ -41,6 +42,99 @@ void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
 		static_branch_enable(&kvm_spe_available);
 }
 
+/*
+ * The maximum buffer size can be zero (no restrictions on the buffer size), so
+ * this value cannot be used as the uninitialized value. The maximum buffer size
+ * must be page aligned, so arbitrarily choose the value '1' for an
+ * uninitialized maximum buffer size.
+ */
+#define KVM_SPE_MAX_BUFFER_SIZE_UNSET		1
+
+void kvm_spe_init_vm(struct kvm *kvm)
+{
+	kvm->arch.kvm_spe.max_buffer_size = KVM_SPE_MAX_BUFFER_SIZE_UNSET;
+}
+
+static u64 max_buffer_size_to_pmbidr_el1(u64 size)
+{
+	u64 msb_idx, num_bits;
+	u64 maxbuffsize;
+	u64 m, e;
+
+	/*
+	 * size = m:zeros(12); m is 9 bits.
+	 */
+	if (size <= GENMASK_ULL(20, 12)) {
+		m = size >> 12;
+		e = 0;
+		goto out;
+	}
+
+	/*
+	 * size = 1:m:zeros(e+11)
+	 */
+
+	num_bits = fls64(size);
+	msb_idx = num_bits - 1;
+
+	/* MSB is not encoded. */
+	m = size & ~BIT(msb_idx);
+	/* m is 9 bits. */
+	m >>= msb_idx - 9;
+	/* MSB is not encoded, m is 9 bits wide and 11 bits are zero. */
+	e = num_bits - 1 - 9 - 11;
+
+out:
+	maxbuffsize = FIELD_PREP(GENMASK_ULL(8, 0), m) | \
+		      FIELD_PREP(GENMASK_ULL(13, 9), e);
+	return FIELD_PREP(PMBIDR_EL1_MaxBuffSize, maxbuffsize);
+}
+
+static u64 pmbidr_el1_to_max_buffer_size(u64 pmbidr_el1)
+{
+	u64 maxbuffsize;
+	u64 e, m;
+
+	maxbuffsize = FIELD_GET(PMBIDR_EL1_MaxBuffSize, pmbidr_el1);
+	e = FIELD_GET(GENMASK_ULL(13, 9), maxbuffsize);
+	m = FIELD_GET(GENMASK_ULL(8, 0), maxbuffsize);
+
+	if (!e)
+		return m << 12;
+	return (1ULL << (9 + e + 11)) | (m << (e + 11));
+}
+
+static int kvm_spe_set_max_buffer_size(struct kvm_vcpu *vcpu, u64 size)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
+	u64 decoded_size, spu_size;
+
+	if (kvm_vm_has_ran_once(kvm))
+		return -EBUSY;
+
+	if (!PAGE_ALIGNED(size))
+		return -EINVAL;
+
+	if (!kvm_spe->arm_spu)
+		return -ENODEV;
+
+	if (kvm_spe->max_buffer_size != KVM_SPE_MAX_BUFFER_SIZE_UNSET)
+		return size == kvm_spe->max_buffer_size ? 0 : -EINVAL;
+
+	decoded_size = pmbidr_el1_to_max_buffer_size(max_buffer_size_to_pmbidr_el1(size));
+	if (decoded_size != size)
+		return -EDOM;
+
+	spu_size = pmbidr_el1_to_max_buffer_size(kvm_spe->arm_spu->pmbidr_el1);
+	if (spu_size != 0 && (size == 0 || size > spu_size))
+		return -ERANGE;
+
+	kvm_spe->max_buffer_size = size;
+
+	return 0;
+}
+
 static int kvm_spe_set_spu(struct kvm_vcpu *vcpu, int spu_id)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -136,6 +230,15 @@ int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 
 		return kvm_spe_set_spu(vcpu, spu_id);
 	}
+	case KVM_ARM_VCPU_SPE_MAX_BUFFER_SIZE: {
+		u64 __user *uaddr = (u64 __user *)(long)attr->addr;
+		u64 size;
+
+		if (get_user(size, uaddr))
+			return -EFAULT;
+
+		return kvm_spe_set_max_buffer_size(vcpu, size);
+	}
 	}
 
 	return -ENXIO;
@@ -181,6 +284,18 @@ int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 
 		return 0;
 	}
+	case KVM_ARM_VCPU_SPE_MAX_BUFFER_SIZE: {
+		u64 __user *uaddr = (u64 __user *)(long)attr->addr;
+		u64 size = kvm_spe->max_buffer_size;
+
+		if (size == KVM_SPE_MAX_BUFFER_SIZE_UNSET)
+			return -EINVAL;
+
+		if (put_user(size, uaddr))
+			return -EFAULT;
+
+		return 0;
+	}
 	}
 
 	return -ENXIO;
@@ -194,6 +309,7 @@ int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 	switch(attr->attr) {
 	case KVM_ARM_VCPU_SPE_IRQ:
 	case KVM_ARM_VCPU_SPE_SPU:
+	case KVM_ARM_VCPU_SPE_MAX_BUFFER_SIZE:
 		return 0;
 	}
 
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 15/35] KVM: arm64: Add SPE VCPU device attribute to initialize SPE
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (13 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 14/35] KVM: arm64: Add SPE VCPU device attribute to set the max buffer size Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:06 ` [RFC PATCH v6 16/35] KVM: arm64: Advertise SPE version in ID_AA64DFR0_EL1.PMSver Alexandru Elisei
                   ` (20 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse, Sudeep Holla

Add KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_INIT) VCPU ioctl to initialize
SPE. Initialization must be done exactly once for each VCPU.

[ Alexandru E: Split from "KVM: arm64: Add a new VCPU device control group
	   for SPE" ]

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/devices/vcpu.rst | 28 +++++++++++++++++--
 arch/arm64/include/asm/kvm_spe.h        |  6 +++++
 arch/arm64/include/uapi/asm/kvm.h       |  1 +
 arch/arm64/kvm/arm.c                    |  4 +++
 arch/arm64/kvm/spe.c                    | 36 +++++++++++++++++++++++++
 5 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index bb1bbd2ff6e2..29dd1f087d4a 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -306,6 +306,7 @@ From the destination VMM process:
 Returns:
 
 	 =======  ==========================================================
+	 -EBUSY   SPE already initialized
 	 -EFAULT  Error accessing the buffer management interrupt number
 	 -EINVAL  Invalid interrupt number or not using an in-kernel irqchip
 	 -ENODEV  KVM_ARM_VCPU_HAS_SPE VCPU feature not set
@@ -328,7 +329,8 @@ vGIC implementation.
 :Returns:
 
 	 =======  ================================================
-	 -EBUSY   Virtual machine has already run
+	 -EBUSY   Virtual machine has already run, or SPE already
+                  initialized
 	 -EFAULT  Error accessing the SPU identifier
 	 -EINVAL  A different SPU already set
 	 -ENXIO   SPE not supported or not properly configured, or
@@ -357,7 +359,8 @@ have the specified SPU.
 :Returns:
 
 	 =======  =========================================================
-	 -EBUSY   Virtual machine has already run
+	 -EBUSY   Virtual machine has already run, or SPE already
+                  initialized
 	 -EDOM    Buffer size cannot be represented by hardware
 	 -EFAULT  Error accessing the max buffer size identifier
 	 -EINVAL  A different maximum buffer size already set or the size is
@@ -396,3 +399,24 @@ This memory that is pinned will count towards the process RLIMIT_MEMLOCK. To
 avoid the limit being exceeded, userspace must increase the RLIMIT_MEMLOCK limit
 prior to running the VCPU, otherwise KVM_RUN will return to userspace with an
 error.
+
+5.2 ATTRIBUTE: KVM_ARM_VCPU_SPE_INIT
+-----------------------------------
+
+:Parameters: no additional parameter in kvm_device_attr.addr
+
+Returns:
+
+	 =======  ============================================
+	 -EBUSY   SPE already initialized for this VCPU
+	 -ENXIO   SPE not supported or not properly configured
+	 =======  ============================================
+
+Required.
+
+Request initialization of the Statistical Profiling Extension for this VCPU.
+Must be done last, after SPE has been fully configured for the VCPU, and after
+the in-kernel irqchip has been initialized.
+
+KVM will refuse to run the VCPU and KVM_RUN will return an error if SPE hasn't
+been initialized for the VCPU.
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index e48f7a7f67bb..6ce70cf2abaf 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -17,6 +17,7 @@ struct kvm_spe {
 
 struct kvm_vcpu_spe {
 	int irq_num;		/* Buffer management interrupt number */
+	bool initialized;	/* SPE initialized for the VCPU */
 };
 
 DECLARE_STATIC_KEY_FALSE(kvm_spe_available);
@@ -30,6 +31,7 @@ static __always_inline bool kvm_supports_spe(void)
 	(vcpu_has_feature(vcpu, KVM_ARM_VCPU_SPE))
 
 void kvm_spe_init_vm(struct kvm *kvm);
+int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu);
 
 int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
@@ -47,6 +49,10 @@ struct kvm_vcpu_spe {
 static inline void kvm_spe_init_vm(struct kvm *kvm)
 {
 }
+static inline int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
 static inline int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
 	return -ENXIO;
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 9db652392781..186dcaf5e210 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -448,6 +448,7 @@ enum {
 #define   KVM_ARM_VCPU_SPE_IRQ			0
 #define   KVM_ARM_VCPU_SPE_SPU			1
 #define   KVM_ARM_VCPU_SPE_MAX_BUFFER_SIZE	2
+#define   KVM_ARM_VCPU_SPE_INIT			3
 
 /* KVM_IRQ_LINE irq field index values */
 #define KVM_ARM_IRQ_VCPU2_SHIFT		28
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9afdf66be8b2..783e331fb57a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -899,6 +899,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
 			return ret;
 	}
 
+	ret = kvm_spe_vcpu_first_run_init(vcpu);
+	if (ret)
+		return ret;
+
 	mutex_lock(&kvm->arch.config_lock);
 	set_bit(KVM_ARCH_FLAG_HAS_RAN_ONCE, &kvm->arch.flags);
 	mutex_unlock(&kvm->arch.config_lock);
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 3478da2a1f7c..6bd074e40f6c 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -55,6 +55,19 @@ void kvm_spe_init_vm(struct kvm *kvm)
 	kvm->arch.kvm_spe.max_buffer_size = KVM_SPE_MAX_BUFFER_SIZE_UNSET;
 }
 
+int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
+
+	if (!vcpu_has_spe(vcpu))
+		return 0;
+
+	if (!vcpu_spe->initialized)
+		return -EINVAL;
+
+	return 0;
+}
+
 static u64 max_buffer_size_to_pmbidr_el1(u64 size)
 {
 	u64 msb_idx, num_bits;
@@ -195,12 +208,16 @@ int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
 	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
 	struct kvm *kvm = vcpu->kvm;
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
 
 	lockdep_assert_held(&kvm->arch.config_lock);
 
 	if (!vcpu_has_spe(vcpu))
 		return -ENODEV;
 
+	if (vcpu_spe->initialized)
+		return -EBUSY;
+
 	switch (attr->attr) {
 	case KVM_ARM_VCPU_SPE_IRQ: {
 		int __user *uaddr = (int __user *)(long)attr->addr;
@@ -239,6 +256,24 @@ int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 
 		return kvm_spe_set_max_buffer_size(vcpu, size);
 	}
+	case KVM_ARM_VCPU_SPE_INIT:
+		if (!vcpu_spe->irq_num)
+			return -ENXIO;
+
+		if (!kvm_spe->arm_spu)
+			return -ENXIO;
+
+		if (kvm_spe->max_buffer_size == KVM_SPE_MAX_BUFFER_SIZE_UNSET)
+			return -ENXIO;
+
+		if (!vgic_initialized(kvm))
+			return -ENXIO;
+
+		if (kvm_vgic_set_owner(vcpu, vcpu_spe->irq_num, vcpu_spe))
+			return -ENXIO;
+
+		vcpu_spe->initialized = true;
+		return 0;
 	}
 
 	return -ENXIO;
@@ -310,6 +345,7 @@ int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 	case KVM_ARM_VCPU_SPE_IRQ:
 	case KVM_ARM_VCPU_SPE_SPU:
 	case KVM_ARM_VCPU_SPE_MAX_BUFFER_SIZE:
+	case KVM_ARM_VCPU_SPE_INIT:
 		return 0;
 	}
 
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 16/35] KVM: arm64: Advertise SPE version in ID_AA64DFR0_EL1.PMSver
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (14 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 15/35] KVM: arm64: Add SPE VCPU device attribute to initialize SPE Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-12-16 11:40   ` Suzuki K Poulose
  2025-11-14 16:06 ` [RFC PATCH v6 17/35] KVM: arm64: Add writable SPE system registers to VCPU context Alexandru Elisei
                   ` (19 subsequent siblings)
  35 siblings, 1 reply; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

The VCPU registers are reset during the KVM_ARM_VCPU_INIT ioctl, before
userspace can set the desired SPU. Assume that the VCPU is initialized from
a thread that runs on one of the physical CPUs that correspond to the SPU
that userspace will choose for the VM. Set PMSVer to that CPUs hardware
value.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_spe.h |  6 ++++++
 arch/arm64/kvm/spe.c             | 10 ++++++++++
 arch/arm64/kvm/sys_regs.c        | 10 +++++++++-
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 6ce70cf2abaf..5e6d7e609a48 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -33,6 +33,8 @@ static __always_inline bool kvm_supports_spe(void)
 void kvm_spe_init_vm(struct kvm *kvm);
 int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu);
 
+u8 kvm_spe_get_pmsver_limit(void);
+
 int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
@@ -53,6 +55,10 @@ static inline int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 {
 	return 0;
 }
+static inline u8 kvm_spe_get_pmsver_limit(void)
+{
+	return 0;
+}
 static inline int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
 	return -ENXIO;
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 6bd074e40f6c..0c4896c6a873 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -68,6 +68,16 @@ int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+u8 kvm_spe_get_pmsver_limit(void)
+{
+	unsigned int pmsver;
+
+	pmsver = SYS_FIELD_GET(ID_AA64DFR0_EL1, PMSVer,
+			       read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1));
+
+	return min(pmsver, ID_AA64DFR0_EL1_PMSVer_V1P5);
+}
+
 static u64 max_buffer_size_to_pmbidr_el1(u64 size)
 {
 	u64 msb_idx, num_bits;
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index e67eb39ddc11..ac859c39c2be 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -29,6 +29,7 @@
 #include <asm/kvm_hyp.h>
 #include <asm/kvm_mmu.h>
 #include <asm/kvm_nested.h>
+#include <asm/kvm_spe.h>
 #include <asm/perf_event.h>
 #include <asm/sysreg.h>
 
@@ -1652,6 +1653,9 @@ static s64 kvm_arm64_ftr_safe_value(u32 id, const struct arm64_ftr_bits *ftrp,
 		case ID_AA64DFR0_EL1_DebugVer_SHIFT:
 			kvm_ftr.type = FTR_LOWER_SAFE;
 			break;
+		case ID_AA64DFR0_EL1_PMSVer_SHIFT:
+			kvm_ftr.type = FTR_LOWER_SAFE;
+			break;
 		}
 		break;
 	case SYS_ID_DFR0_EL1:
@@ -2021,8 +2025,11 @@ static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val)
 		val |= SYS_FIELD_PREP(ID_AA64DFR0_EL1, PMUVer,
 				      kvm_arm_pmu_get_pmuver_limit());
 
-	/* Hide SPE from guests */
 	val &= ~ID_AA64DFR0_EL1_PMSVer_MASK;
+	if (vcpu_has_spe(vcpu)) {
+		val |= SYS_FIELD_PREP(ID_AA64DFR0_EL1, PMSVer,
+				      kvm_spe_get_pmsver_limit());
+	}
 
 	/* Hide BRBE from guests */
 	val &= ~ID_AA64DFR0_EL1_BRBE_MASK;
@@ -3209,6 +3216,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	 */
 	ID_FILTERED(ID_AA64DFR0_EL1, id_aa64dfr0_el1,
 		    ID_AA64DFR0_EL1_DoubleLock_MASK |
+		    ID_AA64DFR0_EL1_PMSVer_MASK |
 		    ID_AA64DFR0_EL1_WRPs_MASK |
 		    ID_AA64DFR0_EL1_PMUVer_MASK |
 		    ID_AA64DFR0_EL1_DebugVer_MASK),
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 16/35] KVM: arm64: Advertise SPE version in ID_AA64DFR0_EL1.PMSver
  2025-11-14 16:06 ` [RFC PATCH v6 16/35] KVM: arm64: Advertise SPE version in ID_AA64DFR0_EL1.PMSver Alexandru Elisei
@ 2025-12-16 11:40   ` Suzuki K Poulose
  0 siblings, 0 replies; 49+ messages in thread
From: Suzuki K Poulose @ 2025-12-16 11:40 UTC (permalink / raw)
  To: Alexandru Elisei, maz, oliver.upton, joey.gouly, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

On 14/11/2025 16:06, Alexandru Elisei wrote:
> The VCPU registers are reset during the KVM_ARM_VCPU_INIT ioctl, before
> userspace can set the desired SPU. Assume that the VCPU is initialized from


> a thread that runs on one of the physical CPUs that correspond to the SPU
> that userspace will choose for the VM. Set PMSVer to that CPUs hardware
> value.

This doesn't match the code. See below.

> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>   arch/arm64/include/asm/kvm_spe.h |  6 ++++++
>   arch/arm64/kvm/spe.c             | 10 ++++++++++
>   arch/arm64/kvm/sys_regs.c        | 10 +++++++++-
>   3 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
> index 6ce70cf2abaf..5e6d7e609a48 100644
> --- a/arch/arm64/include/asm/kvm_spe.h
> +++ b/arch/arm64/include/asm/kvm_spe.h
> @@ -33,6 +33,8 @@ static __always_inline bool kvm_supports_spe(void)
>   void kvm_spe_init_vm(struct kvm *kvm);
>   int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu);
>   
> +u8 kvm_spe_get_pmsver_limit(void);
> +
>   int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>   int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>   int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
> @@ -53,6 +55,10 @@ static inline int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
>   {
>   	return 0;
>   }
> +static inline u8 kvm_spe_get_pmsver_limit(void)
> +{
> +	return 0;
> +}
>   static inline int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
>   {
>   	return -ENXIO;
> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
> index 6bd074e40f6c..0c4896c6a873 100644
> --- a/arch/arm64/kvm/spe.c
> +++ b/arch/arm64/kvm/spe.c
> @@ -68,6 +68,16 @@ int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
>   	return 0;
>   }
>   
> +u8 kvm_spe_get_pmsver_limit(void)
> +{
> +	unsigned int pmsver;
> +
> +	pmsver = SYS_FIELD_GET(ID_AA64DFR0_EL1, PMSVer,
> +			       read_sanitised_ftr_reg(SYS_ID_AA64DFR0_EL1));

The read_sanitised_ftr_reg() gives you the system wide sanitised
version, not the one on the current CPU. You may need
read_sysreg_s() instead here.


> +
> +	return min(pmsver, ID_AA64DFR0_EL1_PMSVer_V1P5);
> +}
> +
>   static u64 max_buffer_size_to_pmbidr_el1(u64 size)
>   {
>   	u64 msb_idx, num_bits;
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index e67eb39ddc11..ac859c39c2be 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -29,6 +29,7 @@
>   #include <asm/kvm_hyp.h>
>   #include <asm/kvm_mmu.h>
>   #include <asm/kvm_nested.h>
> +#include <asm/kvm_spe.h>
>   #include <asm/perf_event.h>
>   #include <asm/sysreg.h>
>   
> @@ -1652,6 +1653,9 @@ static s64 kvm_arm64_ftr_safe_value(u32 id, const struct arm64_ftr_bits *ftrp,
>   		case ID_AA64DFR0_EL1_DebugVer_SHIFT:
>   			kvm_ftr.type = FTR_LOWER_SAFE;
>   			break;
> +		case ID_AA64DFR0_EL1_PMSVer_SHIFT:
> +			kvm_ftr.type = FTR_LOWER_SAFE;

PMSVer is already FTR_LOWER_SAFE, and we don't need to override it here 
? (unlike the DebugVer or PMU Ver)

> +			break;
>   		}
>   		break;
>   	case SYS_ID_DFR0_EL1:
> @@ -2021,8 +2025,11 @@ static u64 sanitise_id_aa64dfr0_el1(const struct kvm_vcpu *vcpu, u64 val)
>   		val |= SYS_FIELD_PREP(ID_AA64DFR0_EL1, PMUVer,
>   				      kvm_arm_pmu_get_pmuver_limit());
>   
> -	/* Hide SPE from guests */
>   	val &= ~ID_AA64DFR0_EL1_PMSVer_MASK;
> +	if (vcpu_has_spe(vcpu)) {
> +		val |= SYS_FIELD_PREP(ID_AA64DFR0_EL1, PMSVer,
> +				      kvm_spe_get_pmsver_limit());
> +	}

So, we ignore value that the user sets and go with what the SPE instance
that has been chosen ? Should we make it non-writable then ?

Suzuki

>   
>   	/* Hide BRBE from guests */
>   	val &= ~ID_AA64DFR0_EL1_BRBE_MASK;
> @@ -3209,6 +3216,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>   	 */
>   	ID_FILTERED(ID_AA64DFR0_EL1, id_aa64dfr0_el1,
>   		    ID_AA64DFR0_EL1_DoubleLock_MASK |
> +		    ID_AA64DFR0_EL1_PMSVer_MASK |
>   		    ID_AA64DFR0_EL1_WRPs_MASK |
>   		    ID_AA64DFR0_EL1_PMUVer_MASK |
>   		    ID_AA64DFR0_EL1_DebugVer_MASK),



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 17/35] KVM: arm64: Add writable SPE system registers to VCPU context
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (15 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 16/35] KVM: arm64: Advertise SPE version in ID_AA64DFR0_EL1.PMSver Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-12-16 11:54   ` Suzuki K Poulose
  2025-11-14 16:06 ` [RFC PATCH v6 18/35] perf: arm_spe_pmu: Add PMSIDR_EL1 to struct arm_spe_pmu Alexandru Elisei
                   ` (18 subsequent siblings)
  35 siblings, 1 reply; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Add the writable SPE registers to the VCPU context. The registers for now
have generic accessors, with proper handling to be added. PMSIDR_EL1 and
PMBIDR_EL1 are not part of the VCPU context because they are read-only
registers.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_host.h | 13 +++++++
 arch/arm64/include/asm/kvm_spe.h  | 11 ++++++
 arch/arm64/kvm/debug.c            | 19 ++++++++---
 arch/arm64/kvm/spe.c              | 13 +++++++
 arch/arm64/kvm/sys_regs.c         | 56 ++++++++++++++++++++++++-------
 5 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 373d22ec4783..876957320672 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -464,6 +464,19 @@ enum vcpu_sysreg {
 	PMOVSSET_EL0,	/* Overflow Flag Status Set Register */
 	PMUSERENR_EL0,	/* User Enable Register */
 
+	/* SPE registers */
+	PMSCR_EL1,
+	PMSNEVFR_EL1,
+	PMSICR_EL1,
+	PMSIRR_EL1,
+	PMSFCR_EL1,
+	PMSEVFR_EL1,
+	PMSLATFR_EL1,
+	PMBLIMITR_EL1,
+	PMBPTR_EL1,
+	PMBSR_EL1,
+	PMSDSFR_EL1,
+
 	/* Pointer Authentication Registers in a strict increasing order. */
 	APIAKEYLO_EL1,
 	APIAKEYHI_EL1,
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 5e6d7e609a48..3506d8c4c661 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -38,6 +38,9 @@ u8 kvm_spe_get_pmsver_limit(void);
 int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+
+bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val);
+u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg);
 #else
 struct kvm_spe {
 };
@@ -71,6 +74,14 @@ static inline int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr
 {
 	return -ENXIO;
 }
+static inline bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
+{
+	return true;
+}
+static inline u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
+{
+	return 0;
+}
 #endif /* CONFIG_KVM_ARM_SPE */
 
 #endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index 3ad6b7c6e4ba..0821ebfb03fa 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -38,19 +38,28 @@ static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
 
-	/*
-	 * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK
-	 * to disable guest access to the profiling and trace buffers
-	 */
 	vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN,
 					 *host_data_ptr(nr_event_counters));
+	/*
+	 * This also clears MDCR_EL2_E2PB_MASK to disable guest access to the
+	 * trace buffer.
+	 */
 	vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
-				MDCR_EL2_TPMS |
 				MDCR_EL2_TTRF |
 				MDCR_EL2_TPMCR |
 				MDCR_EL2_TDRA |
 				MDCR_EL2_TDOSA);
 
+	if (vcpu_has_spe(vcpu)) {
+		/* Set buffer owner to EL1 and trap the buffer registers. */
+		vcpu->arch.mdcr_el2 |= FIELD_PREP(MDCR_EL2_E2PB, MDCR_EL2_E2PB_EL1_TRAP);
+		/* Leave TPMS zero and don't trap the sampling registers. */
+	} else {
+		/* Trap the sampling registers. */
+		vcpu->arch.mdcr_el2 |= MDCR_EL2_TPMS;
+		/* Leave E2PB zero and trap the buffer registers. */
+	}
+
 	/* Is the VM being debugged by userspace? */
 	if (vcpu->guest_debug)
 		/* Route all software debug exceptions to EL2 */
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 0c4896c6a873..5b3dc622cf82 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -8,6 +8,7 @@
 #include <linux/kvm_host.h>
 #include <linux/perf/arm_spe_pmu.h>
 
+#include <asm/kvm_emulate.h>
 #include <asm/kvm_spe.h>
 #include <asm/sysreg.h>
 
@@ -78,6 +79,18 @@ u8 kvm_spe_get_pmsver_limit(void)
 	return min(pmsver, ID_AA64DFR0_EL1_PMSVer_V1P5);
 }
 
+bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
+{
+	__vcpu_assign_sys_reg(vcpu, val, reg);
+
+	return true;
+}
+
+u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
+{
+	return __vcpu_sys_reg(vcpu, reg);
+}
+
 static u64 max_buffer_size_to_pmbidr_el1(u64 size)
 {
 	u64 msb_idx, num_bits;
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index ac859c39c2be..5eeea229b46e 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1374,6 +1374,28 @@ static int set_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
 	return 0;
 }
 
+static unsigned int spe_visibility(const struct kvm_vcpu *vcpu,
+				   const struct sys_reg_desc *r)
+{
+	if (vcpu_has_spe(vcpu))
+		return 0;
+
+	return REG_HIDDEN;
+}
+
+static bool access_spe_reg(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
+			   const struct sys_reg_desc *r)
+{
+	u64 val = p->regval;
+	int reg = r->reg;
+
+	if (p->is_write)
+		return kvm_spe_write_sysreg(vcpu, reg, val);
+
+	p->regval = kvm_spe_read_sysreg(vcpu, reg);
+	return true;
+}
+
 /* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */
 #define DBG_BCR_BVR_WCR_WVR_EL1(n)					\
 	{ SYS_DESC(SYS_DBGBVRn_EL1(n)),					\
@@ -1406,6 +1428,14 @@ static int set_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
 	  .reset = reset_pmevtyper,					\
 	  .access = access_pmu_evtyper, .reg = (PMEVTYPER0_EL0 + n), }
 
+#define SPE_TRAPPED_REG(name)						\
+	SYS_DESC(SYS_##name), .reg = name, .access = access_spe_reg,	\
+	.reset = reset_val, .val = 0, .visibility = spe_visibility
+
+#define SPE_UNTRAPPED_REG(name)						\
+	SYS_DESC(SYS_##name), .reg = name, .access = undef_access,	\
+	.reset = reset_val, .val = 0, .visibility = spe_visibility
+
 /* Macro to expand the AMU counter and type registers*/
 #define AMU_AMEVCNTR0_EL0(n) { SYS_DESC(SYS_AMEVCNTR0_EL0(n)), undef_access }
 #define AMU_AMEVTYPER0_EL0(n) { SYS_DESC(SYS_AMEVTYPER0_EL0(n)), undef_access }
@@ -3323,19 +3353,19 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	{ SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 },
 	{ SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 },
 
-	{ SYS_DESC(SYS_PMSCR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMSNEVFR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMSICR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMSIRR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMSFCR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMSEVFR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMSLATFR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMSIDR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMBLIMITR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMBPTR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMBSR_EL1), undef_access },
-	{ SYS_DESC(SYS_PMSDSFR_EL1), undef_access },
-	/* PMBIDR_EL1 is not trapped */
+	{ SPE_UNTRAPPED_REG(PMSCR_EL1) },
+	{ SPE_UNTRAPPED_REG(PMSNEVFR_EL1) },
+	{ SPE_UNTRAPPED_REG(PMSICR_EL1) },
+	{ SPE_UNTRAPPED_REG(PMSIRR_EL1) },
+	{ SPE_UNTRAPPED_REG(PMSFCR_EL1) },
+	{ SPE_UNTRAPPED_REG(PMSEVFR_EL1) },
+	{ SPE_UNTRAPPED_REG(PMSLATFR_EL1) },
+	{ SYS_DESC(SYS_PMSIDR_EL1), .access = undef_access },
+	{ SPE_TRAPPED_REG(PMBLIMITR_EL1) },
+	{ SPE_TRAPPED_REG(PMBPTR_EL1) },
+	{ SPE_TRAPPED_REG(PMBSR_EL1) },
+	{ SPE_UNTRAPPED_REG(PMSDSFR_EL1) },
+	{ SYS_DESC(SYS_PMBIDR_EL1), .access = undef_access },
 
 	{ PMU_SYS_REG(PMINTENSET_EL1),
 	  .access = access_pminten, .reg = PMINTENSET_EL1,
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 17/35] KVM: arm64: Add writable SPE system registers to VCPU context
  2025-11-14 16:06 ` [RFC PATCH v6 17/35] KVM: arm64: Add writable SPE system registers to VCPU context Alexandru Elisei
@ 2025-12-16 11:54   ` Suzuki K Poulose
  0 siblings, 0 replies; 49+ messages in thread
From: Suzuki K Poulose @ 2025-12-16 11:54 UTC (permalink / raw)
  To: Alexandru Elisei, maz, oliver.upton, joey.gouly, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

On 14/11/2025 16:06, Alexandru Elisei wrote:
> Add the writable SPE registers to the VCPU context. The registers for now
> have generic accessors, with proper handling to be added. PMSIDR_EL1 and
> PMBIDR_EL1 are not part of the VCPU context because they are read-only
> registers.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>   arch/arm64/include/asm/kvm_host.h | 13 +++++++
>   arch/arm64/include/asm/kvm_spe.h  | 11 ++++++
>   arch/arm64/kvm/debug.c            | 19 ++++++++---
>   arch/arm64/kvm/spe.c              | 13 +++++++
>   arch/arm64/kvm/sys_regs.c         | 56 ++++++++++++++++++++++++-------
>   5 files changed, 94 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 373d22ec4783..876957320672 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -464,6 +464,19 @@ enum vcpu_sysreg {
>   	PMOVSSET_EL0,	/* Overflow Flag Status Set Register */
>   	PMUSERENR_EL0,	/* User Enable Register */
>   
> +	/* SPE registers */
> +	PMSCR_EL1,
> +	PMSNEVFR_EL1,
> +	PMSICR_EL1,
> +	PMSIRR_EL1,
> +	PMSFCR_EL1,
> +	PMSEVFR_EL1,
> +	PMSLATFR_EL1,
> +	PMBLIMITR_EL1,
> +	PMBPTR_EL1,
> +	PMBSR_EL1,
> +	PMSDSFR_EL1,
> +
>   	/* Pointer Authentication Registers in a strict increasing order. */
>   	APIAKEYLO_EL1,
>   	APIAKEYHI_EL1,
> diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
> index 5e6d7e609a48..3506d8c4c661 100644
> --- a/arch/arm64/include/asm/kvm_spe.h
> +++ b/arch/arm64/include/asm/kvm_spe.h
> @@ -38,6 +38,9 @@ u8 kvm_spe_get_pmsver_limit(void);
>   int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>   int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>   int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
> +
> +bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val);
> +u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg);
>   #else
>   struct kvm_spe {
>   };
> @@ -71,6 +74,14 @@ static inline int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr
>   {
>   	return -ENXIO;
>   }
> +static inline bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
> +{
> +	return true;
> +}
> +static inline u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
> +{
> +	return 0;
> +}
>   #endif /* CONFIG_KVM_ARM_SPE */
>   
>   #endif /* __ARM64_KVM_SPE_H__ */
> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
> index 3ad6b7c6e4ba..0821ebfb03fa 100644
> --- a/arch/arm64/kvm/debug.c
> +++ b/arch/arm64/kvm/debug.c
> @@ -38,19 +38,28 @@ static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu)
>   {
>   	preempt_disable();
>   
> -	/*
> -	 * This also clears MDCR_EL2_E2PB_MASK and MDCR_EL2_E2TB_MASK
> -	 * to disable guest access to the profiling and trace buffers
> -	 */
>   	vcpu->arch.mdcr_el2 = FIELD_PREP(MDCR_EL2_HPMN,
>   					 *host_data_ptr(nr_event_counters));
> +	/*
> +	 * This also clears MDCR_EL2_E2PB_MASK to disable guest access to the
> +	 * trace buffer.

nit: MDCR_EL2_E2TB_MASK is for the trace buffer.


Suzuki


> +	 */
>   	vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM |
> -				MDCR_EL2_TPMS |
>   				MDCR_EL2_TTRF |
>   				MDCR_EL2_TPMCR |
>   				MDCR_EL2_TDRA |
>   				MDCR_EL2_TDOSA);
>   
> +	if (vcpu_has_spe(vcpu)) {
> +		/* Set buffer owner to EL1 and trap the buffer registers. */
> +		vcpu->arch.mdcr_el2 |= FIELD_PREP(MDCR_EL2_E2PB, MDCR_EL2_E2PB_EL1_TRAP);
> +		/* Leave TPMS zero and don't trap the sampling registers. */
> +	} else {
> +		/* Trap the sampling registers. */
> +		vcpu->arch.mdcr_el2 |= MDCR_EL2_TPMS;
> +		/* Leave E2PB zero and trap the buffer registers. */
> +	}
> +
>   	/* Is the VM being debugged by userspace? */
>   	if (vcpu->guest_debug)
>   		/* Route all software debug exceptions to EL2 */
> diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
> index 0c4896c6a873..5b3dc622cf82 100644
> --- a/arch/arm64/kvm/spe.c
> +++ b/arch/arm64/kvm/spe.c
> @@ -8,6 +8,7 @@
>   #include <linux/kvm_host.h>
>   #include <linux/perf/arm_spe_pmu.h>
>   
> +#include <asm/kvm_emulate.h>
>   #include <asm/kvm_spe.h>
>   #include <asm/sysreg.h>
>   
> @@ -78,6 +79,18 @@ u8 kvm_spe_get_pmsver_limit(void)
>   	return min(pmsver, ID_AA64DFR0_EL1_PMSVer_V1P5);
>   }
>   
> +bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
> +{
> +	__vcpu_assign_sys_reg(vcpu, val, reg);
> +
> +	return true;
> +}
> +
> +u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
> +{
> +	return __vcpu_sys_reg(vcpu, reg);
> +}
> +
>   static u64 max_buffer_size_to_pmbidr_el1(u64 size)
>   {
>   	u64 msb_idx, num_bits;
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index ac859c39c2be..5eeea229b46e 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1374,6 +1374,28 @@ static int set_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
>   	return 0;
>   }
>   
> +static unsigned int spe_visibility(const struct kvm_vcpu *vcpu,
> +				   const struct sys_reg_desc *r)
> +{
> +	if (vcpu_has_spe(vcpu))
> +		return 0;
> +
> +	return REG_HIDDEN;
> +}
> +
> +static bool access_spe_reg(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
> +			   const struct sys_reg_desc *r)
> +{
> +	u64 val = p->regval;
> +	int reg = r->reg;
> +
> +	if (p->is_write)
> +		return kvm_spe_write_sysreg(vcpu, reg, val);
> +
> +	p->regval = kvm_spe_read_sysreg(vcpu, reg);
> +	return true;
> +}
> +
>   /* Silly macro to expand the DBG{BCR,BVR,WVR,WCR}n_EL1 registers in one go */
>   #define DBG_BCR_BVR_WCR_WVR_EL1(n)					\
>   	{ SYS_DESC(SYS_DBGBVRn_EL1(n)),					\
> @@ -1406,6 +1428,14 @@ static int set_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r,
>   	  .reset = reset_pmevtyper,					\
>   	  .access = access_pmu_evtyper, .reg = (PMEVTYPER0_EL0 + n), }
>   
> +#define SPE_TRAPPED_REG(name)						\
> +	SYS_DESC(SYS_##name), .reg = name, .access = access_spe_reg,	\
> +	.reset = reset_val, .val = 0, .visibility = spe_visibility
> +
> +#define SPE_UNTRAPPED_REG(name)						\
> +	SYS_DESC(SYS_##name), .reg = name, .access = undef_access,	\
> +	.reset = reset_val, .val = 0, .visibility = spe_visibility
> +
>   /* Macro to expand the AMU counter and type registers*/
>   #define AMU_AMEVCNTR0_EL0(n) { SYS_DESC(SYS_AMEVCNTR0_EL0(n)), undef_access }
>   #define AMU_AMEVTYPER0_EL0(n) { SYS_DESC(SYS_AMEVTYPER0_EL0(n)), undef_access }
> @@ -3323,19 +3353,19 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>   	{ SYS_DESC(SYS_FAR_EL1), access_vm_reg, reset_unknown, FAR_EL1 },
>   	{ SYS_DESC(SYS_PAR_EL1), NULL, reset_unknown, PAR_EL1 },
>   
> -	{ SYS_DESC(SYS_PMSCR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMSNEVFR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMSICR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMSIRR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMSFCR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMSEVFR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMSLATFR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMSIDR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMBLIMITR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMBPTR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMBSR_EL1), undef_access },
> -	{ SYS_DESC(SYS_PMSDSFR_EL1), undef_access },
> -	/* PMBIDR_EL1 is not trapped */
> +	{ SPE_UNTRAPPED_REG(PMSCR_EL1) },
> +	{ SPE_UNTRAPPED_REG(PMSNEVFR_EL1) },
> +	{ SPE_UNTRAPPED_REG(PMSICR_EL1) },
> +	{ SPE_UNTRAPPED_REG(PMSIRR_EL1) },
> +	{ SPE_UNTRAPPED_REG(PMSFCR_EL1) },
> +	{ SPE_UNTRAPPED_REG(PMSEVFR_EL1) },
> +	{ SPE_UNTRAPPED_REG(PMSLATFR_EL1) },
> +	{ SYS_DESC(SYS_PMSIDR_EL1), .access = undef_access },
> +	{ SPE_TRAPPED_REG(PMBLIMITR_EL1) },
> +	{ SPE_TRAPPED_REG(PMBPTR_EL1) },
> +	{ SPE_TRAPPED_REG(PMBSR_EL1) },
> +	{ SPE_UNTRAPPED_REG(PMSDSFR_EL1) },
> +	{ SYS_DESC(SYS_PMBIDR_EL1), .access = undef_access },
>   
>   	{ PMU_SYS_REG(PMINTENSET_EL1),
>   	  .access = access_pminten, .reg = PMINTENSET_EL1,



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 18/35] perf: arm_spe_pmu: Add PMSIDR_EL1 to struct arm_spe_pmu
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (16 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 17/35] KVM: arm64: Add writable SPE system registers to VCPU context Alexandru Elisei
@ 2025-11-14 16:06 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 19/35] KVM: arm64: Trap PMBIDR_EL1 and PMSIDR_EL1 Alexandru Elisei
                   ` (17 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:06 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

The host SPE driver might find itself lagging behind the architecture
when it comes to new features, so KVM cannot rely on the 'features'
field to completely describe the SPE implementation.

Some features are advertised in PMSIDR_EL1, so teach the driver to save a
copy of PMSIDR_EL1 in struct arm_spe_pmu, from where it will be consumed
directly by KVM.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 drivers/perf/arm_spe_pmu.c       | 2 ++
 include/linux/perf/arm_spe_pmu.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 2ca3377538aa..d79f4a47c6f9 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -1069,6 +1069,8 @@ static void __arm_spe_pmu_dev_probe(void *info)
 
 	/* It's now safe to read PMSIDR and figure out what we've got */
 	reg = read_sysreg_s(SYS_PMSIDR_EL1);
+	spe_pmu->pmsidr_el1 = reg;
+
 	if (FIELD_GET(PMSIDR_EL1_FE, reg))
 		spe_pmu->features |= SPE_PMU_FEAT_FILT_EVT;
 
diff --git a/include/linux/perf/arm_spe_pmu.h b/include/linux/perf/arm_spe_pmu.h
index 25425249c193..7dd1f77040c2 100644
--- a/include/linux/perf/arm_spe_pmu.h
+++ b/include/linux/perf/arm_spe_pmu.h
@@ -22,6 +22,7 @@ struct arm_spe_pmu {
 	struct hlist_node			hotplug_node;
 
 	u64					pmbidr_el1;
+	u64					pmsidr_el1;
 	int					irq; /* PPI */
 	u16					pmsver;
 	u16					min_period;
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 19/35] KVM: arm64: Trap PMBIDR_EL1 and PMSIDR_EL1
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (17 preceding siblings ...)
  2025-11-14 16:06 ` [RFC PATCH v6 18/35] perf: arm_spe_pmu: Add PMSIDR_EL1 to struct arm_spe_pmu Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 20/35] KVM: arm64: config: Use functions from spe.c to test FEAT_SPE_{FnE,FDS} Alexandru Elisei
                   ` (16 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

PMBIDR_EL1 and PMSIDR_EL1 are read-only registers that describe the SPE
implementation. Trap reads to allow KVM to control how SPE is described to
a virtual machine. In particular, this is needed to:

- Advertise the maximum buffer size set by userspace in
  PMBIDR_EL1.MaxBuffSize.
- Hide in PMSIDR_EL1, if necessary, the presence of FEAT_SPE_FDS and
  FEAT_SPE_FnE, both of which add new registers which are already
  trapped via the FGU mechanism.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_spe.h |  4 +-
 arch/arm64/kvm/config.c          | 13 ++++-
 arch/arm64/kvm/spe.c             | 82 +++++++++++++++++++++++++++++++-
 arch/arm64/kvm/sys_regs.c        | 13 +++--
 4 files changed, 103 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 3506d8c4c661..47b94794cc5f 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -40,7 +40,7 @@ int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 
 bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val);
-u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg);
+u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg, u32 encoding);
 #else
 struct kvm_spe {
 };
@@ -78,7 +78,7 @@ static inline bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
 {
 	return true;
 }
-static inline u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
+static inline u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg, u32 encoding)
 {
 	return 0;
 }
diff --git a/arch/arm64/kvm/config.c b/arch/arm64/kvm/config.c
index 24bb3f36e9d5..ed6b167b7aa8 100644
--- a/arch/arm64/kvm/config.c
+++ b/arch/arm64/kvm/config.c
@@ -6,6 +6,7 @@
 
 #include <linux/kvm_host.h>
 #include <asm/kvm_emulate.h>
+#include <asm/kvm_spe.h>
 #include <asm/kvm_nested.h>
 #include <asm/sysreg.h>
 
@@ -1489,6 +1490,16 @@ static void __compute_hfgwtr(struct kvm_vcpu *vcpu)
 		*vcpu_fgt(vcpu, HFGWTR_EL2) |= HFGWTR_EL2_TCR_EL1;
 }
 
+static void __compute_hdfgrtr(struct kvm_vcpu *vcpu)
+{
+	__compute_fgt(vcpu, HDFGRTR_EL2);
+
+	if (vcpu_has_spe(vcpu)) {
+		*vcpu_fgt(vcpu, HDFGRTR_EL2) |= HDFGRTR_EL2_PMBIDR_EL1;
+		*vcpu_fgt(vcpu, HDFGRTR_EL2) |= HDFGRTR_EL2_PMSIDR_EL1;
+	}
+}
+
 static void __compute_hdfgwtr(struct kvm_vcpu *vcpu)
 {
 	__compute_fgt(vcpu, HDFGWTR_EL2);
@@ -1505,7 +1516,7 @@ void kvm_vcpu_load_fgt(struct kvm_vcpu *vcpu)
 	__compute_fgt(vcpu, HFGRTR_EL2);
 	__compute_hfgwtr(vcpu);
 	__compute_fgt(vcpu, HFGITR_EL2);
-	__compute_fgt(vcpu, HDFGRTR_EL2);
+	__compute_hdfgrtr(vcpu);
 	__compute_hdfgwtr(vcpu);
 	__compute_fgt(vcpu, HAFGRTR_EL2);
 
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 5b3dc622cf82..92eb46276c71 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -22,10 +22,16 @@ struct arm_spu_entry {
 	struct arm_spe_pmu *arm_spu;
 };
 
+static u64 max_buffer_size_to_pmbidr_el1(u64 size);
+
 void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
 {
 	struct arm_spu_entry *entry;
 
+	/* PMBIDR_EL1 cannot be trapped without FEAT_FGT. */
+	if (!cpus_have_final_cap(ARM64_HAS_FGT))
+		return;
+
 	/* TODO: pKVM and nVHE support */
 	if (is_protected_kvm_enabled() || !has_vhe())
 		return;
@@ -86,9 +92,81 @@ bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
 	return true;
 }
 
-u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg)
+static bool kvm_spe_has_feat_spe_fds(struct kvm *kvm)
+{
+	struct arm_spe_pmu *spe_pmu = kvm->arch.kvm_spe.arm_spu;
+
+	return kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMSVer, V1P4) &&
+	       FIELD_GET(PMSIDR_EL1_FDS, spe_pmu->pmsidr_el1);
+}
+
+static bool kvm_spe_has_feat_spe_fne(struct kvm *kvm)
+{
+	struct arm_spe_pmu *spe_pmu = kvm->arch.kvm_spe.arm_spu;
+
+	return kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMSVer, V1P2) &&
+	       FIELD_GET(PMSIDR_EL1_FnE, spe_pmu->pmsidr_el1);
+}
+
+static u64 kvm_spe_get_pmsidr_el1(struct kvm_vcpu *vcpu)
 {
-	return __vcpu_sys_reg(vcpu, reg);
+	struct kvm *kvm = vcpu->kvm;
+	u64 pmsidr = kvm->arch.kvm_spe.arm_spu->pmsidr_el1;
+
+	/* Filter out known RES0 bits. */
+	pmsidr &= ~GENMASK_ULL(63, 33);
+
+	if (!kvm_spe_has_feat_spe_fne(kvm))
+		pmsidr &= ~PMSIDR_EL1_FnE;
+
+	if (!kvm_spe_has_feat_spe_fds(kvm))
+		pmsidr &= ~PMSIDR_EL1_FDS;
+
+	return pmsidr;
+}
+
+static u64 kvm_spe_get_pmbidr_el1(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
+	u64 max_buffer_size = kvm_spe->max_buffer_size;
+	u64 pmbidr_el1 = kvm_spe->arm_spu->pmbidr_el1;
+
+	/* Filter out known RES0 bits. */
+	pmbidr_el1 &= ~(GENMASK_ULL(63, 48) | GENMASK_ULL(31, 12));
+
+	if (!kvm_has_feat(kvm, ID_AA64DFR2_EL1, SPE_nVM, IMP)) {
+		pmbidr_el1 &= ~PMBIDR_EL1_AddrMode;
+		pmbidr_el1 |= PMBIDR_EL1_AddrMode_VM_ONLY;
+	}
+
+	pmbidr_el1 &= ~PMBIDR_EL1_MaxBuffSize;
+	pmbidr_el1 |= max_buffer_size_to_pmbidr_el1(max_buffer_size);
+
+	return pmbidr_el1;
+}
+
+u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg, u32 encoding)
+{
+	u64 val = 0;
+
+	switch (encoding) {
+	case SYS_PMBIDR_EL1:
+		val = kvm_spe_get_pmbidr_el1(vcpu);
+		break;
+	case SYS_PMSIDR_EL1:
+		val = kvm_spe_get_pmsidr_el1(vcpu);
+		break;
+	case SYS_PMBLIMITR_EL1:
+	case SYS_PMBPTR_EL1:
+	case SYS_PMBSR_EL1:
+		val = __vcpu_sys_reg(vcpu, reg);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+	}
+
+	return val;
 }
 
 static u64 max_buffer_size_to_pmbidr_el1(u64 size)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 5eeea229b46e..db86d1dcd148 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1386,13 +1386,18 @@ static unsigned int spe_visibility(const struct kvm_vcpu *vcpu,
 static bool access_spe_reg(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
 			   const struct sys_reg_desc *r)
 {
+	u32 encoding = reg_to_encoding(r);
 	u64 val = p->regval;
 	int reg = r->reg;
 
-	if (p->is_write)
+	if (p->is_write) {
+		if (WARN_ON_ONCE(encoding == SYS_PMBIDR_EL1 ||
+				 encoding == SYS_PMSIDR_EL1))
+			return write_to_read_only(vcpu, p, r);
 		return kvm_spe_write_sysreg(vcpu, reg, val);
+	}
 
-	p->regval = kvm_spe_read_sysreg(vcpu, reg);
+	p->regval = kvm_spe_read_sysreg(vcpu, reg, encoding);
 	return true;
 }
 
@@ -3360,12 +3365,12 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	{ SPE_UNTRAPPED_REG(PMSFCR_EL1) },
 	{ SPE_UNTRAPPED_REG(PMSEVFR_EL1) },
 	{ SPE_UNTRAPPED_REG(PMSLATFR_EL1) },
-	{ SYS_DESC(SYS_PMSIDR_EL1), .access = undef_access },
+	{ SYS_DESC(SYS_PMSIDR_EL1), .access = access_spe_reg, .visibility = spe_visibility },
 	{ SPE_TRAPPED_REG(PMBLIMITR_EL1) },
 	{ SPE_TRAPPED_REG(PMBPTR_EL1) },
 	{ SPE_TRAPPED_REG(PMBSR_EL1) },
 	{ SPE_UNTRAPPED_REG(PMSDSFR_EL1) },
-	{ SYS_DESC(SYS_PMBIDR_EL1), .access = undef_access },
+	{ SYS_DESC(SYS_PMBIDR_EL1), .access = access_spe_reg, .visibility = spe_visibility },
 
 	{ PMU_SYS_REG(PMINTENSET_EL1),
 	  .access = access_pminten, .reg = PMINTENSET_EL1,
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 20/35] KVM: arm64: config: Use functions from spe.c to test FEAT_SPE_{FnE,FDS}
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (18 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 19/35] KVM: arm64: Trap PMBIDR_EL1 and PMSIDR_EL1 Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 21/35] KVM: arm64: Check for unsupported CPU early in kvm_arch_vcpu_load() Alexandru Elisei
                   ` (15 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

KVM's FGU mechanism will trap the registers introduced by FEAT_SPE_FnE and
FEAT_SPE_FDS if the feature is not present for the VM. Move the functions
that check for the presence of these features out of config.c and into
spe.c, since that's where the bulk of SPE virtualization lies.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_spe.h | 11 +++++++++++
 arch/arm64/kvm/config.c          | 16 +++-------------
 arch/arm64/kvm/spe.c             |  4 ++--
 3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 47b94794cc5f..6bc728723897 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -41,6 +41,9 @@ int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 
 bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val);
 u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg, u32 encoding);
+
+bool kvm_spe_has_feat_spe_fne(struct kvm *kvm);
+bool kvm_spe_has_feat_spe_fds(struct kvm *kvm);
 #else
 struct kvm_spe {
 };
@@ -82,6 +85,14 @@ static inline u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg, u32 encodi
 {
 	return 0;
 }
+static inline bool kvm_spe_has_feat_spe_fne(struct kvm *kvm)
+{
+	return false;
+}
+static inline bool kvm_spe_has_feat_spe_fds(struct kvm *kvm)
+{
+	return false;
+}
 #endif /* CONFIG_KVM_ARM_SPE */
 
 #endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/kvm/config.c b/arch/arm64/kvm/config.c
index ed6b167b7aa8..51ad0bf936c0 100644
--- a/arch/arm64/kvm/config.c
+++ b/arch/arm64/kvm/config.c
@@ -285,16 +285,6 @@ static bool feat_sme_smps(struct kvm *kvm)
 		(read_sysreg_s(SYS_SMIDR_EL1) & SMIDR_EL1_SMPS));
 }
 
-static bool feat_spe_fds(struct kvm *kvm)
-{
-	/*
-	 * Revists this if KVM ever supports SPE -- this really should
-	 * look at the guest's view of PMSIDR_EL1.
-	 */
-	return (kvm_has_feat(kvm, FEAT_SPEv1p4) &&
-		(read_sysreg_s(SYS_PMSIDR_EL1) & PMSIDR_EL1_FDS));
-}
-
 static bool feat_trbe_mpam(struct kvm *kvm)
 {
 	/*
@@ -548,7 +538,7 @@ static const struct reg_bits_to_feat_map hdfgrtr_feat_map[] = {
 		   HDFGRTR_EL2_PMBPTR_EL1	|
 		   HDFGRTR_EL2_PMBLIMITR_EL1,
 		   FEAT_SPE),
-	NEEDS_FEAT(HDFGRTR_EL2_nPMSNEVFR_EL1, FEAT_SPE_FnE),
+	NEEDS_FEAT(HDFGRTR_EL2_nPMSNEVFR_EL1, kvm_spe_has_feat_spe_fne),
 	NEEDS_FEAT(HDFGRTR_EL2_nBRBDATA		|
 		   HDFGRTR_EL2_nBRBCTL		|
 		   HDFGRTR_EL2_nBRBIDR,
@@ -852,7 +842,7 @@ static const struct reg_bits_to_feat_map hdfgrtr2_feat_map[] = {
 		   HDFGRTR2_EL2_nPMSSDATA,
 		   FEAT_PMUv3_SS),
 	NEEDS_FEAT(HDFGRTR2_EL2_nPMIAR_EL1, FEAT_SEBEP),
-	NEEDS_FEAT(HDFGRTR2_EL2_nPMSDSFR_EL1, feat_spe_fds),
+	NEEDS_FEAT(HDFGRTR2_EL2_nPMSDSFR_EL1, kvm_spe_has_feat_spe_fds),
 	NEEDS_FEAT(HDFGRTR2_EL2_nPMBMAR_EL1, FEAT_SPE_nVM),
 	NEEDS_FEAT(HDFGRTR2_EL2_nSPMACCESSR_EL1	|
 		   HDFGRTR2_EL2_nSPMCNTEN	|
@@ -885,7 +875,7 @@ static const struct reg_bits_to_feat_map hdfgwtr2_feat_map[] = {
 		   feat_pmuv3p9),
 	NEEDS_FEAT(HDFGWTR2_EL2_nPMSSCR_EL1, FEAT_PMUv3_SS),
 	NEEDS_FEAT(HDFGWTR2_EL2_nPMIAR_EL1, FEAT_SEBEP),
-	NEEDS_FEAT(HDFGWTR2_EL2_nPMSDSFR_EL1, feat_spe_fds),
+	NEEDS_FEAT(HDFGWTR2_EL2_nPMSDSFR_EL1, kvm_spe_has_feat_spe_fds),
 	NEEDS_FEAT(HDFGWTR2_EL2_nPMBMAR_EL1, FEAT_SPE_nVM),
 	NEEDS_FEAT(HDFGWTR2_EL2_nSPMACCESSR_EL1	|
 		   HDFGWTR2_EL2_nSPMCNTEN	|
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 92eb46276c71..fa24e47a1e73 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -92,7 +92,7 @@ bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
 	return true;
 }
 
-static bool kvm_spe_has_feat_spe_fds(struct kvm *kvm)
+bool kvm_spe_has_feat_spe_fds(struct kvm *kvm)
 {
 	struct arm_spe_pmu *spe_pmu = kvm->arch.kvm_spe.arm_spu;
 
@@ -100,7 +100,7 @@ static bool kvm_spe_has_feat_spe_fds(struct kvm *kvm)
 	       FIELD_GET(PMSIDR_EL1_FDS, spe_pmu->pmsidr_el1);
 }
 
-static bool kvm_spe_has_feat_spe_fne(struct kvm *kvm)
+bool kvm_spe_has_feat_spe_fne(struct kvm *kvm)
 {
 	struct arm_spe_pmu *spe_pmu = kvm->arch.kvm_spe.arm_spu;
 
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 21/35] KVM: arm64: Check for unsupported CPU early in kvm_arch_vcpu_load()
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (19 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 20/35] KVM: arm64: config: Use functions from spe.c to test FEAT_SPE_{FnE,FDS} Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 22/35] KVM: arm64: VHE: Context switch SPE state Alexandru Elisei
                   ` (14 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

When support for SPE is added, KVM will have to touch SPE hardware
registers on VCPU load and put. Move the check for unsupported CPU at the
start of kvm_arch_vcpu_load(), so KVM doesn't try to access SPE registers
on a CPU which doesn't have FEAT_SPE.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/arm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 783e331fb57a..f5c846c16cb8 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -586,6 +586,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	struct kvm_s2_mmu *mmu;
 	int *last_ran;
 
+	if (!cpumask_test_cpu(cpu, vcpu->kvm->arch.supported_cpus))
+		vcpu_set_on_unsupported_cpu(vcpu);
+
 	if (is_protected_kvm_enabled())
 		goto nommu;
 
@@ -656,9 +659,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		kvm_call_hyp(__vgic_v3_restore_vmcr_aprs,
 			     &vcpu->arch.vgic_cpu.vgic_v3);
 	}
-
-	if (!cpumask_test_cpu(cpu, vcpu->kvm->arch.supported_cpus))
-		vcpu_set_on_unsupported_cpu(vcpu);
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 22/35] KVM: arm64: VHE: Context switch SPE state
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (20 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 21/35] KVM: arm64: Check for unsupported CPU early in kvm_arch_vcpu_load() Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 23/35] KVM: arm64: Allow guest SPE physical timestamps only if perfmon_capable() Alexandru Elisei
                   ` (13 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Save and restore the SPE register state at the appropriate point in the
VCPU life cycle:

1. On VCPU load/put:
    * The sampling registers for the host and guest.
    * The buffer registers for the host.

2. On VCPU entry/exit:
    * The buffer registers for the guest.

Note that as a consequence, when a VM has SPE enabled, KVM disables host
profiling when the VCPU is scheduled in, and it resumes it when the VCPU
is scheduled out. This is different to what happens when a VM doesn't
have SPE enabled, which is to stop sampling when the exception level
changes to EL1.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_hyp.h |  16 +++-
 arch/arm64/include/asm/kvm_spe.h |  17 +++++
 arch/arm64/kvm/arm.c             |  10 +++
 arch/arm64/kvm/debug.c           |  10 ++-
 arch/arm64/kvm/hyp/vhe/Makefile  |   1 +
 arch/arm64/kvm/hyp/vhe/spe-sr.c  |  80 ++++++++++++++++++++
 arch/arm64/kvm/hyp/vhe/switch.c  |   2 +
 arch/arm64/kvm/spe.c             | 125 +++++++++++++++++++++++++++++++
 8 files changed, 259 insertions(+), 2 deletions(-)
 create mode 100644 arch/arm64/kvm/hyp/vhe/spe-sr.c

diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index e6be1f5d0967..93ebe3c0d417 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -109,7 +109,21 @@ void __debug_switch_to_host(struct kvm_vcpu *vcpu);
 #ifdef __KVM_NVHE_HYPERVISOR__
 void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu);
 void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu);
-#endif
+#else
+#ifdef CONFIG_KVM_ARM_SPE
+void __kvm_spe_save_guest_buffer(struct kvm_vcpu *vcpu, struct kvm_cpu_context *guest_ctxt);
+void __kvm_spe_restore_guest_buffer(struct kvm_vcpu *vcpu, struct kvm_cpu_context *guest_ctxt);
+#else
+static inline void
+__kvm_spe_save_guest_buffer(struct kvm_vcpu *vcpu, struct kvm_cpu_context *guest_ctxt)
+{
+}
+static inline void
+__kvm_spe_restore_guest_buffer(struct kvm_vcpu *vcpu, struct kvm_cpu_context *guest_ctxt)
+{
+}
+#endif /* CONFIG_KVM_ARM_SPE */
+#endif /* __KVM_NVHE_HYPERVISOR__ */
 
 void __fpsimd_save_state(struct user_fpsimd_state *fp_regs);
 void __fpsimd_restore_state(struct user_fpsimd_state *fp_regs);
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 6bc728723897..077ca1e596b8 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -16,6 +16,7 @@ struct kvm_spe {
 };
 
 struct kvm_vcpu_spe {
+	u64 host_pmscr_el2;	/* Host PMSCR_EL2 register, context switched. */
 	int irq_num;		/* Buffer management interrupt number */
 	bool initialized;	/* SPE initialized for the VCPU */
 };
@@ -30,6 +31,13 @@ static __always_inline bool kvm_supports_spe(void)
 #define vcpu_has_spe(vcpu)					\
 	(vcpu_has_feature(vcpu, KVM_ARM_VCPU_SPE))
 
+/* Implements the function ProfilingBufferEnabled() from ARM DDI0487K.a */
+static inline bool kvm_spe_profiling_buffer_enabled(u64 pmblimitr_el1, u64 pmbsr_el1)
+{
+	return !FIELD_GET(PMBSR_EL1_S, pmbsr_el1) &&
+	       FIELD_GET(PMBLIMITR_EL1_E, pmblimitr_el1);
+}
+
 void kvm_spe_init_vm(struct kvm *kvm);
 int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu);
 
@@ -44,6 +52,9 @@ u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg, u32 encoding);
 
 bool kvm_spe_has_feat_spe_fne(struct kvm *kvm);
 bool kvm_spe_has_feat_spe_fds(struct kvm *kvm);
+
+void kvm_vcpu_spe_load(struct kvm_vcpu *vcpu);
+void kvm_vcpu_spe_put(struct kvm_vcpu *vcpu);
 #else
 struct kvm_spe {
 };
@@ -93,6 +104,12 @@ static inline bool kvm_spe_has_feat_spe_fds(struct kvm *kvm)
 {
 	return false;
 }
+static inline void kvm_vcpu_spe_load(struct kvm_vcpu *vcpu)
+{
+}
+static inline void kvm_vcpu_spe_put(struct kvm_vcpu *vcpu)
+{
+}
 #endif /* CONFIG_KVM_ARM_SPE */
 
 #endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index f5c846c16cb8..c5f5d5dbd695 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -631,6 +631,11 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	 */
 	kvm_timer_vcpu_load(vcpu);
 	kvm_vgic_load(vcpu);
+	/*
+	 * Drain the host profiling buffer before the buffer owning exception
+	 * level is changed in kvm_vcpu_load_debug().
+	 */
+	kvm_vcpu_spe_load(vcpu);
 	kvm_vcpu_load_debug(vcpu);
 	kvm_vcpu_load_fgt(vcpu);
 	if (has_vhe())
@@ -670,6 +675,11 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	}
 
 	kvm_vcpu_put_debug(vcpu);
+	/*
+	 * Restore the host profiling session after the owning exception level
+	 * is restored in kvm_vcpu_put_debug().
+	 */
+	kvm_vcpu_spe_put(vcpu);
 	kvm_arch_vcpu_put_fp(vcpu);
 	if (has_vhe())
 		kvm_vcpu_put_vhe(vcpu);
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index 0821ebfb03fa..d6357784730d 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -75,8 +75,16 @@ static void kvm_arm_setup_mdcr_el2(struct kvm_vcpu *vcpu)
 		kvm_nested_setup_mdcr_el2(vcpu);
 
 	/* Write MDCR_EL2 directly if we're already at EL2 */
-	if (has_vhe())
+	if (has_vhe()) {
 		write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
+		if (vcpu_has_spe(vcpu)) {
+			/*
+			 * Synchronize the write that changes the owning regime
+			 * to EL1&0.
+			 */
+			isb();
+		}
+	}
 
 	preempt_enable();
 }
diff --git a/arch/arm64/kvm/hyp/vhe/Makefile b/arch/arm64/kvm/hyp/vhe/Makefile
index afc4aed9231a..49496139b156 100644
--- a/arch/arm64/kvm/hyp/vhe/Makefile
+++ b/arch/arm64/kvm/hyp/vhe/Makefile
@@ -11,3 +11,4 @@ CFLAGS_switch.o += -Wno-override-init
 obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o
 obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
 	 ../fpsimd.o ../hyp-entry.o ../exception.o
+obj-$(CONFIG_KVM_ARM_SPE) += spe-sr.o
diff --git a/arch/arm64/kvm/hyp/vhe/spe-sr.c b/arch/arm64/kvm/hyp/vhe/spe-sr.c
new file mode 100644
index 000000000000..fb8614435069
--- /dev/null
+++ b/arch/arm64/kvm/hyp/vhe/spe-sr.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 - ARM Ltd
+ */
+
+#include <linux/kvm_host.h>
+
+#include <asm/kvm_hyp.h>
+#include <asm/kprobes.h>
+#include <asm/kvm_spe.h>
+
+/*
+ * The following are true when the guest buffer is restored or saved:
+ * - Sampling is disabled.
+ * - The buffer owning regime is EL1&0.
+ * - Stage 2 is enabled.
+ */
+
+static bool kvm_spe_profiling_buffer_enabled_ctxt(struct kvm_cpu_context *ctxt)
+{
+	return kvm_spe_profiling_buffer_enabled(ctxt_sys_reg(ctxt, PMBLIMITR_EL1),
+						ctxt_sys_reg(ctxt, PMBSR_EL1));
+}
+
+void __kvm_spe_restore_guest_buffer(struct kvm_vcpu *vcpu, struct kvm_cpu_context *guest_ctxt)
+{
+	if (!vcpu_has_spe(vcpu))
+		return;
+
+	/*
+	 * If StatisticalProfilingEnabled() is false or the buffer is in
+	 * discard mode, the hardware value for PMBPTR_EL1 won't change while
+	 * the guest is running, so no point in writing the registers to
+	 * hardware.
+	 *
+	 * This is also about correctness. KVM runs a guest with hardware
+	 * service bit clear. If the in-memory service bit is set, the only way
+	 * to stop profiling while the guest is running is to have the hardware
+	 * buffer enable bit clear.
+	 */
+	if (!kvm_spe_profiling_buffer_enabled_ctxt(guest_ctxt))
+		return;
+
+	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+	isb();
+	write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+}
+NOKPROBE_SYMBOL(__kvm_spe_restore_guest_buffer);
+
+void __kvm_spe_save_guest_buffer(struct kvm_vcpu *vcpu, struct kvm_cpu_context *guest_ctxt)
+{
+	u64 pmbsr_el1;
+
+	if (!vcpu_has_spe(vcpu))
+		return;
+
+	/* See __kvm_spe_restore_guest_buffer() */
+	if (!kvm_spe_profiling_buffer_enabled_ctxt(guest_ctxt))
+		return;
+
+	psb_csync();
+	dsb(nsh);
+	/* Advance PMBPTR_EL1. */
+	isb();
+	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
+	isb();
+
+	ctxt_sys_reg(guest_ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+
+	pmbsr_el1 = read_sysreg_s(SYS_PMBSR_EL1);
+	if (!FIELD_GET(PMBSR_EL1_S, pmbsr_el1))
+		return;
+
+	/* Stop the SPU from asserting PMBIRQ. */
+	write_sysreg_s(0, SYS_PMBSR_EL1);
+	isb();
+	/* PMBSR_EL1 changed while the VCPU was running, save it */
+	ctxt_sys_reg(guest_ctxt, PMBSR_EL1) = pmbsr_el1;
+}
+NOKPROBE_SYMBOL(__kvm_spe_save_guest_buffer);
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 9984c492305a..14449d568405 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -593,6 +593,7 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
 	__kvm_adjust_pc(vcpu);
 
 	sysreg_restore_guest_state_vhe(guest_ctxt);
+	__kvm_spe_restore_guest_buffer(vcpu, guest_ctxt);
 	__debug_switch_to_guest(vcpu);
 
 	do {
@@ -603,6 +604,7 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
 	} while (fixup_guest_exit(vcpu, &exit_code));
 
 	sysreg_save_guest_state_vhe(guest_ctxt);
+	__kvm_spe_save_guest_buffer(vcpu, guest_ctxt);
 
 	__deactivate_traps(vcpu);
 
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index fa24e47a1e73..32156e43f454 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -169,6 +169,131 @@ u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg, u32 encoding)
 	return val;
 }
 
+static void kvm_spe_save_sampling_regs(struct kvm_vcpu *vcpu, struct kvm_cpu_context *ctxt)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	ctxt_sys_reg(ctxt, PMSCR_EL1) = read_sysreg_el1(SYS_PMSCR);
+	if (kvm_spe_has_feat_spe_fne(kvm))
+		ctxt_sys_reg(ctxt, PMSNEVFR_EL1) = read_sysreg_s(SYS_PMSNEVFR_EL1);
+	ctxt_sys_reg(ctxt, PMSICR_EL1) = read_sysreg_s(SYS_PMSICR_EL1);
+	ctxt_sys_reg(ctxt, PMSIRR_EL1) = read_sysreg_s(SYS_PMSIRR_EL1);
+	ctxt_sys_reg(ctxt, PMSFCR_EL1) = read_sysreg_s(SYS_PMSFCR_EL1);
+	ctxt_sys_reg(ctxt, PMSEVFR_EL1) = read_sysreg_s(SYS_PMSEVFR_EL1);
+	ctxt_sys_reg(ctxt, PMSLATFR_EL1) = read_sysreg_s(SYS_PMSLATFR_EL1);
+	if (kvm_spe_has_feat_spe_fds(kvm))
+		ctxt_sys_reg(ctxt, PMSDSFR_EL1) = read_sysreg_s(SYS_PMSDSFR_EL1);
+}
+
+static void kvm_spe_restore_sampling_regs(struct kvm_vcpu *vcpu, struct kvm_cpu_context *ctxt)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	write_sysreg_el1(ctxt_sys_reg(ctxt, PMSCR_EL1), SYS_PMSCR);
+	if (kvm_spe_has_feat_spe_fne(kvm))
+		write_sysreg_s(ctxt_sys_reg(ctxt, PMSNEVFR_EL1), SYS_PMSNEVFR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSICR_EL1), SYS_PMSICR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSIRR_EL1), SYS_PMSIRR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSFCR_EL1), SYS_PMSFCR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSEVFR_EL1), SYS_PMSEVFR_EL1);
+	write_sysreg_s(ctxt_sys_reg(ctxt, PMSLATFR_EL1), SYS_PMSLATFR_EL1);
+	if (kvm_spe_has_feat_spe_fds(kvm))
+		write_sysreg_s(ctxt_sys_reg(ctxt, PMSDSFR_EL1), SYS_PMSDSFR_EL1);
+}
+
+void kvm_vcpu_spe_load(struct kvm_vcpu *vcpu)
+{
+	u64 host_pmblimitr_el1, host_pmscr_el2, host_pmbsr_el1;
+	struct kvm_cpu_context *host_ctxt;
+	struct kvm_cpu_context *guest_ctxt;
+
+	if (!vcpu_has_spe(vcpu) || unlikely(vcpu_on_unsupported_cpu(vcpu)))
+		return;
+
+	host_ctxt = host_data_ptr(host_ctxt);
+	guest_ctxt = &vcpu->arch.ctxt;
+
+	/* Disable interrupts to prevent races with the perf interrupt handler. */
+	local_irq_disable();
+
+	host_pmscr_el2 = read_sysreg_el2(SYS_PMSCR);
+	write_sysreg_el2(0, SYS_PMSCR);
+	/* Host was profiling, synchronize the write to PMSCR_EL2. */
+	if (FIELD_GET(PMSCR_EL2_E2SPE, host_pmscr_el2))
+		isb();
+
+	host_pmblimitr_el1 = read_sysreg_s(SYS_PMBLIMITR_EL1);
+	if (FIELD_GET(PMBLIMITR_EL1_E, host_pmblimitr_el1)) {
+		psb_csync();
+		dsb(nsh);
+		/*
+		 * Disable the buffer, to avoid the wrong translation table
+		 * entries being cached while KVM restores the guest context.
+		 */
+		write_sysreg_s(0, SYS_PMBLIMITR_EL1);
+		/*
+		 * The ISB here has two uses: hardware updates to the host's
+		 * PMBPTR_EL1 register are made visible, and the write to
+		 * PMBLIMITR_EL1 is synchronized.
+		 */
+		isb();
+	}
+
+	host_pmbsr_el1 = read_sysreg_s(SYS_PMBSR_EL1);
+	if (FIELD_GET(PMBSR_EL1_S, host_pmbsr_el1)) {
+		/*
+		 * If the GIC asserts the interrupt after local_irq_enabled()
+		 * below, the perf interrupt handler will read PMBSR_EL1.S zero
+		 * and treat it as a spurious interrupt.
+		 */
+		write_sysreg_s(0, SYS_PMBSR_EL1);
+		isb();
+	}
+
+	local_irq_enable();
+
+	ctxt_sys_reg(host_ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+	ctxt_sys_reg(host_ctxt, PMBLIMITR_EL1) = host_pmblimitr_el1;
+	ctxt_sys_reg(host_ctxt, PMBSR_EL1) = host_pmbsr_el1;
+	vcpu->arch.vcpu_spe.host_pmscr_el2 = host_pmscr_el2;
+
+	kvm_spe_save_sampling_regs(vcpu, host_ctxt);
+	kvm_spe_restore_sampling_regs(vcpu, guest_ctxt);
+}
+
+void kvm_vcpu_spe_put(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpu_context *host_ctxt;
+	struct kvm_cpu_context *guest_ctxt;
+	u64 host_pmblimitr_el1;
+	bool buffer_enabled;
+
+	if (!vcpu_has_spe(vcpu) || unlikely(vcpu_on_unsupported_cpu(vcpu)))
+		return;
+
+	guest_ctxt = &vcpu->arch.ctxt;
+	host_ctxt = host_data_ptr(host_ctxt);
+
+	kvm_spe_save_sampling_regs(vcpu, guest_ctxt);
+	kvm_spe_restore_sampling_regs(vcpu, host_ctxt);
+
+	write_sysreg_el2(vcpu->arch.vcpu_spe.host_pmscr_el2, SYS_PMSCR);
+	write_sysreg_s(ctxt_sys_reg(host_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+	write_sysreg_s(ctxt_sys_reg(host_ctxt, PMBSR_EL1), SYS_PMBSR_EL1);
+
+	host_pmblimitr_el1 = ctxt_sys_reg(host_ctxt, PMBLIMITR_EL1);
+	buffer_enabled = FIELD_GET(PMBLIMITR_EL1_E, host_pmblimitr_el1);
+
+	/* Synchronise above writes before enabling the buffer. */
+	if (buffer_enabled)
+		isb();
+
+	write_sysreg_s(host_pmblimitr_el1, SYS_PMBLIMITR_EL1);
+	/* Everything is on the hardware, re-enable the host buffer. */
+	if (buffer_enabled)
+		isb();
+}
+
 static u64 max_buffer_size_to_pmbidr_el1(u64 size)
 {
 	u64 msb_idx, num_bits;
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 23/35] KVM: arm64: Allow guest SPE physical timestamps only if perfmon_capable()
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (21 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 22/35] KVM: arm64: VHE: Context switch SPE state Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 24/35] KVM: arm64: Handle SPE hardware maintenance interrupts Alexandru Elisei
                   ` (12 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

The SPE driver allows userspace to use physical timestamps for records only
if the process if perfmon_capable(). Do the same for a virtual machine.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_spe.h |  1 +
 arch/arm64/kvm/spe.c             | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 077ca1e596b8..a61c1c1de76f 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -13,6 +13,7 @@
 struct kvm_spe {
 	struct arm_spe_pmu *arm_spu;
 	u64 max_buffer_size;	/* Maximum per VCPU buffer size */
+	u64 guest_pmscr_el2;
 };
 
 struct kvm_vcpu_spe {
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 32156e43f454..85a1ac8bb57f 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -4,6 +4,7 @@
  */
 
 #include <linux/bitops.h>
+#include <linux/capability.h>
 #include <linux/cpumask.h>
 #include <linux/kvm_host.h>
 #include <linux/perf/arm_spe_pmu.h>
@@ -64,6 +65,8 @@ void kvm_spe_init_vm(struct kvm *kvm)
 
 int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 {
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
 	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
 
 	if (!vcpu_has_spe(vcpu))
@@ -72,6 +75,12 @@ int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 	if (!vcpu_spe->initialized)
 		return -EINVAL;
 
+	if (kvm_vm_has_ran_once(kvm))
+		return 0;
+
+	if (perfmon_capable())
+		kvm_spe->guest_pmscr_el2 = PMSCR_EL2_PCT_PHYS;
+
 	return 0;
 }
 
@@ -217,8 +226,11 @@ void kvm_vcpu_spe_load(struct kvm_vcpu *vcpu)
 	local_irq_disable();
 
 	host_pmscr_el2 = read_sysreg_el2(SYS_PMSCR);
-	write_sysreg_el2(0, SYS_PMSCR);
-	/* Host was profiling, synchronize the write to PMSCR_EL2. */
+	write_sysreg_el2(vcpu->kvm->arch.kvm_spe.guest_pmscr_el2, SYS_PMSCR);
+	/*
+	 * Host was profiling, synchronize the write to PMSCR_EL2 with the guest
+	 * value which disables profiling at EL2.
+	 */
 	if (FIELD_GET(PMSCR_EL2_E2SPE, host_pmscr_el2))
 		isb();
 
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 24/35] KVM: arm64: Handle SPE hardware maintenance interrupts
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (22 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 23/35] KVM: arm64: Allow guest SPE physical timestamps only if perfmon_capable() Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 25/35] KVM: arm64: Add basic handling of SPE buffer control registers writes Alexandru Elisei
                   ` (11 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Re-inject all maintenance interrupts raised by SPE while the guest was
running.

Save the value of the hardware PMBSR_EL1 register in a separate variable,
instead of updating the VCPU sysreg directly, to detect when the SPU
asserted the interrupt, as opposed to the guest writing 1 to the
PMBSR_EL1.S bit.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_spe.h |  7 +++++++
 arch/arm64/kvm/arm.c             |  2 ++
 arch/arm64/kvm/hyp/vhe/spe-sr.c  |  2 +-
 arch/arm64/kvm/spe.c             | 27 +++++++++++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index a61c1c1de76f..7d8becf76314 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -17,9 +17,11 @@ struct kvm_spe {
 };
 
 struct kvm_vcpu_spe {
+	u64 hw_pmbsr_el1;	/* Updated on hardware management event */
 	u64 host_pmscr_el2;	/* Host PMSCR_EL2 register, context switched. */
 	int irq_num;		/* Buffer management interrupt number */
 	bool initialized;	/* SPE initialized for the VCPU */
+	bool irq_level;		/* Virtual buffer management interrupt level */
 };
 
 DECLARE_STATIC_KEY_FALSE(kvm_spe_available);
@@ -56,6 +58,8 @@ bool kvm_spe_has_feat_spe_fds(struct kvm *kvm);
 
 void kvm_vcpu_spe_load(struct kvm_vcpu *vcpu);
 void kvm_vcpu_spe_put(struct kvm_vcpu *vcpu);
+
+void kvm_spe_sync_hwstate(struct kvm_vcpu *vcpu);
 #else
 struct kvm_spe {
 };
@@ -111,6 +115,9 @@ static inline void kvm_vcpu_spe_load(struct kvm_vcpu *vcpu)
 static inline void kvm_vcpu_spe_put(struct kvm_vcpu *vcpu)
 {
 }
+static inline void kvm_spe_sync_hwstate(struct kvm_vcpu *vcpu)
+{
+}
 #endif /* CONFIG_KVM_ARM_SPE */
 
 #endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index c5f5d5dbd695..a2c97daece24 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1272,6 +1272,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		if (kvm_vcpu_has_pmu(vcpu))
 			kvm_pmu_sync_hwstate(vcpu);
 
+		kvm_spe_sync_hwstate(vcpu);
+
 		/*
 		 * Sync the vgic state before syncing the timer state because
 		 * the timer code needs to know if the virtual timer
diff --git a/arch/arm64/kvm/hyp/vhe/spe-sr.c b/arch/arm64/kvm/hyp/vhe/spe-sr.c
index fb8614435069..0fb625eadfa1 100644
--- a/arch/arm64/kvm/hyp/vhe/spe-sr.c
+++ b/arch/arm64/kvm/hyp/vhe/spe-sr.c
@@ -75,6 +75,6 @@ void __kvm_spe_save_guest_buffer(struct kvm_vcpu *vcpu, struct kvm_cpu_context *
 	write_sysreg_s(0, SYS_PMBSR_EL1);
 	isb();
 	/* PMBSR_EL1 changed while the VCPU was running, save it */
-	ctxt_sys_reg(guest_ctxt, PMBSR_EL1) = pmbsr_el1;
+	vcpu->arch.vcpu_spe.hw_pmbsr_el1 = pmbsr_el1;
 }
 NOKPROBE_SYMBOL(__kvm_spe_save_guest_buffer);
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 85a1ac8bb57f..d163ddfdd8e2 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -178,6 +178,33 @@ u64 kvm_spe_read_sysreg(struct kvm_vcpu *vcpu, int reg, u32 encoding)
 	return val;
 }
 
+static void kvm_spe_update_irq_level(struct kvm_vcpu *vcpu, bool level)
+{
+	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
+	int ret;
+
+	if (vcpu_spe->irq_level == level)
+		return;
+
+	vcpu_spe->irq_level = level;
+	ret = kvm_vgic_inject_irq(vcpu->kvm, vcpu, vcpu_spe->irq_num, level, vcpu_spe);
+	WARN_ONCE(ret, "kvm_vgic_inject_irq");
+}
+
+void kvm_spe_sync_hwstate(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
+
+	if (!vcpu_has_spe(vcpu))
+		return;
+
+	if (FIELD_GET(PMBSR_EL1_S, vcpu_spe->hw_pmbsr_el1)) {
+		__vcpu_assign_sys_reg(vcpu, PMBSR_EL1, vcpu_spe->hw_pmbsr_el1);
+		vcpu_spe->hw_pmbsr_el1 = 0;
+		kvm_spe_update_irq_level(vcpu, true);
+	}
+}
+
 static void kvm_spe_save_sampling_regs(struct kvm_vcpu *vcpu, struct kvm_cpu_context *ctxt)
 {
 	struct kvm *kvm = vcpu->kvm;
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 25/35] KVM: arm64: Add basic handling of SPE buffer control registers writes
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (23 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 24/35] KVM: arm64: Handle SPE hardware maintenance interrupts Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 26/35] KVM: arm64: Add comment to explain how trapped SPE registers are handled Alexandru Elisei
                   ` (10 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

The buffer is controlled with three registers: PMBLIMITR_EL1, PMBPTR_EL1
and PMBSR_EL1.

PMBSR_EL1 is the most straightforward one to handle: update the status of
the virtual buffer management interrupt following a change in the
PMBSR_EL1.S value.

For the other two, at the moment KVM only cares about detecting eroneous
programming: either the buffer is larger than the maximum advertised in
PMBIDR_EL1.MaxBuffSize, or it has been misprogrammed.

Making sure that the stage 2 mappings for the buffer don't disappear while
ProfilingBufferEnabled() = true will be handled separately.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/sysreg.h |   1 +
 arch/arm64/kvm/spe.c            | 139 +++++++++++++++++++++++++++++++-
 2 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index c231d2a3e515..28388e12a251 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -350,6 +350,7 @@
 #define PMBSR_EL1_BUF_BSC_MASK		PMBSR_EL1_MSS_MASK
 
 #define PMBSR_EL1_BUF_BSC_FULL		0x1UL
+#define PMBSR_EL1_BUF_BSC_SIZE		0x4UL
 
 /*** End of Statistical Profiling Extension ***/
 
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index d163ddfdd8e2..6e8e0068e7e4 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -24,6 +24,9 @@ struct arm_spu_entry {
 };
 
 static u64 max_buffer_size_to_pmbidr_el1(u64 size);
+static void kvm_spe_update_irq_level(struct kvm_vcpu *vcpu, bool level);
+
+static u64 pmblimitr_el1_res0_mask = GENMASK_ULL(11, 8) | GENMASK_ULL(6, 3);
 
 void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
 {
@@ -63,6 +66,33 @@ void kvm_spe_init_vm(struct kvm *kvm)
 	kvm->arch.kvm_spe.max_buffer_size = KVM_SPE_MAX_BUFFER_SIZE_UNSET;
 }
 
+static bool kvm_spe_has_physical_addrmode(struct kvm *kvm)
+{
+	return kvm_has_feat(kvm, ID_AA64DFR2_EL1, SPE_nVM, IMP);
+}
+
+static bool kvm_spe_has_discard_mode(struct kvm *kvm)
+{
+	return kvm_has_feat(kvm, ID_AA64DFR0_EL1, PMSVer, V1P2);
+}
+
+static void kvm_spe_compute_pmblimitr_el1_res0_mask(struct kvm *kvm)
+{
+	if (!kvm_spe_has_discard_mode(kvm))
+		pmblimitr_el1_res0_mask |= PMBLIMITR_EL1_FM;
+
+	if (!kvm_spe_has_physical_addrmode(kvm))
+		pmblimitr_el1_res0_mask |= PMBLIMITR_EL1_nVM;
+
+	if (kvm_has_feat(kvm, ID_AA64MMFR0_EL1, TGRAN4, IMP))
+		return;
+
+	if (kvm_has_feat(kvm, ID_AA64MMFR0_EL1, TGRAN16, IMP))
+		pmblimitr_el1_res0_mask |= GENMASK_ULL(13, 12);
+	else
+		pmblimitr_el1_res0_mask |= GENMASK_ULL(15, 12);
+}
+
 int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -81,6 +111,8 @@ int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 	if (perfmon_capable())
 		kvm_spe->guest_pmscr_el2 = PMSCR_EL2_PCT_PHYS;
 
+	kvm_spe_compute_pmblimitr_el1_res0_mask(kvm);
+
 	return 0;
 }
 
@@ -94,10 +126,113 @@ u8 kvm_spe_get_pmsver_limit(void)
 	return min(pmsver, ID_AA64DFR0_EL1_PMSVer_V1P5);
 }
 
+/* Implements OtherSPEManagementEvent() from ARM DDI0487L.b */
+static void kvm_spe_inject_other_event(struct kvm_vcpu *vcpu, u8 bsc)
+{
+	u64 pmbsr_el1 = __vcpu_sys_reg(vcpu, PMBSR_EL1);
+
+	pmbsr_el1 &= ~(PMBSR_EL1_MSS2 | PMBSR_EL1_EC | PMBSR_EL1_MSS);
+	pmbsr_el1 |= PMBSR_EL1_S;
+	pmbsr_el1 |= FIELD_PREP(PMBSR_EL1_MSS, bsc);
+
+	__vcpu_assign_sys_reg(vcpu, PMBSR_EL1, pmbsr_el1);
+
+	kvm_spe_update_irq_level(vcpu, true);
+}
+
+static u64 kvm_spe_max_buffer_size(struct kvm *kvm)
+{
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
+
+	return kvm_spe->max_buffer_size;
+}
+
+static u16 kvm_spe_max_record_size(struct kvm *kvm)
+{
+	struct arm_spe_pmu *spu = kvm->arch.kvm_spe.arm_spu;
+
+	return spu->max_record_sz;
+}
+
+static u64 kvm_spe_buffer_limit(u64 pmblimitr_el1)
+{
+	return FIELD_GET(PMBLIMITR_EL1_LIMIT, pmblimitr_el1) << 12;
+}
+
+static u64 kvm_spe_buffer_ptr(u64 pmbptr_el1)
+{
+	return FIELD_GET(PMBPTR_EL1_PTR, pmbptr_el1);
+}
+
+static bool kvm_spe_profiling_buffer_enabled_vcpu(struct kvm_vcpu *vcpu)
+{
+	return kvm_spe_profiling_buffer_enabled(__vcpu_sys_reg(vcpu, PMBLIMITR_EL1),
+						__vcpu_sys_reg(vcpu, PMBSR_EL1));
+}
+
+static bool kvm_spe_in_discard_mode(u64 pmblimitr_el1)
+{
+	return FIELD_GET(PMBLIMITR_EL1_FM, pmblimitr_el1);
+}
+
+static bool kvm_spe_in_discard_mode_vcpu(struct kvm_vcpu *vcpu)
+{
+	return kvm_spe_in_discard_mode(__vcpu_sys_reg(vcpu, PMBLIMITR_EL1));
+}
+
+static u16 kvm_spe_min_align(struct kvm *kvm)
+{
+	struct arm_spe_pmu *spu = kvm->arch.kvm_spe.arm_spu;
+
+	return spu->align;
+}
+
 bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
 {
-	__vcpu_assign_sys_reg(vcpu, val, reg);
+	struct kvm *kvm = vcpu->kvm;
+	u64 ptr, limit, max_buffer_size;
+
+	switch (reg) {
+	case PMBLIMITR_EL1:
+		val &= ~pmblimitr_el1_res0_mask;
+		break;
+	case PMBSR_EL1:
+		break;
+	case PMBPTR_EL1:
+		/* Treat bits PMBIDR_EL1.Align-1:0 as RES0. */
+		val = ALIGN_DOWN(val, kvm_spe_min_align(kvm));
+		break;
+	default:
+		WARN_ON_ONCE(true);
+	}
+
+	__vcpu_assign_sys_reg(vcpu, reg, val);
+	if (reg == PMBSR_EL1) {
+		kvm_spe_update_irq_level(vcpu,
+					 FIELD_GET(PMBSR_EL1_S, __vcpu_sys_reg(vcpu, PMBSR_EL1)));
+	}
+
+	if (!kvm_spe_profiling_buffer_enabled_vcpu(vcpu) || kvm_spe_in_discard_mode_vcpu(vcpu))
+		goto out;
 
+	ptr = kvm_spe_buffer_ptr(__vcpu_sys_reg(vcpu, PMBPTR_EL1));
+	limit = kvm_spe_buffer_limit(__vcpu_sys_reg(vcpu, PMBLIMITR_EL1));
+
+	/*
+	 * In the Arm ARM, Uint() performs a *signed* integer conversion.
+	 * Convert all members to signed to avoid C promotion to unsigned.
+	 */
+	if (!limit || (s64)ptr > (s64)limit - (s64)kvm_spe_max_record_size(kvm) ||
+	    FIELD_GET(GENMASK_ULL(63, 56), ptr) != FIELD_GET(GENMASK_ULL(63, 56), limit)) {
+		kvm_spe_inject_other_event(vcpu, PMBSR_EL1_BUF_BSC_FULL);
+		goto out;
+	}
+
+	max_buffer_size = kvm_spe_max_buffer_size(kvm);
+	if (max_buffer_size && limit - ptr > max_buffer_size)
+		kvm_spe_inject_other_event(vcpu, PMBSR_EL1_BUF_BSC_SIZE);
+
+out:
 	return true;
 }
 
@@ -144,7 +279,7 @@ static u64 kvm_spe_get_pmbidr_el1(struct kvm_vcpu *vcpu)
 	/* Filter out known RES0 bits. */
 	pmbidr_el1 &= ~(GENMASK_ULL(63, 48) | GENMASK_ULL(31, 12));
 
-	if (!kvm_has_feat(kvm, ID_AA64DFR2_EL1, SPE_nVM, IMP)) {
+	if (!kvm_spe_has_physical_addrmode(kvm)) {
 		pmbidr_el1 &= ~PMBIDR_EL1_AddrMode;
 		pmbidr_el1 |= PMBIDR_EL1_AddrMode_VM_ONLY;
 	}
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 26/35] KVM: arm64: Add comment to explain how trapped SPE registers are handled
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (24 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 25/35] KVM: arm64: Add basic handling of SPE buffer control registers writes Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 27/35] KVM: arm64: Make MTE functions public Alexandru Elisei
                   ` (9 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

The SPE buffer registers are trapped, while the sampling control registers
are directly accessed by the guest. The in-memory value of PMBSR_EL1 can be
modified by both the guest, following a direct write, and the hardware,
following a hardware maintenence interrupt. The in-memory value is never
written to the hardware.

The rest of the buffer register are written to the hardware at different
times in the VCPU run loop.

Add a comment explaining all of this.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/kvm/spe.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 6e8e0068e7e4..b138b564413b 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -126,6 +126,46 @@ u8 kvm_spe_get_pmsver_limit(void)
 	return min(pmsver, ID_AA64DFR0_EL1_PMSVer_V1P5);
 }
 
+/*
+ * Note on register handling:
+ *
+ * - Only the buffer registers (this includes PMBIDR_EL1) and PMSIDR_EL1 are
+ *   trapped, the rest of the registers the guest can access directly.
+ *
+ *  - PMBIDR_EL1 is trapped so KVM can advertise to the guest the maximum buffer
+ *  size set by userspace.
+ *
+ *  - PMSIDR_EL1 is trapped to hide the presence of features which the VM does
+ *  not have, but the hardware implements.
+ *
+ * - PMBLIMITR_EL1:
+ *   * Guest value is written to hardware only when
+ *   kvm_spe_profiling_buffer_enabled() is true. This is done after KVM enables
+ *   stage 2.
+ *   * KVM always disables the buffer (PMBLIMITR_EL1.E=0) when exiting the
+ *   guest. This is done before stage 2 is disabled.
+ *   * In-memory value of the register is updated following a direct write to
+ *   the register by the guest.
+ *
+ * - PMBSR_EL1:
+ *   * In-memory value of the register is never written to hardware.
+ *   * The hardware value of the register is cleared on guest exit if KVM
+ *   detects that the service bit is set.
+ *   * In-memory value of the register is updated in the following situations:
+ *     - Following a direct write to the register by the guest.
+ *     - When the buffer has been misprogrammed.
+ *     - When the hardware asserts the management event interrupt.
+ *
+ * - PMBPTR_EL1:
+ *   * Guest value is written to hardware:
+ *     - Before entering the guest, if kvm_spe_profiling_buffer_enabled() is
+ *     true.
+ *   * In-memory value of the register is updated:
+ *     - Following a direct write to the register by the guest.
+ *     - On each exit from the guest, if kvm_spe_profiling_buffer_enabled() was
+ *     true when the guest was entered.
+ */
+
 /* Implements OtherSPEManagementEvent() from ARM DDI0487L.b */
 static void kvm_spe_inject_other_event(struct kvm_vcpu *vcpu, u8 bsc)
 {
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 27/35] KVM: arm64: Make MTE functions public
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (25 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 26/35] KVM: arm64: Add comment to explain how trapped SPE registers are handled Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 28/35] KVM: arm64: at: Use callback for reading descriptor Alexandru Elisei
                   ` (8 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Make sanitise_mte_tags() (renamed to kvm_sanitise_mte_tags()) and
kvm_vma_mte_allowed() public, to be used by the SPE virtualization code.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_mmu.h | 3 +++
 arch/arm64/kvm/mmu.c             | 9 +++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index e4069f2ce642..37b84e9d4337 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -183,6 +183,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 int kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
 int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
 
+bool kvm_vma_mte_allowed(struct vm_area_struct *vma);
+void kvm_sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, unsigned long size);
+
 phys_addr_t kvm_mmu_get_httbr(void);
 phys_addr_t kvm_get_idmap_vector(void);
 int __init kvm_mmu_init(u32 *hyp_va_bits);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7cc964af8d30..8abba9619c58 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1434,13 +1434,14 @@ static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
  * Must be called with kvm->mmu_lock held to ensure the memory remains mapped
  * while the tags are zeroed.
  */
-static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
-			      unsigned long size)
+void kvm_sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, unsigned long size)
 {
 	unsigned long i, nr_pages = size >> PAGE_SHIFT;
 	struct page *page = pfn_to_page(pfn);
 	struct folio *folio = page_folio(page);
 
+	lockdep_assert_held(&kvm->mmu_lock);
+
 	if (!kvm_has_mte(kvm))
 		return;
 
@@ -1462,7 +1463,7 @@ static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
 	}
 }
 
-static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
+bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
@@ -1833,7 +1834,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (!fault_is_perm && !s2_force_noncacheable && kvm_has_mte(kvm)) {
 		/* Check the VMM hasn't introduced a new disallowed VMA */
 		if (mte_allowed) {
-			sanitise_mte_tags(kvm, pfn, vma_pagesize);
+			kvm_sanitise_mte_tags(kvm, pfn, vma_pagesize);
 		} else {
 			ret = -EFAULT;
 			goto out_unlock;
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 28/35] KVM: arm64: at: Use callback for reading descriptor
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (26 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 27/35] KVM: arm64: Make MTE functions public Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 29/35] KVM: arm64: Pin the SPE buffer in the host and map it at stage 2 Alexandru Elisei
                   ` (7 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Allow callers of __kvm_translate_va() to use a custom defined function for
reading the translation table descriptor from guest memory. This will be
useful for SPE virtualization, where the translation tables must also be
mapped at stage 2 for the entire duration of the guest profiling session.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_nested.h |  6 ++++++
 arch/arm64/kvm/at.c                 | 17 +++++++++++++----
 arch/arm64/kvm/nested.c             |  7 ++++---
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index f7c06a840963..e9a73ea74f24 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -284,6 +284,11 @@ static inline unsigned int ps_to_output_size(unsigned int ps, bool pa52bit)
 	}
 }
 
+typedef int (*read_desc_fn)(struct kvm_vcpu *, gpa_t, void *, unsigned long);
+
+int kvm_vcpu_read_desc(struct kvm_vcpu *vcpu, gpa_t gpa, void *desc,
+		       unsigned long len);
+
 enum trans_regime {
 	TR_EL10,
 	TR_EL20,
@@ -305,6 +310,7 @@ struct s1_walk_filter {
 
 struct s1_walk_info {
 	struct s1_walk_filter	*filter;
+	read_desc_fn		read_desc;
 	u64	     		baddr;
 	enum trans_regime	regime;
 	unsigned int		max_oa_bits;
diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
index be26d5aa668c..e1df4278d5d5 100644
--- a/arch/arm64/kvm/at.c
+++ b/arch/arm64/kvm/at.c
@@ -110,6 +110,12 @@ static u64 effective_tcr2(struct kvm_vcpu *vcpu, enum trans_regime regime)
 	return vcpu_read_sys_reg(vcpu, TCR2_EL2);
 }
 
+int kvm_vcpu_read_desc(struct kvm_vcpu *vcpu, gpa_t gpa, void *desc,
+		       unsigned long len)
+{
+	return kvm_read_guest(vcpu->kvm, gpa, desc, len);
+}
+
 static bool s1pie_enabled(struct kvm_vcpu *vcpu, enum trans_regime regime)
 {
 	if (!kvm_has_s1pie(vcpu->kvm))
@@ -142,6 +148,9 @@ static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
 	unsigned int stride, x;
 	bool va55, tbi, lva;
 
+	if (!wi->read_desc)
+		return -EINVAL;
+
 	va55 = va & BIT(55);
 
 	if (vcpu_has_nv(vcpu)) {
@@ -414,7 +423,7 @@ static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
 				return ret;
 		}
 
-		ret = kvm_read_guest(vcpu->kvm, ipa, &desc, sizeof(desc));
+		ret = wi->read_desc(vcpu, ipa, &desc, sizeof(desc));
 		if (ret) {
 			fail_s1_walk(wr, ESR_ELx_FSC_SEA_TTW(level), false);
 			return ret;
@@ -1542,9 +1551,9 @@ void __kvm_at_s12(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 }
 
 /*
- * Translate a VA for a given EL in a given translation regime, with
- * or without PAN. This requires wi->{regime, as_el0, pan} to be
- * set. The rest of the wi and wr should be 0-initialised.
+ * Translate a VA for a given EL in a given translation regime, with or without
+ * PAN. This requires wi->{read_desc, regime, as_el0, pan} to be set. The rest
+ * of the wi and wr should be 0-initialised.
  */
 int __kvm_translate_va(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
 		       struct s1_walk_result *wr, u64 va)
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index f04cda40545b..92e94bb96bcc 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -1196,9 +1196,10 @@ static int kvm_translate_vncr(struct kvm_vcpu *vcpu, bool *is_gmem)
 		invalidate_vncr(vt);
 
 		vt->wi = (struct s1_walk_info) {
-			.regime	= TR_EL20,
-			.as_el0	= false,
-			.pan	= false,
+			.read_desc	= kvm_vcpu_read_desc,
+			.regime		= TR_EL20,
+			.as_el0		= false,
+			.pan		= false,
 		};
 		vt->wr = (struct s1_walk_result){};
 	}
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 29/35] KVM: arm64: Pin the SPE buffer in the host and map it at stage 2
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (27 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 28/35] KVM: arm64: at: Use callback for reading descriptor Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 30/35] KVM: Propagate MMU event to the MMU notifier handlers Alexandru Elisei
                   ` (6 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

If the SPU encounters a translation fault when it attempts to write a
profiling record to memory, it stops profiling and asserts the PMBIRQ
interrupt.  Interrupts are not delivered instantaneously to the CPU, and
this creates a profiling blackout window where the profiled CPU executes
instructions, but no samples are collected.

This is not desirable, and the SPE driver avoids it by keeping the buffer
mapped for the entire the profiling session.

KVM maps memory at stage 2 when the guest accesses it, following a fault on
a missing stage 2 translation, which means that the problem is present in a
SPE enabled virtual machine. Worse yet, the blackout windows are
unpredictable: the guest profiling the same process can during one
profiling session, not trigger any stage 2 faults (the entire buffer memory
is already mapped at stage 2), but worst case scenario, during another
profiling session, trigger stage 2 faults for every record it attempts to
write (if KVM keeps removing the buffer pages from stage 2), or something
in between - some records trigger a stage 2 fault, some don't.

The solution is for KVM to follow what the SPE driver does: keep the buffer
mapped at stage 2 while ProfilingBufferEnabled() is true. To accomplish
this, pin the host pages that correspond to the guest buffer, **and** the
pages that correspond to the stage 1 entries used for translating the
buffer guest virtual addresses.

When a guest enables profiling, KVM walks stage 1, finds the IPA mapping
for guest VA, pins the corresponding host page and installs the IPA to PA
mapping at stage 2.

Transparent huge pages can be split by split_huge_pmd() regardless of the
reference count (as per Documentation/mm/transhuge.rst). On arm64, going
from a block mapping to a PTE mapping requires break-before-make, and the
SPU will trigger a stage 2 fault if it happens to write to memory exactly
during the break part.  Avoid this by pre-splitting the THP using the
FOLL_SPLIT_PMD flag for gup.

Note that this is not enough to guarantee that the buffer remains mapped at
stage 2, as a page pinned longterm in the host can still be unmapped from
stage 2. Hugetlb-backed guest memory is also not working, because the flag
FOLL_SPLIT_PMD is incompatible with hugetlb.

And yet another glaring ommission is dirty page tracking: KVM will happily
mark a buffer page as read-only, even though that will likely cause a
buffer management event, with the associated blackout window.

All of these shortcomings will be handled in later patches.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_emulate.h |   9 +-
 arch/arm64/include/asm/kvm_spe.h     |   9 +
 arch/arm64/include/asm/sysreg.h      |   2 +
 arch/arm64/kvm/arm.c                 |   3 +
 arch/arm64/kvm/spe.c                 | 625 ++++++++++++++++++++++++++-
 5 files changed, 634 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index c9eab316398e..50174ab86a58 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -472,9 +472,9 @@ u64 kvm_vcpu_trap_get_perm_fault_granule(const struct kvm_vcpu *vcpu)
 	return BIT(ARM64_HW_PGTABLE_LEVEL_SHIFT(esr & ESR_ELx_FSC_LEVEL));
 }
 
-static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu)
+static __always_inline bool kvm_fsc_issea(const u8 fsc)
 {
-	switch (kvm_vcpu_trap_get_fault(vcpu)) {
+	switch (fsc) {
 	case ESR_ELx_FSC_EXTABT:
 	case ESR_ELx_FSC_SEA_TTW(-1) ... ESR_ELx_FSC_SEA_TTW(3):
 	case ESR_ELx_FSC_SECC:
@@ -485,6 +485,11 @@ static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu)
 	}
 }
 
+static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu)
+{
+	return kvm_fsc_issea(kvm_vcpu_trap_get_fault(vcpu));
+}
+
 static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
 {
 	u64 esr = kvm_vcpu_get_esr(vcpu);
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 7d8becf76314..6c091fbfc95d 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -11,6 +11,7 @@
 #ifdef CONFIG_KVM_ARM_SPE
 
 struct kvm_spe {
+	struct xarray pinned_pages;
 	struct arm_spe_pmu *arm_spu;
 	u64 max_buffer_size;	/* Maximum per VCPU buffer size */
 	u64 guest_pmscr_el2;
@@ -42,7 +43,9 @@ static inline bool kvm_spe_profiling_buffer_enabled(u64 pmblimitr_el1, u64 pmbsr
 }
 
 void kvm_spe_init_vm(struct kvm *kvm);
+void kvm_spe_destroy_vm(struct kvm *kvm);
 int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu);
+void kvm_spe_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 u8 kvm_spe_get_pmsver_limit(void);
 
@@ -73,10 +76,16 @@ struct kvm_vcpu_spe {
 static inline void kvm_spe_init_vm(struct kvm *kvm)
 {
 }
+static inline void kvm_spe_destroy_vm(struct kvm *kvm)
+{
+}
 static inline int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 {
 	return 0;
 }
+static inline void kvm_spe_vcpu_destroy(struct kvm_vcpu *vcpu)
+{
+}
 static inline u8 kvm_spe_get_pmsver_limit(void)
 {
 	return 0;
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 28388e12a251..87bc46a68d51 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -345,6 +345,8 @@
 /* Buffer error reporting */
 #define PMBSR_EL1_FAULT_FSC_SHIFT	PMBSR_EL1_MSS_SHIFT
 #define PMBSR_EL1_FAULT_FSC_MASK	PMBSR_EL1_MSS_MASK
+#define PMBSR_EL1_FAULT_FSC_ALIGN	0x21
+#define PMBSR_EL1_FAULT_FSC_TTW0	0x4
 
 #define PMBSR_EL1_BUF_BSC_SHIFT		PMBSR_EL1_MSS_SHIFT
 #define PMBSR_EL1_BUF_BSC_MASK		PMBSR_EL1_MSS_MASK
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a2c97daece24..8da772690173 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -259,6 +259,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kfree(kvm->arch.sysreg_masks);
 	kvm_destroy_vcpus(kvm);
 
+	kvm_spe_destroy_vm(kvm);
+
 	kvm_unshare_hyp(kvm, kvm + 1);
 
 	kvm_arm_teardown_hypercalls(kvm);
@@ -517,6 +519,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 		free_hyp_memcache(&vcpu->arch.pkvm_memcache);
 	kvm_timer_vcpu_terminate(vcpu);
 	kvm_pmu_vcpu_destroy(vcpu);
+	kvm_spe_vcpu_destroy(vcpu);
 	kvm_vgic_vcpu_destroy(vcpu);
 	kvm_arm_vcpu_destroy(vcpu);
 }
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index b138b564413b..35848e4ff68b 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -8,9 +8,12 @@
 #include <linux/cpumask.h>
 #include <linux/kvm_host.h>
 #include <linux/perf/arm_spe_pmu.h>
+#include <linux/swap.h>
 
 #include <asm/kvm_emulate.h>
+#include <asm/kvm_mmu.h>
 #include <asm/kvm_spe.h>
+#include <asm/pgtable-hwdef.h>
 #include <asm/sysreg.h>
 
 DEFINE_STATIC_KEY_FALSE(kvm_spe_available);
@@ -23,8 +26,16 @@ struct arm_spu_entry {
 	struct arm_spe_pmu *arm_spu;
 };
 
+struct pinned_page {
+	DECLARE_BITMAP(vcpus, KVM_MAX_VCPUS);	/* The page is pinned on these VCPUs */
+	struct page *page;
+	gfn_t gfn;
+	bool writable;				/* Is the page mapped as writable? */
+};
+
 static u64 max_buffer_size_to_pmbidr_el1(u64 size);
 static void kvm_spe_update_irq_level(struct kvm_vcpu *vcpu, bool level);
+static void kvm_spe_unpin_buffer(struct kvm_vcpu *vcpu);
 
 static u64 pmblimitr_el1_res0_mask = GENMASK_ULL(11, 8) | GENMASK_ULL(6, 3);
 
@@ -63,7 +74,18 @@ void kvm_host_spe_init(struct arm_spe_pmu *arm_spu)
 
 void kvm_spe_init_vm(struct kvm *kvm)
 {
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+
 	kvm->arch.kvm_spe.max_buffer_size = KVM_SPE_MAX_BUFFER_SIZE_UNSET;
+	xa_init(pinned_pages);
+}
+
+void kvm_spe_destroy_vm(struct kvm *kvm)
+{
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+
+	WARN_ON_ONCE(!xa_empty(pinned_pages));
+	xa_destroy(pinned_pages);
 }
 
 static bool kvm_spe_has_physical_addrmode(struct kvm *kvm)
@@ -116,6 +138,14 @@ int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+void kvm_spe_vcpu_destroy(struct kvm_vcpu *vcpu)
+{
+	if (!vcpu_has_spe(vcpu))
+		return;
+
+	kvm_spe_unpin_buffer(vcpu);
+}
+
 u8 kvm_spe_get_pmsver_limit(void)
 {
 	unsigned int pmsver;
@@ -180,6 +210,534 @@ static void kvm_spe_inject_other_event(struct kvm_vcpu *vcpu, u8 bsc)
 	kvm_spe_update_irq_level(vcpu, true);
 }
 
+/* Implements DebugWriteFault() from ARM DDI0487L.a. */
+static void kvm_spe_inject_data_abort(struct kvm_vcpu *vcpu, u8 fst, bool s2)
+{
+	u64 pmbsr_el1 = __vcpu_sys_reg(vcpu, PMBSR_EL1);
+	u64 mss2 = 0, ec = 0, mss = 0;
+
+	pmbsr_el1 &= ~(PMBSR_EL1_MSS2 | PMBSR_EL1_EC | PMBSR_EL1_MSS);
+
+	ec = s2 ? PMBSR_EL1_EC_FAULT_S2 : PMBSR_EL1_EC_FAULT_S1;
+	mss = fst & GENMASK_ULL(5, 0);
+
+	pmbsr_el1 |= FIELD_PREP(PMBSR_EL1_MSS2, mss2);
+	pmbsr_el1 |= FIELD_PREP(PMBSR_EL1_EC, ec);
+	pmbsr_el1 |= FIELD_PREP(PMBSR_EL1_S, 1);
+	pmbsr_el1 |= FIELD_PREP(PMBSR_EL1_MSS, mss);
+
+	__vcpu_assign_sys_reg(vcpu, PMBSR_EL1, pmbsr_el1);
+
+	kvm_spe_update_irq_level(vcpu, true);
+}
+
+static void kvm_spe_unpin_buffer(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+	struct pinned_page *pinned_page;
+	unsigned long gfn;
+	int idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	xa_lock(pinned_pages);
+
+	xa_for_each(pinned_pages, gfn, pinned_page) {
+		if (!test_bit(vcpu->vcpu_idx, pinned_page->vcpus))
+			continue;
+
+		clear_bit(vcpu->vcpu_idx, pinned_page->vcpus);
+		if (bitmap_empty(pinned_page->vcpus, KVM_MAX_VCPUS)) {
+			__xa_erase(pinned_pages, pinned_page->gfn);
+			unpin_user_pages_dirty_lock(&pinned_page->page, 1, pinned_page->writable);
+		}
+	}
+
+	xa_unlock(pinned_pages);
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
+#define MAP_GPA_RET_NOTIFIER_RETRY	1
+#define MAP_GPA_RET_PAGE_EXIST		2
+
+#define PGTABLE_ACTION_NONE	0
+#define PGTABLE_RELAX_PERMS	(1 << 0)
+#define PGTABLE_MAP_GPA		(1 << 1)
+#define PGTABLE_MAKE_YOUNG	(1 << 31)
+
+/* Calls release_faultin_page(), regardless of the return value */
+static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, struct page *page,
+			   bool make_writable, bool mte_allowed, unsigned long mmu_seq,
+			   struct pinned_page *pinned_page)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+	struct pinned_page *pp = NULL;
+	phys_addr_t hpa = page_to_phys(page);
+	gfn_t gfn = PHYS_PFN(gpa);
+	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
+	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
+	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
+	int action = PGTABLE_ACTION_NONE;
+	s8 level = S8_MAX;
+	kvm_pte_t pte = 0;
+	int ret;
+
+	read_lock(&kvm->mmu_lock);
+	if (mmu_invalidate_retry(kvm, mmu_seq)) {
+		ret = MAP_GPA_RET_NOTIFIER_RETRY;
+		goto mmu_unlock;
+	}
+
+	if (make_writable)
+		prot |= KVM_PGTABLE_PROT_W;
+
+	ret = kvm_pgtable_get_leaf(pgt, gpa, &pte, &level);
+	if (ret)
+		goto mmu_unlock;
+
+	if (kvm_pte_valid(pte)) {
+		enum kvm_pgtable_prot existing_prot;
+		phys_addr_t stage2_hpa;
+
+		/* Final sanity check. */
+		stage2_hpa = kvm_pte_to_phys(pte) + gpa % kvm_granule_size(level);
+		if (WARN_ON_ONCE(PHYS_PFN(stage2_hpa) != hfn)) {
+			ret = -EFAULT;
+			goto mmu_unlock;
+		}
+
+		existing_prot = kvm_pgtable_stage2_pte_prot(pte);
+		if (kvm_granule_size(level) != PAGE_SIZE) {
+			/* Break block mapping */
+			action = PGTABLE_MAP_GPA;
+		} else {
+			if (make_writable && !(existing_prot & KVM_PGTABLE_PROT_W))
+				action = PGTABLE_RELAX_PERMS;
+			if (!(pte & PTE_AF))
+				action |= PGTABLE_MAKE_YOUNG;
+		}
+	} else {
+		action = PGTABLE_MAP_GPA;
+	}
+
+	if (action == PGTABLE_MAP_GPA) {
+		read_unlock(&kvm->mmu_lock);
+		ret = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_cache,
+				kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
+		if (ret) {
+			kvm_release_faultin_page(kvm, page, false, make_writable);
+			goto out;
+		}
+		read_lock(&kvm->mmu_lock);
+		if (mmu_invalidate_retry(kvm, mmu_seq)) {
+			ret = MAP_GPA_RET_NOTIFIER_RETRY;
+			goto mmu_unlock;
+		}
+	}
+
+	/*
+	 * Serialize changes to stage 2 made by pinning the buffer - if multiple
+	 * VCPUs enable the buffer at the same time, they will race when pinning
+	 * the guest's stage 1 tables.
+	 */
+	xa_lock(pinned_pages);
+	pp = xa_load(pinned_pages, gfn);
+	if (pp) {
+		if (make_writable && !pp->writable) {
+			/*
+			 * GPA was made young when it was mapped, only need to
+			 * make it writable.
+			 */
+			action = PGTABLE_RELAX_PERMS;
+		} else {
+			/*
+			 * Another VCPU snuck in before we took the lock and
+			 * mapped the GPA, don't modify stage 2 twice.
+			 */
+			action = PGTABLE_ACTION_NONE;
+		}
+	}
+
+	if (!pp && !kvm_pte_valid(pte) && kvm_has_mte(kvm)) {
+		if (mte_allowed) {
+			kvm_sanitise_mte_tags(kvm, hfn, PAGE_SIZE);
+		} else {
+			ret = -EFAULT;
+			goto mmu_unlock;
+		}
+	}
+
+	ret = 0;
+	if (action & PGTABLE_RELAX_PERMS) {
+		ret = kvm_pgtable_stage2_relax_perms(pgt, gpa, prot, flags);
+	} else if (action & PGTABLE_MAP_GPA) {
+		ret = kvm_pgtable_stage2_map(pgt, gpa, PAGE_SIZE, hpa, prot,
+					     &vcpu->arch.mmu_page_cache, flags);
+	}
+	if (ret)
+		goto pages_unlock;
+
+	if (action & PGTABLE_MAKE_YOUNG)
+		kvm_pgtable_stage2_mkyoung(pgt, gpa, flags);
+
+	if (pp) {
+		pp->writable = make_writable;
+		set_bit(vcpu->vcpu_idx, pp->vcpus);
+
+		ret = MAP_GPA_RET_PAGE_EXIST;
+	} else {
+		pinned_page->page = page;
+		pinned_page->gfn = gfn;
+		pinned_page->writable = make_writable;
+		set_bit(vcpu->vcpu_idx, pinned_page->vcpus);
+
+		pp = __xa_store(pinned_pages, gfn, pinned_page, GFP_ATOMIC);
+		if (xa_is_err(pp)) {
+			ret = xa_err(pp);
+			goto pages_unlock;
+		}
+
+		ret = 0;
+	}
+
+pages_unlock:
+	xa_unlock(pinned_pages);
+mmu_unlock:
+	kvm_release_faultin_page(kvm, page, ret < 0, make_writable);
+	if (!ret && make_writable)
+		kvm_vcpu_mark_page_dirty(vcpu, gfn);
+
+	read_unlock(&kvm->mmu_lock);
+out:
+	return ret;
+}
+
+static int kvm_spe_pin_hva_locked(hva_t hva, bool make_writable, struct page **page)
+{
+	unsigned int gup_flags;
+	long nr_pages;
+
+	/*
+	 * FOLL_SPLIT_PMD is what allows us to ignore the order of the folio and
+	 * how the page is mapped in the host and operate on a single page
+	 * instead of a higher order folio.
+	 *
+	 * Let's assume that we don't use FOLL_SPLIT_PMD and the pinned page is
+	 * mapped with a block mapping in the host's stage 1.  kvm_spe_map_gpa()
+	 * will map the pinned page at the PTE level, but any number of pages
+	 * from the block mapping in the host might not be mapped at stage 2.
+	 *
+	 * When KVM takes a stage 2 fault on an IPA that corresponds to an
+	 * unmapped page that is part of the block mapping at host's stage 1,
+	 * KVM will walk the host's stage 1 and conclude it can also map the IPA
+	 * with a block mapping at stage 2. This requires break-before-make at
+	 * stage 2, during which the SPU might observe the short lived invalid
+	 * entry and report a stage 2 fault.
+	 *
+	 * Note that a higher order pinned folio, mapped at the PTE level,
+	 * cannot be collapsed into a block mapping, but the reverse is not
+	 * true: a higher order folio can be split into PTEs regardless of its
+	 * elevated reference count (see split_huge_pmd()).
+	 */
+	gup_flags = FOLL_LONGTERM | FOLL_SPLIT_PMD | FOLL_HONOR_NUMA_FAULT | FOLL_HWPOISON;
+	if (make_writable)
+		gup_flags |= FOLL_WRITE;
+
+	nr_pages = pin_user_pages(hva, 1, gup_flags, page);
+
+	if (nr_pages < 0)
+		return nr_pages;
+	if (nr_pages == 0)
+		return -ENOMEM;
+	return 0;
+}
+
+static int kvm_spe_find_hva(struct kvm *kvm, gfn_t gfn, bool make_writable, hva_t *hva)
+{
+	struct kvm_memory_slot *memslot;
+	bool writable;
+
+	memslot = gfn_to_memslot(kvm, gfn);
+	/* Confidential things not yet supported */
+	if (kvm_slot_has_gmem(memslot))
+		return -EFAULT;
+	*hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
+	if (kvm_is_error_hva(*hva))
+		return -EFAULT;
+	if (make_writable && !writable)
+		return -EPERM;
+
+	return 0;
+}
+
+static bool kvm_spe_test_gpa_pinned(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+	struct pinned_page *pp;
+
+	xa_lock(pinned_pages);
+
+	pp = xa_load(pinned_pages, PHYS_PFN(gpa));
+	if (!pp)
+		goto out_unlock;
+
+	/*
+	 * Only happens if the buffer overlaps with a translation table, which
+	 * is almost certainly a guest bug and hopefully exceedingly rare. To
+	 * avoid unnecessary complexity, pretend that the gpa is not pinned, and
+	 * kvm_spe_map_gpa() will fix things up. Sure, it means doing a lot of
+	 * unnecessary work, but it's all on the guest for programming the
+	 * buffer with the wrong translations.
+	 */
+	if (make_writable && !pp->writable)
+		goto out_unlock;
+
+	set_bit(vcpu->vcpu_idx, pp->vcpus);
+
+	xa_unlock(pinned_pages);
+	return true;
+
+out_unlock:
+	xa_unlock(pinned_pages);
+	return false;
+}
+
+static int kvm_spe_pin_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+	struct pinned_page *pinned_page;
+	unsigned long mmu_seq, tries;
+	struct vm_area_struct *vma;
+	gfn_t gfn = PHYS_PFN(gpa);
+	bool writable = false, mte_allowed = false;
+	struct page *page;
+	kvm_pfn_t hfn;
+	hva_t hva;
+	int ret;
+
+	WARN_ON_ONCE(!srcu_read_lock_held(&vcpu->kvm->srcu));
+
+	/*
+	 * For each buffer page, KVM needs to pin up to four pages, one for each
+	 * level of the guest's stage 1 translation tables. The first level
+	 * table is shared between each page of the buffer, and likely some of
+	 * the next levels too, so it's worth checking if a gpa is already
+	 * pinned.
+	 */
+	if (kvm_spe_test_gpa_pinned(vcpu, gpa, make_writable))
+		return 0;
+
+	ret = kvm_spe_find_hva(kvm, gfn, make_writable, &hva);
+	if (ret)
+		return ret;
+
+	scoped_guard(mmap_read_lock, current->mm) {
+		if (kvm_has_mte(kvm)) {
+			vma = vma_lookup(current->mm, hva);
+			if (!vma) {
+				kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
+				return -EFAULT;
+			}
+			mte_allowed = kvm_vma_mte_allowed(vma);
+		}
+		ret = kvm_spe_pin_hva_locked(hva, make_writable, &page);
+		if (ret)
+			return ret;
+	}
+
+	pinned_page = kzalloc(sizeof(*pinned_page), GFP_KERNEL_ACCOUNT);
+	if (!pinned_page) {
+		ret = -ENOMEM;
+		goto out_unpin_page;
+	}
+	ret = xa_reserve(pinned_pages, gfn, GFP_KERNEL_ACCOUNT);
+	if (ret)
+		goto out_free;
+
+	mmu_seq = kvm->mmu_invalidate_seq;
+	smp_rmb();
+
+	hfn = page_to_pfn(page);
+
+	get_page(page);
+	ret = kvm_spe_map_gpa(vcpu, gpa, hfn, page, make_writable, mte_allowed, mmu_seq,
+			      pinned_page);
+	tries = 1;
+
+	while (ret == MAP_GPA_RET_NOTIFIER_RETRY) {
+		struct page *retry_page;
+
+		/*
+		 * mmu_seq has likely changed for benign reasons (a memory
+		 * allocation triggered reclaim/compaction, for example), but it
+		 * could have also changed because userspace did something that
+		 * KVM must handle, like changing the protection for the VMA
+		 * that backs the memslot. So walk stage 1 again instead of
+		 * failing prematurely.
+		 */
+		mmu_seq = kvm->mmu_invalidate_seq;
+		smp_rmb();
+
+		hfn = kvm_faultin_pfn(vcpu, gfn, make_writable, &writable, &retry_page);
+		if (hfn == KVM_PFN_ERR_HWPOISON) {
+			send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SIZE, current);
+			ret = 0;
+			goto out_release;
+		}
+		if (is_error_noslot_pfn(hfn)) {
+			ret = -EFAULT;
+			break;
+		}
+		if (WARN_ON_ONCE(retry_page != page)) {
+			kvm_release_page_unused(retry_page);
+			ret = -EFAULT;
+			break;
+		}
+		if (make_writable && !writable) {
+			kvm_release_page_unused(page);
+			ret = -EPERM;
+			break;
+		}
+
+		ret = kvm_spe_map_gpa(vcpu, gpa, hfn, page, make_writable, mte_allowed, mmu_seq,
+				      pinned_page);
+		/*
+		 * Choose the number of VCPUs as the limit on retrying because
+		 * the guest can enable SPE on all VCPUs at the same, and
+		 * pinning the buffer can lead to memory allocation or
+		 * migration, which increment the MMU notification count.
+		 */
+		tries++;
+		if (ret == MAP_GPA_RET_NOTIFIER_RETRY && tries == kvm->created_vcpus + 1)
+			ret = -EAGAIN;
+	}
+
+	if (ret < 0)
+		goto out_release;
+
+	switch (ret) {
+	case 0:
+		break;
+	case MAP_GPA_RET_PAGE_EXIST:
+		kfree(pinned_page);
+		pinned_page = NULL;
+		/* Unpin the page we pinned twice. */
+		unpin_user_pages_dirty_lock(&page, 1, make_writable);
+		break;
+	default:
+		WARN_ON_ONCE(true);
+	}
+
+	/* Treat all non-negative return codes as success. */
+	return 0;
+
+out_release:
+	xa_release(pinned_pages, gfn);
+out_free:
+	kfree(pinned_page);
+out_unpin_page:
+	unpin_user_pages_dirty_lock(&page, 1, make_writable);
+	return ret;
+}
+
+/*
+ * Read the address of the next level translation table and pin the table at the
+ * current translation level.
+ *
+ * Called with KVM's SRCU lock held.
+ */
+static int kvm_spe_pin_buffer_read_desc(struct kvm_vcpu *vcpu, gpa_t gpa, void *data,
+					      unsigned long len)
+{
+	int ret;
+
+	/* Page descriptors are always 64 bits. */
+	if (WARN_ON_ONCE(len != 8))
+		return -EINVAL;
+
+	ret = kvm_read_guest(vcpu->kvm, gpa, data, len);
+	if (ret)
+		return ret;
+
+	return kvm_spe_pin_gpa(vcpu, gpa, false);
+}
+
+static bool kvm_spe_pin_buffer(struct kvm_vcpu *vcpu, u64 ptr, u64 limit)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct s1_walk_result wr = {};
+	struct s1_walk_info wi = {
+		.read_desc 	= kvm_spe_pin_buffer_read_desc,
+		.regime		= TR_EL10,
+		.as_el0		= false,
+		.pan		= false,
+	};
+	bool commit_write, s2_err;
+	int idx, ret;
+	u8 fst = 0;
+
+	/* KVM can only pin memory at the host's PAGE_SIZE granularity. */
+	ptr = PAGE_ALIGN_DOWN(ptr);
+	limit = PAGE_ALIGN(limit);
+
+	idx = srcu_read_lock(&kvm->srcu);
+	for (; ptr < limit; ptr += PAGE_SIZE) {
+		ret = __kvm_translate_va(vcpu, &wi, &wr, ptr);
+		if (ret) {
+			fst = wr.fst;
+			s2_err = wr.s2;
+			break;
+		}
+		if (!wr.pw) {
+			/* I_GQYCH */
+			fst = ESR_ELx_FSC_PERM_L(wr.level);
+			s2_err = false;
+			ret = -EPERM;
+			break;
+		}
+
+		ret = kvm_spe_pin_gpa(vcpu, wr.pa, true);
+		if (ret) {
+			if (ret == -EPERM)
+				fst = ESR_ELx_FSC_PERM_L(wr.level);
+			s2_err = true;
+			break;
+		}
+	}
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	if (!ret)
+		return true;
+
+	switch (ret) {
+	case -EAGAIN:
+		commit_write = false;
+		break;
+	case -EPERM:
+		if (!fst)
+			fst = ESR_ELx_FSC_PERM_L(1);
+		kvm_spe_inject_data_abort(vcpu, fst, s2_err);
+		commit_write = true;
+		break;
+	case -ENOMEM:
+		kvm_spe_inject_other_event(vcpu, PMBSR_EL1_BUF_BSC_SIZE);
+		commit_write = true;
+		break;
+	default:
+		if (!fst)
+			fst = ESR_ELx_FSC_FAULT_L(0);
+		kvm_spe_inject_data_abort(vcpu, fst, s2_err);
+		commit_write = true;
+	}
+
+	kvm_spe_unpin_buffer(vcpu);
+
+	return commit_write;
+}
+
 static u64 kvm_spe_max_buffer_size(struct kvm *kvm)
 {
 	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
@@ -229,8 +787,14 @@ static u16 kvm_spe_min_align(struct kvm *kvm)
 
 bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
 {
+	u64 pmbptr_el1, pmblimitr_el1, pmbsr_el1;
+	bool was_enabled, now_enabled;
 	struct kvm *kvm = vcpu->kvm;
 	u64 ptr, limit, max_buffer_size;
+	bool commit_write;
+
+	was_enabled = kvm_spe_profiling_buffer_enabled_vcpu(vcpu) &&
+		      !kvm_spe_in_discard_mode_vcpu(vcpu);
 
 	switch (reg) {
 	case PMBLIMITR_EL1:
@@ -244,19 +808,32 @@ bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
 		break;
 	default:
 		WARN_ON_ONCE(true);
+		goto commit_write;
 	}
 
-	__vcpu_assign_sys_reg(vcpu, reg, val);
-	if (reg == PMBSR_EL1) {
-		kvm_spe_update_irq_level(vcpu,
-					 FIELD_GET(PMBSR_EL1_S, __vcpu_sys_reg(vcpu, PMBSR_EL1)));
-	}
+	/*
+	 * Don't update the VCPU register just yet, we might be required to
+	 * replay the access to retry pinning the buffer.
+	 */
 
-	if (!kvm_spe_profiling_buffer_enabled_vcpu(vcpu) || kvm_spe_in_discard_mode_vcpu(vcpu))
-		goto out;
+	pmbptr_el1 = reg == PMBPTR_EL1 ? val : __vcpu_sys_reg(vcpu, PMBPTR_EL1);
+	pmblimitr_el1 = reg == PMBLIMITR_EL1 ? val : __vcpu_sys_reg(vcpu, PMBLIMITR_EL1);
+	pmbsr_el1 = reg == PMBSR_EL1 ? val : __vcpu_sys_reg(vcpu, PMBSR_EL1);
 
-	ptr = kvm_spe_buffer_ptr(__vcpu_sys_reg(vcpu, PMBPTR_EL1));
-	limit = kvm_spe_buffer_limit(__vcpu_sys_reg(vcpu, PMBLIMITR_EL1));
+	now_enabled = kvm_spe_profiling_buffer_enabled(pmblimitr_el1, pmbsr_el1) &&
+		      !kvm_spe_in_discard_mode(pmblimitr_el1);
+
+	if (!was_enabled && !now_enabled)
+		goto commit_write;
+
+	if (was_enabled)
+		kvm_spe_unpin_buffer(vcpu);
+
+	if (!now_enabled)
+		goto commit_write;
+
+	ptr = kvm_spe_buffer_ptr(pmbptr_el1);
+	limit = kvm_spe_buffer_limit(pmblimitr_el1);
 
 	/*
 	 * In the Arm ARM, Uint() performs a *signed* integer conversion.
@@ -265,14 +842,34 @@ bool kvm_spe_write_sysreg(struct kvm_vcpu *vcpu, int reg, u64 val)
 	if (!limit || (s64)ptr > (s64)limit - (s64)kvm_spe_max_record_size(kvm) ||
 	    FIELD_GET(GENMASK_ULL(63, 56), ptr) != FIELD_GET(GENMASK_ULL(63, 56), limit)) {
 		kvm_spe_inject_other_event(vcpu, PMBSR_EL1_BUF_BSC_FULL);
-		goto out;
+		goto buffer_management_event;
 	}
 
 	max_buffer_size = kvm_spe_max_buffer_size(kvm);
-	if (max_buffer_size && limit - ptr > max_buffer_size)
+	if (max_buffer_size && limit - ptr > max_buffer_size) {
 		kvm_spe_inject_other_event(vcpu, PMBSR_EL1_BUF_BSC_SIZE);
+		goto buffer_management_event;
+	}
 
-out:
+	commit_write = kvm_spe_pin_buffer(vcpu, ptr, limit);
+	if (!commit_write)
+		return false;
+
+commit_write:
+	__vcpu_assign_sys_reg(vcpu, reg, val);
+	if (reg == PMBSR_EL1) {
+		kvm_spe_update_irq_level(vcpu,
+					 FIELD_GET(PMBSR_EL1_S, __vcpu_sys_reg(vcpu, PMBSR_EL1)));
+	}
+	return true;
+
+buffer_management_event:
+	/*
+	 * Injecting an event modifies PMBSR_EL1, make sure the write doesn't
+	 * overwrite it.
+	 */
+	if (reg != PMBSR_EL1)
+		__vcpu_assign_sys_reg(vcpu, reg, val);
 	return true;
 }
 
@@ -327,6 +924,9 @@ static u64 kvm_spe_get_pmbidr_el1(struct kvm_vcpu *vcpu)
 	pmbidr_el1 &= ~PMBIDR_EL1_MaxBuffSize;
 	pmbidr_el1 |= max_buffer_size_to_pmbidr_el1(max_buffer_size);
 
+	/* TODO: Implement support for FEAT_HAFDBS in the table walker. */
+	pmbidr_el1 &= ~PMBIDR_EL1_F;
+
 	return pmbidr_el1;
 }
 
@@ -375,6 +975,7 @@ void kvm_spe_sync_hwstate(struct kvm_vcpu *vcpu)
 
 	if (FIELD_GET(PMBSR_EL1_S, vcpu_spe->hw_pmbsr_el1)) {
 		__vcpu_assign_sys_reg(vcpu, PMBSR_EL1, vcpu_spe->hw_pmbsr_el1);
+		kvm_spe_unpin_buffer(vcpu);
 		vcpu_spe->hw_pmbsr_el1 = 0;
 		kvm_spe_update_irq_level(vcpu, true);
 	}
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 30/35] KVM: Propagate MMU event to the MMU notifier handlers
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (28 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 29/35] KVM: arm64: Pin the SPE buffer in the host and map it at stage 2 Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 31/35] KVM: arm64: Handle MMU notifiers for the SPE buffer Alexandru Elisei
                   ` (5 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

arm64 will want to perform a different action for MMU notifiers based on
the reason for the notifier. Propagate the reason for the MMU notifier down
to the arch code. Where there is no primary MMU event associated with a
notifier callback, add our own.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 include/linux/kvm_host.h | 17 +++++++++++++++++
 virt/kvm/kvm_main.c      |  8 ++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5bd76cf394fa..772e75d13af1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -263,12 +263,29 @@ enum kvm_gfn_range_filter {
 	KVM_FILTER_PRIVATE		= BIT(1),
 };
 
+enum kvm_mmu_notifier_event {
+	KVM_MMU_NOTIFY_UNMAP            = MMU_NOTIFY_UNMAP,
+	KVM_MMU_NOTIFY_CLEAR            = MMU_NOTIFY_CLEAR,
+	KVM_MMU_NOTIFY_PROTECTION_VMA   = MMU_NOTIFY_PROTECTION_VMA,
+	KVM_MMU_NOTIFY_PROTECTION_PAGE  = MMU_NOTIFY_PROTECTION_PAGE,
+	KVM_MMU_NOTIFY_SOFT_DIRTY       = MMU_NOTIFY_SOFT_DIRTY,
+	KVM_MMU_NOTIFY_RELEASE          = MMU_NOTIFY_RELEASE,
+	KVM_MMU_NOTIFY_MIGRATE          = MMU_NOTIFY_MIGRATE,
+	KVM_MMU_NOTIFY_EXCLUSIVE        = MMU_NOTIFY_EXCLUSIVE,
+	KVM_MMU_NOTIFY_AGE		= 32,
+	KVM_MMU_NOTIFY_MEMORY_ATTRIBUTES,
+	KVM_MMU_NOTIFY_ARCH1,
+	KVM_MMU_NOTIFY_ARCH2,
+	KVM_MMU_NOTIFY_ARCH3,
+};
+
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
 	gfn_t end;
 	union kvm_mmu_notifier_arg arg;
 	enum kvm_gfn_range_filter attr_filter;
+	enum kvm_mmu_notifier_event event;
 	bool may_block;
 	bool lockless;
 };
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7a0ae2a7b20..2dce50bcb181 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -521,6 +521,7 @@ struct kvm_mmu_notifier_range {
 	union kvm_mmu_notifier_arg arg;
 	gfn_handler_t handler;
 	on_lock_fn_t on_lock;
+	enum kvm_mmu_notifier_event event;
 	bool flush_on_ret;
 	bool may_block;
 	bool lockless;
@@ -618,6 +619,7 @@ static __always_inline kvm_mn_ret_t kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
+			gfn_range.event = range->event;
 			gfn_range.lockless = range->lockless;
 
 			if (!r.found_memslot) {
@@ -660,6 +662,7 @@ static __always_inline int kvm_age_hva_range(struct mmu_notifier *mn,
 		.handler	= handler,
 		.on_lock	= (void *)kvm_null_fn,
 		.flush_on_ret	= flush_on_ret,
+		.event		= KVM_MMU_NOTIFY_AGE,
 		.may_block	= false,
 		.lockless	= IS_ENABLED(CONFIG_KVM_MMU_LOCKLESS_AGING),
 	};
@@ -732,6 +735,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.end		= range->end,
 		.handler	= kvm_mmu_unmap_gfn_range,
 		.on_lock	= kvm_mmu_invalidate_begin,
+		.event		= (enum kvm_mmu_notifier_event)range->event,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
@@ -808,6 +812,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 		.end		= range->end,
 		.handler	= (void *)kvm_null_fn,
 		.on_lock	= kvm_mmu_invalidate_end,
+		.event		= (enum kvm_mmu_notifier_event)range->event,
 		.flush_on_ret	= false,
 		.may_block	= mmu_notifier_range_blockable(range),
 	};
@@ -2482,6 +2487,7 @@ static __always_inline void kvm_handle_gfn_range(struct kvm *kvm,
 
 	gfn_range.arg = range->arg;
 	gfn_range.may_block = range->may_block;
+	gfn_range.event = range->event;
 
 	/*
 	 * If/when KVM supports more attributes beyond private .vs shared, this
@@ -2550,6 +2556,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		.arg.attributes = attributes,
 		.handler = kvm_pre_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_begin,
+		.event = KVM_MMU_NOTIFY_MEMORY_ATTRIBUTES,
 		.flush_on_ret = true,
 		.may_block = true,
 	};
@@ -2559,6 +2566,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		.arg.attributes = attributes,
 		.handler = kvm_arch_post_set_memory_attributes,
 		.on_lock = kvm_mmu_invalidate_end,
+		.event = KVM_MMU_NOTIFY_MEMORY_ATTRIBUTES,
 		.may_block = true,
 	};
 	unsigned long i;
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 31/35] KVM: arm64: Handle MMU notifiers for the SPE buffer
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (29 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 30/35] KVM: Propagate MMU event to the MMU notifier handlers Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 32/35] KVM: Add KVM_EXIT_RLIMIT exit_reason Alexandru Elisei
                   ` (4 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

KVM makes changes to the stage 2 for two reasons: to mirror the changes
that happen to the host's stage 1, and when userspace makes a direct
change to the memory of the virtual machine.

Explicit changes made to the VM by userspace - like changing a memslot or
clearing the list of dirty pages, are immediately honored when it comes to
memory pinned and mapped at stage 2 for the SPE buffer. The only caveat is
when it comes to making the buffer pages read-only: to avoid a blackout
window, KVM skips making the affected buffer pages readonly and instead
immediately redirties them.

Changes to the host's stage 1 that affect the stage 2 entries for the
buffer broadly fall in two categories: changes that are attempted, but
never executed because the memory is pinned, and changes that are
committed.

The first type of changes share a similar cause: the reference count for a
page or folio is incremented with page table spinlock held, but the MMU
notifiers must be invoked from preemptible contexts. As a result, features
like THP collapse (khugepaged),  automatic NUMA balancing, KSM, etc. use
the following pattern for modifying the host's stage 1:

	mmu_notifier_invalidate_range_start(&range)
	pte = pte_offset_map_lock(.., &ptl)
	if (page_maybe_dma_pinned(page))
		goto out_unlock
	/* do stuff */

out_unlock:
	spin_unlock(ptl)
	mmu_notifier_invalidate_range_end(&range)

It is safe for KVM to ignore these type of changes, because the host's page
table won't be modified.

Changes to the host's stage 1 that are committed will be reflected in the
buffer stage 2 entries. The only exception is the access flag, for the same
reasoning as making the entries read-only.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_host.h |   2 +
 arch/arm64/include/asm/kvm_mmu.h  |   7 +-
 arch/arm64/include/asm/kvm_spe.h  |  19 +++
 arch/arm64/kvm/arm.c              |  14 +-
 arch/arm64/kvm/mmu.c              | 125 ++++++++++++----
 arch/arm64/kvm/nested.c           |   9 +-
 arch/arm64/kvm/spe.c              | 232 +++++++++++++++++++++++++++++-
 arch/arm64/kvm/sys_regs.c         |   5 +-
 arch/arm64/kvm/vgic/vgic-its.c    |   4 +-
 include/kvm/arm_vgic.h            |   2 +
 include/linux/kvm_host.h          |   2 +
 11 files changed, 380 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 876957320672..e79ec480d1d1 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -351,6 +351,8 @@ struct kvm_arch {
 #define KVM_ARCH_FLAG_GUEST_HAS_SVE			9
 	/* MIDR_EL1, REVIDR_EL1, and AIDR_EL1 are writable from userspace */
 #define KVM_ARCH_FLAG_WRITABLE_IMP_ID_REGS		10
+	/* Statistical Profiling Extension enabled for the guest */
+#define KVM_ARCH_FLAG_SPE_ENABLED			11
 	unsigned long flags;
 
 	/* VM-wide vCPU feature set */
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 37b84e9d4337..a4a0e00d1bbb 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -100,6 +100,10 @@ alternative_cb_end
 #include <asm/kvm_host.h>
 #include <asm/kvm_nested.h>
 
+#define KVM_MMU_NOTIFY_CMO		KVM_MMU_NOTIFY_ARCH1
+#define KVM_MMU_NOTIFY_SHADOW_S2	KVM_MMU_NOTIFY_ARCH2
+#define KVM_MMU_NOTIFY_SPLIT_HUGE_PAGE	KVM_MMU_NOTIFY_ARCH3
+
 void kvm_update_va_mask(struct alt_instr *alt,
 			__le32 *origptr, __le32 *updptr, int nr_inst);
 void kvm_compute_layout(void);
@@ -168,8 +172,9 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
 int create_hyp_stack(phys_addr_t phys_addr, unsigned long *haddr);
 void __init free_hyp_pgds(void);
 
+enum kvm_mmu_notifier_event;
 void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
-			    u64 size, bool may_block);
+			    u64 size, bool may_block, enum kvm_mmu_notifier_event event);
 void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end);
 void kvm_stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end);
 
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 6c091fbfc95d..59a0e825a226 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -8,6 +8,8 @@
 
 #include <linux/kvm.h>
 
+#include <asm/stage2_pgtable.h>
+
 #ifdef CONFIG_KVM_ARM_SPE
 
 struct kvm_spe {
@@ -15,6 +17,7 @@ struct kvm_spe {
 	struct arm_spe_pmu *arm_spu;
 	u64 max_buffer_size;	/* Maximum per VCPU buffer size */
 	u64 guest_pmscr_el2;
+	bool dirtying_pages;
 };
 
 struct kvm_vcpu_spe {
@@ -35,6 +38,9 @@ static __always_inline bool kvm_supports_spe(void)
 #define vcpu_has_spe(vcpu)					\
 	(vcpu_has_feature(vcpu, KVM_ARM_VCPU_SPE))
 
+#define kvm_has_spe(kvm)					\
+	(test_bit(KVM_ARCH_FLAG_SPE_ENABLED, &(kvm)->arch.flags))
+
 /* Implements the function ProfilingBufferEnabled() from ARM DDI0487K.a */
 static inline bool kvm_spe_profiling_buffer_enabled(u64 pmblimitr_el1, u64 pmbsr_el1)
 {
@@ -47,6 +53,14 @@ void kvm_spe_destroy_vm(struct kvm *kvm);
 int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu);
 void kvm_spe_vcpu_destroy(struct kvm_vcpu *vcpu);
 
+bool kvm_spe_allow_write_without_running_vcpu(struct kvm *kvm);
+
+enum kvm_mmu_notifier_event;
+phys_addr_t kvm_spe_adjust_range_start(struct kvm *kvm, phys_addr_t start, phys_addr_t end,
+				       enum kvm_mmu_notifier_event event);
+phys_addr_t kvm_spe_adjust_range_end(struct kvm *kvm, phys_addr_t start, phys_addr_t end,
+				     enum kvm_mmu_notifier_event event);
+
 u8 kvm_spe_get_pmsver_limit(void);
 
 int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
@@ -72,6 +86,7 @@ struct kvm_vcpu_spe {
 
 #define kvm_supports_spe()	false
 #define vcpu_has_spe(vcpu)	false
+#define kvm_has_spe(kvm)	false
 
 static inline void kvm_spe_init_vm(struct kvm *kvm)
 {
@@ -86,6 +101,10 @@ static inline int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 static inline void kvm_spe_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
 }
+static inline bool kvm_spe_allow_write_without_running_vcpu(struct kvm *kvm)
+{
+	return false;
+}
 static inline u8 kvm_spe_get_pmsver_limit(void)
 {
 	return 0;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 8da772690173..d05dbb6d2d7a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -777,6 +777,16 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
 		&& !kvm_arm_vcpu_stopped(v) && !v->arch.pause);
 }
 
+/*
+ * kvm_arch_allow_write_without_running_vcpu - allow writing guest memory
+ * without a running VCPU when dirty ring is enabled.
+ */
+bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
+{
+       return kvm_vgic_allow_write_without_running_vcpu(kvm) ||
+	      kvm_spe_allow_write_without_running_vcpu(kvm);
+}
+
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
 {
 	return vcpu_mode_priv(vcpu);
@@ -1275,8 +1285,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		if (kvm_vcpu_has_pmu(vcpu))
 			kvm_pmu_sync_hwstate(vcpu);
 
-		kvm_spe_sync_hwstate(vcpu);
-
 		/*
 		 * Sync the vgic state before syncing the timer state because
 		 * the timer code needs to know if the virtual timer
@@ -1326,6 +1334,8 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 
 		preempt_enable();
 
+		kvm_spe_sync_hwstate(vcpu);
+
 		/*
 		 * The ARMv8 architecture doesn't give the hypervisor
 		 * a mechanism to prevent a guest from dropping to AArch32 EL0
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8abba9619c58..de48fb7c0fff 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -37,6 +37,22 @@ static unsigned long __ro_after_init io_map_base;
 
 #define KVM_PGT_FN(fn)		(!is_protected_kvm_enabled() ? fn : p ## fn)
 
+#ifndef CONFIG_KVM_ARM_SPE
+static inline phys_addr_t
+kvm_spe_adjust_range_start(struct kvm *kvm, phys_addr_t start, phys_addr_t end,
+			   enum kvm_mmu_notifier_event event)
+{
+	return start;
+}
+
+static inline phys_addr_t
+kvm_spe_adjust_range_end(struct kvm *kvm, phys_addr_t start, phys_addr_t end,
+			 enum kvm_mmu_notifier_event event)
+{
+	return end;
+}
+#endif
+
 static phys_addr_t __stage2_range_addr_end(phys_addr_t addr, phys_addr_t end,
 					   phys_addr_t size)
 {
@@ -62,10 +78,10 @@ static phys_addr_t stage2_range_addr_end(phys_addr_t addr, phys_addr_t end)
 static int stage2_apply_range(struct kvm_s2_mmu *mmu, phys_addr_t addr,
 			      phys_addr_t end,
 			      int (*fn)(struct kvm_pgtable *, u64, u64),
-			      bool resched)
+			      bool resched, enum kvm_mmu_notifier_event event)
 {
 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
-	int ret;
+	int ret = 0;
 	u64 next;
 
 	do {
@@ -73,7 +89,15 @@ static int stage2_apply_range(struct kvm_s2_mmu *mmu, phys_addr_t addr,
 		if (!pgt)
 			return -EINVAL;
 
+		if (kvm_has_spe(kvm)) {
+			addr = kvm_spe_adjust_range_start(kvm, addr, end, event);
+			if (addr == end)
+				break;
+		}
+
 		next = stage2_range_addr_end(addr, end);
+		if (kvm_has_spe(kvm))
+			next = kvm_spe_adjust_range_end(kvm, addr, next, event);
 		ret = fn(pgt, addr, next - addr);
 		if (ret)
 			break;
@@ -85,8 +109,8 @@ static int stage2_apply_range(struct kvm_s2_mmu *mmu, phys_addr_t addr,
 	return ret;
 }
 
-#define stage2_apply_range_resched(mmu, addr, end, fn)			\
-	stage2_apply_range(mmu, addr, end, fn, true)
+#define stage2_apply_range_resched(mmu, addr, end, fn, event)		\
+	stage2_apply_range(mmu, addr, end, fn, true, event)
 
 /*
  * Get the maximum number of page-tables pages needed to split a range
@@ -122,8 +146,9 @@ static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
 {
 	struct kvm_mmu_memory_cache *cache;
 	struct kvm_pgtable *pgt;
-	int ret, cache_capacity;
 	u64 next, chunk_size;
+	int cache_capacity;
+	int ret = 0;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
@@ -152,7 +177,18 @@ static int kvm_mmu_split_huge_pages(struct kvm *kvm, phys_addr_t addr,
 		if (!pgt)
 			return -EINVAL;
 
+		if (kvm_has_spe(kvm)) {
+			addr = kvm_spe_adjust_range_start(kvm, addr, end,
+					KVM_MMU_NOTIFY_SPLIT_HUGE_PAGE);
+			if (addr == end)
+				break;
+		}
+
 		next = __stage2_range_addr_end(addr, end, chunk_size);
+		if (kvm_has_spe(kvm)) {
+			next = kvm_spe_adjust_range_end(kvm, addr, next,
+							KVM_MMU_NOTIFY_SPLIT_HUGE_PAGE);
+		}
 		ret = KVM_PGT_FN(kvm_pgtable_stage2_split)(pgt, addr, next - addr, cache);
 		if (ret)
 			break;
@@ -319,6 +355,7 @@ static void invalidate_icache_guest_page(void *va, size_t size)
  * @start: The intermediate physical base address of the range to unmap
  * @size:  The size of the area to unmap
  * @may_block: Whether or not we are permitted to block
+ * @event: MMU notifier event
  *
  * Clear a range of stage-2 mappings, lowering the various ref-counts.  Must
  * be called while holding mmu_lock (unless for freeing the stage2 pgd before
@@ -326,7 +363,7 @@ static void invalidate_icache_guest_page(void *va, size_t size)
  * with things behind our backs.
  */
 static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size,
-				 bool may_block)
+				 bool may_block, enum kvm_mmu_notifier_event event)
 {
 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
 	phys_addr_t end = start + size;
@@ -334,18 +371,19 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
 	lockdep_assert_held_write(&kvm->mmu_lock);
 	WARN_ON(size & ~PAGE_MASK);
 	WARN_ON(stage2_apply_range(mmu, start, end, KVM_PGT_FN(kvm_pgtable_stage2_unmap),
-				   may_block));
+				   may_block, event));
 }
 
 void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
-			    u64 size, bool may_block)
+			    u64 size, bool may_block, enum kvm_mmu_notifier_event event)
 {
-	__unmap_stage2_range(mmu, start, size, may_block);
+	__unmap_stage2_range(mmu, start, size, may_block, event);
 }
 
 void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end)
 {
-	stage2_apply_range_resched(mmu, addr, end, KVM_PGT_FN(kvm_pgtable_stage2_flush));
+	stage2_apply_range_resched(mmu, addr, end, KVM_PGT_FN(kvm_pgtable_stage2_flush),
+				   KVM_MMU_NOTIFY_CMO);
 }
 
 static void stage2_flush_memslot(struct kvm *kvm,
@@ -1028,7 +1066,8 @@ static void stage2_unmap_memslot(struct kvm *kvm,
 
 		if (!(vma->vm_flags & VM_PFNMAP)) {
 			gpa_t gpa = addr + (vm_start - memslot->userspace_addr);
-			kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start, true);
+			kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start, true,
+					       KVM_MMU_NOTIFY_MEMSLOT);
 		}
 		hva = vm_end;
 	} while (hva < reg_end);
@@ -1187,7 +1226,8 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
  */
 void kvm_stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end)
 {
-	stage2_apply_range_resched(mmu, addr, end, KVM_PGT_FN(kvm_pgtable_stage2_wrprotect));
+	stage2_apply_range_resched(mmu, addr, end, KVM_PGT_FN(kvm_pgtable_stage2_wrprotect),
+				   KVM_MMU_NOTIFY_WP);
 }
 
 /**
@@ -2100,22 +2140,63 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 
 	__unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
 			     (range->end - range->start) << PAGE_SHIFT,
-			     range->may_block);
+			     range->may_block, range->event);
 
 	kvm_nested_s2_unmap(kvm, range->may_block);
 	return false;
 }
 
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+static bool kvm_test_age_range(struct kvm *kvm, struct kvm_gfn_range *range,
+			       bool mkold)
 {
-	u64 size = (range->end - range->start) << PAGE_SHIFT;
+	phys_addr_t range_start = range->start << PAGE_SHIFT;
+	phys_addr_t range_end = range->end << PAGE_SHIFT;
+	enum kvm_mmu_notifier_event event = range->event;
+	phys_addr_t start, end;
+	bool was_young = false;
+
+	if (!kvm_has_spe(kvm)) {
+		return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
+							   range->start,
+							   range->end - range->start,
+							   mkold);
+	}
 
+       /* Prime the first iteration */
+       start = end = range_start;
+       do {
+	       start = kvm_spe_adjust_range_start(kvm, start, range_end, event);
+	       /*
+		* 'start' is initialised to 'end' at the beginning of each
+		* iteration.  They can only be different because
+		* kvm_spe_adjust_range_start() detected at least one page in use
+		* for SPE.
+		*/
+	       if (start != end)
+		       was_young = true;
+	       if (start == range_end)
+		       break;
+
+               end = kvm_spe_adjust_range_end(kvm, start, range_end, event);
+               if (end != range_end)
+                      was_young = true;
+
+               was_young |= KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
+									    start, end - start,
+									    mkold);
+	       start = end;
+       } while (end != range_end);
+
+       return was_young;
+}
+
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
+{
 	if (!kvm->arch.mmu.pgt)
 		return false;
 
-	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
-						   range->start << PAGE_SHIFT,
-						   size, true);
+	return kvm_test_age_range(kvm, range, true);
+
 	/*
 	 * TODO: Handle nested_mmu structures here using the reverse mapping in
 	 * a later version of patch series.
@@ -2124,14 +2205,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	u64 size = (range->end - range->start) << PAGE_SHIFT;
-
 	if (!kvm->arch.mmu.pgt)
 		return false;
 
-	return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm->arch.mmu.pgt,
-						   range->start << PAGE_SHIFT,
-						   size, false);
+	return kvm_test_age_range(kvm, range, false);
 }
 
 phys_addr_t kvm_mmu_get_httbr(void)
@@ -2386,7 +2463,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 	phys_addr_t size = slot->npages << PAGE_SHIFT;
 
 	write_lock(&kvm->mmu_lock);
-	kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true);
+	kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true, KVM_MMU_NOTIFY_MEMSLOT);
 	kvm_nested_s2_unmap(kvm, true);
 	write_unlock(&kvm->mmu_lock);
 }
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 92e94bb96bcc..73e09dcef3ca 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -1076,8 +1076,10 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
 	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
 		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
 
-		if (kvm_s2_mmu_valid(mmu))
-			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
+		if (kvm_s2_mmu_valid(mmu)) {
+			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block,
+					       KVM_MMU_NOTIFY_SHADOW_S2);
+		}
 	}
 
 	kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
@@ -1787,7 +1789,8 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
 
 		write_lock(&vcpu->kvm->mmu_lock);
 		if (mmu->pending_unmap) {
-			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
+			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true,
+					       KVM_MMU_NOTIFY_SHADOW_S2);
 			mmu->pending_unmap = false;
 		}
 		write_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 35848e4ff68b..f80ef8cdb1d8 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -30,11 +30,13 @@ struct pinned_page {
 	DECLARE_BITMAP(vcpus, KVM_MAX_VCPUS);	/* The page is pinned on these VCPUs */
 	struct page *page;
 	gfn_t gfn;
+	bool unmap_after_unpin;			/* Unmap the page after the buffer is unpinned */
 	bool writable;				/* Is the page mapped as writable? */
 };
 
 static u64 max_buffer_size_to_pmbidr_el1(u64 size);
 static void kvm_spe_update_irq_level(struct kvm_vcpu *vcpu, bool level);
+static void kvm_spe_unpin_page(struct kvm *kvm, struct pinned_page *pinned_page);
 static void kvm_spe_unpin_buffer(struct kvm_vcpu *vcpu);
 
 static u64 pmblimitr_el1_res0_mask = GENMASK_ULL(11, 8) | GENMASK_ULL(6, 3);
@@ -146,6 +148,172 @@ void kvm_spe_vcpu_destroy(struct kvm_vcpu *vcpu)
 	kvm_spe_unpin_buffer(vcpu);
 }
 
+bool kvm_spe_allow_write_without_running_vcpu(struct kvm *kvm)
+{
+	return kvm->arch.kvm_spe.dirtying_pages;
+}
+
+static bool kvm_spe_allow_stage2_change(enum kvm_mmu_notifier_event event)
+{
+	switch (event) {
+	/* Host table entry will be reverted because the page is pinned. */
+	case KVM_MMU_NOTIFY_CLEAR:
+	/*
+	 * MMU_NOTIFY_PROTECTION_VMA is generated for the mprotect() call, but
+	 * also for benign reasons, like automatic NUMA balancing. In the latter
+	 * case, the changes to the host's stage 1 will be reverted when it is
+	 * observed that the page is pinned.
+	 *
+	 * In the mprotect() case, it is userspace that is explicitely changing
+	 * the protection for the VMA. Because KVM cannot distinguish between
+	 * mprotect() and the other cases, the buffer pages will be marked for
+	 * unmapping from the host's stage 1 when the guest disables the buffer.
+	 */
+	case KVM_MMU_NOTIFY_PROTECTION_VMA:
+	/* Don't allow buffer pages to be made read-only at stage 2. */
+	case KVM_MMU_NOTIFY_SOFT_DIRTY:
+	/* Host page migration will fail because the page is pinned. */
+	case KVM_MMU_NOTIFY_MIGRATE:
+	/*
+	 * SPE can write to the buffer at any time, treat the pinned pages as
+	 * young.
+	 */
+	case KVM_MMU_NOTIFY_AGE:
+	/*
+	 * This event is generated when a memslot is marked for dirty page
+	 * logging. The buffer pages will be kept mapped at stage 2 and they
+	 * will be immediately marked as dirty because KVM, without SPE
+	 * reporting a fault, has no means of detecting when a record is written
+	 * to memory.
+	 */
+	case KVM_MMU_NOTIFY_WP:
+	/*
+	 * All buffer pages are mapped with PAGE_SIZE granularity at stage 2,
+	 * it's safe to skip them.
+	 */
+	case KVM_MMU_NOTIFY_SPLIT_HUGE_PAGE:
+		return false;
+
+	/* Userspace munmap'ed the VMA. */
+	case KVM_MMU_NOTIFY_UNMAP:
+	/*
+	 * pin_user_pages() does not return a PFN without an associated struct
+	 * page, so the event shouldn't apply to a buffer page. Be conservative
+	 * and allow the stage 2 changes.
+	 */
+	case KVM_MMU_NOTIFY_PROTECTION_PAGE:
+	/*
+	 * KVM doesn't propagate this event to the architecture code because the
+	 * MMU notifier is unregistered when the VM is being destroyed and no
+	 * VCPUs should be running. Also, after the notifier is released, the
+	 * stage 2 will be destroyed. It makes little difference if we allow or
+	 * don't allow the buffer to be unmapped here, but put the event in the
+	 * allow group anyway in case anything changes.
+	 *
+	 * The buffer for each VCPU will be unpinned in the next stage of the VM
+	 * cleanup process, when the VCPUs are destroyed.
+	 */
+	case KVM_MMU_NOTIFY_RELEASE:
+	/* Same as KVM_MMU_NOTIFY_PROTECTION_PAGE. */
+	case KVM_MMU_NOTIFY_EXCLUSIVE:
+	/* x86-specific, but be conservative. */
+	case KVM_MMU_NOTIFY_MEMORY_ATTRIBUTES:
+	/* Userspace is changing a memslot while the buffer is enabled. */
+	case KVM_MMU_NOTIFY_MEMSLOT:
+	/* CMOs don't change stage 2 entries. */
+	case KVM_MMU_NOTIFY_CMO:
+	/* SPE is not yet compatible with nested virt, but be conservative. */
+	case KVM_MMU_NOTIFY_SHADOW_S2:
+		break;
+	default:
+		WARN_ON_ONCE(1);
+	}
+
+	return true;
+}
+
+phys_addr_t kvm_spe_adjust_range_start(struct kvm *kvm, phys_addr_t start, phys_addr_t end,
+				       enum kvm_mmu_notifier_event event)
+{
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
+	struct xarray *pinned_pages = &kvm_spe->pinned_pages;
+	struct pinned_page *pinned_page;
+	kvm_pfn_t gfn;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	if (kvm_spe_allow_stage2_change(event))
+		return start;
+
+	xa_lock(pinned_pages);
+	for (gfn = PHYS_PFN(start); gfn < PHYS_PFN(end); gfn++) {
+		pinned_page = xa_load(pinned_pages, gfn);
+		if (!pinned_page)
+			break;
+
+		pinned_page->unmap_after_unpin = true;
+		if (event == KVM_MMU_NOTIFY_WP && pinned_page->writable) {
+			kvm_spe->dirtying_pages = true;
+			mark_page_dirty(kvm, gfn);
+			kvm_spe->dirtying_pages = false;
+		}
+	}
+	xa_unlock(pinned_pages);
+
+	return PFN_PHYS(gfn);
+}
+
+/*
+ * Ignores pinned_page->unmap_after_unpin() because it is called only from the
+ * MMU notifiers, before changes are allowed to be made to stage 2.
+ */
+static void kvm_spe_unpin_page_range(struct kvm *kvm, phys_addr_t start, phys_addr_t end)
+{
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+	struct pinned_page *pinned_page;
+	kvm_pfn_t gfn;
+
+	xa_lock(pinned_pages);
+	for (gfn = PHYS_PFN(start); gfn < PHYS_PFN(end); gfn++) {
+		pinned_page = xa_load(pinned_pages, gfn);
+		if (!pinned_page)
+			continue;
+
+		kvm_spe_unpin_page(kvm, pinned_page);
+		kfree(pinned_page);
+	}
+	xa_unlock(pinned_pages);
+}
+
+phys_addr_t kvm_spe_adjust_range_end(struct kvm *kvm, phys_addr_t start, phys_addr_t end,
+				     enum kvm_mmu_notifier_event event)
+{
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+	kvm_pfn_t gfn;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	if (kvm_spe_allow_stage2_change(event)) {
+		if (event != KVM_MMU_NOTIFY_CMO)
+			kvm_spe_unpin_page_range(kvm, start, end);
+		return end;
+	}
+
+	xa_lock(pinned_pages);
+	/*
+	 * We know that @start is not a buffer page. Stop at the first buffer
+	 * page in the range [@start + PAGE_SIZE, @end) - this page will be
+	 * handled in the following call to kvm_spe_adjust_range_start().
+	 */
+	for (gfn = PHYS_PFN(start + PAGE_SIZE); gfn < PHYS_PFN(end); gfn++) {
+		if (xa_load(pinned_pages, gfn))
+			break;
+	}
+	xa_unlock(pinned_pages);
+
+	return PFN_PHYS(gfn);
+}
+
 u8 kvm_spe_get_pmsver_limit(void)
 {
 	unsigned int pmsver;
@@ -231,29 +399,78 @@ static void kvm_spe_inject_data_abort(struct kvm_vcpu *vcpu, u8 fst, bool s2)
 	kvm_spe_update_irq_level(vcpu, true);
 }
 
+static void kvm_spe_unpin_page(struct kvm *kvm, struct pinned_page *pinned_page)
+{
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+
+	__xa_erase(pinned_pages, pinned_page->gfn);
+	unpin_user_pages_dirty_lock(&pinned_page->page, 1, pinned_page->writable);
+}
+
 static void kvm_spe_unpin_buffer(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
 	struct pinned_page *pinned_page;
-	unsigned long gfn;
+	int unmap_count, unmap_resched;
+	bool write_locked = false;
+	struct kvm_pgtable *pgt;
 	int idx;
 
+	XA_STATE(xas, pinned_pages, 0);
+
+	might_sleep();
+
+	/* Copy what stage2_apply_range() does */
+	unmap_resched = kvm_granule_size(KVM_PGTABLE_MIN_BLOCK_LEVEL) >> PAGE_SHIFT;
+	unmap_count = 0;
+
 	idx = srcu_read_lock(&kvm->srcu);
-	xa_lock(pinned_pages);
+	xas_lock(&xas);
+
+	xas_for_each(&xas, pinned_page, ULONG_MAX) {
+		if (xas_retry(&xas, pinned_page))
+			continue;
 
-	xa_for_each(pinned_pages, gfn, pinned_page) {
 		if (!test_bit(vcpu->vcpu_idx, pinned_page->vcpus))
 			continue;
 
 		clear_bit(vcpu->vcpu_idx, pinned_page->vcpus);
-		if (bitmap_empty(pinned_page->vcpus, KVM_MAX_VCPUS)) {
-			__xa_erase(pinned_pages, pinned_page->gfn);
-			unpin_user_pages_dirty_lock(&pinned_page->page, 1, pinned_page->writable);
+		if (!bitmap_empty(pinned_page->vcpus, KVM_MAX_VCPUS))
+			continue;
+
+		kvm_spe_unpin_page(kvm, pinned_page);
+		if (!pinned_page->unmap_after_unpin)
+			goto free_continue;
+
+		if (!write_locked) {
+			xas_pause(&xas);
+			xas_unlock(&xas);
+			write_lock(&kvm->mmu_lock);
+			xas_lock(&xas);
+			write_locked = true;
+			pgt = vcpu->arch.hw_mmu->pgt;
+		}
+
+		if (!pgt)
+			goto free_continue;
+
+		kvm_pgtable_stage2_unmap(pgt, PFN_PHYS(pinned_page->gfn), PAGE_SIZE);
+		unmap_count++;
+		if (unmap_count == unmap_resched) {
+			xas_pause(&xas);
+			xas_unlock(&xas);
+			cond_resched_rwlock_write(&kvm->mmu_lock);
+			xas_lock(&xas);
+			unmap_count = 0;
 		}
+free_continue:
+		kfree(pinned_page);
 	}
 
-	xa_unlock(pinned_pages);
+	xas_unlock(&xas);
+	if (write_locked)
+		write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
@@ -1314,6 +1531,7 @@ int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 			return -ENXIO;
 
 		vcpu_spe->initialized = true;
+		set_bit(KVM_ARCH_FLAG_SPE_ENABLED, &kvm->arch.flags);
 		return 0;
 	}
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index db86d1dcd148..e8fd1688abba 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -3947,7 +3947,8 @@ static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
 	 * the L1 needs to put its stage-2 in a consistent state before doing
 	 * the TLBI.
 	 */
-	kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true);
+	kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true,
+			       KVM_MMU_NOTIFY_SHADOW_S2);
 }
 
 static bool handle_vmalls12e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
@@ -4026,7 +4027,7 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu,
 	 * See comment in s2_mmu_unmap_range() for why this is allowed to
 	 * reschedule.
 	 */
-	kvm_stage2_unmap_range(mmu, base_addr, max_size, true);
+	kvm_stage2_unmap_range(mmu, base_addr, max_size, true, KVM_MMU_NOTIFY_SHADOW_S2);
 }
 
 static bool handle_ipas2e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
index ce3e3ed3f29f..fb36f1b4fdae 100644
--- a/arch/arm64/kvm/vgic/vgic-its.c
+++ b/arch/arm64/kvm/vgic/vgic-its.c
@@ -2706,7 +2706,7 @@ static int vgic_its_ctrl(struct kvm *kvm, struct vgic_its *its, u64 attr)
 }
 
 /*
- * kvm_arch_allow_write_without_running_vcpu - allow writing guest memory
+ * kvm_vgic_allow_write_without_running_vcpu - allow writing guest memory
  * without the running VCPU when dirty ring is enabled.
  *
  * The running VCPU is required to track dirty guest pages when dirty ring
@@ -2715,7 +2715,7 @@ static int vgic_its_ctrl(struct kvm *kvm, struct vgic_its *its, u64 attr)
  * bitmap is used to track the dirty guest pages due to the missed running
  * VCPU in the period.
  */
-bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
+bool kvm_vgic_allow_write_without_running_vcpu(struct kvm *kvm)
 {
 	struct vgic_dist *dist = &kvm->arch.vgic;
 
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 7a0b972eb1b1..4c0f4f80e8ef 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -446,6 +446,8 @@ int kvm_vgic_v4_set_forwarding(struct kvm *kvm, int irq,
 
 void kvm_vgic_v4_unset_forwarding(struct kvm *kvm, int host_irq);
 
+bool kvm_vgic_allow_write_without_running_vcpu(struct kvm *kvm);
+
 int vgic_v4_load(struct kvm_vcpu *vcpu);
 void vgic_v4_commit(struct kvm_vcpu *vcpu);
 int vgic_v4_put(struct kvm_vcpu *vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 772e75d13af1..273ee3339468 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -274,6 +274,8 @@ enum kvm_mmu_notifier_event {
 	KVM_MMU_NOTIFY_EXCLUSIVE        = MMU_NOTIFY_EXCLUSIVE,
 	KVM_MMU_NOTIFY_AGE		= 32,
 	KVM_MMU_NOTIFY_MEMORY_ATTRIBUTES,
+	KVM_MMU_NOTIFY_MEMSLOT,
+	KVM_MMU_NOTIFY_WP,
 	KVM_MMU_NOTIFY_ARCH1,
 	KVM_MMU_NOTIFY_ARCH2,
 	KVM_MMU_NOTIFY_ARCH3,
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 32/35] KVM: Add KVM_EXIT_RLIMIT exit_reason
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (30 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 31/35] KVM: arm64: Handle MMU notifiers for the SPE buffer Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 33/35] KVM: arm64: Implement locked memory accounting for the SPE buffer Alexandru Elisei
                   ` (3 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Arm CPUs can optionally implement a feature called Statistical Profiling
Extension (SPE). When this feature is in use, a record is created at
certain intervals, with information about the operation that the CPU was
executing when the record was created. This record is then written to a
buffer in memory.

The buffer where records are written is defined by virtual addresses (a
base and a limit). The translation from a buffer virtual address to a
physical address is performed using the CPU's translation tables. If the
Statistical Profiling Unit (SPU) encounters a fault on the CPU's stage 2
during the translation process, profiling stops and the fault is reported
to the CPU asynchronously, via an **interrupt**, not a (synchronous)
exception, like with CPU MMU faults.

The interrupt is delivered to the CPU asynchronously, and operations
executed by the CPU after the SPU asserts the interrupt and before the
CPU receives the interrupt are not sampled by the SPU. This leads to
different sampling profiles between baremetal and a virtual machine when
the same code is executed.

The solution is to pre-fault the memory representing the buffer and pin
it in the host so it doesn't get unmapped from stage 2. Furthermore, the
host memory representing the guest's translation tables for the buffer
virtual addresses must also be pinned, to avoid faults on translation
table walks. The stage 1 tables that map the buffer are programmed by
the guest, and this makes it impossible for KVM to know beforehand how
many levels and how many pages it needs to pin; KVM has this information
only after walking the guest's stage 1 tables, when the running guest
enables the buffer.

Memory pinned by KVM for the buffer must be subject to the RLIMIT_MEMLOCK
limit.  Add a new KVM_RUN exit code for KVM to let userspace know when the
limit has been exceeded, and by how much. Userspace can then decide if they
want to (or can) further increase the limit.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/api.rst | 13 +++++++++++++
 include/uapi/linux/kvm.h       |  6 ++++++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 10e0733297ac..2276c4590948 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7318,6 +7318,19 @@ Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.

+::
+
+		/* KVM_EXIT_RLIMIT */
+		struct {
+			__u64 excess;
+			__u8 rlimit_id;
+		} rlimit;
+
+If the exit_reason is KVM_EXIT_RLIMIT, the VCPU has exceeded a system resource
+limit.  The 'rlimit_id' is set to the resource limit ID (see man 2 getrlimit),
+and the 'excess' field is set to the amount by which the limit was exceeded.
+The unit of measurement is the unit of measurement associated with the resource
+limit.

 .. _cap_enable:

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 11e5dbde331b..f27679266197 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -179,6 +179,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
 #define KVM_EXIT_TDX              40
+#define KVM_EXIT_RLIMIT           41

 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -473,6 +474,11 @@ struct kvm_run {
 				} setup_event_notify;
 			};
 		} tdx;
+		/* KVM_EXIT_RLIMIT */
+		struct {
+			__u64 excess;
+			__u8 rlimit_id;
+		} rlimit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 33/35] KVM: arm64: Implement locked memory accounting for the SPE buffer
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (31 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 32/35] KVM: Add KVM_EXIT_RLIMIT exit_reason Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 34/35] KVM: arm64: Add hugetlb support for SPE Alexandru Elisei
                   ` (2 subsequent siblings)
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Teach KVM to count the memory pinned for the SPE buffer towards the
process' RLIMIT_MEMLOCK. It is up to userspace to make sure RLIMIT_MEMLOCK
is large enough to accommodate this memory.

The pinned memory is tracked in two places: when the maximum buffer size
for a VCPU is set by userspace during the configuration phase of a virtual
machine - in which case the memory accounted for is the maximum size, and
when the SPE buffer is enabled by the guest.

Doing locked memory accounting when the VCPU is running is necessary
because the memory that KVM pins for a buffer can exceed the maximum buffer
size set by userspace. This happens because KVM must also pin the
translation tables for the buffer.

KVM keeps track of the historical maximum for locked memory and the current
amount of pinned memory. The historical maximum is reflected in the VmLck
status field for the process and KVM will never decrease it, except when
the VM is being destroyed.

If the RLIMIT_MEMLOCK limit is exceeded when userspace sets the maximum
buffer size, the ioctl KVM_ARM_VCPU_SPE_CTRL(KVM_AR_VCPU_MAX_BUFFER_SIZE)
returns to userspace with the error -ENOMEM.

If the limit is exceeded when KVM attempts to pin the buffer, KVM_RUN will
return to userspace with the return value 0, run->exit_reason set to
KVM_EXIT_RLIMIT, and run->rlimit populated accordingly.

The expectation in both cases is that userspace will increase
RLIMIT_MEMLOCK, and the ioctl that failed will be retried.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 Documentation/virt/kvm/devices/vcpu.rst |  14 +++-
 arch/arm64/include/asm/kvm_host.h       |   1 +
 arch/arm64/include/asm/kvm_spe.h        |   8 ++
 arch/arm64/kvm/arm.c                    |   5 ++
 arch/arm64/kvm/spe.c                    | 106 +++++++++++++++++++++++-
 5 files changed, 131 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 29dd1f087d4a..b02ff6d6a9d2 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -365,6 +365,7 @@ have the specified SPU.
 	 -EFAULT  Error accessing the max buffer size identifier
 	 -EINVAL  A different maximum buffer size already set or the size is
                   not aligned to the host's page size
+	 -ENOMEM  RLIMIT_MEMLOCK exceeded
 	 -ENXIO   SPE not supported or not properly configured
 	 -ENODEV  KVM_ARM_VCPU_HAS_SPE VCPU feature or SPU instance not set
 	 -ERANGE  Buffer size larger than maximum supported by the SPU
@@ -397,8 +398,17 @@ slightly larger that the maximum buffer set with this ioctl.
 
 This memory that is pinned will count towards the process RLIMIT_MEMLOCK. To
 avoid the limit being exceeded, userspace must increase the RLIMIT_MEMLOCK limit
-prior to running the VCPU, otherwise KVM_RUN will return to userspace with an
-error.
+prior to running the VCPU. If the limit is exceeded when KVM pins the buffer,
+KVM_RUN will return to userspace with exit_reason set to KVM_EXIT_RLIMIT and
+struct run->rlimit populated: 'rlimit_id' set to RLIMIT_MEMLOCK and 'excess'
+equal to the amount of memory over RLIMIT_MEMLOCK.  Userspace then must increase
+RLIMIT_MEMLOCK by at least 'excess' amount and resume the VCPU. Userspace can
+increase RLIMIT_MEMLOCK with more than the 'excess' amount, to avoid repeated
+exits.
+
+Note that the process status field VmLck includes the historical maximum, not
+the amount of memory that is current consumed by KVM for pinning the SPE
+buffers, if any.
 
 5.2 ATTRIBUTE: KVM_ARM_VCPU_SPE_INIT
 -----------------------------------
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index e79ec480d1d1..b730401717b5 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -55,6 +55,7 @@
 #define KVM_REQ_NESTED_S2_UNMAP		KVM_ARCH_REQ(8)
 #define KVM_REQ_GUEST_HYP_IRQ_PENDING	KVM_ARCH_REQ(9)
 #define KVM_REQ_MAP_L1_VNCR_EL2		KVM_ARCH_REQ(10)
+#define KVM_REQ_SPE_MEMLOCK		KVM_ARCH_REQ(11)
 
 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 				     KVM_DIRTY_LOG_INITIALLY_SET)
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 59a0e825a226..7dcf03980019 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -17,11 +17,14 @@ struct kvm_spe {
 	struct arm_spe_pmu *arm_spu;
 	u64 max_buffer_size;	/* Maximum per VCPU buffer size */
 	u64 guest_pmscr_el2;
+	u64 locked_mem_watermark;
+	u64 locked_mem;
 	bool dirtying_pages;
 };
 
 struct kvm_vcpu_spe {
 	u64 hw_pmbsr_el1;	/* Updated on hardware management event */
+	u64 locked_mem_excess;
 	u64 host_pmscr_el2;	/* Host PMSCR_EL2 register, context switched. */
 	int irq_num;		/* Buffer management interrupt number */
 	bool initialized;	/* SPE initialized for the VCPU */
@@ -63,6 +66,8 @@ phys_addr_t kvm_spe_adjust_range_end(struct kvm *kvm, phys_addr_t start, phys_ad
 
 u8 kvm_spe_get_pmsver_limit(void);
 
+void kvm_spe_handle_req_memlock(struct kvm_vcpu *vcpu);
+
 int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
@@ -109,6 +114,9 @@ static inline u8 kvm_spe_get_pmsver_limit(void)
 {
 	return 0;
 }
+static inline void kvm_spe_handle_req_memlock(struct kvm_vcpu *vcpu)
+{
+}
 static inline int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
 	return -ENXIO;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d05dbb6d2d7a..039401c2d0b4 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1097,6 +1097,11 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 		if (kvm_dirty_ring_check_request(vcpu))
 			return 0;
 
+		if (kvm_check_request(KVM_REQ_SPE_MEMLOCK, vcpu)) {
+			kvm_spe_handle_req_memlock(vcpu);
+			return 0;
+		}
+
 		check_nested_vcpu_requests(vcpu);
 	}
 
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index f80ef8cdb1d8..2e2b97c3b861 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -84,7 +84,16 @@ void kvm_spe_init_vm(struct kvm *kvm)
 
 void kvm_spe_destroy_vm(struct kvm *kvm)
 {
-	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
+	struct xarray *pinned_pages = &kvm_spe->pinned_pages;
+
+	/*
+	 * All VCPUs destroyed, the MMU notifiers unregistered - locking not
+	 * needed.
+	 */
+	WARN_ON_ONCE(kvm_spe->locked_mem);
+	account_locked_vm(current->mm, PHYS_PFN(kvm_spe->locked_mem_watermark), false);
+	kvm_spe->locked_mem_watermark = 0;
 
 	WARN_ON_ONCE(!xa_empty(pinned_pages));
 	xa_destroy(pinned_pages);
@@ -399,10 +408,22 @@ static void kvm_spe_inject_data_abort(struct kvm_vcpu *vcpu, u8 fst, bool s2)
 	kvm_spe_update_irq_level(vcpu, true);
 }
 
+static void kvm_spe_remove_locked_mem(struct kvm *kvm, unsigned long size)
+{
+	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
+
+	lockdep_assert_held(&kvm_spe->pinned_pages.xa_lock);
+
+	WARN_ON_ONCE(kvm_spe->locked_mem < size);
+	kvm_spe->locked_mem -= size;
+}
+
 static void kvm_spe_unpin_page(struct kvm *kvm, struct pinned_page *pinned_page)
 {
 	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
 
+	kvm_spe_remove_locked_mem(kvm, PAGE_SIZE);
+
 	__xa_erase(pinned_pages, pinned_page->gfn);
 	unpin_user_pages_dirty_lock(&pinned_page->page, 1, pinned_page->writable);
 }
@@ -474,6 +495,49 @@ static void kvm_spe_unpin_buffer(struct kvm_vcpu *vcpu)
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
+static int kvm_spe_account_locked_mem(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
+	struct kvm_spe *kvm_spe = &vcpu->kvm->arch.kvm_spe;
+	struct xarray *pinned_pages = &kvm_spe->pinned_pages;
+	u64 excess = vcpu_spe->locked_mem_excess;
+	int ret;
+
+	if (!excess)
+		return 0;
+
+	ret = account_locked_vm(current->mm, PHYS_PFN(excess), true);
+	if (ret)
+		return ret;
+
+	xa_lock(pinned_pages);
+	kvm_spe->locked_mem_watermark += excess;
+	vcpu_spe->locked_mem_excess = 0;
+	xa_unlock(pinned_pages);
+
+	return 0;
+}
+
+static void kvm_spe_add_locked_mem(struct kvm_vcpu *vcpu, unsigned long size)
+{
+	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
+	struct kvm_spe *kvm_spe = &vcpu->kvm->arch.kvm_spe;
+	struct xarray *pinned_pages = &kvm_spe->pinned_pages;
+
+	lockdep_assert_held(&pinned_pages->xa_lock);
+
+	/* Another VCPU is already over the watermark. */
+	if (kvm_spe->locked_mem >= kvm_spe->locked_mem_watermark) {
+		kvm_spe->locked_mem += size;
+		vcpu_spe->locked_mem_excess = size;
+		return;
+	}
+
+	kvm_spe->locked_mem += size;
+	if (kvm_spe->locked_mem > kvm_spe->locked_mem_watermark)
+		vcpu_spe->locked_mem_excess = kvm_spe->locked_mem - kvm_spe->locked_mem_watermark;
+}
+
 #define MAP_GPA_RET_NOTIFIER_RETRY	1
 #define MAP_GPA_RET_PAGE_EXIST		2
 
@@ -615,6 +679,7 @@ static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, stru
 			goto pages_unlock;
 		}
 
+		kvm_spe_add_locked_mem(vcpu, PAGE_SIZE);
 		ret = 0;
 	}
 
@@ -721,6 +786,8 @@ static bool kvm_spe_test_gpa_pinned(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_
 	return false;
 }
 
+#define PIN_GPA_RET_MEMLOCK	1
+
 static int kvm_spe_pin_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -837,6 +904,14 @@ static int kvm_spe_pin_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
 
 	switch (ret) {
 	case 0:
+		if (kvm_spe_account_locked_mem(vcpu)) {
+			/*
+			 * Do not go through the error handling path, the page
+			 * is at this point stored in pinned_pages and it will
+			 * be properly removed when the buffer is unpinned.
+			 */
+			return PIN_GPA_RET_MEMLOCK;
+		}
 		break;
 	case MAP_GPA_RET_PAGE_EXIST:
 		kfree(pinned_page);
@@ -930,6 +1005,10 @@ static bool kvm_spe_pin_buffer(struct kvm_vcpu *vcpu, u64 ptr, u64 limit)
 		return true;
 
 	switch (ret) {
+	case PIN_GPA_RET_MEMLOCK:
+		kvm_make_request(KVM_REQ_SPE_MEMLOCK, vcpu);
+		commit_write = false;
+		break;
 	case -EAGAIN:
 		commit_write = false;
 		break;
@@ -1326,6 +1405,21 @@ void kvm_vcpu_spe_put(struct kvm_vcpu *vcpu)
 		isb();
 }
 
+static void kvm_spe_set_exit_rlimit(struct kvm_run *run, u64 excess)
+{
+	run->exit_reason = KVM_EXIT_RLIMIT;
+	run->rlimit.excess = excess;
+	run->rlimit.rlimit_id = RLIMIT_MEMLOCK;
+}
+
+void kvm_spe_handle_req_memlock(struct kvm_vcpu *vcpu)
+{
+	struct kvm_vcpu_spe *vcpu_spe = &vcpu->arch.vcpu_spe;
+
+	kvm_spe_set_exit_rlimit(vcpu->run, vcpu_spe->locked_mem_excess);
+	vcpu_spe->locked_mem_excess = 0;
+}
+
 static u64 max_buffer_size_to_pmbidr_el1(u64 size)
 {
 	u64 msb_idx, num_bits;
@@ -1379,7 +1473,9 @@ static int kvm_spe_set_max_buffer_size(struct kvm_vcpu *vcpu, u64 size)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
+	struct xarray *pinned_pages = &kvm_spe->pinned_pages;
 	u64 decoded_size, spu_size;
+	int ret;
 
 	if (kvm_vm_has_ran_once(kvm))
 		return -EBUSY;
@@ -1401,6 +1497,14 @@ static int kvm_spe_set_max_buffer_size(struct kvm_vcpu *vcpu, u64 size)
 	if (spu_size != 0 && (size == 0 || size > spu_size))
 		return -ERANGE;
 
+	ret = account_locked_vm(current->mm, PHYS_PFN(size), true);
+	if (ret)
+		return -ENOMEM;
+
+	xa_lock(pinned_pages);
+	kvm_spe->locked_mem_watermark += size;
+	xa_unlock(pinned_pages);
+
 	kvm_spe->max_buffer_size = size;
 
 	return 0;
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 34/35] KVM: arm64: Add hugetlb support for SPE
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (32 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 33/35] KVM: arm64: Implement locked memory accounting for the SPE buffer Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-11-14 16:07 ` [RFC PATCH v6 35/35] KVM: arm64: Allow the creation of a SPE enabled VM Alexandru Elisei
  2025-12-11 16:34 ` [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Leo Yan
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Hugetlb pages are different from transparent huge pages, as they cannot be
split by GUP with the FOLL_SPLIT_PMD flag.

Mapping hugetlb pages at stage 2 with a page mapping would be a mistake:
CPU stage 2 faults will make KVM map the entire hugetlb page with a block
mapping at stage 2. This process requires a break-before-make sequence, and
the SPU will trigger a stage 2 fault if it attempts to write a record to
memory during the break part of the sequence.

Map hugetlb pages with a block mapping at stage 2, to make sure this won't
happen.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_mmu.h |   3 +
 arch/arm64/include/asm/kvm_spe.h |   6 +
 arch/arm64/kvm/mmu.c             |  23 ++-
 arch/arm64/kvm/spe.c             | 292 ++++++++++++++++++++++++-------
 4 files changed, 250 insertions(+), 74 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index a4a0e00d1bbb..4d57c6d62f4a 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -191,6 +191,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu);
 bool kvm_vma_mte_allowed(struct vm_area_struct *vma);
 void kvm_sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, unsigned long size);
 
+bool kvm_stage2_supports_map_size(struct kvm_memory_slot *memslot,
+				  unsigned long hva, unsigned long map_size);
+
 phys_addr_t kvm_mmu_get_httbr(void);
 phys_addr_t kvm_get_idmap_vector(void);
 int __init kvm_mmu_init(u32 *hyp_va_bits);
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 7dcf03980019..a22764719ecc 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -68,6 +68,8 @@ u8 kvm_spe_get_pmsver_limit(void);
 
 void kvm_spe_handle_req_memlock(struct kvm_vcpu *vcpu);
 
+bool kvm_spe_gfn_is_pinned(struct kvm_vcpu *vcpu, gfn_t gfn);
+
 int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
 int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
@@ -117,6 +119,10 @@ static inline u8 kvm_spe_get_pmsver_limit(void)
 static inline void kvm_spe_handle_req_memlock(struct kvm_vcpu *vcpu)
 {
 }
+static inline bool kvm_spe_gfn_is_pinned(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	return false;
+}
 static inline int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
 {
 	return -ENXIO;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index de48fb7c0fff..cc2993f10269 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1332,9 +1332,8 @@ static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
 	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
 }
 
-static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
-					       unsigned long hva,
-					       unsigned long map_size)
+bool kvm_stage2_supports_map_size(struct kvm_memory_slot *memslot,
+				  unsigned long hva, unsigned long map_size)
 {
 	gpa_t gpa_start;
 	hva_t uaddr_start, uaddr_end;
@@ -1417,7 +1416,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	 * sure that the HVA and IPA are sufficiently aligned and that the
 	 * block map is contained within the memslot.
 	 */
-	if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
+	if (kvm_stage2_supports_map_size(memslot, hva, PMD_SIZE)) {
 		int sz = get_user_mapping_size(kvm, hva);
 
 		if (sz < 0)
@@ -1664,6 +1663,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct page *page;
 	vm_flags_t vm_flags;
 	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
+	bool is_vma_hugetlbfs;
 
 	if (fault_is_perm)
 		fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
@@ -1694,6 +1694,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
+	is_vma_hugetlbfs = is_vm_hugetlb_page(vma);
+
 	if (force_pte)
 		vma_shift = PAGE_SHIFT;
 	else
@@ -1702,7 +1704,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	switch (vma_shift) {
 #ifndef __PAGETABLE_PMD_FOLDED
 	case PUD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PUD_SIZE))
+		if (kvm_stage2_supports_map_size(memslot, hva, PUD_SIZE))
 			break;
 		fallthrough;
 #endif
@@ -1710,7 +1712,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		vma_shift = PMD_SHIFT;
 		fallthrough;
 	case PMD_SHIFT:
-		if (fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE))
+		if (kvm_stage2_supports_map_size(memslot, hva, PMD_SIZE))
 			break;
 		fallthrough;
 	case CONT_PTE_SHIFT:
@@ -1853,6 +1855,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		goto out_unlock;
 	}
 
+	if (vcpu_has_spe(vcpu) && logging_active && is_vma_hugetlbfs) {
+		gfn_t pmd_gfn = PHYS_PFN(fault_ipa & PMD_MASK);
+
+		if (kvm_spe_gfn_is_pinned(vcpu, pmd_gfn)) {
+			/* SPE won the race, don't break the block mapping. */
+			goto out_unlock;
+		}
+	}
+
 	/*
 	 * If we are not forced to use page mapping, check if we are
 	 * backed by a THP and thus use block mapping if possible.
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 2e2b97c3b861..81e07bc08ba6 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -6,6 +6,7 @@
 #include <linux/bitops.h>
 #include <linux/capability.h>
 #include <linux/cpumask.h>
+#include <linux/hugetlb.h>
 #include <linux/kvm_host.h>
 #include <linux/perf/arm_spe_pmu.h>
 #include <linux/swap.h>
@@ -30,6 +31,7 @@ struct pinned_page {
 	DECLARE_BITMAP(vcpus, KVM_MAX_VCPUS);	/* The page is pinned on these VCPUs */
 	struct page *page;
 	gfn_t gfn;
+	unsigned long s2_map_size;
 	bool unmap_after_unpin;			/* Unmap the page after the buffer is unpinned */
 	bool writable;				/* Is the page mapped as writable? */
 };
@@ -196,10 +198,7 @@ static bool kvm_spe_allow_stage2_change(enum kvm_mmu_notifier_event event)
 	 * to memory.
 	 */
 	case KVM_MMU_NOTIFY_WP:
-	/*
-	 * All buffer pages are mapped with PAGE_SIZE granularity at stage 2,
-	 * it's safe to skip them.
-	 */
+	/* Buffer pages will be unmapped after they are unpinned. */
 	case KVM_MMU_NOTIFY_SPLIT_HUGE_PAGE:
 		return false;
 
@@ -246,8 +245,8 @@ phys_addr_t kvm_spe_adjust_range_start(struct kvm *kvm, phys_addr_t start, phys_
 {
 	struct kvm_spe *kvm_spe = &kvm->arch.kvm_spe;
 	struct xarray *pinned_pages = &kvm_spe->pinned_pages;
+	kvm_pfn_t limit_gfn, gfn = PHYS_PFN(start);
 	struct pinned_page *pinned_page;
-	kvm_pfn_t gfn;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
@@ -255,21 +254,55 @@ phys_addr_t kvm_spe_adjust_range_start(struct kvm *kvm, phys_addr_t start, phys_
 		return start;
 
 	xa_lock(pinned_pages);
-	for (gfn = PHYS_PFN(start); gfn < PHYS_PFN(end); gfn++) {
+	while (gfn < PHYS_PFN(end)) {
 		pinned_page = xa_load(pinned_pages, gfn);
 		if (!pinned_page)
 			break;
 
 		pinned_page->unmap_after_unpin = true;
-		if (event == KVM_MMU_NOTIFY_WP && pinned_page->writable) {
+
+		if (event == KVM_MMU_NOTIFY_WP) {
+			if (!pinned_page->writable && pinned_page->s2_map_size == PAGE_SIZE)
+				goto next_gfn;
+
+                       /*
+			* Stage 2 block mappings are special, because, while
+			* dirty page locking is enabled, any fault will break
+			* the block mapping. Map it with all permissions to
+			* avoid the fault.
+                        */
+			if (pinned_page->s2_map_size > PAGE_SIZE) {
+				phys_addr_t gpa = PFN_PHYS(pinned_page->gfn);
+				struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
+				enum kvm_pgtable_walk_flags flags;
+				enum kvm_pgtable_prot prot;
+				int ret;
+
+				prot = KVM_PGTABLE_PROT_X |
+				       KVM_PGTABLE_PROT_R |
+				       KVM_PGTABLE_PROT_W;
+				flags = KVM_PGTABLE_WALK_HANDLE_FAULT;
+				ret = kvm_pgtable_stage2_relax_perms(mmu->pgt, gpa, prot, flags);
+				if (WARN_ON_ONCE(ret))
+					goto next_gfn;
+				kvm_call_hyp(__kvm_tlb_flush_vmid_range,
+					     mmu, gpa, PHYS_PFN(pinned_page->s2_map_size));
+			}
+
+			limit_gfn = min(PHYS_PFN(end), gfn + PHYS_PFN(pinned_page->s2_map_size));
 			kvm_spe->dirtying_pages = true;
-			mark_page_dirty(kvm, gfn);
+			for (; gfn < limit_gfn; gfn++)
+				mark_page_dirty(kvm, gfn);
 			kvm_spe->dirtying_pages = false;
+
+			continue;
 		}
+next_gfn:
+		gfn += PHYS_PFN(pinned_page->s2_map_size);
 	}
 	xa_unlock(pinned_pages);
 
-	return PFN_PHYS(gfn);
+	return min_t(phys_addr_t, PFN_PHYS(gfn), end);
 }
 
 /*
@@ -280,16 +313,19 @@ static void kvm_spe_unpin_page_range(struct kvm *kvm, phys_addr_t start, phys_ad
 {
 	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
 	struct pinned_page *pinned_page;
-	kvm_pfn_t gfn;
+	kvm_pfn_t gfn = PHYS_PFN(start);
 
 	xa_lock(pinned_pages);
-	for (gfn = PHYS_PFN(start); gfn < PHYS_PFN(end); gfn++) {
+	while (gfn < PHYS_PFN(end)) {
 		pinned_page = xa_load(pinned_pages, gfn);
-		if (!pinned_page)
+		if (!pinned_page) {
+			gfn++;
 			continue;
+		}
 
 		kvm_spe_unpin_page(kvm, pinned_page);
 		kfree(pinned_page);
+		gfn += PHYS_PFN(pinned_page->s2_map_size);
 	}
 	xa_unlock(pinned_pages);
 }
@@ -422,7 +458,7 @@ static void kvm_spe_unpin_page(struct kvm *kvm, struct pinned_page *pinned_page)
 {
 	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
 
-	kvm_spe_remove_locked_mem(kvm, PAGE_SIZE);
+	kvm_spe_remove_locked_mem(kvm, pinned_page->s2_map_size);
 
 	__xa_erase(pinned_pages, pinned_page->gfn);
 	unpin_user_pages_dirty_lock(&pinned_page->page, 1, pinned_page->writable);
@@ -476,7 +512,7 @@ static void kvm_spe_unpin_buffer(struct kvm_vcpu *vcpu)
 		if (!pgt)
 			goto free_continue;
 
-		kvm_pgtable_stage2_unmap(pgt, PFN_PHYS(pinned_page->gfn), PAGE_SIZE);
+		kvm_pgtable_stage2_unmap(pgt, PFN_PHYS(pinned_page->gfn), pinned_page->s2_map_size);
 		unmap_count++;
 		if (unmap_count == unmap_resched) {
 			xas_pause(&xas);
@@ -538,6 +574,30 @@ static void kvm_spe_add_locked_mem(struct kvm_vcpu *vcpu, unsigned long size)
 		vcpu_spe->locked_mem_excess = kvm_spe->locked_mem - kvm_spe->locked_mem_watermark;
 }
 
+static void kvm_spe_unlock_mmu(struct kvm *kvm, bool exclusive_access)
+{
+	if (exclusive_access)
+		write_unlock(&kvm->mmu_lock);
+	else
+		read_unlock(&kvm->mmu_lock);
+}
+
+static bool kvm_spe_lock_mmu(struct kvm *kvm, unsigned long s2_map_size, bool logging_active)
+{
+	bool exclusive_access;
+
+	if (s2_map_size > PAGE_SIZE && logging_active) {
+		/* Prevent concurrent CPU faults breaking the block mapping. */
+		write_lock(&kvm->mmu_lock);
+		exclusive_access = true;
+	} else {
+		read_lock(&kvm->mmu_lock);
+		exclusive_access = false;
+	}
+
+	return exclusive_access;
+}
+
 #define MAP_GPA_RET_NOTIFIER_RETRY	1
 #define MAP_GPA_RET_PAGE_EXIST		2
 
@@ -549,7 +609,7 @@ static void kvm_spe_add_locked_mem(struct kvm_vcpu *vcpu, unsigned long size)
 /* Calls release_faultin_page(), regardless of the return value */
 static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, struct page *page,
 			   bool make_writable, bool mte_allowed, unsigned long mmu_seq,
-			   struct pinned_page *pinned_page)
+			   struct pinned_page *pinned_page, unsigned long s2_map_size)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
@@ -558,21 +618,31 @@ static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, stru
 	gfn_t gfn = PHYS_PFN(gpa);
 	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
-	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
+	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT;
 	int action = PGTABLE_ACTION_NONE;
+	struct kvm_memory_slot *memslot = gfn_to_memslot(kvm, gfn);
+	bool logging_active = kvm_slot_dirty_track_enabled(memslot);
+	bool exclusive_access;
 	s8 level = S8_MAX;
 	kvm_pte_t pte = 0;
 	int ret;
 
-	read_lock(&kvm->mmu_lock);
+	if (make_writable)
+		prot |= KVM_PGTABLE_PROT_W;
+
+	/* Avoid all faults. */
+	if (s2_map_size > PAGE_SIZE && logging_active)
+		prot |= KVM_PGTABLE_PROT_X;
+
+	exclusive_access = kvm_spe_lock_mmu(kvm, s2_map_size, logging_active);
+	if (!exclusive_access)
+		flags |= KVM_PGTABLE_WALK_SHARED;
+
 	if (mmu_invalidate_retry(kvm, mmu_seq)) {
 		ret = MAP_GPA_RET_NOTIFIER_RETRY;
 		goto mmu_unlock;
 	}
 
-	if (make_writable)
-		prot |= KVM_PGTABLE_PROT_W;
-
 	ret = kvm_pgtable_get_leaf(pgt, gpa, &pte, &level);
 	if (ret)
 		goto mmu_unlock;
@@ -589,7 +659,7 @@ static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, stru
 		}
 
 		existing_prot = kvm_pgtable_stage2_pte_prot(pte);
-		if (kvm_granule_size(level) != PAGE_SIZE) {
+		if (kvm_granule_size(level) != s2_map_size) {
 			/* Break block mapping */
 			action = PGTABLE_MAP_GPA;
 		} else {
@@ -603,14 +673,14 @@ static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, stru
 	}
 
 	if (action == PGTABLE_MAP_GPA) {
-		read_unlock(&kvm->mmu_lock);
+		kvm_spe_unlock_mmu(kvm, exclusive_access);
 		ret = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_cache,
 				kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
 		if (ret) {
 			kvm_release_faultin_page(kvm, page, false, make_writable);
 			goto out;
 		}
-		read_lock(&kvm->mmu_lock);
+		exclusive_access = kvm_spe_lock_mmu(kvm, s2_map_size, logging_active);
 		if (mmu_invalidate_retry(kvm, mmu_seq)) {
 			ret = MAP_GPA_RET_NOTIFIER_RETRY;
 			goto mmu_unlock;
@@ -642,7 +712,7 @@ static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, stru
 
 	if (!pp && !kvm_pte_valid(pte) && kvm_has_mte(kvm)) {
 		if (mte_allowed) {
-			kvm_sanitise_mte_tags(kvm, hfn, PAGE_SIZE);
+			kvm_sanitise_mte_tags(kvm, hfn, s2_map_size);
 		} else {
 			ret = -EFAULT;
 			goto mmu_unlock;
@@ -653,7 +723,7 @@ static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, stru
 	if (action & PGTABLE_RELAX_PERMS) {
 		ret = kvm_pgtable_stage2_relax_perms(pgt, gpa, prot, flags);
 	} else if (action & PGTABLE_MAP_GPA) {
-		ret = kvm_pgtable_stage2_map(pgt, gpa, PAGE_SIZE, hpa, prot,
+		ret = kvm_pgtable_stage2_map(pgt, gpa, s2_map_size, hpa, prot,
 					     &vcpu->arch.mmu_page_cache, flags);
 	}
 	if (ret)
@@ -662,14 +732,22 @@ static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, stru
 	if (action & PGTABLE_MAKE_YOUNG)
 		kvm_pgtable_stage2_mkyoung(pgt, gpa, flags);
 
+	if (exclusive_access &&
+	    (action & (PGTABLE_MAKE_YOUNG | PGTABLE_RELAX_PERMS)) == PGTABLE_RELAX_PERMS) {
+		kvm_call_hyp(__kvm_tlb_flush_vmid_range, &kvm->arch.mmu, gpa,
+			     PHYS_PFN(s2_map_size));
+	}
+
 	if (pp) {
 		pp->writable = make_writable;
-		set_bit(vcpu->vcpu_idx, pp->vcpus);
+		if (!test_bit(vcpu->vcpu_idx, pp->vcpus))
+			set_bit(vcpu->vcpu_idx, pp->vcpus);
 
 		ret = MAP_GPA_RET_PAGE_EXIST;
 	} else {
 		pinned_page->page = page;
 		pinned_page->gfn = gfn;
+		pinned_page->s2_map_size = s2_map_size;
 		pinned_page->writable = make_writable;
 		set_bit(vcpu->vcpu_idx, pinned_page->vcpus);
 
@@ -679,31 +757,65 @@ static int kvm_spe_map_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t hfn, stru
 			goto pages_unlock;
 		}
 
-		kvm_spe_add_locked_mem(vcpu, PAGE_SIZE);
+		kvm_spe_add_locked_mem(vcpu, s2_map_size);
 		ret = 0;
 	}
 
+	if (!ret && make_writable && s2_map_size > PAGE_SIZE &&
+	    kvm_slot_dirty_track_enabled(memslot)) {
+		/*
+		 * Unmap the huge page from stage 2 after unpinning to
+		 * resume normal dirty page logging.
+		 */
+		pinned_page->unmap_after_unpin = true;
+	}
+
 pages_unlock:
 	xa_unlock(pinned_pages);
 mmu_unlock:
 	kvm_release_faultin_page(kvm, page, ret < 0, make_writable);
-	if (!ret && make_writable)
-		kvm_vcpu_mark_page_dirty(vcpu, gfn);
-
-	read_unlock(&kvm->mmu_lock);
+	if (!ret && make_writable) {
+		for (int i = 0; i < PHYS_PFN(s2_map_size); i++, gfn++)
+			mark_page_dirty_in_slot(kvm, memslot, gfn);
+	}
+	kvm_spe_unlock_mmu(kvm, exclusive_access);
 out:
 	return ret;
 }
 
-static int kvm_spe_pin_hva_locked(hva_t hva, bool make_writable, struct page **page)
+static unsigned long kvm_spe_compute_stage2_map_size(struct kvm *kvm, gfn_t gfn, hva_t hva,
+						     struct vm_area_struct *vma)
+{
+	unsigned long map_size = PAGE_SIZE;
+
+	if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP)) {
+		map_size = huge_page_size(hstate_vma(vma));
+		/* Stage 2 supports only PMD_SIZE huge mappings. */
+		if (map_size > PMD_SIZE)
+			map_size = PMD_SIZE;
+		if (!kvm_stage2_supports_map_size(gfn_to_memslot(kvm, gfn), hva, map_size))
+			map_size = PAGE_SIZE;
+	}
+
+	return map_size;
+}
+
+static int kvm_spe_pin_hva_locked(hva_t hva, bool make_writable, struct vm_area_struct *vma,
+				  struct page **page)
 {
 	unsigned int gup_flags;
 	long nr_pages;
 
+	gup_flags = FOLL_LONGTERM | FOLL_HONOR_NUMA_FAULT | FOLL_HWPOISON;
+	if (make_writable)
+		gup_flags |= FOLL_WRITE;
 	/*
-	 * FOLL_SPLIT_PMD is what allows us to ignore the order of the folio and
-	 * how the page is mapped in the host and operate on a single page
-	 * instead of a higher order folio.
+	 * When the VMA is backed by hugetlb, the memory will be mapped with a
+	 * block mapping at stage 2.
+	 *
+	 * In the non-hugetlb case, FOLL_SPLIT_PMD is what allows us to ignore
+	 * the order of the folio and how the page is mapped in the host and
+	 * operate on a single page instead of a higher order folio.
 	 *
 	 * Let's assume that we don't use FOLL_SPLIT_PMD and the pinned page is
 	 * mapped with a block mapping in the host's stage 1.  kvm_spe_map_gpa()
@@ -722,9 +834,8 @@ static int kvm_spe_pin_hva_locked(hva_t hva, bool make_writable, struct page **p
 	 * true: a higher order folio can be split into PTEs regardless of its
 	 * elevated reference count (see split_huge_pmd()).
 	 */
-	gup_flags = FOLL_LONGTERM | FOLL_SPLIT_PMD | FOLL_HONOR_NUMA_FAULT | FOLL_HWPOISON;
-	if (make_writable)
-		gup_flags |= FOLL_WRITE;
+	if (!is_vm_hugetlb_page(vma))
+		gup_flags |= FOLL_SPLIT_PMD;
 
 	nr_pages = pin_user_pages(hva, 1, gup_flags, page);
 
@@ -753,7 +864,8 @@ static int kvm_spe_find_hva(struct kvm *kvm, gfn_t gfn, bool make_writable, hva_
 	return 0;
 }
 
-static bool kvm_spe_test_gpa_pinned(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
+static bool kvm_spe_test_gpa_pinned(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned long s2_map_size,
+				    bool make_writable)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
@@ -766,17 +878,21 @@ static bool kvm_spe_test_gpa_pinned(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_
 		goto out_unlock;
 
 	/*
-	 * Only happens if the buffer overlaps with a translation table, which
-	 * is almost certainly a guest bug and hopefully exceedingly rare. To
-	 * avoid unnecessary complexity, pretend that the gpa is not pinned, and
-	 * kvm_spe_map_gpa() will fix things up. Sure, it means doing a lot of
-	 * unnecessary work, but it's all on the guest for programming the
-	 * buffer with the wrong translations.
+	 * Should never happen. The buffer is mapped at stage 2 with a block
+	 * mapping if it's backed by a hugetlb page in the host, otherwise it's
+	 * mapped with PAGE_SIZE granularity. On the host side, changing a
+	 * mapping from a PAGE_SIZE page to a hugetlb page, or the other way
+	 * around, is performed only after userspace explictely unmaps the
+	 * memory.  kvm_spe_adjust_range_end() will unpin the affected buffer
+	 * page(s) when memory is unmapped by userspace.
 	 */
+	WARN_ON_ONCE(pp->s2_map_size != s2_map_size);
+
 	if (make_writable && !pp->writable)
 		goto out_unlock;
 
-	set_bit(vcpu->vcpu_idx, pp->vcpus);
+	if (!test_bit(vcpu->vcpu_idx, pp->vcpus))
+		set_bit(vcpu->vcpu_idx, pp->vcpus);
 
 	xa_unlock(pinned_pages);
 	return true;
@@ -793,7 +909,7 @@ static int kvm_spe_pin_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
 	struct kvm *kvm = vcpu->kvm;
 	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
 	struct pinned_page *pinned_page;
-	unsigned long mmu_seq, tries;
+	unsigned long mmu_seq, tries, s2_map_size;
 	struct vm_area_struct *vma;
 	gfn_t gfn = PHYS_PFN(gpa);
 	bool writable = false, mte_allowed = false;
@@ -804,32 +920,50 @@ static int kvm_spe_pin_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
 
 	WARN_ON_ONCE(!srcu_read_lock_held(&vcpu->kvm->srcu));
 
-	/*
-	 * For each buffer page, KVM needs to pin up to four pages, one for each
-	 * level of the guest's stage 1 translation tables. The first level
-	 * table is shared between each page of the buffer, and likely some of
-	 * the next levels too, so it's worth checking if a gpa is already
-	 * pinned.
-	 */
-	if (kvm_spe_test_gpa_pinned(vcpu, gpa, make_writable))
-		return 0;
-
 	ret = kvm_spe_find_hva(kvm, gfn, make_writable, &hva);
 	if (ret)
 		return ret;
 
 	scoped_guard(mmap_read_lock, current->mm) {
-		if (kvm_has_mte(kvm)) {
-			vma = vma_lookup(current->mm, hva);
-			if (!vma) {
-				kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
-				return -EFAULT;
-			}
-			mte_allowed = kvm_vma_mte_allowed(vma);
+		vma = vma_lookup(current->mm, hva);
+		if (unlikely(!vma)) {
+			kvm_err("Failed to find VMA for hva 0x%lx\n", hva);
+			return -EFAULT;
+		}
+
+		s2_map_size = kvm_spe_compute_stage2_map_size(kvm, gfn, hva, vma);
+		if (s2_map_size == PMD_SIZE) {
+			/*
+			 * Make the adjustements before searching for the gpa in
+			 * pinned_pages.
+			 */
+			hva &= PMD_MASK;
+			gpa &= PMD_MASK;
+			gfn = PHYS_PFN(gpa);
+			/*
+			 * Dirty page tracking cannot be enabled on read-only
+			 * memslots.
+			 */
+			if (kvm_slot_dirty_track_enabled(gfn_to_memslot(kvm, gfn)))
+				make_writable = true;
 		}
-		ret = kvm_spe_pin_hva_locked(hva, make_writable, &page);
+
+		/*
+		 * For each buffer page, KVM needs to pin up to four pages, one
+		 * for each level of the guest's stage 1 translation tables. The
+		 * first level table is shared between each page of the buffer,
+		 * and likely some of the next levels too, so it's worth
+		 * checking if a gpa is already pinned.
+		 */
+		if (kvm_spe_test_gpa_pinned(vcpu, gpa, s2_map_size, make_writable))
+			return 0;
+
+		ret = kvm_spe_pin_hva_locked(hva, make_writable, vma, &page);
 		if (ret)
 			return ret;
+
+		if (kvm_has_mte(kvm))
+			mte_allowed = kvm_vma_mte_allowed(vma);
 	}
 
 	pinned_page = kzalloc(sizeof(*pinned_page), GFP_KERNEL_ACCOUNT);
@@ -848,7 +982,7 @@ static int kvm_spe_pin_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
 
 	get_page(page);
 	ret = kvm_spe_map_gpa(vcpu, gpa, hfn, page, make_writable, mte_allowed, mmu_seq,
-			      pinned_page);
+			      pinned_page, s2_map_size);
 	tries = 1;
 
 	while (ret == MAP_GPA_RET_NOTIFIER_RETRY) {
@@ -867,7 +1001,7 @@ static int kvm_spe_pin_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
 
 		hfn = kvm_faultin_pfn(vcpu, gfn, make_writable, &writable, &retry_page);
 		if (hfn == KVM_PFN_ERR_HWPOISON) {
-			send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SIZE, current);
+			send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, s2_map_size, current);
 			ret = 0;
 			goto out_release;
 		}
@@ -887,7 +1021,7 @@ static int kvm_spe_pin_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, bool make_writable)
 		}
 
 		ret = kvm_spe_map_gpa(vcpu, gpa, hfn, page, make_writable, mte_allowed, mmu_seq,
-				      pinned_page);
+				      pinned_page, s2_map_size);
 		/*
 		 * Choose the number of VCPUs as the limit on retrying because
 		 * the guest can enable SPE on all VCPUs at the same, and
@@ -1420,6 +1554,28 @@ void kvm_spe_handle_req_memlock(struct kvm_vcpu *vcpu)
 	vcpu_spe->locked_mem_excess = 0;
 }
 
+bool kvm_spe_gfn_is_pinned(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct xarray *pinned_pages = &kvm->arch.kvm_spe.pinned_pages;
+	bool is_pinned = false;
+	struct pinned_page *pinned_page;
+
+	xa_lock(pinned_pages);
+	if (xa_empty(pinned_pages))
+		goto unlock;
+
+	pinned_page = xa_load(pinned_pages, gfn);
+	if (pinned_page) {
+		is_pinned = true;
+		WARN_ON_ONCE(!pinned_page->writable);
+	}
+
+unlock:
+	xa_unlock(pinned_pages);
+	return is_pinned;
+}
+
 static u64 max_buffer_size_to_pmbidr_el1(u64 size)
 {
 	u64 msb_idx, num_bits;
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH v6 35/35] KVM: arm64: Allow the creation of a SPE enabled VM
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (33 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 34/35] KVM: arm64: Add hugetlb support for SPE Alexandru Elisei
@ 2025-11-14 16:07 ` Alexandru Elisei
  2025-12-11 16:34 ` [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Leo Yan
  35 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-11-14 16:07 UTC (permalink / raw)
  To: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm
  Cc: james.clark, mark.rutland, james.morse

Everything is in place, allow userspace to enable SPE for a virtual
machine.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/include/asm/kvm_host.h | 2 +-
 arch/arm64/kvm/Kconfig            | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index b730401717b5..1c987274556e 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -40,7 +40,7 @@
 
 #define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS
 
-#define KVM_VCPU_MAX_FEATURES 9
+#define KVM_VCPU_MAX_FEATURES 10
 #define KVM_VCPU_VALID_FEATURES	(BIT(KVM_VCPU_MAX_FEATURES) - 1)
 
 #define KVM_REQ_SLEEP \
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 31388b5b2655..f746df3c2c28 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -84,9 +84,9 @@ config PTDUMP_STAGE2_DEBUGFS
 	  If in doubt, say N.
 
 config KVM_ARM_SPE
-	bool
+	bool "Support SPE in guest"
 	depends on KVM && ARM_SPE_PMU=y
-	default n
+	default y
 	help
 	  Adds support for Statistical Profiling Extension (SPE) in virtual
 	  machines.
-- 
2.51.2



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support
  2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
                   ` (34 preceding siblings ...)
  2025-11-14 16:07 ` [RFC PATCH v6 35/35] KVM: arm64: Allow the creation of a SPE enabled VM Alexandru Elisei
@ 2025-12-11 16:34 ` Leo Yan
  2025-12-12 10:18   ` Alexandru Elisei
  35 siblings, 1 reply; 49+ messages in thread
From: Leo Yan @ 2025-12-11 16:34 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm, james.clark,
	mark.rutland, james.morse

Hi Alexandru,

Just couples general questions for myself to easier understand the series
(sorry if I asked duplicated questions).

> I wanted the focus to be on pinning memory at stage 2 (that's patches #29, 'KVM:
> arm64: Pin the SPE buffer in the host and map it at stage 2', to #3, 'KVM:
> arm64: Add hugetlb support for SPE') and I would very much like to start a
> discussion around that.

I am confused for "pinning memory at stage 2" and then I read "Pin the
SPE buffer in the host".  I read Chapter 2 Specification, ARM DEN 0154,
my conclusion is:

1) You set PMBLIMITR_EL1.nVM == 0 (virtual address mode) so that the
   driver uses the same mode whether it is running in a host or in a
   guest.

2) The KVM hypervisor needs to parse the VA -> IPA -> PA with:

   Guest stage-1 table (managed in guest OS);
   Guest stage-2 table (managed in KVM hypervisor);

3) In the end, the KVM hypervisor pins physical pages on the host
   stage-1 page table for:

   The physical pages are pinned for Guest stage-1 table;
   The physical pages are pinned for Guest stage-2 table;
   The physical pages are pinned for used for TRBE buffer in guest.

Due the host might migrate or swap pages, so all the pin operations
happen on the host's page table.  The pin operations never to be set up
in guest's stage-2 table, right?

> The problem
> ===========
> 
> When the Statistical Profiling Unit (SPU from now on) encounter a fault when
> it attempts to write a record to memory, two things happen: profiling is
> stopped, and the fault is reported to the CPU via an interrupt, not an
> exception. This creates a blackout window during which the CPU executes
> instructions which aren't profiled. The SPE driver avoid this by keeping the
> buffer mapped while ProfilingBufferEnabled() = true. But when running as a
> guest under KVM, the SPU will trigger stage 2 faults, with the associated
> blackout windows.

My understanding is that there are two prominent challenges for SPE
virtualization:

1) Allocation: we need to allocate trace buffer with mapping both
   guest's stage-1 and stage-2 before enabling SPU.  (For me, the free
   buffer is never an issue as we always disable the SPU before
   releasing the resource).

2) Pin: the physical pages used by trace buffer and the relevant stage-1
   and stage-2 tables must be pinned during the session.

I will read (and learn) the patches in next days.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support
  2025-12-11 16:34 ` [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Leo Yan
@ 2025-12-12 10:18   ` Alexandru Elisei
  2025-12-12 11:15     ` Leo Yan
  0 siblings, 1 reply; 49+ messages in thread
From: Alexandru Elisei @ 2025-12-12 10:18 UTC (permalink / raw)
  To: Leo Yan
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm, james.clark,
	mark.rutland, james.morse

Hi Leo,

On Thu, Dec 11, 2025 at 04:34:25PM +0000, Leo Yan wrote:
> Hi Alexandru,
> 
> Just couples general questions for myself to easier understand the series
> (sorry if I asked duplicated questions).
> 
> > I wanted the focus to be on pinning memory at stage 2 (that's patches #29, 'KVM:
> > arm64: Pin the SPE buffer in the host and map it at stage 2', to #3, 'KVM:
> > arm64: Add hugetlb support for SPE') and I would very much like to start a
> > discussion around that.
> 
> I am confused for "pinning memory at stage 2" and then I read "Pin the
> SPE buffer in the host".  I read Chapter 2 Specification, ARM DEN 0154,
> my conclusion is:
> 
> 1) You set PMBLIMITR_EL1.nVM == 0 (virtual address mode) so that the
>    driver uses the same mode whether it is running in a host or in a
>    guest.

KVM does not advertise FEAT_SPE_nVM and treats PMBLIMITR_EL1.nVM as RES0 on a
guest access. The value of PMSCR_EL2.EnVM is always zero while a guest is
running.

So yes, and the Linux driver is not aware of physical addressing mode and
that's what I used for testing.

> 
> 2) The KVM hypervisor needs to parse the VA -> IPA -> PA with:
> 
>    Guest stage-1 table (managed in guest OS);

Yes.

>    Guest stage-2 table (managed in KVM hypervisor);

Yes.

> 
> 3) In the end, the KVM hypervisor pins physical pages on the host
>    stage-1 page table for:

By 'pin' meaning using pin_user_pages(), yes.

> 
>    The physical pages are pinned for Guest stage-1 table;

Yes.

>    The physical pages are pinned for Guest stage-2 table;

Yes and no. The pages allocated for the stage 2 translation tables are not
mapped in the host's userspace, they are mapped in the kernel linear address
space. This means that they are not subject to migration/swap/compaction/etc,
they will only be reused after KVM frees them.

But that's how KVM manages stage 2 for all VMs, so maybe I misunderstood what
you were saying.

>    The physical pages are pinned for used for TRBE buffer in guest.

SPE, but yes, the same principle.

> 
> Due the host might migrate or swap pages, so all the pin operations
> happen on the host's page table.  The pin operations never to be set up
> in guest's stage-2 table, right?

I'm not sure what you mean.

> 
> > The problem
> > ===========
> > 
> > When the Statistical Profiling Unit (SPU from now on) encounter a fault when
> > it attempts to write a record to memory, two things happen: profiling is
> > stopped, and the fault is reported to the CPU via an interrupt, not an
> > exception. This creates a blackout window during which the CPU executes
> > instructions which aren't profiled. The SPE driver avoid this by keeping the
> > buffer mapped while ProfilingBufferEnabled() = true. But when running as a
> > guest under KVM, the SPU will trigger stage 2 faults, with the associated
> > blackout windows.
> 
> My understanding is that there are two prominent challenges for SPE
> virtualization:
> 
> 1) Allocation: we need to allocate trace buffer with mapping both
>    guest's stage-1 and stage-2 before enabling SPU.  (For me, the free

It's the guest responsibility to map the buffer in the guest stage 1 before
enabling it. When the guest enables the buffer, KVM walks the guest's stage 1
and if it doesn't find a translation for a buffer guest VA, it will inject a
profiling buffer management event to the guest, with EC stage 1 data abort.

If the buffer was mapped in the guest stage 1 when the guest enabled the buffer,
but at same point in the future the guest unmaps the buffer from stage 1, the
statistical profiling unit might encounter a stage 1 data abort when attempting
to write to memory. If that's the case, the interrupt is taken by the host, and
KVM will inject the buffer management event back to the guest.

>    buffer is never an issue as we always disable the SPU before
>    releasing the resource).
> 
> 2) Pin: the physical pages used by trace buffer and the relevant stage-1
>    and stage-2 tables must be pinned during the session.

If by pinning you mean pin_user_pages() and friends, then KVM does not need to
do that for the stage 2 tables, pin_user_pages() makes sense only for userspace
addresses.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support
  2025-12-12 10:18   ` Alexandru Elisei
@ 2025-12-12 11:15     ` Leo Yan
  2025-12-12 11:54       ` Alexandru Elisei
  0 siblings, 1 reply; 49+ messages in thread
From: Leo Yan @ 2025-12-12 11:15 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm, james.clark,
	mark.rutland, james.morse

On Fri, Dec 12, 2025 at 10:18:27AM +0000, Alexandru Elisei wrote:

[...]


> > 3) In the end, the KVM hypervisor pins physical pages on the host
> >    stage-1 page table for:
> 
> By 'pin' meaning using pin_user_pages(), yes.
> 
> > 
> >    The physical pages are pinned for Guest stage-1 table;
> 
> Yes.
> 
> >    The physical pages are pinned for Guest stage-2 table;
> 
> Yes and no. The pages allocated for the stage 2 translation tables are not
> mapped in the host's userspace, they are mapped in the kernel linear address
> space. This means that they are not subject to migration/swap/compaction/etc,
> they will only be reused after KVM frees them.
> 
> But that's how KVM manages stage 2 for all VMs, so maybe I misunderstood what
> you were saying.

No, you did not misunderstand.  I did not understand stage-2 table
allocation before — it is allocated by KVM, not from user memory via
the VMM.

[...]

> > Due the host might migrate or swap pages, so all the pin operations
> > happen on the host's page table.  The pin operations never to be set up
> > in guest's stage-2 table, right?
> 
> I'm not sure what you mean.

Never mind.  I think you have answered this below (pin user memory via
pin_user_pages() and no matter with stage-2 tables).

> > My understanding is that there are two prominent challenges for SPE
> > virtualization:
> > 
> > 1) Allocation: we need to allocate trace buffer with mapping both
> >    guest's stage-1 and stage-2 before enabling SPU.  (For me, the free
> 
> It's the guest responsibility to map the buffer in the guest stage 1 before
> enabling it. When the guest enables the buffer, KVM walks the guest's stage 1
> and if it doesn't find a translation for a buffer guest VA, it will inject a
> profiling buffer management event to the guest, with EC stage 1 data abort.

IIUC, KVM will inject a buffer management interrupt to guest and then
guest driver can detect EC="stage 1 data abort".  KVM does not raise a
data abort exception in this case.

> If the buffer was mapped in the guest stage 1 when the guest enabled the buffer,
> but at same point in the future the guest unmaps the buffer from stage 1, the
> statistical profiling unit might encounter a stage 1 data abort when attempting
> to write to memory. If that's the case, the interrupt is taken by the host, and
> KVM will inject the buffer management event back to the guest.

Hmm... just a note, it would be straightforward for guest to directly
respond IRQ for "stage-1 data abort" (TBH, I don't know how to inject
IRQ vs fast-forward IRQ, you could ignore this note until I dig a bit).

Thanks for quick response.  The info is quite helpful for me.

Leo


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support
  2025-12-12 11:15     ` Leo Yan
@ 2025-12-12 11:54       ` Alexandru Elisei
  0 siblings, 0 replies; 49+ messages in thread
From: Alexandru Elisei @ 2025-12-12 11:54 UTC (permalink / raw)
  To: Leo Yan
  Cc: maz, oliver.upton, joey.gouly, suzuki.poulose, yuzenghui, will,
	catalin.marinas, linux-arm-kernel, kvmarm, james.clark,
	mark.rutland, james.morse

Hi Leo,

On Fri, Dec 12, 2025 at 11:15:41AM +0000, Leo Yan wrote:
> On Fri, Dec 12, 2025 at 10:18:27AM +0000, Alexandru Elisei wrote:
> 
> [...]
> 
> 
> > > 3) In the end, the KVM hypervisor pins physical pages on the host
> > >    stage-1 page table for:
> > 
> > By 'pin' meaning using pin_user_pages(), yes.
> > 
> > > 
> > >    The physical pages are pinned for Guest stage-1 table;
> > 
> > Yes.
> > 
> > >    The physical pages are pinned for Guest stage-2 table;
> > 
> > Yes and no. The pages allocated for the stage 2 translation tables are not
> > mapped in the host's userspace, they are mapped in the kernel linear address
> > space. This means that they are not subject to migration/swap/compaction/etc,
> > they will only be reused after KVM frees them.
> > 
> > But that's how KVM manages stage 2 for all VMs, so maybe I misunderstood what
> > you were saying.
> 
> No, you did not misunderstand.  I did not understand stage-2 table
> allocation before — it is allocated by KVM, not from user memory via
> the VMM.
> 
> [...]
> 
> > > Due the host might migrate or swap pages, so all the pin operations
> > > happen on the host's page table.  The pin operations never to be set up
> > > in guest's stage-2 table, right?
> > 
> > I'm not sure what you mean.
> 
> Never mind.  I think you have answered this below (pin user memory via
> pin_user_pages() and no matter with stage-2 tables).
> 
> > > My understanding is that there are two prominent challenges for SPE
> > > virtualization:
> > > 
> > > 1) Allocation: we need to allocate trace buffer with mapping both
> > >    guest's stage-1 and stage-2 before enabling SPU.  (For me, the free
> > 
> > It's the guest responsibility to map the buffer in the guest stage 1 before
> > enabling it. When the guest enables the buffer, KVM walks the guest's stage 1
> > and if it doesn't find a translation for a buffer guest VA, it will inject a
> > profiling buffer management event to the guest, with EC stage 1 data abort.
> 
> IIUC, KVM will inject a buffer management interrupt to guest and then
> guest driver can detect EC="stage 1 data abort".  KVM does not raise a
> data abort exception in this case.
> 
> > If the buffer was mapped in the guest stage 1 when the guest enabled the buffer,
> > but at same point in the future the guest unmaps the buffer from stage 1, the
> > statistical profiling unit might encounter a stage 1 data abort when attempting
> > to write to memory. If that's the case, the interrupt is taken by the host, and
> > KVM will inject the buffer management event back to the guest.
> 
> Hmm... just a note, it would be straightforward for guest to directly
> respond IRQ for "stage-1 data abort" (TBH, I don't know how to inject
> IRQ vs fast-forward IRQ, you could ignore this note until I dig a bit).

PMBIRQ is a purely virtual interrupt for KVM. Very early on guest exit, KVM
saves the hardware value for PMBSR_EL1 and clears PMBSR_EL1.S, which leads to
the SPU deasserting PMBIRQ to the GIC.

From my very limited testing, the GIC is always fast enough to deasserts the
interrupt to the CPU before interrupts are enabled much later in the VCPU run
loop. If that doesn't happen (GIC is still asserting PMBIRQ when interrupts on
enabled on the CPU), the SPE driver interrupt handler will treat it as a
spurious interrupt because the driver reads PMBSR_EL1.S = 0.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-12-16 11:55 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14 16:06 [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 01/35] arm64/sysreg: Add new SPE fields Alexandru Elisei
2025-12-10 18:38   ` Leo Yan
2025-12-12  9:39     ` Alexandru Elisei
2025-12-15 21:42   ` Suzuki K Poulose
2025-11-14 16:06 ` [RFC PATCH v6 02/35] arm64/sysreg: Define MDCR_EL2.E2PB values Alexandru Elisei
2025-12-15 21:33   ` Suzuki K Poulose
2025-11-14 16:06 ` [RFC PATCH v6 03/35] KVM: arm64: Add CONFIG_KVM_ARM_SPE Kconfig option Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 04/35] perf: arm_spe_pmu: Move struct arm_spe_pmu to a separate header file Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 05/35] KVM: arm64: Add KVM_CAP_ARM_SPE capability Alexandru Elisei
2025-12-14 12:18   ` Leo Yan
2025-12-15 11:46     ` Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 06/35] KVM: arm64: Add KVM_ARM_VCPU_SPE VCPU feature Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 07/35] HACK! KVM: arm64: Disable SPE virtualization if protected KVM is enabled Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 08/35] HACK! KVM: arm64: Enable SPE virtualization only in VHE mode Alexandru Elisei
2025-12-15 17:49   ` Leo Yan
2025-11-14 16:06 ` [RFC PATCH v6 09/35] HACK! KVM: arm64: Disable SPE virtualization if nested virt is enabled Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 10/35] KVM: arm64: Add a new VCPU device control group for SPE Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 11/35] KVM: arm64: Add SPE VCPU device attribute to set the interrupt number Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 12/35] KVM: arm64: Add SPE VCPU device attribute to set the SPU device Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 13/35] perf: arm_spe_pmu: Add PMBIDR_EL1 to struct arm_spe_pmu Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 14/35] KVM: arm64: Add SPE VCPU device attribute to set the max buffer size Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 15/35] KVM: arm64: Add SPE VCPU device attribute to initialize SPE Alexandru Elisei
2025-11-14 16:06 ` [RFC PATCH v6 16/35] KVM: arm64: Advertise SPE version in ID_AA64DFR0_EL1.PMSver Alexandru Elisei
2025-12-16 11:40   ` Suzuki K Poulose
2025-11-14 16:06 ` [RFC PATCH v6 17/35] KVM: arm64: Add writable SPE system registers to VCPU context Alexandru Elisei
2025-12-16 11:54   ` Suzuki K Poulose
2025-11-14 16:06 ` [RFC PATCH v6 18/35] perf: arm_spe_pmu: Add PMSIDR_EL1 to struct arm_spe_pmu Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 19/35] KVM: arm64: Trap PMBIDR_EL1 and PMSIDR_EL1 Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 20/35] KVM: arm64: config: Use functions from spe.c to test FEAT_SPE_{FnE,FDS} Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 21/35] KVM: arm64: Check for unsupported CPU early in kvm_arch_vcpu_load() Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 22/35] KVM: arm64: VHE: Context switch SPE state Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 23/35] KVM: arm64: Allow guest SPE physical timestamps only if perfmon_capable() Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 24/35] KVM: arm64: Handle SPE hardware maintenance interrupts Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 25/35] KVM: arm64: Add basic handling of SPE buffer control registers writes Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 26/35] KVM: arm64: Add comment to explain how trapped SPE registers are handled Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 27/35] KVM: arm64: Make MTE functions public Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 28/35] KVM: arm64: at: Use callback for reading descriptor Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 29/35] KVM: arm64: Pin the SPE buffer in the host and map it at stage 2 Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 30/35] KVM: Propagate MMU event to the MMU notifier handlers Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 31/35] KVM: arm64: Handle MMU notifiers for the SPE buffer Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 32/35] KVM: Add KVM_EXIT_RLIMIT exit_reason Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 33/35] KVM: arm64: Implement locked memory accounting for the SPE buffer Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 34/35] KVM: arm64: Add hugetlb support for SPE Alexandru Elisei
2025-11-14 16:07 ` [RFC PATCH v6 35/35] KVM: arm64: Allow the creation of a SPE enabled VM Alexandru Elisei
2025-12-11 16:34 ` [RFC PATCH v6 00/35] KVM: arm64: Add Statistical Profiling Extension (SPE) support Leo Yan
2025-12-12 10:18   ` Alexandru Elisei
2025-12-12 11:15     ` Leo Yan
2025-12-12 11:54       ` Alexandru Elisei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).