[PATCH 0/3] perf: arm_spe: Add support for SPE VM interface

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface
@ 2025-07-01 15:31 James Clark
  2025-07-01 15:31 ` [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer James Clark
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: James Clark @ 2025-07-01 15:31 UTC (permalink / raw)
  To: Will Deacon, Mark Rutland, Catalin Marinas, Alexandru Elisei,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy
  Cc: linux-arm-kernel, linux-perf-users, linux-kernel, James Clark

SPE can be used from within a guest as long as the driver adheres to the
new VM interface spec [1]. Because the driver should behave correctly
whether it's running in a guest or not, the first patches are marked as
a fix. Furthermore, in future versions of the architecture the PE will
be allowed to behave in the same way.

The last patch adds new behavior to make it easier for guests to be
able to reserve large buffers. It's not strictly necessary, so it's not
marked as a fix.

[1]: https://developer.arm.com/documentation/den0154/latest/

Signed-off-by: James Clark <james.clark@linaro.org>
---
James Clark (3):
      perf: arm_spe: Add barrier before enabling profiling buffer
      perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
      perf: arm_spe: Add support for SPE VM interface

 arch/arm64/include/asm/sysreg.h |  1 +
 arch/arm64/tools/sysreg         |  6 ++++-
 drivers/perf/arm_spe_pmu.c      | 60 ++++++++++++++++++++++++++++++++---------
 3 files changed, 54 insertions(+), 13 deletions(-)
---
base-commit: d0b3b7b22dfa1f4b515fd3a295b3fd958f9e81af
change-id: 20250609-james-spe-vm-interface-2bb41e238072

Best regards,
-- 
James Clark <james.clark@linaro.org>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer
  2025-07-01 15:31 [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface James Clark
@ 2025-07-01 15:31 ` James Clark
  2025-07-04 14:04   ` Leo Yan
  2025-07-08 14:40   ` Alexandru Elisei
  2025-07-01 15:31 ` [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1 James Clark
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: James Clark @ 2025-07-01 15:31 UTC (permalink / raw)
  To: Will Deacon, Mark Rutland, Catalin Marinas, Alexandru Elisei,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy
  Cc: linux-arm-kernel, linux-perf-users, linux-kernel, James Clark

DEN0154 states that PMBPTR_EL1 must not be modified while the profiling
buffer is enabled. Ensure that enabling the buffer comes after setting
PMBPTR_EL1 by inserting an isb().

This only applies to guests for now, but in future versions of the
architecture the PE will be allowed to behave in the same way.

Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")
Signed-off-by: James Clark <james.clark@linaro.org>
---
 drivers/perf/arm_spe_pmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 3efed8839a4e..6235ca7ecd48 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -537,6 +537,7 @@ static void arm_spe_perf_aux_output_begin(struct perf_output_handle *handle,
 	limit += (u64)buf->base;
 	base = (u64)buf->base + PERF_IDX2OFF(handle->head, buf);
 	write_sysreg_s(base, SYS_PMBPTR_EL1);
+	isb();
 
 out_write_limit:
 	write_sysreg_s(limit, SYS_PMBLIMITR_EL1);

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-01 15:31 [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface James Clark
  2025-07-01 15:31 ` [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer James Clark
@ 2025-07-01 15:31 ` James Clark
  2025-07-04 15:50   ` Leo Yan
  2025-07-09 10:09   ` Alexandru Elisei
  2025-07-01 15:31 ` [PATCH 3/3] perf: arm_spe: Add support for SPE VM interface James Clark
  2025-08-01 13:28 ` [PATCH 0/3] " Alexandru Elisei
  3 siblings, 2 replies; 21+ messages in thread
From: James Clark @ 2025-07-01 15:31 UTC (permalink / raw)
  To: Will Deacon, Mark Rutland, Catalin Marinas, Alexandru Elisei,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy
  Cc: linux-arm-kernel, linux-perf-users, linux-kernel, James Clark

DEN0154 states that writes to PMBPTR_EL1 or PMBSR_EL1 must be done while
the buffer is disabled (PMBLIMITR_EL1.E == 0). Re-arrange the interrupt
handler to always disable the buffer for non-spurious interrupts before
doing either.

Most of arm_spe_pmu_disable_and_drain_local() is now always done, so for
faults the only thing left to do is clear PMSCR_EL1.

Elaborate the comment in arm_spe_pmu_disable_and_drain_local() to
explain the ramifications of not doing it in the right order.

Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")
Signed-off-by: James Clark <james.clark@linaro.org>
---
 drivers/perf/arm_spe_pmu.c | 33 +++++++++++++++++++++------------
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 6235ca7ecd48..5829947c8871 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -559,7 +559,12 @@ static void arm_spe_perf_aux_output_end(struct perf_output_handle *handle)
 
 static void arm_spe_pmu_disable_and_drain_local(void)
 {
-	/* Disable profiling at EL0 and EL1 */
+	/*
+	 * To prevent the CONSTRAINED UNPREDICTABLE behavior of either writing
+	 * to memory after the buffer is disabled, or SPE reporting an access
+	 * not allowed event, we must disable sampling before draining the
+	 * buffer.
+	 */
 	write_sysreg_s(0, SYS_PMSCR_EL1);
 	isb();
 
@@ -661,16 +666,24 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
 	 */
 	irq_work_run();
 
+	/*
+	 * arm_spe_pmu_buf_get_fault_act() already drained, and PMBSR_EL1.S == 1
+	 * means that StatisticalProfilingEnabled() == false. So now we can
+	 * safely disable the buffer.
+	 */
+	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
+	isb();
+
+	/* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
+	write_sysreg_s(0, SYS_PMBSR_EL1);
+
 	switch (act) {
 	case SPE_PMU_BUF_FAULT_ACT_FATAL:
 		/*
-		 * If a fatal exception occurred then leaving the profiling
-		 * buffer enabled is a recipe waiting to happen. Since
-		 * fatal faults don't always imply truncation, make sure
-		 * that the profiling buffer is disabled explicitly before
-		 * clearing the syndrome register.
+		 * To complete the full disable sequence, also disable profiling
+		 * at EL0 and EL1, we don't want to continue at all anymore.
 		 */
-		arm_spe_pmu_disable_and_drain_local();
+		write_sysreg_s(0, SYS_PMSCR_EL1);
 		break;
 	case SPE_PMU_BUF_FAULT_ACT_OK:
 		/*
@@ -679,18 +692,14 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
 		 * PMBPTR might be misaligned, but we'll burn that bridge
 		 * when we get to it.
 		 */
-		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)) {
+		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED))
 			arm_spe_perf_aux_output_begin(handle, event);
-			isb();
-		}
 		break;
 	case SPE_PMU_BUF_FAULT_ACT_SPURIOUS:
 		/* We've seen you before, but GCC has the memory of a sieve. */
 		break;
 	}
 
-	/* The buffer pointers are now sane, so resume profiling. */
-	write_sysreg_s(0, SYS_PMBSR_EL1);
 	return IRQ_HANDLED;
 }
 

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] perf: arm_spe: Add support for SPE VM interface
  2025-07-01 15:31 [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface James Clark
  2025-07-01 15:31 ` [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer James Clark
  2025-07-01 15:31 ` [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1 James Clark
@ 2025-07-01 15:31 ` James Clark
  2025-08-01 13:28 ` [PATCH 0/3] " Alexandru Elisei
  3 siblings, 0 replies; 21+ messages in thread
From: James Clark @ 2025-07-01 15:31 UTC (permalink / raw)
  To: Will Deacon, Mark Rutland, Catalin Marinas, Alexandru Elisei,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy
  Cc: linux-arm-kernel, linux-perf-users, linux-kernel, James Clark

DEN0154 describes the new SPE VM interface, which allows the hypervisor
to define a max buffer size hint and to raise a new buffer management
error for exceeding it.

Report the size as a capability to userspace, and prevent larger buffers
from being allocated in the driver. Although it's a only a hint and
smaller buffers may also be disallowed in some scenarios, a larger
buffer is never expected to work so we can fail early. Failures after
arm_spe_pmu_setup_aux() have to happen asynchronously through the buffer
management event.

Signed-off-by: James Clark <james.clark@linaro.org>
---
 arch/arm64/include/asm/sysreg.h |  1 +
 arch/arm64/tools/sysreg         |  6 +++++-
 drivers/perf/arm_spe_pmu.c      | 26 ++++++++++++++++++++++++++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index f1bb0d10c39a..9c48a7119aa7 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -367,6 +367,7 @@
 #define PMBSR_EL1_BUF_BSC_MASK		PMBSR_EL1_MSS_MASK
 
 #define PMBSR_EL1_BUF_BSC_FULL		0x1UL
+#define PMBSR_EL1_BUF_BSC_SIZE		0x4UL
 
 /*** End of Statistical Profiling Extension ***/
 
diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 8a8cf6874298..d6bb1736f554 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -2976,7 +2976,11 @@ Field	7:0	Attr
 EndSysreg
 
 Sysreg	PMBIDR_EL1	3	0	9	10	7
-Res0	63:12
+Res0	63:48
+Field	47:46	MaxBuffSize_RES
+Field	45:41	MaxBuffSize_E
+Field	40:32	MaxBuffSize_M
+Res0	31:12
 Enum	11:8	EA
 	0b0000	NotDescribed
 	0b0001	Ignored
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 5829947c8871..23f18dc2890c 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -92,6 +92,7 @@ struct arm_spe_pmu {
 	u16					max_record_sz;
 	u16					align;
 	struct perf_output_handle __percpu	*handle;
+	u64					max_buff_size;
 };
 
 #define to_spe_pmu(p) (container_of(p, struct arm_spe_pmu, pmu))
@@ -115,6 +116,7 @@ enum arm_spe_pmu_capabilities {
 	SPE_PMU_CAP_FEAT_MAX,
 	SPE_PMU_CAP_CNT_SZ = SPE_PMU_CAP_FEAT_MAX,
 	SPE_PMU_CAP_MIN_IVAL,
+	SPE_PMU_CAP_MAX_BUFF_SIZE,
 };
 
 static int arm_spe_pmu_feat_caps[SPE_PMU_CAP_FEAT_MAX] = {
@@ -132,6 +134,8 @@ static u32 arm_spe_pmu_cap_get(struct arm_spe_pmu *spe_pmu, int cap)
 		return spe_pmu->counter_sz;
 	case SPE_PMU_CAP_MIN_IVAL:
 		return spe_pmu->min_period;
+	case SPE_PMU_CAP_MAX_BUFF_SIZE:
+		return spe_pmu->max_buff_size;
 	default:
 		WARN(1, "unknown cap %d\n", cap);
 	}
@@ -164,6 +168,7 @@ static struct attribute *arm_spe_pmu_cap_attr[] = {
 	SPE_CAP_EXT_ATTR_ENTRY(ernd, SPE_PMU_CAP_ERND),
 	SPE_CAP_EXT_ATTR_ENTRY(count_size, SPE_PMU_CAP_CNT_SZ),
 	SPE_CAP_EXT_ATTR_ENTRY(min_interval, SPE_PMU_CAP_MIN_IVAL),
+	SPE_CAP_EXT_ATTR_ENTRY(max_buff_size, SPE_PMU_CAP_MAX_BUFF_SIZE),
 	NULL,
 };
 
@@ -631,6 +636,9 @@ arm_spe_pmu_buf_get_fault_act(struct perf_output_handle *handle)
 	case PMBSR_EL1_BUF_BSC_FULL:
 		ret = SPE_PMU_BUF_FAULT_ACT_OK;
 		goto out_stop;
+	case PMBSR_EL1_BUF_BSC_SIZE:
+		err_str = "Buffer size too large";
+		goto out_err;
 	default:
 		err_str = "Unknown buffer status code";
 	}
@@ -896,6 +904,7 @@ static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages,
 	int i, cpu = event->cpu;
 	struct page **pglist;
 	struct arm_spe_pmu_buf *buf;
+	struct arm_spe_pmu *spe_pmu = to_spe_pmu(event->pmu);
 
 	/* We need at least two pages for this to work. */
 	if (nr_pages < 2)
@@ -910,6 +919,10 @@ static void *arm_spe_pmu_setup_aux(struct perf_event *event, void **pages,
 	if (snapshot && (nr_pages & 1))
 		return NULL;
 
+	if (spe_pmu->max_buff_size &&
+	    nr_pages * PAGE_SIZE > spe_pmu->max_buff_size)
+		return NULL;
+
 	if (cpu == -1)
 		cpu = raw_smp_processor_id();
 
@@ -999,6 +1012,17 @@ static void arm_spe_pmu_perf_destroy(struct arm_spe_pmu *spe_pmu)
 	perf_pmu_unregister(&spe_pmu->pmu);
 }
 
+static u64 arm_spe_decode_buf_size(u64 pmbidr)
+{
+	u64 mantissa = FIELD_GET(PMBIDR_EL1_MaxBuffSize_M, pmbidr);
+	u8 exp = FIELD_GET(PMBIDR_EL1_MaxBuffSize_E, pmbidr);
+
+	if (exp == 0)
+		return mantissa << 12;
+
+	return (mantissa | 0b1000000000) << (exp + 11);
+}
+
 static void __arm_spe_pmu_dev_probe(void *info)
 {
 	int fld;
@@ -1033,6 +1057,8 @@ static void __arm_spe_pmu_dev_probe(void *info)
 		return;
 	}
 
+	spe_pmu->max_buff_size = arm_spe_decode_buf_size(reg);
+
 	/* It's now safe to read PMSIDR and figure out what we've got */
 	reg = read_sysreg_s(SYS_PMSIDR_EL1);
 	if (FIELD_GET(PMSIDR_EL1_FE, reg))

-- 
2.34.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer
  2025-07-01 15:31 ` [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer James Clark
@ 2025-07-04 14:04   ` Leo Yan
  2025-07-07 11:22     ` James Clark
  2025-07-08 14:40   ` Alexandru Elisei
  1 sibling, 1 reply; 21+ messages in thread
From: Leo Yan @ 2025-07-04 14:04 UTC (permalink / raw)
  To: James Clark
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Alexandru Elisei,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel

On Tue, Jul 01, 2025 at 04:31:57PM +0100, James Clark wrote:
> DEN0154 states that PMBPTR_EL1 must not be modified while the profiling
> buffer is enabled. Ensure that enabling the buffer comes after setting
> PMBPTR_EL1 by inserting an isb().
> 
> This only applies to guests for now, but in future versions of the
> architecture the PE will be allowed to behave in the same way.
> 
> Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")
> Signed-off-by: James Clark <james.clark@linaro.org>
> ---
>  drivers/perf/arm_spe_pmu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
> index 3efed8839a4e..6235ca7ecd48 100644
> --- a/drivers/perf/arm_spe_pmu.c
> +++ b/drivers/perf/arm_spe_pmu.c
> @@ -537,6 +537,7 @@ static void arm_spe_perf_aux_output_begin(struct perf_output_handle *handle,
>  	limit += (u64)buf->base;
>  	base = (u64)buf->base + PERF_IDX2OFF(handle->head, buf);
>  	write_sysreg_s(base, SYS_PMBPTR_EL1);
> +	isb();

I know that you and Alexandru have discussed whether the isb() should
be placed here or after the out_write_limit label. I should have engaged
in the discussion earlier. Sorry for raising the question now.

My understanding is that isb() is not only for synchronizing the write
to PMBPTR_EL1. It also serves as a context synchronization event
between any other SPE register writes and the write to
SYS_PMBLIMITR_EL1.

Let me give an example (perhaps a rare one): if we use perf snapshot
mode or the AUX pause/resume mode, it's possible that the flow does
not trigger an interrupt via overflow. Instead, the sequence might
look like this:

  arm_spe_pmu_stop()
    `> arm_spe_pmu_start()
         `> arm_spe_perf_aux_output_begin()

In this case, to ensure that all SPE system registers are properly
written to the hardware, the safest approach is to always execute isb()
just before writing to SYS_PMBLIMITR_EL1. (In other words, after the
label out_write_limit).

Thanks,
Leo

>  out_write_limit:
>  	write_sysreg_s(limit, SYS_PMBLIMITR_EL1);
> 
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-01 15:31 ` [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1 James Clark
@ 2025-07-04 15:50   ` Leo Yan
  2025-07-07 11:39     ` James Clark
  2025-07-09 10:09   ` Alexandru Elisei
  1 sibling, 1 reply; 21+ messages in thread
From: Leo Yan @ 2025-07-04 15:50 UTC (permalink / raw)
  To: James Clark
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Alexandru Elisei,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel

On Tue, Jul 01, 2025 at 04:31:58PM +0100, James Clark wrote:

[...]

> @@ -661,16 +666,24 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
>  	 */
>  	irq_work_run();
>  
> +	/*
> +	 * arm_spe_pmu_buf_get_fault_act() already drained, and PMBSR_EL1.S == 1
> +	 * means that StatisticalProfilingEnabled() == false. So now we can
> +	 * safely disable the buffer.
> +	 */
> +	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
> +	isb();
> +
> +	/* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
> +	write_sysreg_s(0, SYS_PMBSR_EL1);
> +

An important thing is about sequence:
As described in arm_spe_pmu_disable_and_drain_local(), should we always
clear ELs bits in PMSCR_EL1 before clear PMBLIMITR_EL1.E bit? As a
reference, we could see TRBE always clear ELx bits before disable trace
buffer.

And a trivial flaw:

If the TRUNCATED flag has been set, the irq_work_run() above runs the
IRQ work to invoke the arm_spe_pmu_stop() to disable trace buffer, which
clear SYS_PMBLIMITR_EL1.E bit. This is why the current code does not
explictly clear SYS_PMBLIMITR_EL1.E bit.

With this patch, the interrupt handler will clear SYS_PMBLIMITR_EL1.E
bit twice for a trunacated case.


>  	switch (act) {
>  	case SPE_PMU_BUF_FAULT_ACT_FATAL:
>  		/*
> -		 * If a fatal exception occurred then leaving the profiling
> -		 * buffer enabled is a recipe waiting to happen. Since
> -		 * fatal faults don't always imply truncation, make sure
> -		 * that the profiling buffer is disabled explicitly before
> -		 * clearing the syndrome register.
> +		 * To complete the full disable sequence, also disable profiling
> +		 * at EL0 and EL1, we don't want to continue at all anymore.
>  		 */
> -		arm_spe_pmu_disable_and_drain_local();
> +		write_sysreg_s(0, SYS_PMSCR_EL1);
>  		break;
>  	case SPE_PMU_BUF_FAULT_ACT_OK:
>  		/*
> @@ -679,18 +692,14 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
>  		 * PMBPTR might be misaligned, but we'll burn that bridge
>  		 * when we get to it.
>  		 */
> -		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)) {
> +		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED))
>  			arm_spe_perf_aux_output_begin(handle, event);
> -			isb();

I am a bit suspecious we can remove this isb().

As a reference to the software usage PKLXF in Arm ARM (DDI 0487 L.a),
after enable TRBE trace unit, an ISB is mandatory. Maybe check a bit
for this?

Thanks,
Leo

> -		}
>  		break;
>  	case SPE_PMU_BUF_FAULT_ACT_SPURIOUS:
>  		/* We've seen you before, but GCC has the memory of a sieve. */
>  		break;
>  	}
>  
> -	/* The buffer pointers are now sane, so resume profiling. */
> -	write_sysreg_s(0, SYS_PMBSR_EL1);
>  	return IRQ_HANDLED;
>  }
>  
> 
> -- 
> 2.34.1
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer
  2025-07-04 14:04   ` Leo Yan
@ 2025-07-07 11:22     ` James Clark
  0 siblings, 0 replies; 21+ messages in thread
From: James Clark @ 2025-07-07 11:22 UTC (permalink / raw)
  To: Leo Yan
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Alexandru Elisei,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel



On 04/07/2025 3:04 pm, Leo Yan wrote:
> On Tue, Jul 01, 2025 at 04:31:57PM +0100, James Clark wrote:
>> DEN0154 states that PMBPTR_EL1 must not be modified while the profiling
>> buffer is enabled. Ensure that enabling the buffer comes after setting
>> PMBPTR_EL1 by inserting an isb().
>>
>> This only applies to guests for now, but in future versions of the
>> architecture the PE will be allowed to behave in the same way.
>>
>> Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")
>> Signed-off-by: James Clark <james.clark@linaro.org>
>> ---
>>   drivers/perf/arm_spe_pmu.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
>> index 3efed8839a4e..6235ca7ecd48 100644
>> --- a/drivers/perf/arm_spe_pmu.c
>> +++ b/drivers/perf/arm_spe_pmu.c
>> @@ -537,6 +537,7 @@ static void arm_spe_perf_aux_output_begin(struct perf_output_handle *handle,
>>   	limit += (u64)buf->base;
>>   	base = (u64)buf->base + PERF_IDX2OFF(handle->head, buf);
>>   	write_sysreg_s(base, SYS_PMBPTR_EL1);
>> +	isb();
> 
> I know that you and Alexandru have discussed whether the isb() should
> be placed here or after the out_write_limit label. I should have engaged
> in the discussion earlier. Sorry for raising the question now.
> 
> My understanding is that isb() is not only for synchronizing the write
> to PMBPTR_EL1. It also serves as a context synchronization event
> between any other SPE register writes and the write to
> SYS_PMBLIMITR_EL1.
> 
> Let me give an example (perhaps a rare one): if we use perf snapshot
> mode or the AUX pause/resume mode, it's possible that the flow does
> not trigger an interrupt via overflow. Instead, the sequence might
> look like this:
> 
>    arm_spe_pmu_stop()
>      `> arm_spe_pmu_start()
>           `> arm_spe_perf_aux_output_begin()
> 
> In this case, to ensure that all SPE system registers are properly
> written to the hardware, the safest approach is to always execute isb()
> just before writing to SYS_PMBLIMITR_EL1. (In other words, after the
> label out_write_limit).
> 
> Thanks,
> Leo
> 

I think the point is that any write that enables the buffer must come 
last, but not necessarily all writes. And not all paths in 
arm_spe_perf_aux_output_begin() enable it so the isb() was only added on 
the path that does.

I couldn't see an issue with your example, are you saying 
arm_spe_pmu_stop() could call arm_spe_pmu_start()? It doesn't call it 
directly. Or do you mean the aux pause/resume thing can cause a 
arm_spe_pmu_start() from any point in time? If that was true then it 
doesn't matter where the isb() is because you can never be sure it will 
be before the write.

James

>>   out_write_limit:
>>   	write_sysreg_s(limit, SYS_PMBLIMITR_EL1);
>>
>> -- 
>> 2.34.1
>>
>>
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-04 15:50   ` Leo Yan
@ 2025-07-07 11:39     ` James Clark
  2025-07-07 15:37       ` Leo Yan
  0 siblings, 1 reply; 21+ messages in thread
From: James Clark @ 2025-07-07 11:39 UTC (permalink / raw)
  To: Leo Yan
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Alexandru Elisei,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel



On 04/07/2025 4:50 pm, Leo Yan wrote:
> On Tue, Jul 01, 2025 at 04:31:58PM +0100, James Clark wrote:
> 
> [...]
> 
>> @@ -661,16 +666,24 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
>>   	 */
>>   	irq_work_run();
>>   
>> +	/*
>> +	 * arm_spe_pmu_buf_get_fault_act() already drained, and PMBSR_EL1.S == 1
>> +	 * means that StatisticalProfilingEnabled() == false. So now we can
>> +	 * safely disable the buffer.
>> +	 */
>> +	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
>> +	isb();
>> +
>> +	/* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
>> +	write_sysreg_s(0, SYS_PMBSR_EL1);
>> +
> 
> An important thing is about sequence:
> As described in arm_spe_pmu_disable_and_drain_local(), should we always
> clear ELs bits in PMSCR_EL1 before clear PMBLIMITR_EL1.E bit? As a
> reference, we could see TRBE always clear ELx bits before disable trace
> buffer.
> 
> And a trivial flaw:
> 
> If the TRUNCATED flag has been set, the irq_work_run() above runs the
> IRQ work to invoke the arm_spe_pmu_stop() to disable trace buffer, which
> clear SYS_PMBLIMITR_EL1.E bit. This is why the current code does not
> explictly clear SYS_PMBLIMITR_EL1.E bit.
> 
> With this patch, the interrupt handler will clear SYS_PMBLIMITR_EL1.E
> bit twice for a trunacated case.
> 
> 

I suppose that's a rarer case that we don't necessarily have to optimize 
for. I don't think it will do any harm, but is it even possible to avoid?

There are already some other duplications in the driver, for example in 
arm_spe_pmu_stop() we call arm_spe_pmu_disable_and_drain_local() which 
drains, and then arm_spe_pmu_buf_get_fault_act() which also drains again.

>>   	switch (act) {
>>   	case SPE_PMU_BUF_FAULT_ACT_FATAL:
>>   		/*
>> -		 * If a fatal exception occurred then leaving the profiling
>> -		 * buffer enabled is a recipe waiting to happen. Since
>> -		 * fatal faults don't always imply truncation, make sure
>> -		 * that the profiling buffer is disabled explicitly before
>> -		 * clearing the syndrome register.
>> +		 * To complete the full disable sequence, also disable profiling
>> +		 * at EL0 and EL1, we don't want to continue at all anymore.
>>   		 */
>> -		arm_spe_pmu_disable_and_drain_local();
>> +		write_sysreg_s(0, SYS_PMSCR_EL1);
>>   		break;
>>   	case SPE_PMU_BUF_FAULT_ACT_OK:
>>   		/*
>> @@ -679,18 +692,14 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
>>   		 * PMBPTR might be misaligned, but we'll burn that bridge
>>   		 * when we get to it.
>>   		 */
>> -		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)) {
>> +		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED))
>>   			arm_spe_perf_aux_output_begin(handle, event);
>> -			isb();
> 
> I am a bit suspecious we can remove this isb().
> 
> As a reference to the software usage PKLXF in Arm ARM (DDI 0487 L.a),
> after enable TRBE trace unit, an ISB is mandatory. Maybe check a bit
> for this?
> 
> Thanks,
> Leo
> 

Wasn't this isb() to separate the programming of the registers with the 
status register clear at the end of this function to enable profiling?

But now we enable profiling with the write to PMBLIMITR_EL1 in 
arm_spe_perf_aux_output_begin() and the last thing here is the ERET. 
That's specifically mentioned as enough synchronization in PKLXF:

   In the common case, this is an ERET instruction that returns to a
   different Exception level where tracing is allowed.

>> -		}
>>   		break;
>>   	case SPE_PMU_BUF_FAULT_ACT_SPURIOUS:
>>   		/* We've seen you before, but GCC has the memory of a sieve. */
>>   		break;
>>   	}
>>   
>> -	/* The buffer pointers are now sane, so resume profiling. */
>> -	write_sysreg_s(0, SYS_PMBSR_EL1);
>>   	return IRQ_HANDLED;
>>   }
>>   
>>
>> -- 
>> 2.34.1
>>
>>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-07 11:39     ` James Clark
@ 2025-07-07 15:37       ` Leo Yan
  2025-07-08 14:45         ` Alexandru Elisei
  2025-07-09 10:08         ` Alexandru Elisei
  0 siblings, 2 replies; 21+ messages in thread
From: Leo Yan @ 2025-07-07 15:37 UTC (permalink / raw)
  To: James Clark
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Alexandru Elisei,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel

On Mon, Jul 07, 2025 at 12:39:57PM +0100, James Clark wrote:

[...]

> > > @@ -661,16 +666,24 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> > >   	 */
> > >   	irq_work_run();
> > > +	/*
> > > +	 * arm_spe_pmu_buf_get_fault_act() already drained, and PMBSR_EL1.S == 1
> > > +	 * means that StatisticalProfilingEnabled() == false. So now we can
> > > +	 * safely disable the buffer.
> > > +	 */
> > > +	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
> > > +	isb();
> > > +
> > > +	/* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
> > > +	write_sysreg_s(0, SYS_PMBSR_EL1);
> > > +
> > 
> > An important thing is about sequence:
> > As described in arm_spe_pmu_disable_and_drain_local(), should we always
> > clear ELs bits in PMSCR_EL1 before clear PMBLIMITR_EL1.E bit? As a
> > reference, we could see TRBE always clear ELx bits before disable trace
> > buffer.
> > 
> > And a trivial flaw:
> > 
> > If the TRUNCATED flag has been set, the irq_work_run() above runs the
> > IRQ work to invoke the arm_spe_pmu_stop() to disable trace buffer, which
> > clear SYS_PMBLIMITR_EL1.E bit. This is why the current code does not
> > explictly clear SYS_PMBLIMITR_EL1.E bit.
> > 
> > With this patch, the interrupt handler will clear SYS_PMBLIMITR_EL1.E
> > bit twice for a trunacated case.
> 
> I suppose that's a rarer case that we don't necessarily have to optimize
> for. I don't think it will do any harm, but is it even possible to avoid?
> 
> There are already some other duplications in the driver, for example in
> arm_spe_pmu_stop() we call arm_spe_pmu_disable_and_drain_local() which
> drains, and then arm_spe_pmu_buf_get_fault_act() which also drains again.

If we don't need to worry about duplicated operations in the truncated
case, then for easier maintenance and better readability, I'm wondering
if we could simplify the interrupt handler as follows:

    arm_spe_pmu_irq_handler()
    {
        ...

        act = arm_spe_pmu_buf_get_fault_act(handle);
        if (act == SPE_PMU_BUF_FAULT_ACT_SPURIOUS)
                return IRQ_NONE;

        arm_spe_pmu_disable_and_drain_local();

        /* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
        write_sysreg_s(0, SYS_PMBSR_EL1);
        isb();

        switch (act) {
            ...
        }
    }

This approach complies with DEN0154 - we must clear PMBLIMITR_EL1.E
before writing to other SPE system registers (e.g., PMBSR).

The reason for using arm_spe_pmu_disable_and_drain_local() is that we
first need to disable profiling instructions by clearing PMSCR_EL1/EL2,
and then is it safe to disable the profiling buffer.

[...]

> > >   	case SPE_PMU_BUF_FAULT_ACT_OK:
> > >   		/*
> > > @@ -679,18 +692,14 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> > >   		 * PMBPTR might be misaligned, but we'll burn that bridge
> > >   		 * when we get to it.
> > >   		 */
> > > -		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)) {
> > > +		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED))
> > >   			arm_spe_perf_aux_output_begin(handle, event);
> > > -			isb();
> > 
> > I am a bit suspecious we can remove this isb().
> > 
> > As a reference to the software usage PKLXF in Arm ARM (DDI 0487 L.a),
> > after enable TRBE trace unit, an ISB is mandatory. Maybe check a bit
> > for this?
>
> Wasn't this isb() to separate the programming of the registers with the
> status register clear at the end of this function to enable profiling?

Enabling profiling buffer followed an isb() is not only for separating
other register programming.

As described in section D17.9, Synchronization and Statistical Profiling
in Arm ARM:

  "A Context Synchronization event guarantees that a direct write to a
  System register made by the PE in program order before the Context
  synchronization event are observable by indirect reads and indirect
  writes of the same System register made by a profiling operation
  relating to a sampled operation in program order after the Context
  synchronization event."

My understanding is: after the ARM SPE profiling is enabled, the
followed ISB is a Synchronization to make sure the system register
values are observed by SPE. And we cannot rely on ERET, especially if
we are tracing the kernel mode.

Thanks,
Leo

> But now we enable profiling with the write to PMBLIMITR_EL1 in
> arm_spe_perf_aux_output_begin() and the last thing here is the ERET. That's
> specifically mentioned as enough synchronization in PKLXF:
> 
>   In the common case, this is an ERET instruction that returns to a
>   different Exception level where tracing is allowed.
> 
> > > -		}
> > >   		break;
> > >   	case SPE_PMU_BUF_FAULT_ACT_SPURIOUS:
> > >   		/* We've seen you before, but GCC has the memory of a sieve. */
> > >   		break;
> > >   	}
> > > -	/* The buffer pointers are now sane, so resume profiling. */
> > > -	write_sysreg_s(0, SYS_PMBSR_EL1);
> > >   	return IRQ_HANDLED;
> > >   }
> > > 
> > > -- 
> > > 2.34.1
> > > 
> > > 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer
  2025-07-01 15:31 ` [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer James Clark
  2025-07-04 14:04   ` Leo Yan
@ 2025-07-08 14:40   ` Alexandru Elisei
  1 sibling, 0 replies; 21+ messages in thread
From: Alexandru Elisei @ 2025-07-08 14:40 UTC (permalink / raw)
  To: James Clark
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Anshuman Khandual,
	Rob Herring, Suzuki Poulose, Robin Murphy, linux-arm-kernel,
	linux-perf-users, linux-kernel

Hi James,

On Tue, Jul 01, 2025 at 04:31:57PM +0100, James Clark wrote:
> DEN0154 states that PMBPTR_EL1 must not be modified while the profiling
> buffer is enabled. Ensure that enabling the buffer comes after setting
> PMBPTR_EL1 by inserting an isb().
> 
> This only applies to guests for now, but in future versions of the
> architecture the PE will be allowed to behave in the same way.
> 
> Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")

A note on why I think this is a fix.

The write to PMBLIMITR_EL1 serves two purposes: to enable the buffer and to set
the limit for the new buffer. The statistical profiling unit (I'm calling it
SPU) performs indirect reads of the registers. Without the ISB between the
buffer pointer write and the write that enables + sets limit for the buffer, I
think it's possible for the SPU to observe the PMBLIMITR_EL1 write before the
PMBPTR_EL1 write. During this time, from the point of view of the SPU, the
buffer is incorrectly programmed, and potentially the old PMBPTR_EL1.PTR > new
PMBLIMITR_EL1.Limit.

arm_spe_perf_aux_output_begin() can be called after sampling has been enabled
(PMSCR_EL1.E1SPE = 1).

Putting it all together, this means that we have all the conditions to break the
restrictions on the current write pointer and the behaviour is UNPREDICTABLE, as
per section D17.7.1, ARM DDI 0487L. I think it's worth pointing out that the SPU
can do any of the folling in this situation: writing to any writable virtual
address in the current owning translation regime (!), generate a profiling
management event, discard all data or don't enable profiling.

Thanks,
Alex

> Signed-off-by: James Clark <james.clark@linaro.org>
> ---
>  drivers/perf/arm_spe_pmu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
> index 3efed8839a4e..6235ca7ecd48 100644
> --- a/drivers/perf/arm_spe_pmu.c
> +++ b/drivers/perf/arm_spe_pmu.c
> @@ -537,6 +537,7 @@ static void arm_spe_perf_aux_output_begin(struct perf_output_handle *handle,
>  	limit += (u64)buf->base;
>  	base = (u64)buf->base + PERF_IDX2OFF(handle->head, buf);
>  	write_sysreg_s(base, SYS_PMBPTR_EL1);
> +	isb();
>  
>  out_write_limit:
>  	write_sysreg_s(limit, SYS_PMBLIMITR_EL1);
> 
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-07 15:37       ` Leo Yan
@ 2025-07-08 14:45         ` Alexandru Elisei
  2025-07-09 10:08         ` Alexandru Elisei
  1 sibling, 0 replies; 21+ messages in thread
From: Alexandru Elisei @ 2025-07-08 14:45 UTC (permalink / raw)
  To: Leo Yan
  Cc: James Clark, Will Deacon, Mark Rutland, Catalin Marinas,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel

Hi Leo,

On Mon, Jul 07, 2025 at 04:37:10PM +0100, Leo Yan wrote:
> On Mon, Jul 07, 2025 at 12:39:57PM +0100, James Clark wrote:
> 
> [...]
> 
> > > > @@ -661,16 +666,24 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> > > >   	 */
> > > >   	irq_work_run();
> > > > +	/*
> > > > +	 * arm_spe_pmu_buf_get_fault_act() already drained, and PMBSR_EL1.S == 1
> > > > +	 * means that StatisticalProfilingEnabled() == false. So now we can
> > > > +	 * safely disable the buffer.
> > > > +	 */
> > > > +	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
> > > > +	isb();
> > > > +
> > > > +	/* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
> > > > +	write_sysreg_s(0, SYS_PMBSR_EL1);
> > > > +
> > > 
> > > An important thing is about sequence:
> > > As described in arm_spe_pmu_disable_and_drain_local(), should we always
> > > clear ELs bits in PMSCR_EL1 before clear PMBLIMITR_EL1.E bit? As a
> > > reference, we could see TRBE always clear ELx bits before disable trace
> > > buffer.
> > > 
> > > And a trivial flaw:
> > > 
> > > If the TRUNCATED flag has been set, the irq_work_run() above runs the
> > > IRQ work to invoke the arm_spe_pmu_stop() to disable trace buffer, which
> > > clear SYS_PMBLIMITR_EL1.E bit. This is why the current code does not
> > > explictly clear SYS_PMBLIMITR_EL1.E bit.
> > > 
> > > With this patch, the interrupt handler will clear SYS_PMBLIMITR_EL1.E
> > > bit twice for a trunacated case.
> > 
> > I suppose that's a rarer case that we don't necessarily have to optimize
> > for. I don't think it will do any harm, but is it even possible to avoid?
> > 
> > There are already some other duplications in the driver, for example in
> > arm_spe_pmu_stop() we call arm_spe_pmu_disable_and_drain_local() which
> > drains, and then arm_spe_pmu_buf_get_fault_act() which also drains again.
> 
> If we don't need to worry about duplicated operations in the truncated
> case, then for easier maintenance and better readability, I'm wondering
> if we could simplify the interrupt handler as follows:
> 
>     arm_spe_pmu_irq_handler()
>     {
>         ...
> 
>         act = arm_spe_pmu_buf_get_fault_act(handle);
>         if (act == SPE_PMU_BUF_FAULT_ACT_SPURIOUS)
>                 return IRQ_NONE;
> 
>         arm_spe_pmu_disable_and_drain_local();
> 
>         /* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
>         write_sysreg_s(0, SYS_PMBSR_EL1);
>         isb();
> 
>         switch (act) {
>             ...
>         }
>     }
> 
> This approach complies with DEN0154 - we must clear PMBLIMITR_EL1.E
> before writing to other SPE system registers (e.g., PMBSR).
> 
> The reason for using arm_spe_pmu_disable_and_drain_local() is that we
> first need to disable profiling instructions by clearing PMSCR_EL1/EL2,
> and then is it safe to disable the profiling buffer.

Sampling is disabled when ProfilingBufferEnabled() = false (look at the
pseudocode for SPEPreExecution() in the Arm ARM).  ProfilingBufferEnabled() =
false if PMBLIMITR_EL1.E = 0 *or* PMBSR_EL1.S = 1.

One way to think about it is: if PMBLIMITR_EL1.E = 0, then there's no point in
sampling instructions and creating records because the profiling unit cannot
write them to memory; and if PMBSR_EL1.S = 1, something went wrong with the
buffer, or the buffer is full, and similarly there's no point in sampling
instructions and creating records because, again, the profiling unit cannot
write them to memory.

> 
> [...]
> 
> > > >   	case SPE_PMU_BUF_FAULT_ACT_OK:
> > > >   		/*
> > > > @@ -679,18 +692,14 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> > > >   		 * PMBPTR might be misaligned, but we'll burn that bridge
> > > >   		 * when we get to it.
> > > >   		 */
> > > > -		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)) {
> > > > +		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED))
> > > >   			arm_spe_perf_aux_output_begin(handle, event);
> > > > -			isb();
> > > 
> > > I am a bit suspecious we can remove this isb().
> > > 
> > > As a reference to the software usage PKLXF in Arm ARM (DDI 0487 L.a),
> > > after enable TRBE trace unit, an ISB is mandatory. Maybe check a bit
> > > for this?
> >
> > Wasn't this isb() to separate the programming of the registers with the
> > status register clear at the end of this function to enable profiling?
> 
> Enabling profiling buffer followed an isb() is not only for separating
> other register programming.
> 
> As described in section D17.9, Synchronization and Statistical Profiling
> in Arm ARM:
> 
>   "A Context Synchronization event guarantees that a direct write to a
>   System register made by the PE in program order before the Context
>   synchronization event are observable by indirect reads and indirect
>   writes of the same System register made by a profiling operation
>   relating to a sampled operation in program order after the Context
>   synchronization event."
> 
> My understanding is: after the ARM SPE profiling is enabled, the
> followed ISB is a Synchronization to make sure the system register
> values are observed by SPE. And we cannot rely on ERET, especially if
> we are tracing the kernel mode.

DDI 0487L.a, rule R_BWCFK: If the Effective value of SCTLR_ELx.EOS is 1, the
exception return is a Context synchronization event.

Linux always sets EOS.

Thanks,
Alex

> 
> Thanks,
> Leo
> 
> > But now we enable profiling with the write to PMBLIMITR_EL1 in
> > arm_spe_perf_aux_output_begin() and the last thing here is the ERET. That's
> > specifically mentioned as enough synchronization in PKLXF:
> > 
> >   In the common case, this is an ERET instruction that returns to a
> >   different Exception level where tracing is allowed.
> > 
> > > > -		}
> > > >   		break;
> > > >   	case SPE_PMU_BUF_FAULT_ACT_SPURIOUS:
> > > >   		/* We've seen you before, but GCC has the memory of a sieve. */
> > > >   		break;
> > > >   	}
> > > > -	/* The buffer pointers are now sane, so resume profiling. */
> > > > -	write_sysreg_s(0, SYS_PMBSR_EL1);
> > > >   	return IRQ_HANDLED;
> > > >   }
> > > > 
> > > > -- 
> > > > 2.34.1
> > > > 
> > > > 
> > 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-07 15:37       ` Leo Yan
  2025-07-08 14:45         ` Alexandru Elisei
@ 2025-07-09 10:08         ` Alexandru Elisei
  2025-07-14  8:58           ` Leo Yan
  1 sibling, 1 reply; 21+ messages in thread
From: Alexandru Elisei @ 2025-07-09 10:08 UTC (permalink / raw)
  To: Leo Yan
  Cc: James Clark, Will Deacon, Mark Rutland, Catalin Marinas,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel

Hi Leo, James,

On Mon, Jul 07, 2025 at 04:37:10PM +0100, Leo Yan wrote:
> On Mon, Jul 07, 2025 at 12:39:57PM +0100, James Clark wrote:
> 
> [...]
> 
> > > > @@ -661,16 +666,24 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> > > >   	 */
> > > >   	irq_work_run();
> > > > +	/*
> > > > +	 * arm_spe_pmu_buf_get_fault_act() already drained, and PMBSR_EL1.S == 1
> > > > +	 * means that StatisticalProfilingEnabled() == false. So now we can
> > > > +	 * safely disable the buffer.
> > > > +	 */
> > > > +	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
> > > > +	isb();
> > > > +
> > > > +	/* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
> > > > +	write_sysreg_s(0, SYS_PMBSR_EL1);
> > > > +
> > > 
> > > An important thing is about sequence:
> > > As described in arm_spe_pmu_disable_and_drain_local(), should we always
> > > clear ELs bits in PMSCR_EL1 before clear PMBLIMITR_EL1.E bit? As a
> > > reference, we could see TRBE always clear ELx bits before disable trace
> > > buffer.
> > > 
> > > And a trivial flaw:
> > > 
> > > If the TRUNCATED flag has been set, the irq_work_run() above runs the
> > > IRQ work to invoke the arm_spe_pmu_stop() to disable trace buffer, which
> > > clear SYS_PMBLIMITR_EL1.E bit. This is why the current code does not
> > > explictly clear SYS_PMBLIMITR_EL1.E bit.
> > > 
> > > With this patch, the interrupt handler will clear SYS_PMBLIMITR_EL1.E
> > > bit twice for a trunacated case.
> > 
> > I suppose that's a rarer case that we don't necessarily have to optimize
> > for. I don't think it will do any harm, but is it even possible to avoid?
> > 
> > There are already some other duplications in the driver, for example in
> > arm_spe_pmu_stop() we call arm_spe_pmu_disable_and_drain_local() which
> > drains, and then arm_spe_pmu_buf_get_fault_act() which also drains again.
> 
> If we don't need to worry about duplicated operations in the truncated
> case, then for easier maintenance and better readability, I'm wondering
> if we could simplify the interrupt handler as follows:
> 
>     arm_spe_pmu_irq_handler()
>     {
>         ...
> 
>         act = arm_spe_pmu_buf_get_fault_act(handle);
>         if (act == SPE_PMU_BUF_FAULT_ACT_SPURIOUS)
>                 return IRQ_NONE;
> 
>         arm_spe_pmu_disable_and_drain_local();
> 
>         /* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
>         write_sysreg_s(0, SYS_PMBSR_EL1);
>         isb();
> 
>         switch (act) {
>             ...
>         }
>     }
> 
> This approach complies with DEN0154 - we must clear PMBLIMITR_EL1.E
> before writing to other SPE system registers (e.g., PMBSR).
> 
> The reason for using arm_spe_pmu_disable_and_drain_local() is that we
> first need to disable profiling instructions by clearing PMSCR_EL1/EL2,
> and then is it safe to disable the profiling buffer.
> 
> [...]
> 
> > > >   	case SPE_PMU_BUF_FAULT_ACT_OK:
> > > >   		/*
> > > > @@ -679,18 +692,14 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> > > >   		 * PMBPTR might be misaligned, but we'll burn that bridge
> > > >   		 * when we get to it.
> > > >   		 */
> > > > -		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)) {
> > > > +		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED))
> > > >   			arm_spe_perf_aux_output_begin(handle, event);
> > > > -			isb();
> > > 
> > > I am a bit suspecious we can remove this isb().
> > > 
> > > As a reference to the software usage PKLXF in Arm ARM (DDI 0487 L.a),
> > > after enable TRBE trace unit, an ISB is mandatory. Maybe check a bit
> > > for this?
> >
> > Wasn't this isb() to separate the programming of the registers with the
> > status register clear at the end of this function to enable profiling?
> 
> Enabling profiling buffer followed an isb() is not only for separating
> other register programming.
> 
> As described in section D17.9, Synchronization and Statistical Profiling
> in Arm ARM:
> 
>   "A Context Synchronization event guarantees that a direct write to a
>   System register made by the PE in program order before the Context
>   synchronization event are observable by indirect reads and indirect
>   writes of the same System register made by a profiling operation
>   relating to a sampled operation in program order after the Context
>   synchronization event."
> 
> My understanding is: after the ARM SPE profiling is enabled, the
> followed ISB is a Synchronization to make sure the system register
> values are observed by SPE. And we cannot rely on ERET, especially if
> we are tracing the kernel mode.

Thought about this some more.

Before:

arm_spe_pmu_buf_get_fault_act:
  <drain buffer>
  ISB
arm_spe_perf_aux_output_begin:
  PMBLIMITR_EL1.E = 1
ISB
PMBSR_EL1.S = 0
ERET

Now:

PMBLIMITR_EL1 = 0
ISB

PMBSR_EL1.S = 0
arm_spe_perf_aux_output_begin:
  ISB
  PMBLIMITR_EL1.E = 1
ERET

I don't see much of a difference between the two sequences - the point after
which we can be sure that profiling is enabled remains the ERET from the
exception return.  The only difference is that, before this change, the ERET
synchronized clearing the PMBSR_EL1.S bit, now it synchronizes setting the
PMBLIMITR_EL1.E bit.

Thoughts?

Thanks,
Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-01 15:31 ` [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1 James Clark
  2025-07-04 15:50   ` Leo Yan
@ 2025-07-09 10:09   ` Alexandru Elisei
  1 sibling, 0 replies; 21+ messages in thread
From: Alexandru Elisei @ 2025-07-09 10:09 UTC (permalink / raw)
  To: James Clark
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Anshuman Khandual,
	Rob Herring, Suzuki Poulose, Robin Murphy, linux-arm-kernel,
	linux-perf-users, linux-kernel

Hi James,

On Tue, Jul 01, 2025 at 04:31:58PM +0100, James Clark wrote:
> DEN0154 states that writes to PMBPTR_EL1 or PMBSR_EL1 must be done while
> the buffer is disabled (PMBLIMITR_EL1.E == 0). Re-arrange the interrupt
> handler to always disable the buffer for non-spurious interrupts before
> doing either.
> 
> Most of arm_spe_pmu_disable_and_drain_local() is now always done, so for
> faults the only thing left to do is clear PMSCR_EL1.
> 
> Elaborate the comment in arm_spe_pmu_disable_and_drain_local() to
> explain the ramifications of not doing it in the right order.
> 
> Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")
> Signed-off-by: James Clark <james.clark@linaro.org>
> ---
>  drivers/perf/arm_spe_pmu.c | 33 +++++++++++++++++++++------------
>  1 file changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
> index 6235ca7ecd48..5829947c8871 100644
> --- a/drivers/perf/arm_spe_pmu.c
> +++ b/drivers/perf/arm_spe_pmu.c
> @@ -559,7 +559,12 @@ static void arm_spe_perf_aux_output_end(struct perf_output_handle *handle)
>  
>  static void arm_spe_pmu_disable_and_drain_local(void)
>  {
> -	/* Disable profiling at EL0 and EL1 */
> +	/*
> +	 * To prevent the CONSTRAINED UNPREDICTABLE behavior of either writing
> +	 * to memory after the buffer is disabled, or SPE reporting an access
> +	 * not allowed event, we must disable sampling before draining the
> +	 * buffer.
> +	 */
>  	write_sysreg_s(0, SYS_PMSCR_EL1);
>  	isb();
>  
> @@ -661,16 +666,24 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
>  	 */
>  	irq_work_run();
>  
> +	/*
> +	 * arm_spe_pmu_buf_get_fault_act() already drained, and PMBSR_EL1.S == 1
> +	 * means that StatisticalProfilingEnabled() == false. So now we can
> +	 * safely disable the buffer.
> +	 */
> +	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
> +	isb();
> +
> +	/* Status can be cleared now that PMBLIMITR_EL1.E == 0 */
> +	write_sysreg_s(0, SYS_PMBSR_EL1);

I've been trying to figure out if we need an ISB here to order clearing the
service bit before the PMBLIMITR_EL1 write in arm_spe_perf_aux_output_begin().

arm_spe_perf_aux_output_begin() is called only when the buffer is full, and this
rules out the event having the discard attribute (buffer management events are
not generated in discard mode).

If a new buffer cannot be allocated (perf_aux_output_begin() returns NULL), then
PMBLIMITR_EL1.E remains 0, so no need to order the two writes.

The only other path remaining in arm_spe_perf_aux_output_begin() is
reprogramming the buffer, in which case there's an ISB before the write to
PMBLIMITR_EL1.

In conclusion, I think it's correct not to have an ISB here.

> +
>  	switch (act) {
>  	case SPE_PMU_BUF_FAULT_ACT_FATAL:
>  		/*
> -		 * If a fatal exception occurred then leaving the profiling
> -		 * buffer enabled is a recipe waiting to happen. Since
> -		 * fatal faults don't always imply truncation, make sure
> -		 * that the profiling buffer is disabled explicitly before
> -		 * clearing the syndrome register.
> +		 * To complete the full disable sequence, also disable profiling
> +		 * at EL0 and EL1, we don't want to continue at all anymore.
>  		 */
> -		arm_spe_pmu_disable_and_drain_local();
> +		write_sysreg_s(0, SYS_PMSCR_EL1);

Before:

arm_spe_pmu_buf_get_fault_act:
  <drain buffer>
  ISB
arm_spe_pmu_disable_and_drain_local:
  PMSCR_EL1 = 0
  ISB		# disables profiling
  <drain buffer>
  PMBLIMITR_EL1=0
PMBSR_EL1=0
ERET		# synchronizes the two writes above

Now:

arm_spe_pmu_buf_get_fault_act:
  <drain buffer>
  ISB
PMBLIMITR_EL1=0
ISB 		# disables profiling
PMBSR_EL1=0
PMSCR_EL1=0
ERET		# synchronizes the two writes above

This looks correct to me.

Thanks,
Alex

>  		break;
>  	case SPE_PMU_BUF_FAULT_ACT_OK:
>  		/*
> @@ -679,18 +692,14 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
>  		 * PMBPTR might be misaligned, but we'll burn that bridge
>  		 * when we get to it.
>  		 */
> -		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)) {
> +		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED))
>  			arm_spe_perf_aux_output_begin(handle, event);
> -			isb();
> -		}
>  		break;
>  	case SPE_PMU_BUF_FAULT_ACT_SPURIOUS:
>  		/* We've seen you before, but GCC has the memory of a sieve. */
>  		break;
>  	}
>  
> -	/* The buffer pointers are now sane, so resume profiling. */
> -	write_sysreg_s(0, SYS_PMBSR_EL1);
>  	return IRQ_HANDLED;
>  }
>  
> 
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-09 10:08         ` Alexandru Elisei
@ 2025-07-14  8:58           ` Leo Yan
  2025-07-21 13:20             ` James Clark
  0 siblings, 1 reply; 21+ messages in thread
From: Leo Yan @ 2025-07-14  8:58 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: James Clark, Will Deacon, Mark Rutland, Catalin Marinas,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel

Hi Alexandru,

On Wed, Jul 09, 2025 at 11:08:57AM +0100, Alexandru Elisei wrote:

[...]

> > > > >   	case SPE_PMU_BUF_FAULT_ACT_OK:
> > > > >   		/*
> > > > > @@ -679,18 +692,14 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
> > > > >   		 * PMBPTR might be misaligned, but we'll burn that bridge
> > > > >   		 * when we get to it.
> > > > >   		 */
> > > > > -		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)) {
> > > > > +		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED))
> > > > >   			arm_spe_perf_aux_output_begin(handle, event);
> > > > > -			isb();
> > > > 
> > > > I am a bit suspecious we can remove this isb().
> > > > 
> > > > As a reference to the software usage PKLXF in Arm ARM (DDI 0487 L.a),
> > > > after enable TRBE trace unit, an ISB is mandatory. Maybe check a bit
> > > > for this?
> > >
> > > Wasn't this isb() to separate the programming of the registers with the
> > > status register clear at the end of this function to enable profiling?
> > 
> > Enabling profiling buffer followed an isb() is not only for separating
> > other register programming.
> > 
> > As described in section D17.9, Synchronization and Statistical Profiling
> > in Arm ARM:
> > 
> >   "A Context Synchronization event guarantees that a direct write to a
> >   System register made by the PE in program order before the Context
> >   synchronization event are observable by indirect reads and indirect
> >   writes of the same System register made by a profiling operation
> >   relating to a sampled operation in program order after the Context
> >   synchronization event."
> > 
> > My understanding is: after the ARM SPE profiling is enabled, the
> > followed ISB is a Synchronization to make sure the system register
> > values are observed by SPE. And we cannot rely on ERET, especially if
> > we are tracing the kernel mode.
> 
> Thought about this some more.
> 
> Before:
> 
> arm_spe_pmu_buf_get_fault_act:
>   <drain buffer>
>   ISB
> arm_spe_perf_aux_output_begin:
>   PMBLIMITR_EL1.E = 1
> ISB
> PMBSR_EL1.S = 0
> ERET
> 
> Now:
> 
> PMBLIMITR_EL1 = 0
> ISB
> 
> PMBSR_EL1.S = 0
> arm_spe_perf_aux_output_begin:
>   ISB
>   PMBLIMITR_EL1.E = 1
> ERET
> 
> I don't see much of a difference between the two sequences - the point after
> which we can be sure that profiling is enabled remains the ERET from the
> exception return.  The only difference is that, before this change, the ERET
> synchronized clearing the PMBSR_EL1.S bit, now it synchronizes setting the
> PMBLIMITR_EL1.E bit.
> 
> Thoughts?

To make the discussion easier, I'll focus on the trace enabling flow
in this reply.

My understanding of a sane flow would be:

  arm_spe_pmu_irq_handler() {
    arm_spe_perf_aux_output_begin() {
          SYS_PMBPTR_EL1 = base;

          ISB // Synchronization between SPE register setting and
              // enabling profiling buffer.
          PMBLIMITR_EL1.E = 1;
    }
    ISB // Context synchronization event to ensure visibility to SPE
  }

  ... start trace ... (Bottom half, e.g., softirq, etc)

  ERET

In the current code, arm_spe_perf_aux_output_begin() is followed by an
ISB, which serves as a context synchronization event to ensure
visibility to the SPE. After that, it ensures that the trace unit will
function correctly.

I understand that the Software Usage PKLXF recommends using an ERET as
the synchronization point. However, between enabling the profiling
buffer and the ERET, the kernel might execute other operations (e.g.,
softirqs, tasklets, etc.).

Therefore, it seems to me that using ERET as the synchronization point
may be too late. This is why I think we should keep an ISB after
arm_spe_perf_aux_output_begin().

Thanks,
Leo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-14  8:58           ` Leo Yan
@ 2025-07-21 13:20             ` James Clark
  2025-07-21 15:21               ` Leo Yan
  0 siblings, 1 reply; 21+ messages in thread
From: James Clark @ 2025-07-21 13:20 UTC (permalink / raw)
  To: Leo Yan, Alexandru Elisei
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Anshuman Khandual,
	Rob Herring, Suzuki Poulose, Robin Murphy, linux-arm-kernel,
	linux-perf-users, linux-kernel



On 14/07/2025 9:58 am, Leo Yan wrote:
> Hi Alexandru,
> 
> On Wed, Jul 09, 2025 at 11:08:57AM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
>>>>>>    	case SPE_PMU_BUF_FAULT_ACT_OK:
>>>>>>    		/*
>>>>>> @@ -679,18 +692,14 @@ static irqreturn_t arm_spe_pmu_irq_handler(int irq, void *dev)
>>>>>>    		 * PMBPTR might be misaligned, but we'll burn that bridge
>>>>>>    		 * when we get to it.
>>>>>>    		 */
>>>>>> -		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)) {
>>>>>> +		if (!(handle->aux_flags & PERF_AUX_FLAG_TRUNCATED))
>>>>>>    			arm_spe_perf_aux_output_begin(handle, event);
>>>>>> -			isb();
>>>>>
>>>>> I am a bit suspecious we can remove this isb().
>>>>>
>>>>> As a reference to the software usage PKLXF in Arm ARM (DDI 0487 L.a),
>>>>> after enable TRBE trace unit, an ISB is mandatory. Maybe check a bit
>>>>> for this?
>>>>
>>>> Wasn't this isb() to separate the programming of the registers with the
>>>> status register clear at the end of this function to enable profiling?
>>>
>>> Enabling profiling buffer followed an isb() is not only for separating
>>> other register programming.
>>>
>>> As described in section D17.9, Synchronization and Statistical Profiling
>>> in Arm ARM:
>>>
>>>    "A Context Synchronization event guarantees that a direct write to a
>>>    System register made by the PE in program order before the Context
>>>    synchronization event are observable by indirect reads and indirect
>>>    writes of the same System register made by a profiling operation
>>>    relating to a sampled operation in program order after the Context
>>>    synchronization event."
>>>
>>> My understanding is: after the ARM SPE profiling is enabled, the
>>> followed ISB is a Synchronization to make sure the system register
>>> values are observed by SPE. And we cannot rely on ERET, especially if
>>> we are tracing the kernel mode.
>>
>> Thought about this some more.
>>
>> Before:
>>
>> arm_spe_pmu_buf_get_fault_act:
>>    <drain buffer>
>>    ISB
>> arm_spe_perf_aux_output_begin:
>>    PMBLIMITR_EL1.E = 1
>> ISB
>> PMBSR_EL1.S = 0
>> ERET
>>
>> Now:
>>
>> PMBLIMITR_EL1 = 0
>> ISB
>>
>> PMBSR_EL1.S = 0
>> arm_spe_perf_aux_output_begin:
>>    ISB
>>    PMBLIMITR_EL1.E = 1
>> ERET
>>
>> I don't see much of a difference between the two sequences - the point after
>> which we can be sure that profiling is enabled remains the ERET from the
>> exception return.  The only difference is that, before this change, the ERET
>> synchronized clearing the PMBSR_EL1.S bit, now it synchronizes setting the
>> PMBLIMITR_EL1.E bit.
>>
>> Thoughts?
> 
> To make the discussion easier, I'll focus on the trace enabling flow
> in this reply.
> 
> My understanding of a sane flow would be:
> 
>    arm_spe_pmu_irq_handler() {
>      arm_spe_perf_aux_output_begin() {
>            SYS_PMBPTR_EL1 = base;
> 
>            ISB // Synchronization between SPE register setting and
>                // enabling profiling buffer.
>            PMBLIMITR_EL1.E = 1;
>      }
>      ISB // Context synchronization event to ensure visibility to SPE
>    }
> 
>    ... start trace ... (Bottom half, e.g., softirq, etc)
> 
>    ERET
> 
> In the current code, arm_spe_perf_aux_output_begin() is followed by an
> ISB, which serves as a context synchronization event to ensure
> visibility to the SPE. After that, it ensures that the trace unit will
> function correctly.
> 

But I think Alex's point is that in the existing code the thing that 
finally enables trace (PMBSR_EL1.S = 0) isn't followed by an isb(), only 
an ERET. So the new flow isn't any different in that regard.

> I understand that the Software Usage PKLXF recommends using an ERET as
> the synchronization point. However, between enabling the profiling
> buffer and the ERET, the kernel might execute other operations (e.g.,
> softirqs, tasklets, etc.).

Isn't preemption disabled? Surely that's not possible. Even if something 
did run it wouldn't be anything that touches the SPE registers, and 
we're sure there's an isb() between setting up the buffer and the final 
PMBLIMITR_EL1.E = 1

> 
> Therefore, it seems to me that using ERET as the synchronization point
> may be too late. This is why I think we should keep an ISB after
> arm_spe_perf_aux_output_begin().
> 
> Thanks,
> Leo

Wouldn't that make the ERET too late even in the current code then? But 
I think we're agreeing there's no issue there?

James



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-21 13:20             ` James Clark
@ 2025-07-21 15:21               ` Leo Yan
  2025-07-22 14:46                 ` James Clark
  0 siblings, 1 reply; 21+ messages in thread
From: Leo Yan @ 2025-07-21 15:21 UTC (permalink / raw)
  To: James Clark
  Cc: Alexandru Elisei, Will Deacon, Mark Rutland, Catalin Marinas,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel

On Mon, Jul 21, 2025 at 02:20:15PM +0100, James Clark wrote:

[...]

> > > Thought about this some more.
> > > 
> > > Before:
> > > 
> > > arm_spe_pmu_buf_get_fault_act:
> > >    <drain buffer>
> > >    ISB
> > > arm_spe_perf_aux_output_begin:
> > >    PMBLIMITR_EL1.E = 1
> > > ISB
> > > PMBSR_EL1.S = 0
> > > ERET
> > > 
> > > Now:
> > > 
> > > PMBLIMITR_EL1 = 0
> > > ISB
> > > 
> > > PMBSR_EL1.S = 0
> > > arm_spe_perf_aux_output_begin:
> > >    ISB
> > >    PMBLIMITR_EL1.E = 1
> > > ERET
> > > 
> > > I don't see much of a difference between the two sequences - the point after
> > > which we can be sure that profiling is enabled remains the ERET from the
> > > exception return.  The only difference is that, before this change, the ERET
> > > synchronized clearing the PMBSR_EL1.S bit, now it synchronizes setting the
> > > PMBLIMITR_EL1.E bit.
> > > 
> > > Thoughts?
> > 
> > To make the discussion easier, I'll focus on the trace enabling flow
> > in this reply.
> > 
> > My understanding of a sane flow would be:
> > 
> >    arm_spe_pmu_irq_handler() {
> >      arm_spe_perf_aux_output_begin() {
> >            SYS_PMBPTR_EL1 = base;
> > 
> >            ISB // Synchronization between SPE register setting and
> >                // enabling profiling buffer.
> >            PMBLIMITR_EL1.E = 1;
> >      }
> >      ISB // Context synchronization event to ensure visibility to SPE
> >    }
> > 
> >    ... start trace ... (Bottom half, e.g., softirq, etc)
> > 
> >    ERET
> > 
> > In the current code, arm_spe_perf_aux_output_begin() is followed by an
> > ISB, which serves as a context synchronization event to ensure
> > visibility to the SPE. After that, it ensures that the trace unit will
> > function correctly.
> > 
> 
> But I think Alex's point is that in the existing code the thing that finally
> enables trace (PMBSR_EL1.S = 0) isn't followed by an isb(), only an ERET. So
> the new flow isn't any different in that regard.

Thanks for explanation.

> > I understand that the Software Usage PKLXF recommends using an ERET as
> > the synchronization point. However, between enabling the profiling
> > buffer and the ERET, the kernel might execute other operations (e.g.,
> > softirqs, tasklets, etc.).
> 
> Isn't preemption disabled? Surely that's not possible. Even if something did
> run it wouldn't be anything that touches the SPE registers, and we're sure
> there's an isb() between setting up the buffer and the final PMBLIMITR_EL1.E
> = 1

Yes, bottom half runs in preemtion disabled state. See:

  el1_interrupt() -> __el1_irq() -> irq_exit_rcu() -> invoke_softirq()

In some cases, sotfirq and tasklet might even cause long latency (I
believe it can be milliseconds or even longer), this is why ftrace
"irqsoff" tracer is used to profile latency caused by irq off state.

Bottom half does not modify SPE registsers, but it can be a long road
between enabling SPE trace and ERET.

> > Therefore, it seems to me that using ERET as the synchronization point
> > may be too late. This is why I think we should keep an ISB after
> > arm_spe_perf_aux_output_begin().
> 
> Wouldn't that make the ERET too late even in the current code then? But I
> think we're agreeing there's no issue there?

I would say ERET is too late for current code as well.

Thanks,
Leo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-21 15:21               ` Leo Yan
@ 2025-07-22 14:46                 ` James Clark
  2025-07-30  9:50                   ` Alexandru Elisei
  0 siblings, 1 reply; 21+ messages in thread
From: James Clark @ 2025-07-22 14:46 UTC (permalink / raw)
  To: Leo Yan
  Cc: Alexandru Elisei, Will Deacon, Mark Rutland, Catalin Marinas,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel



On 21/07/2025 4:21 pm, Leo Yan wrote:
> On Mon, Jul 21, 2025 at 02:20:15PM +0100, James Clark wrote:
> 
> [...]
> 
>>>> Thought about this some more.
>>>>
>>>> Before:
>>>>
>>>> arm_spe_pmu_buf_get_fault_act:
>>>>     <drain buffer>
>>>>     ISB
>>>> arm_spe_perf_aux_output_begin:
>>>>     PMBLIMITR_EL1.E = 1
>>>> ISB
>>>> PMBSR_EL1.S = 0
>>>> ERET
>>>>
>>>> Now:
>>>>
>>>> PMBLIMITR_EL1 = 0
>>>> ISB
>>>>
>>>> PMBSR_EL1.S = 0
>>>> arm_spe_perf_aux_output_begin:
>>>>     ISB
>>>>     PMBLIMITR_EL1.E = 1
>>>> ERET
>>>>
>>>> I don't see much of a difference between the two sequences - the point after
>>>> which we can be sure that profiling is enabled remains the ERET from the
>>>> exception return.  The only difference is that, before this change, the ERET
>>>> synchronized clearing the PMBSR_EL1.S bit, now it synchronizes setting the
>>>> PMBLIMITR_EL1.E bit.
>>>>
>>>> Thoughts?
>>>
>>> To make the discussion easier, I'll focus on the trace enabling flow
>>> in this reply.
>>>
>>> My understanding of a sane flow would be:
>>>
>>>     arm_spe_pmu_irq_handler() {
>>>       arm_spe_perf_aux_output_begin() {
>>>             SYS_PMBPTR_EL1 = base;
>>>
>>>             ISB // Synchronization between SPE register setting and
>>>                 // enabling profiling buffer.
>>>             PMBLIMITR_EL1.E = 1;
>>>       }
>>>       ISB // Context synchronization event to ensure visibility to SPE
>>>     }
>>>
>>>     ... start trace ... (Bottom half, e.g., softirq, etc)
>>>
>>>     ERET
>>>
>>> In the current code, arm_spe_perf_aux_output_begin() is followed by an
>>> ISB, which serves as a context synchronization event to ensure
>>> visibility to the SPE. After that, it ensures that the trace unit will
>>> function correctly.
>>>
>>
>> But I think Alex's point is that in the existing code the thing that finally
>> enables trace (PMBSR_EL1.S = 0) isn't followed by an isb(), only an ERET. So
>> the new flow isn't any different in that regard.
> 
> Thanks for explanation.
> 
>>> I understand that the Software Usage PKLXF recommends using an ERET as
>>> the synchronization point. However, between enabling the profiling
>>> buffer and the ERET, the kernel might execute other operations (e.g.,
>>> softirqs, tasklets, etc.).
>>
>> Isn't preemption disabled? Surely that's not possible. Even if something did
>> run it wouldn't be anything that touches the SPE registers, and we're sure
>> there's an isb() between setting up the buffer and the final PMBLIMITR_EL1.E
>> = 1
> 
> Yes, bottom half runs in preemtion disabled state. See:
> 
>    el1_interrupt() -> __el1_irq() -> irq_exit_rcu() -> invoke_softirq()
> 
> In some cases, sotfirq and tasklet might even cause long latency (I
> believe it can be milliseconds or even longer), this is why ftrace
> "irqsoff" tracer is used to profile latency caused by irq off state.
> 
> Bottom half does not modify SPE registsers, but it can be a long road
> between enabling SPE trace and ERET.
> 
>>> Therefore, it seems to me that using ERET as the synchronization point
>>> may be too late. This is why I think we should keep an ISB after
>>> arm_spe_perf_aux_output_begin().
>>
>> Wouldn't that make the ERET too late even in the current code then? But I
>> think we're agreeing there's no issue there?
> 
> I would say ERET is too late for current code as well.
> 
> Thanks,
> Leo


Ok I get it now. The point is that there is stuff in between the return 
in the SPE IRQ handler and the actual ERET. If someone is interested in 
sampling the kernel then they might miss sampling a small amount of that.

It's not a correctness thing, just reducing potential gaps in the 
samples. I can add another commit to add it, but it doesn't look like it 
would be a fixes commit either, just an improvement. And the same issue 
applies to the existing code too.

James



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
  2025-07-22 14:46                 ` James Clark
@ 2025-07-30  9:50                   ` Alexandru Elisei
  0 siblings, 0 replies; 21+ messages in thread
From: Alexandru Elisei @ 2025-07-30  9:50 UTC (permalink / raw)
  To: James Clark
  Cc: Leo Yan, Will Deacon, Mark Rutland, Catalin Marinas,
	Anshuman Khandual, Rob Herring, Suzuki Poulose, Robin Murphy,
	linux-arm-kernel, linux-perf-users, linux-kernel

Hi Leo, James,

On Tue, Jul 22, 2025 at 03:46:11PM +0100, James Clark wrote:
> 
> 
> On 21/07/2025 4:21 pm, Leo Yan wrote:
> > On Mon, Jul 21, 2025 at 02:20:15PM +0100, James Clark wrote:
> > 
> > [...]
> > 
> > > > > Thought about this some more.
> > > > > 
> > > > > Before:
> > > > > 
> > > > > arm_spe_pmu_buf_get_fault_act:
> > > > >     <drain buffer>
> > > > >     ISB
> > > > > arm_spe_perf_aux_output_begin:
> > > > >     PMBLIMITR_EL1.E = 1
> > > > > ISB
> > > > > PMBSR_EL1.S = 0
> > > > > ERET
> > > > > 
> > > > > Now:
> > > > > 
> > > > > PMBLIMITR_EL1 = 0
> > > > > ISB
> > > > > 
> > > > > PMBSR_EL1.S = 0
> > > > > arm_spe_perf_aux_output_begin:
> > > > >     ISB
> > > > >     PMBLIMITR_EL1.E = 1
> > > > > ERET
> > > > > 
> > > > > I don't see much of a difference between the two sequences - the point after
> > > > > which we can be sure that profiling is enabled remains the ERET from the
> > > > > exception return.  The only difference is that, before this change, the ERET
> > > > > synchronized clearing the PMBSR_EL1.S bit, now it synchronizes setting the
> > > > > PMBLIMITR_EL1.E bit.
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > To make the discussion easier, I'll focus on the trace enabling flow
> > > > in this reply.
> > > > 
> > > > My understanding of a sane flow would be:
> > > > 
> > > >     arm_spe_pmu_irq_handler() {
> > > >       arm_spe_perf_aux_output_begin() {
> > > >             SYS_PMBPTR_EL1 = base;
> > > > 
> > > >             ISB // Synchronization between SPE register setting and
> > > >                 // enabling profiling buffer.
> > > >             PMBLIMITR_EL1.E = 1;
> > > >       }
> > > >       ISB // Context synchronization event to ensure visibility to SPE
> > > >     }
> > > > 
> > > >     ... start trace ... (Bottom half, e.g., softirq, etc)
> > > > 
> > > >     ERET
> > > > 
> > > > In the current code, arm_spe_perf_aux_output_begin() is followed by an
> > > > ISB, which serves as a context synchronization event to ensure
> > > > visibility to the SPE. After that, it ensures that the trace unit will
> > > > function correctly.
> > > > 
> > > 
> > > But I think Alex's point is that in the existing code the thing that finally
> > > enables trace (PMBSR_EL1.S = 0) isn't followed by an isb(), only an ERET. So
> > > the new flow isn't any different in that regard.
> > 
> > Thanks for explanation.
> > 
> > > > I understand that the Software Usage PKLXF recommends using an ERET as
> > > > the synchronization point. However, between enabling the profiling
> > > > buffer and the ERET, the kernel might execute other operations (e.g.,
> > > > softirqs, tasklets, etc.).
> > > 
> > > Isn't preemption disabled? Surely that's not possible. Even if something did
> > > run it wouldn't be anything that touches the SPE registers, and we're sure
> > > there's an isb() between setting up the buffer and the final PMBLIMITR_EL1.E
> > > = 1
> > 
> > Yes, bottom half runs in preemtion disabled state. See:
> > 
> >    el1_interrupt() -> __el1_irq() -> irq_exit_rcu() -> invoke_softirq()
> > 
> > In some cases, sotfirq and tasklet might even cause long latency (I
> > believe it can be milliseconds or even longer), this is why ftrace
> > "irqsoff" tracer is used to profile latency caused by irq off state.
> > 
> > Bottom half does not modify SPE registsers, but it can be a long road
> > between enabling SPE trace and ERET.
> > 
> > > > Therefore, it seems to me that using ERET as the synchronization point
> > > > may be too late. This is why I think we should keep an ISB after
> > > > arm_spe_perf_aux_output_begin().
> > > 
> > > Wouldn't that make the ERET too late even in the current code then? But I
> > > think we're agreeing there's no issue there?
> > 
> > I would say ERET is too late for current code as well.
> > 
> > Thanks,
> > Leo

Thanks for the explanation and the analysis. I think we were looking at the
patch from different point of views: I was interested in not changing the
current behaviour, you were saying that the existing behaviour can be improved
upon.

> Ok I get it now. The point is that there is stuff in between the return in
> the SPE IRQ handler and the actual ERET. If someone is interested in
> sampling the kernel then they might miss sampling a small amount of that.
> 
> It's not a correctness thing, just reducing potential gaps in the samples. I
> can add another commit to add it, but it doesn't look like it would be a
> fixes commit either, just an improvement. And the same issue applies to the
> existing code too.

I agree here, this looks on an improvement on existing behaviour. I would keep
it as a patch separate from this series, as it's not a fix, and it's not related
to to the rules from DEN0154.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface
  2025-07-01 15:31 [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface James Clark
                   ` (2 preceding siblings ...)
  2025-07-01 15:31 ` [PATCH 3/3] perf: arm_spe: Add support for SPE VM interface James Clark
@ 2025-08-01 13:28 ` Alexandru Elisei
  2025-08-04 16:00   ` James Clark
  3 siblings, 1 reply; 21+ messages in thread
From: Alexandru Elisei @ 2025-08-01 13:28 UTC (permalink / raw)
  To: James Clark
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Anshuman Khandual,
	Rob Herring, Suzuki Poulose, Robin Murphy, alexandru.elisei,
	linux-arm-kernel, linux-perf-users, linux-kernel

Hi,

On Tue, Jul 01, 2025 at 04:31:56PM +0100, James Clark wrote:
> SPE can be used from within a guest as long as the driver adheres to the
> new VM interface spec [1]. Because the driver should behave correctly
> whether it's running in a guest or not, the first patches are marked as
> a fix. Furthermore, in future versions of the architecture the PE will
> be allowed to behave in the same way.
> 
> The last patch adds new behavior to make it easier for guests to be
> able to reserve large buffers. It's not strictly necessary, so it's not
> marked as a fix.

I had a look at the patches, and they all look ok to me, so for the series:

Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>

I also tested the series by hacking SPE virtualization support in KVM:

- without these changes, the SPE driver gets into an infinite loop because it
  clears PMBSR_EL1.S before clearing PMBLIMITR_EL.E, and the hypervisor is
  allowed to ignore the write to PMBSR_EL1.

- with these changes, that doesn't happen.

- ran perf for about a day in a loop in a virtual machine and didn't notice
  anything out of the ordinary.

- ran perf for about a day in a loop on baremetal and similary everything looked
  alright.

- checked that the SPE driver correctly decodes the maximum buffer size for
  sizes 4M, 2M (2M is right at the boundary between the two encoding schemes)
  and 1M; that's also correctly reflected in
  /sys/devices/platform/<spe>/arm_spe_0/caps/max_buffer_size.

- checked that perf is not allowed to use a buffer larger than the maximum.

- checked that the SPE driver correctly detects a buffer size management event.

So:

Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>

While testing I noticed two things:

1. When perf tries to use a buffer larger than the maximum, the error is EINVAL
(22):

# cat /sys/devices/platform/spe/arm_spe_0/caps/max_buff_size
4194304
# perf record -ae arm_spe// -m,16M -- sleep 10
failed to mmap with 22 (Invalid argument)

(used 16M as the buffer size because what the driver ends up programming is half
that).

I would have expected to get back ENOMEM (12), that seems less ambiguous to me.
I had to hack the driver to print an error message to dmesg when the max buffer
size is exceed to make sure that's why I was seeing the error message in perf,
and it wasn't because of something else. I get that that's because .setup_aux()
can only return NULL on error, but feels like there's room for improvement here.

2. A hypervisor is allowed to inject a buffer size event even though the buffer
set by the guest is smaller than the maximum advertised. For example, this can
happen if there isn't enough memory to pin the buffer, or if the limit on pinned
memory is exceeded in the hypervisor (implementation specific behaviour, not
mandated in DEN0154, of course).

In this situation, when the SPE driver gets a buffer size management event
injected by the hypervisor, there is no way for the driver to communicate it to
the perf instance, and the profiled process continues executing even though
profiling has stopped.

That's not different from what happens today with buffer management events, but
unlike the other events, which aren't under the control of userspace, the buffer
size event is potentially recoverable if userspace restarts perf with a smaller
buffer.

Do you think there's something that can be done to improve this situation?

Thanks,
Alex

> 
> [1]: https://developer.arm.com/documentation/den0154/latest/
> 
> Signed-off-by: James Clark <james.clark@linaro.org>
> ---
> James Clark (3):
>       perf: arm_spe: Add barrier before enabling profiling buffer
>       perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1
>       perf: arm_spe: Add support for SPE VM interface
> 
>  arch/arm64/include/asm/sysreg.h |  1 +
>  arch/arm64/tools/sysreg         |  6 ++++-
>  drivers/perf/arm_spe_pmu.c      | 60 ++++++++++++++++++++++++++++++++---------
>  3 files changed, 54 insertions(+), 13 deletions(-)
> ---
> base-commit: d0b3b7b22dfa1f4b515fd3a295b3fd958f9e81af
> change-id: 20250609-james-spe-vm-interface-2bb41e238072
> 
> Best regards,
> -- 
> James Clark <james.clark@linaro.org>
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface
  2025-08-01 13:28 ` [PATCH 0/3] " Alexandru Elisei
@ 2025-08-04 16:00   ` James Clark
  2025-08-04 21:49     ` Suzuki K Poulose
  0 siblings, 1 reply; 21+ messages in thread
From: James Clark @ 2025-08-04 16:00 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Anshuman Khandual,
	Rob Herring, Suzuki Poulose, Robin Murphy, linux-arm-kernel,
	linux-perf-users, linux-kernel, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Namhyung Kim



On 01/08/2025 2:28 pm, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Jul 01, 2025 at 04:31:56PM +0100, James Clark wrote:
>> SPE can be used from within a guest as long as the driver adheres to the
>> new VM interface spec [1]. Because the driver should behave correctly
>> whether it's running in a guest or not, the first patches are marked as
>> a fix. Furthermore, in future versions of the architecture the PE will
>> be allowed to behave in the same way.
>>
>> The last patch adds new behavior to make it easier for guests to be
>> able to reserve large buffers. It's not strictly necessary, so it's not
>> marked as a fix.
> 
> I had a look at the patches, and they all look ok to me, so for the series:
> 
> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
> 
> I also tested the series by hacking SPE virtualization support in KVM:
> 
> - without these changes, the SPE driver gets into an infinite loop because it
>    clears PMBSR_EL1.S before clearing PMBLIMITR_EL.E, and the hypervisor is
>    allowed to ignore the write to PMBSR_EL1.
> 
> - with these changes, that doesn't happen.
> 
> - ran perf for about a day in a loop in a virtual machine and didn't notice
>    anything out of the ordinary.
> 
> - ran perf for about a day in a loop on baremetal and similary everything looked
>    alright.
> 
> - checked that the SPE driver correctly decodes the maximum buffer size for
>    sizes 4M, 2M (2M is right at the boundary between the two encoding schemes)
>    and 1M; that's also correctly reflected in
>    /sys/devices/platform/<spe>/arm_spe_0/caps/max_buffer_size.
> 
> - checked that perf is not allowed to use a buffer larger than the maximum.
> 
> - checked that the SPE driver correctly detects a buffer size management event.
> 
> So:
> 
> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
> 
> While testing I noticed two things:
> 
> 1. When perf tries to use a buffer larger than the maximum, the error is EINVAL
> (22):
> 
> # cat /sys/devices/platform/spe/arm_spe_0/caps/max_buff_size
> 4194304
> # perf record -ae arm_spe// -m,16M -- sleep 10
> failed to mmap with 22 (Invalid argument)
> 
> (used 16M as the buffer size because what the driver ends up programming is half
> that).
> 
> I would have expected to get back ENOMEM (12), that seems less ambiguous to me.
> I had to hack the driver to print an error message to dmesg when the max buffer
> size is exceed to make sure that's why I was seeing the error message in perf,
> and it wasn't because of something else. I get that that's because .setup_aux()
> can only return NULL on error, but feels like there's room for improvement here.
> 

We could add an error code, rb_alloc_aux() already returns one and that 
calls setup_aux(). But the scenarios would be either EINVAL or ENOMEM 
and wouldn't give the user the exact reason ("need > 2 pages", "need 
even number of pages", etc). So I'm not sure it would be enough of an 
improvement over returning NULL to be worth it.

However I will add a warning into Perf if the user asks for more than 
caps/max_buffer_size. That would be a useful message and Perf can do it 
itself so it doesn't need to be in the driver changes.

> 2. A hypervisor is allowed to inject a buffer size event even though the buffer
> set by the guest is smaller than the maximum advertised. For example, this can
> happen if there isn't enough memory to pin the buffer, or if the limit on pinned
> memory is exceeded in the hypervisor (implementation specific behaviour, not
> mandated in DEN0154, of course).
> 
> In this situation, when the SPE driver gets a buffer size management event
> injected by the hypervisor, there is no way for the driver to communicate it to
> the perf instance, and the profiled process continues executing even though
> profiling has stopped.
> 
> That's not different from what happens today with buffer management events, but
> unlike the other events, which aren't under the control of userspace, the buffer
> size event is potentially recoverable if userspace restarts perf with a smaller
> buffer.
> 
> Do you think there's something that can be done to improve this situation?
> 
> Thanks,
> Alex
> 

It doesn't look like there's currently anything that can stop an event 
or signal to Perf that the event has gone bad.

We could add something like "__u32 error" to struct 
perf_event_mmap_page. But I'm not sure what you'd do with it. If Perf is 
the parent of the process you wouldn't want to kill it in case anything 
bad happens. So you're left with leaving it running anyway. If it's just 
an error message that you want then there's already one in dmesg for 
buffer management errors, and that string is a lot better than a single 
code. Unless these new codes were detailed PMU specific ones? Actually 
it's a whole page so why not make it a string...

It's not a case of the samples ending randomly somewhere though, you'll 
either get all of them or none of them. So it will be quite obvious to 
the user that something has gone wrong. Secondly I think the scenario of 
not being able to pin memory when asking for less than the limit would 
be very rare. It's probably fine to leave it like this for now and we 
can always add something later, maybe if people start to run into it for 
real.

James




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface
  2025-08-04 16:00   ` James Clark
@ 2025-08-04 21:49     ` Suzuki K Poulose
  0 siblings, 0 replies; 21+ messages in thread
From: Suzuki K Poulose @ 2025-08-04 21:49 UTC (permalink / raw)
  To: James Clark, Alexandru Elisei
  Cc: Will Deacon, Mark Rutland, Catalin Marinas, Anshuman Khandual,
	Rob Herring, Robin Murphy, linux-arm-kernel, linux-perf-users,
	linux-kernel, Peter Zijlstra, Arnaldo Carvalho de Melo,
	Namhyung Kim

On 04/08/2025 17:00, James Clark wrote:
> 
> 
> On 01/08/2025 2:28 pm, Alexandru Elisei wrote:
>> Hi,
>>
>> On Tue, Jul 01, 2025 at 04:31:56PM +0100, James Clark wrote:
>>> SPE can be used from within a guest as long as the driver adheres to the
>>> new VM interface spec [1]. Because the driver should behave correctly
>>> whether it's running in a guest or not, the first patches are marked as
>>> a fix. Furthermore, in future versions of the architecture the PE will
>>> be allowed to behave in the same way.
>>>
>>> The last patch adds new behavior to make it easier for guests to be
>>> able to reserve large buffers. It's not strictly necessary, so it's not
>>> marked as a fix.
>>
>> I had a look at the patches, and they all look ok to me, so for the 
>> series:
>>
>> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
>>
>> I also tested the series by hacking SPE virtualization support in KVM:
>>
>> - without these changes, the SPE driver gets into an infinite loop 
>> because it
>>    clears PMBSR_EL1.S before clearing PMBLIMITR_EL.E, and the 
>> hypervisor is
>>    allowed to ignore the write to PMBSR_EL1.
>>
>> - with these changes, that doesn't happen.
>>
>> - ran perf for about a day in a loop in a virtual machine and didn't 
>> notice
>>    anything out of the ordinary.
>>
>> - ran perf for about a day in a loop on baremetal and similary 
>> everything looked
>>    alright.
>>
>> - checked that the SPE driver correctly decodes the maximum buffer 
>> size for
>>    sizes 4M, 2M (2M is right at the boundary between the two encoding 
>> schemes)
>>    and 1M; that's also correctly reflected in
>>    /sys/devices/platform/<spe>/arm_spe_0/caps/max_buffer_size.
>>
>> - checked that perf is not allowed to use a buffer larger than the 
>> maximum.
>>
>> - checked that the SPE driver correctly detects a buffer size 
>> management event.
>>
>> So:
>>
>> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
>>
>> While testing I noticed two things:
>>
>> 1. When perf tries to use a buffer larger than the maximum, the error 
>> is EINVAL
>> (22):
>>
>> # cat /sys/devices/platform/spe/arm_spe_0/caps/max_buff_size
>> 4194304
>> # perf record -ae arm_spe// -m,16M -- sleep 10
>> failed to mmap with 22 (Invalid argument)
>>
>> (used 16M as the buffer size because what the driver ends up 
>> programming is half
>> that).
>>
>> I would have expected to get back ENOMEM (12), that seems less 
>> ambiguous to me.
>> I had to hack the driver to print an error message to dmesg when the 
>> max buffer
>> size is exceed to make sure that's why I was seeing the error message 
>> in perf,
>> and it wasn't because of something else. I get that that's 
>> because .setup_aux()
>> can only return NULL on error, but feels like there's room for 
>> improvement here.
>>
> 
> We could add an error code, rb_alloc_aux() already returns one and that 
> calls setup_aux(). But the scenarios would be either EINVAL or ENOMEM 
> and wouldn't give the user the exact reason ("need > 2 pages", "need 
> even number of pages", etc). So I'm not sure it would be enough of an 
> improvement over returning NULL to be worth it.
> 
> However I will add a warning into Perf if the user asks for more than 
> caps/max_buffer_size. That would be a useful message and Perf can do it 
> itself so it doesn't need to be in the driver changes.
> 
>> 2. A hypervisor is allowed to inject a buffer size event even though 
>> the buffer
>> set by the guest is smaller than the maximum advertised. For example, 
>> this can
>> happen if there isn't enough memory to pin the buffer, or if the limit 
>> on pinned
>> memory is exceeded in the hypervisor (implementation specific 
>> behaviour, not
>> mandated in DEN0154, of course).
>>
>> In this situation, when the SPE driver gets a buffer size management 
>> event
>> injected by the hypervisor, there is no way for the driver to 
>> communicate it to
>> the perf instance, and the profiled process continues executing even 
>> though
>> profiling has stopped.
>>
>> That's not different from what happens today with buffer management 
>> events, but
>> unlike the other events, which aren't under the control of userspace, 
>> the buffer
>> size event is potentially recoverable if userspace restarts perf with 
>> a smaller
>> buffer.
>>
>> Do you think there's something that can be done to improve this 
>> situation?
>>
>> Thanks,
>> Alex
>>
> 
> It doesn't look like there's currently anything that can stop an event 
> or signal to Perf that the event has gone bad.
> 
> We could add something like "__u32 error" to struct 
> perf_event_mmap_page. But I'm not sure what you'd do with it. If Perf is 
> the parent of the process you wouldn't want to kill it in case anything 
> bad happens. So you're left with leaving it running anyway. If it's just 
> an error message that you want then there's already one in dmesg for 
> buffer management errors, and that string is a lot better than a single 
> code. Unless these new codes were detailed PMU specific ones? Actually 
> it's a whole page so why not make it a string...
> 
> It's not a case of the samples ending randomly somewhere though, you'll 
> either get all of them or none of them. So it will be quite obvious to 
> the user that something has gone wrong. Secondly I think the scenario of 
> not being able to pin memory when asking for less than the limit would 
> be very rare. It's probably fine to leave it like this for now and we 
> can always add something later, maybe if people start to run into it for 
> real.

Do we get EMPTY AUX records in this case where there was no profiling
data ? If so, why not convey the error via AUX record header flags ?

Suzuk

> 
> James
> 
> 



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-08-04 21:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01 15:31 [PATCH 0/3] perf: arm_spe: Add support for SPE VM interface James Clark
2025-07-01 15:31 ` [PATCH 1/3] perf: arm_spe: Add barrier before enabling profiling buffer James Clark
2025-07-04 14:04   ` Leo Yan
2025-07-07 11:22     ` James Clark
2025-07-08 14:40   ` Alexandru Elisei
2025-07-01 15:31 ` [PATCH 2/3] perf: arm_spe: Disable buffer before writing to PMBPTR_EL1 or PMBSR_EL1 James Clark
2025-07-04 15:50   ` Leo Yan
2025-07-07 11:39     ` James Clark
2025-07-07 15:37       ` Leo Yan
2025-07-08 14:45         ` Alexandru Elisei
2025-07-09 10:08         ` Alexandru Elisei
2025-07-14  8:58           ` Leo Yan
2025-07-21 13:20             ` James Clark
2025-07-21 15:21               ` Leo Yan
2025-07-22 14:46                 ` James Clark
2025-07-30  9:50                   ` Alexandru Elisei
2025-07-09 10:09   ` Alexandru Elisei
2025-07-01 15:31 ` [PATCH 3/3] perf: arm_spe: Add support for SPE VM interface James Clark
2025-08-01 13:28 ` [PATCH 0/3] " Alexandru Elisei
2025-08-04 16:00   ` James Clark
2025-08-04 21:49     ` Suzuki K Poulose

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).