[PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-06 15:05 ` Alexandru Elisei
  0 siblings, 0 replies; 18+ messages in thread
From: Alexandru Elisei @ 2020-10-06 15:05 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, linux-kernel; +Cc: maz, catalin.marinas, will

From ARM DDI 0487F.b, page D9-2807:

"Although the Statistical Profiling Extension acts as another observer in
the system, for determining the Shareability domain of the DSB
instructions, the writes of sample records are treated as coming from the
PE that is being profiled."

Similarly, on page D9-2801:

"The memory type and attributes that are used for a write by the
Statistical Profiling Extension to the Profiling Buffer is taken from the
translation table entries for the virtual address being written to. That
is:
- The writes are treated as coming from an observer that is coherent with
  all observers in the Shareability domain that is defined by the
  translation tables."

All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
writes to the profiling buffer have completed.

Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
Found by code inspection.

All the places where the buffer was drained were found by using the command
"grep -r psb_csync".

 arch/arm64/kvm/hyp/nvhe/debug-sr.c | 2 +-
 drivers/perf/arm_spe_pmu.c         | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 91a711aa8382..e05a08c5ad1f 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -43,7 +43,7 @@ static void __debug_save_spe(u64 *pmscr_el1)
 
 	/* Now drain all buffered data to memory */
 	psb_csync();
-	dsb(nsh);
+	dsb(ish);
 }
 
 static void __debug_restore_spe(u64 pmscr_el1)
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index cc00915ad6d1..402892caef34 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -525,7 +525,7 @@ static void arm_spe_pmu_disable_and_drain_local(void)
 
 	/* Drain any buffered data */
 	psb_csync();
-	dsb(nsh);
+	dsb(ish);
 
 	/* Disable the profiling buffer */
 	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
@@ -545,7 +545,7 @@ arm_spe_pmu_buf_get_fault_act(struct perf_output_handle *handle)
 	 * aborts have been resolved.
 	 */
 	psb_csync();
-	dsb(nsh);
+	dsb(ish);
 
 	/* Ensure hardware updates to PMBPTR_EL1 are visible */
 	isb();
-- 
2.28.0

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-06 15:05 ` Alexandru Elisei
  0 siblings, 0 replies; 18+ messages in thread
From: Alexandru Elisei @ 2020-10-06 15:05 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, linux-kernel
  Cc: maz, james.morse, julien.thierry.kdev, suzuki.poulose,
	catalin.marinas, will, mark.rutland

From ARM DDI 0487F.b, page D9-2807:

"Although the Statistical Profiling Extension acts as another observer in
the system, for determining the Shareability domain of the DSB
instructions, the writes of sample records are treated as coming from the
PE that is being profiled."

Similarly, on page D9-2801:

"The memory type and attributes that are used for a write by the
Statistical Profiling Extension to the Profiling Buffer is taken from the
translation table entries for the virtual address being written to. That
is:
- The writes are treated as coming from an observer that is coherent with
  all observers in the Shareability domain that is defined by the
  translation tables."

All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
writes to the profiling buffer have completed.

Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
Found by code inspection.

All the places where the buffer was drained were found by using the command
"grep -r psb_csync".

 arch/arm64/kvm/hyp/nvhe/debug-sr.c | 2 +-
 drivers/perf/arm_spe_pmu.c         | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 91a711aa8382..e05a08c5ad1f 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -43,7 +43,7 @@ static void __debug_save_spe(u64 *pmscr_el1)
 
 	/* Now drain all buffered data to memory */
 	psb_csync();
-	dsb(nsh);
+	dsb(ish);
 }
 
 static void __debug_restore_spe(u64 pmscr_el1)
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index cc00915ad6d1..402892caef34 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -525,7 +525,7 @@ static void arm_spe_pmu_disable_and_drain_local(void)
 
 	/* Drain any buffered data */
 	psb_csync();
-	dsb(nsh);
+	dsb(ish);
 
 	/* Disable the profiling buffer */
 	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
@@ -545,7 +545,7 @@ arm_spe_pmu_buf_get_fault_act(struct perf_output_handle *handle)
 	 * aborts have been resolved.
 	 */
 	psb_csync();
-	dsb(nsh);
+	dsb(ish);
 
 	/* Ensure hardware updates to PMBPTR_EL1 are visible */
 	isb();
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-06 15:05 ` Alexandru Elisei
  0 siblings, 0 replies; 18+ messages in thread
From: Alexandru Elisei @ 2020-10-06 15:05 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, linux-kernel
  Cc: mark.rutland, suzuki.poulose, maz, james.morse, catalin.marinas,
	will, julien.thierry.kdev

From ARM DDI 0487F.b, page D9-2807:

"Although the Statistical Profiling Extension acts as another observer in
the system, for determining the Shareability domain of the DSB
instructions, the writes of sample records are treated as coming from the
PE that is being profiled."

Similarly, on page D9-2801:

"The memory type and attributes that are used for a write by the
Statistical Profiling Extension to the Profiling Buffer is taken from the
translation table entries for the virtual address being written to. That
is:
- The writes are treated as coming from an observer that is coherent with
  all observers in the Shareability domain that is defined by the
  translation tables."

All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
writes to the profiling buffer have completed.

Fixes: d5d9696b0380 ("drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension")
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
Found by code inspection.

All the places where the buffer was drained were found by using the command
"grep -r psb_csync".

 arch/arm64/kvm/hyp/nvhe/debug-sr.c | 2 +-
 drivers/perf/arm_spe_pmu.c         | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 91a711aa8382..e05a08c5ad1f 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -43,7 +43,7 @@ static void __debug_save_spe(u64 *pmscr_el1)
 
 	/* Now drain all buffered data to memory */
 	psb_csync();
-	dsb(nsh);
+	dsb(ish);
 }
 
 static void __debug_restore_spe(u64 pmscr_el1)
diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index cc00915ad6d1..402892caef34 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -525,7 +525,7 @@ static void arm_spe_pmu_disable_and_drain_local(void)
 
 	/* Drain any buffered data */
 	psb_csync();
-	dsb(nsh);
+	dsb(ish);
 
 	/* Disable the profiling buffer */
 	write_sysreg_s(0, SYS_PMBLIMITR_EL1);
@@ -545,7 +545,7 @@ arm_spe_pmu_buf_get_fault_act(struct perf_output_handle *handle)
 	 * aborts have been resolved.
 	 */
 	psb_csync();
-	dsb(nsh);
+	dsb(ish);
 
 	/* Ensure hardware updates to PMBPTR_EL1 are visible */
 	isb();
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
  2020-10-06 15:05 ` Alexandru Elisei
  (?)
@ 2020-10-06 15:32   ` Marc Zyngier
  -1 siblings, 0 replies; 18+ messages in thread
From: Marc Zyngier @ 2020-10-06 15:32 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: catalin.marinas, linux-kernel, linux-arm-kernel, will, kvmarm

Hi Alex,

On Tue, 06 Oct 2020 16:05:20 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> From ARM DDI 0487F.b, page D9-2807:
> 
> "Although the Statistical Profiling Extension acts as another observer in
> the system, for determining the Shareability domain of the DSB
> instructions, the writes of sample records are treated as coming from the
> PE that is being profiled."
> 
> Similarly, on page D9-2801:
> 
> "The memory type and attributes that are used for a write by the
> Statistical Profiling Extension to the Profiling Buffer is taken from the
> translation table entries for the virtual address being written to. That
> is:
> - The writes are treated as coming from an observer that is coherent with
>   all observers in the Shareability domain that is defined by the
>   translation tables."
> 
> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
> writes to the profiling buffer have completed.

I'm a bit sceptical of this change. The SPE writes are per-CPU, and
all we are trying to ensure is that the CPU we are running on has
drained its own queue of accesses.

The accesses being made within the IS domain doesn't invalidate the
fact that they are still per-CPU, because "the writes of sample
records are treated as coming from the PE that is being profiled.".

So why should we have an IS-wide synchronisation for accesses that are
purely local?

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-06 15:32   ` Marc Zyngier
  0 siblings, 0 replies; 18+ messages in thread
From: Marc Zyngier @ 2020-10-06 15:32 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: linux-arm-kernel, kvmarm, linux-kernel, james.morse,
	julien.thierry.kdev, suzuki.poulose, catalin.marinas, will,
	mark.rutland

Hi Alex,

On Tue, 06 Oct 2020 16:05:20 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> From ARM DDI 0487F.b, page D9-2807:
> 
> "Although the Statistical Profiling Extension acts as another observer in
> the system, for determining the Shareability domain of the DSB
> instructions, the writes of sample records are treated as coming from the
> PE that is being profiled."
> 
> Similarly, on page D9-2801:
> 
> "The memory type and attributes that are used for a write by the
> Statistical Profiling Extension to the Profiling Buffer is taken from the
> translation table entries for the virtual address being written to. That
> is:
> - The writes are treated as coming from an observer that is coherent with
>   all observers in the Shareability domain that is defined by the
>   translation tables."
> 
> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
> writes to the profiling buffer have completed.

I'm a bit sceptical of this change. The SPE writes are per-CPU, and
all we are trying to ensure is that the CPU we are running on has
drained its own queue of accesses.

The accesses being made within the IS domain doesn't invalidate the
fact that they are still per-CPU, because "the writes of sample
records are treated as coming from the PE that is being profiled.".

So why should we have an IS-wide synchronisation for accesses that are
purely local?

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-06 15:32   ` Marc Zyngier
  0 siblings, 0 replies; 18+ messages in thread
From: Marc Zyngier @ 2020-10-06 15:32 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: mark.rutland, suzuki.poulose, catalin.marinas, linux-kernel,
	james.morse, linux-arm-kernel, will, kvmarm, julien.thierry.kdev

Hi Alex,

On Tue, 06 Oct 2020 16:05:20 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> From ARM DDI 0487F.b, page D9-2807:
> 
> "Although the Statistical Profiling Extension acts as another observer in
> the system, for determining the Shareability domain of the DSB
> instructions, the writes of sample records are treated as coming from the
> PE that is being profiled."
> 
> Similarly, on page D9-2801:
> 
> "The memory type and attributes that are used for a write by the
> Statistical Profiling Extension to the Profiling Buffer is taken from the
> translation table entries for the virtual address being written to. That
> is:
> - The writes are treated as coming from an observer that is coherent with
>   all observers in the Shareability domain that is defined by the
>   translation tables."
> 
> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
> writes to the profiling buffer have completed.

I'm a bit sceptical of this change. The SPE writes are per-CPU, and
all we are trying to ensure is that the CPU we are running on has
drained its own queue of accesses.

The accesses being made within the IS domain doesn't invalidate the
fact that they are still per-CPU, because "the writes of sample
records are treated as coming from the PE that is being profiled.".

So why should we have an IS-wide synchronisation for accesses that are
purely local?

	M.

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
  2020-10-06 15:32   ` Marc Zyngier
  (?)
@ 2020-10-06 16:13     ` Alexandru Elisei
  -1 siblings, 0 replies; 18+ messages in thread
From: Alexandru Elisei @ 2020-10-06 16:13 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: catalin.marinas, linux-kernel, linux-arm-kernel, will, kvmarm

Hi Marc,

Thank you for having a look at the patch!

On 10/6/20 4:32 PM, Marc Zyngier wrote:
> Hi Alex,
>
> On Tue, 06 Oct 2020 16:05:20 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> From ARM DDI 0487F.b, page D9-2807:
>>
>> "Although the Statistical Profiling Extension acts as another observer in
>> the system, for determining the Shareability domain of the DSB
>> instructions, the writes of sample records are treated as coming from the
>> PE that is being profiled."
>>
>> Similarly, on page D9-2801:
>>
>> "The memory type and attributes that are used for a write by the
>> Statistical Profiling Extension to the Profiling Buffer is taken from the
>> translation table entries for the virtual address being written to. That
>> is:
>> - The writes are treated as coming from an observer that is coherent with
>>   all observers in the Shareability domain that is defined by the
>>   translation tables."
>>
>> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
>> writes to the profiling buffer have completed.
> I'm a bit sceptical of this change. The SPE writes are per-CPU, and
> all we are trying to ensure is that the CPU we are running on has
> drained its own queue of accesses.
>
> The accesses being made within the IS domain doesn't invalidate the
> fact that they are still per-CPU, because "the writes of sample
> records are treated as coming from the PE that is being profiled.".
>
> So why should we have an IS-wide synchronisation for accesses that are
> purely local?

I think I might have misunderstood how perf spe works. Below is my original train
of thought.

In the buffer management event interrupt we drain the buffer, and if the buffer is
full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). The comment
for perf_aux_output_end() says "Commit the data written by hardware into the ring
buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the perf buffer.
It is the pmu driver's responsibility to observe ordering rules of the hardware,
so that all the data is externally visible before this is called." My conclusion
was that after we drain the buffer, the data must be visible to all CPUs.

From the definition of non-shareable memory (ARM DDI0487F.b, page B2-155):

"For Normal memory locations, the Non-shareable attribute identifies Normal memory
that is likely to be accessed only by a single PE. A location in Normal memory
with the Non-shareable attribute does not require the hardware to make data
accesses by different observers coherent, unless the memory is Non-cacheable."

Linux configures all memory to be Inner Shareable (SH[1:0] = 0b11), *not*
Non-shareable (SH[1:0] = 0b00). I think that the DSB NSH doesn't really do
anything, because the PE will not do any accesses to Non-shareable memory, and we
end up breaking the assumption of perf_aux_output_end().

Did I make a mistake in my reasoning?

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-06 16:13     ` Alexandru Elisei
  0 siblings, 0 replies; 18+ messages in thread
From: Alexandru Elisei @ 2020-10-06 16:13 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-arm-kernel, kvmarm, linux-kernel, james.morse,
	julien.thierry.kdev, suzuki.poulose, catalin.marinas, will,
	mark.rutland

Hi Marc,

Thank you for having a look at the patch!

On 10/6/20 4:32 PM, Marc Zyngier wrote:
> Hi Alex,
>
> On Tue, 06 Oct 2020 16:05:20 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> From ARM DDI 0487F.b, page D9-2807:
>>
>> "Although the Statistical Profiling Extension acts as another observer in
>> the system, for determining the Shareability domain of the DSB
>> instructions, the writes of sample records are treated as coming from the
>> PE that is being profiled."
>>
>> Similarly, on page D9-2801:
>>
>> "The memory type and attributes that are used for a write by the
>> Statistical Profiling Extension to the Profiling Buffer is taken from the
>> translation table entries for the virtual address being written to. That
>> is:
>> - The writes are treated as coming from an observer that is coherent with
>>   all observers in the Shareability domain that is defined by the
>>   translation tables."
>>
>> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
>> writes to the profiling buffer have completed.
> I'm a bit sceptical of this change. The SPE writes are per-CPU, and
> all we are trying to ensure is that the CPU we are running on has
> drained its own queue of accesses.
>
> The accesses being made within the IS domain doesn't invalidate the
> fact that they are still per-CPU, because "the writes of sample
> records are treated as coming from the PE that is being profiled.".
>
> So why should we have an IS-wide synchronisation for accesses that are
> purely local?

I think I might have misunderstood how perf spe works. Below is my original train
of thought.

In the buffer management event interrupt we drain the buffer, and if the buffer is
full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). The comment
for perf_aux_output_end() says "Commit the data written by hardware into the ring
buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the perf buffer.
It is the pmu driver's responsibility to observe ordering rules of the hardware,
so that all the data is externally visible before this is called." My conclusion
was that after we drain the buffer, the data must be visible to all CPUs.

From the definition of non-shareable memory (ARM DDI0487F.b, page B2-155):

"For Normal memory locations, the Non-shareable attribute identifies Normal memory
that is likely to be accessed only by a single PE. A location in Normal memory
with the Non-shareable attribute does not require the hardware to make data
accesses by different observers coherent, unless the memory is Non-cacheable."

Linux configures all memory to be Inner Shareable (SH[1:0] = 0b11), *not*
Non-shareable (SH[1:0] = 0b00). I think that the DSB NSH doesn't really do
anything, because the PE will not do any accesses to Non-shareable memory, and we
end up breaking the assumption of perf_aux_output_end().

Did I make a mistake in my reasoning?

Thanks,
Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-06 16:13     ` Alexandru Elisei
  0 siblings, 0 replies; 18+ messages in thread
From: Alexandru Elisei @ 2020-10-06 16:13 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: mark.rutland, suzuki.poulose, catalin.marinas, linux-kernel,
	james.morse, linux-arm-kernel, will, kvmarm, julien.thierry.kdev

Hi Marc,

Thank you for having a look at the patch!

On 10/6/20 4:32 PM, Marc Zyngier wrote:
> Hi Alex,
>
> On Tue, 06 Oct 2020 16:05:20 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> From ARM DDI 0487F.b, page D9-2807:
>>
>> "Although the Statistical Profiling Extension acts as another observer in
>> the system, for determining the Shareability domain of the DSB
>> instructions, the writes of sample records are treated as coming from the
>> PE that is being profiled."
>>
>> Similarly, on page D9-2801:
>>
>> "The memory type and attributes that are used for a write by the
>> Statistical Profiling Extension to the Profiling Buffer is taken from the
>> translation table entries for the virtual address being written to. That
>> is:
>> - The writes are treated as coming from an observer that is coherent with
>>   all observers in the Shareability domain that is defined by the
>>   translation tables."
>>
>> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
>> writes to the profiling buffer have completed.
> I'm a bit sceptical of this change. The SPE writes are per-CPU, and
> all we are trying to ensure is that the CPU we are running on has
> drained its own queue of accesses.
>
> The accesses being made within the IS domain doesn't invalidate the
> fact that they are still per-CPU, because "the writes of sample
> records are treated as coming from the PE that is being profiled.".
>
> So why should we have an IS-wide synchronisation for accesses that are
> purely local?

I think I might have misunderstood how perf spe works. Below is my original train
of thought.

In the buffer management event interrupt we drain the buffer, and if the buffer is
full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). The comment
for perf_aux_output_end() says "Commit the data written by hardware into the ring
buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the perf buffer.
It is the pmu driver's responsibility to observe ordering rules of the hardware,
so that all the data is externally visible before this is called." My conclusion
was that after we drain the buffer, the data must be visible to all CPUs.

From the definition of non-shareable memory (ARM DDI0487F.b, page B2-155):

"For Normal memory locations, the Non-shareable attribute identifies Normal memory
that is likely to be accessed only by a single PE. A location in Normal memory
with the Non-shareable attribute does not require the hardware to make data
accesses by different observers coherent, unless the memory is Non-cacheable."

Linux configures all memory to be Inner Shareable (SH[1:0] = 0b11), *not*
Non-shareable (SH[1:0] = 0b00). I think that the DSB NSH doesn't really do
anything, because the PE will not do any accesses to Non-shareable memory, and we
end up breaking the assumption of perf_aux_output_end().

Did I make a mistake in my reasoning?

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
  2020-10-06 16:13     ` Alexandru Elisei
  (?)
@ 2020-10-19 12:24       ` Mark Rutland
  -1 siblings, 0 replies; 18+ messages in thread
From: Mark Rutland @ 2020-10-19 12:24 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, linux-kernel, linux-arm-kernel, catalin.marinas,
	will, kvmarm

On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
> Hi Marc,
> 
> Thank you for having a look at the patch!
> 
> On 10/6/20 4:32 PM, Marc Zyngier wrote:
> > Hi Alex,
> >
> > On Tue, 06 Oct 2020 16:05:20 +0100,
> > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> >> From ARM DDI 0487F.b, page D9-2807:
> >>
> >> "Although the Statistical Profiling Extension acts as another observer in
> >> the system, for determining the Shareability domain of the DSB
> >> instructions, the writes of sample records are treated as coming from the
> >> PE that is being profiled."
> >>
> >> Similarly, on page D9-2801:
> >>
> >> "The memory type and attributes that are used for a write by the
> >> Statistical Profiling Extension to the Profiling Buffer is taken from the
> >> translation table entries for the virtual address being written to. That
> >> is:
> >> - The writes are treated as coming from an observer that is coherent with
> >>   all observers in the Shareability domain that is defined by the
> >>   translation tables."
> >>
> >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
> >> writes to the profiling buffer have completed.
> > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
> > all we are trying to ensure is that the CPU we are running on has
> > drained its own queue of accesses.
> >
> > The accesses being made within the IS domain doesn't invalidate the
> > fact that they are still per-CPU, because "the writes of sample
> > records are treated as coming from the PE that is being profiled.".
> >
> > So why should we have an IS-wide synchronisation for accesses that are
> > purely local?
> 
> I think I might have misunderstood how perf spe works. Below is my original train
> of thought.
> 
> In the buffer management event interrupt we drain the buffer, and if the buffer is
> full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). The comment
> for perf_aux_output_end() says "Commit the data written by hardware into the ring
> buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the perf buffer.
> It is the pmu driver's responsibility to observe ordering rules of the hardware,
> so that all the data is externally visible before this is called." My conclusion
> was that after we drain the buffer, the data must be visible to all CPUs.

FWIW, this reasoning sounds correct to me. The DSB NSH will be
sufficient to drain the buffer, but we need the DSB ISH to ensure that
it's visbile to other CPUs at the instant we call perf_aux_output_end().

Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
might see the aux buffer pointers updated before the samples are
viisble, and hence read junk from the buffer.

We can add a comment to that effect (or rework perf_aux_output_end()
somehow to handle that ordering).

Thanks,
Mark.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-19 12:24       ` Mark Rutland
  0 siblings, 0 replies; 18+ messages in thread
From: Mark Rutland @ 2020-10-19 12:24 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: Marc Zyngier, linux-arm-kernel, kvmarm, linux-kernel, james.morse,
	julien.thierry.kdev, suzuki.poulose, catalin.marinas, will

On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
> Hi Marc,
> 
> Thank you for having a look at the patch!
> 
> On 10/6/20 4:32 PM, Marc Zyngier wrote:
> > Hi Alex,
> >
> > On Tue, 06 Oct 2020 16:05:20 +0100,
> > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> >> From ARM DDI 0487F.b, page D9-2807:
> >>
> >> "Although the Statistical Profiling Extension acts as another observer in
> >> the system, for determining the Shareability domain of the DSB
> >> instructions, the writes of sample records are treated as coming from the
> >> PE that is being profiled."
> >>
> >> Similarly, on page D9-2801:
> >>
> >> "The memory type and attributes that are used for a write by the
> >> Statistical Profiling Extension to the Profiling Buffer is taken from the
> >> translation table entries for the virtual address being written to. That
> >> is:
> >> - The writes are treated as coming from an observer that is coherent with
> >>   all observers in the Shareability domain that is defined by the
> >>   translation tables."
> >>
> >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
> >> writes to the profiling buffer have completed.
> > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
> > all we are trying to ensure is that the CPU we are running on has
> > drained its own queue of accesses.
> >
> > The accesses being made within the IS domain doesn't invalidate the
> > fact that they are still per-CPU, because "the writes of sample
> > records are treated as coming from the PE that is being profiled.".
> >
> > So why should we have an IS-wide synchronisation for accesses that are
> > purely local?
> 
> I think I might have misunderstood how perf spe works. Below is my original train
> of thought.
> 
> In the buffer management event interrupt we drain the buffer, and if the buffer is
> full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). The comment
> for perf_aux_output_end() says "Commit the data written by hardware into the ring
> buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the perf buffer.
> It is the pmu driver's responsibility to observe ordering rules of the hardware,
> so that all the data is externally visible before this is called." My conclusion
> was that after we drain the buffer, the data must be visible to all CPUs.

FWIW, this reasoning sounds correct to me. The DSB NSH will be
sufficient to drain the buffer, but we need the DSB ISH to ensure that
it's visbile to other CPUs at the instant we call perf_aux_output_end().

Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
might see the aux buffer pointers updated before the samples are
viisble, and hence read junk from the buffer.

We can add a comment to that effect (or rework perf_aux_output_end()
somehow to handle that ordering).

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-19 12:24       ` Mark Rutland
  0 siblings, 0 replies; 18+ messages in thread
From: Mark Rutland @ 2020-10-19 12:24 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: suzuki.poulose, Marc Zyngier, linux-kernel, james.morse,
	linux-arm-kernel, catalin.marinas, will, kvmarm,
	julien.thierry.kdev

On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
> Hi Marc,
> 
> Thank you for having a look at the patch!
> 
> On 10/6/20 4:32 PM, Marc Zyngier wrote:
> > Hi Alex,
> >
> > On Tue, 06 Oct 2020 16:05:20 +0100,
> > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> >> From ARM DDI 0487F.b, page D9-2807:
> >>
> >> "Although the Statistical Profiling Extension acts as another observer in
> >> the system, for determining the Shareability domain of the DSB
> >> instructions, the writes of sample records are treated as coming from the
> >> PE that is being profiled."
> >>
> >> Similarly, on page D9-2801:
> >>
> >> "The memory type and attributes that are used for a write by the
> >> Statistical Profiling Extension to the Profiling Buffer is taken from the
> >> translation table entries for the virtual address being written to. That
> >> is:
> >> - The writes are treated as coming from an observer that is coherent with
> >>   all observers in the Shareability domain that is defined by the
> >>   translation tables."
> >>
> >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
> >> writes to the profiling buffer have completed.
> > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
> > all we are trying to ensure is that the CPU we are running on has
> > drained its own queue of accesses.
> >
> > The accesses being made within the IS domain doesn't invalidate the
> > fact that they are still per-CPU, because "the writes of sample
> > records are treated as coming from the PE that is being profiled.".
> >
> > So why should we have an IS-wide synchronisation for accesses that are
> > purely local?
> 
> I think I might have misunderstood how perf spe works. Below is my original train
> of thought.
> 
> In the buffer management event interrupt we drain the buffer, and if the buffer is
> full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). The comment
> for perf_aux_output_end() says "Commit the data written by hardware into the ring
> buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the perf buffer.
> It is the pmu driver's responsibility to observe ordering rules of the hardware,
> so that all the data is externally visible before this is called." My conclusion
> was that after we drain the buffer, the data must be visible to all CPUs.

FWIW, this reasoning sounds correct to me. The DSB NSH will be
sufficient to drain the buffer, but we need the DSB ISH to ensure that
it's visbile to other CPUs at the instant we call perf_aux_output_end().

Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
might see the aux buffer pointers updated before the samples are
viisble, and hence read junk from the buffer.

We can add a comment to that effect (or rework perf_aux_output_end()
somehow to handle that ordering).

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
  2020-10-19 12:24       ` Mark Rutland
  (?)
@ 2020-10-19 12:55         ` Marc Zyngier
  -1 siblings, 0 replies; 18+ messages in thread
From: Marc Zyngier @ 2020-10-19 12:55 UTC (permalink / raw)
  To: Mark Rutland
  Cc: will, catalin.marinas, linux-kernel, linux-arm-kernel, kvmarm

On 2020-10-19 13:24, Mark Rutland wrote:
> On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
>> Hi Marc,
>> 
>> Thank you for having a look at the patch!
>> 
>> On 10/6/20 4:32 PM, Marc Zyngier wrote:
>> > Hi Alex,
>> >
>> > On Tue, 06 Oct 2020 16:05:20 +0100,
>> > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> >> From ARM DDI 0487F.b, page D9-2807:
>> >>
>> >> "Although the Statistical Profiling Extension acts as another observer in
>> >> the system, for determining the Shareability domain of the DSB
>> >> instructions, the writes of sample records are treated as coming from the
>> >> PE that is being profiled."
>> >>
>> >> Similarly, on page D9-2801:
>> >>
>> >> "The memory type and attributes that are used for a write by the
>> >> Statistical Profiling Extension to the Profiling Buffer is taken from the
>> >> translation table entries for the virtual address being written to. That
>> >> is:
>> >> - The writes are treated as coming from an observer that is coherent with
>> >>   all observers in the Shareability domain that is defined by the
>> >>   translation tables."
>> >>
>> >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
>> >> writes to the profiling buffer have completed.
>> > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
>> > all we are trying to ensure is that the CPU we are running on has
>> > drained its own queue of accesses.
>> >
>> > The accesses being made within the IS domain doesn't invalidate the
>> > fact that they are still per-CPU, because "the writes of sample
>> > records are treated as coming from the PE that is being profiled.".
>> >
>> > So why should we have an IS-wide synchronisation for accesses that are
>> > purely local?
>> 
>> I think I might have misunderstood how perf spe works. Below is my 
>> original train
>> of thought.
>> 
>> In the buffer management event interrupt we drain the buffer, and if 
>> the buffer is
>> full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). 
>> The comment
>> for perf_aux_output_end() says "Commit the data written by hardware 
>> into the ring
>> buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the 
>> perf buffer.
>> It is the pmu driver's responsibility to observe ordering rules of the 
>> hardware,
>> so that all the data is externally visible before this is called." My 
>> conclusion
>> was that after we drain the buffer, the data must be visible to all 
>> CPUs.
> 
> FWIW, this reasoning sounds correct to me. The DSB NSH will be
> sufficient to drain the buffer, but we need the DSB ISH to ensure that
> it's visbile to other CPUs at the instant we call 
> perf_aux_output_end().

Right. I think I missed that last bit (and Alex's email at the same 
time).

> Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
> might see the aux buffer pointers updated before the samples are
> viisble, and hence read junk from the buffer.
> 
> We can add a comment to that effect (or rework perf_aux_output_end()
> somehow to handle that ordering).

I'd rather this is done in perf_aux_output_end(), as a full blown DSB 
ISH
on guest entry is pretty harsh... It would also nicely split the 
responsibilities:

- KVM stops SPE and make sure the output is drained
- Perf makes the data visible to all CPUs

Thoughts?

         M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-19 12:55         ` Marc Zyngier
  0 siblings, 0 replies; 18+ messages in thread
From: Marc Zyngier @ 2020-10-19 12:55 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Alexandru Elisei, linux-arm-kernel, kvmarm, linux-kernel,
	james.morse, julien.thierry.kdev, suzuki.poulose, catalin.marinas,
	will

On 2020-10-19 13:24, Mark Rutland wrote:
> On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
>> Hi Marc,
>> 
>> Thank you for having a look at the patch!
>> 
>> On 10/6/20 4:32 PM, Marc Zyngier wrote:
>> > Hi Alex,
>> >
>> > On Tue, 06 Oct 2020 16:05:20 +0100,
>> > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> >> From ARM DDI 0487F.b, page D9-2807:
>> >>
>> >> "Although the Statistical Profiling Extension acts as another observer in
>> >> the system, for determining the Shareability domain of the DSB
>> >> instructions, the writes of sample records are treated as coming from the
>> >> PE that is being profiled."
>> >>
>> >> Similarly, on page D9-2801:
>> >>
>> >> "The memory type and attributes that are used for a write by the
>> >> Statistical Profiling Extension to the Profiling Buffer is taken from the
>> >> translation table entries for the virtual address being written to. That
>> >> is:
>> >> - The writes are treated as coming from an observer that is coherent with
>> >>   all observers in the Shareability domain that is defined by the
>> >>   translation tables."
>> >>
>> >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
>> >> writes to the profiling buffer have completed.
>> > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
>> > all we are trying to ensure is that the CPU we are running on has
>> > drained its own queue of accesses.
>> >
>> > The accesses being made within the IS domain doesn't invalidate the
>> > fact that they are still per-CPU, because "the writes of sample
>> > records are treated as coming from the PE that is being profiled.".
>> >
>> > So why should we have an IS-wide synchronisation for accesses that are
>> > purely local?
>> 
>> I think I might have misunderstood how perf spe works. Below is my 
>> original train
>> of thought.
>> 
>> In the buffer management event interrupt we drain the buffer, and if 
>> the buffer is
>> full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). 
>> The comment
>> for perf_aux_output_end() says "Commit the data written by hardware 
>> into the ring
>> buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the 
>> perf buffer.
>> It is the pmu driver's responsibility to observe ordering rules of the 
>> hardware,
>> so that all the data is externally visible before this is called." My 
>> conclusion
>> was that after we drain the buffer, the data must be visible to all 
>> CPUs.
> 
> FWIW, this reasoning sounds correct to me. The DSB NSH will be
> sufficient to drain the buffer, but we need the DSB ISH to ensure that
> it's visbile to other CPUs at the instant we call 
> perf_aux_output_end().

Right. I think I missed that last bit (and Alex's email at the same 
time).

> Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
> might see the aux buffer pointers updated before the samples are
> viisble, and hence read junk from the buffer.
> 
> We can add a comment to that effect (or rework perf_aux_output_end()
> somehow to handle that ordering).

I'd rather this is done in perf_aux_output_end(), as a full blown DSB 
ISH
on guest entry is pretty harsh... It would also nicely split the 
responsibilities:

- KVM stops SPE and make sure the output is drained
- Perf makes the data visible to all CPUs

Thoughts?

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-19 12:55         ` Marc Zyngier
  0 siblings, 0 replies; 18+ messages in thread
From: Marc Zyngier @ 2020-10-19 12:55 UTC (permalink / raw)
  To: Mark Rutland
  Cc: will, suzuki.poulose, catalin.marinas, linux-kernel, james.morse,
	linux-arm-kernel, Alexandru Elisei, kvmarm, julien.thierry.kdev

On 2020-10-19 13:24, Mark Rutland wrote:
> On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
>> Hi Marc,
>> 
>> Thank you for having a look at the patch!
>> 
>> On 10/6/20 4:32 PM, Marc Zyngier wrote:
>> > Hi Alex,
>> >
>> > On Tue, 06 Oct 2020 16:05:20 +0100,
>> > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
>> >> From ARM DDI 0487F.b, page D9-2807:
>> >>
>> >> "Although the Statistical Profiling Extension acts as another observer in
>> >> the system, for determining the Shareability domain of the DSB
>> >> instructions, the writes of sample records are treated as coming from the
>> >> PE that is being profiled."
>> >>
>> >> Similarly, on page D9-2801:
>> >>
>> >> "The memory type and attributes that are used for a write by the
>> >> Statistical Profiling Extension to the Profiling Buffer is taken from the
>> >> translation table entries for the virtual address being written to. That
>> >> is:
>> >> - The writes are treated as coming from an observer that is coherent with
>> >>   all observers in the Shareability domain that is defined by the
>> >>   translation tables."
>> >>
>> >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
>> >> writes to the profiling buffer have completed.
>> > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
>> > all we are trying to ensure is that the CPU we are running on has
>> > drained its own queue of accesses.
>> >
>> > The accesses being made within the IS domain doesn't invalidate the
>> > fact that they are still per-CPU, because "the writes of sample
>> > records are treated as coming from the PE that is being profiled.".
>> >
>> > So why should we have an IS-wide synchronisation for accesses that are
>> > purely local?
>> 
>> I think I might have misunderstood how perf spe works. Below is my 
>> original train
>> of thought.
>> 
>> In the buffer management event interrupt we drain the buffer, and if 
>> the buffer is
>> full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). 
>> The comment
>> for perf_aux_output_end() says "Commit the data written by hardware 
>> into the ring
>> buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the 
>> perf buffer.
>> It is the pmu driver's responsibility to observe ordering rules of the 
>> hardware,
>> so that all the data is externally visible before this is called." My 
>> conclusion
>> was that after we drain the buffer, the data must be visible to all 
>> CPUs.
> 
> FWIW, this reasoning sounds correct to me. The DSB NSH will be
> sufficient to drain the buffer, but we need the DSB ISH to ensure that
> it's visbile to other CPUs at the instant we call 
> perf_aux_output_end().

Right. I think I missed that last bit (and Alex's email at the same 
time).

> Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
> might see the aux buffer pointers updated before the samples are
> viisble, and hence read junk from the buffer.
> 
> We can add a comment to that effect (or rework perf_aux_output_end()
> somehow to handle that ordering).

I'd rather this is done in perf_aux_output_end(), as a full blown DSB 
ISH
on guest entry is pretty harsh... It would also nicely split the 
responsibilities:

- KVM stops SPE and make sure the output is drained
- Perf makes the data visible to all CPUs

Thoughts?

         M.
-- 
Jazz is not dead. It just smells funny...

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
  2020-10-19 12:24       ` Mark Rutland
  (?)
@ 2020-10-19 13:01         ` Will Deacon
  -1 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2020-10-19 13:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Marc Zyngier, linux-kernel, catalin.marinas, kvmarm,
	linux-arm-kernel

On Mon, Oct 19, 2020 at 01:24:55PM +0100, Mark Rutland wrote:
> On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
> > On 10/6/20 4:32 PM, Marc Zyngier wrote:
> > > On Tue, 06 Oct 2020 16:05:20 +0100,
> > > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > >> From ARM DDI 0487F.b, page D9-2807:
> > >>
> > >> "Although the Statistical Profiling Extension acts as another observer in
> > >> the system, for determining the Shareability domain of the DSB
> > >> instructions, the writes of sample records are treated as coming from the
> > >> PE that is being profiled."
> > >>
> > >> Similarly, on page D9-2801:
> > >>
> > >> "The memory type and attributes that are used for a write by the
> > >> Statistical Profiling Extension to the Profiling Buffer is taken from the
> > >> translation table entries for the virtual address being written to. That
> > >> is:
> > >> - The writes are treated as coming from an observer that is coherent with
> > >>   all observers in the Shareability domain that is defined by the
> > >>   translation tables."
> > >>
> > >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
> > >> writes to the profiling buffer have completed.
> > > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
> > > all we are trying to ensure is that the CPU we are running on has
> > > drained its own queue of accesses.
> > >
> > > The accesses being made within the IS domain doesn't invalidate the
> > > fact that they are still per-CPU, because "the writes of sample
> > > records are treated as coming from the PE that is being profiled.".
> > >
> > > So why should we have an IS-wide synchronisation for accesses that are
> > > purely local?
> > 
> > I think I might have misunderstood how perf spe works. Below is my original train
> > of thought.
> > 
> > In the buffer management event interrupt we drain the buffer, and if the buffer is
> > full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). The comment
> > for perf_aux_output_end() says "Commit the data written by hardware into the ring
> > buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the perf buffer.
> > It is the pmu driver's responsibility to observe ordering rules of the hardware,
> > so that all the data is externally visible before this is called." My conclusion
> > was that after we drain the buffer, the data must be visible to all CPUs.
> 
> FWIW, this reasoning sounds correct to me. The DSB NSH will be
> sufficient to drain the buffer, but we need the DSB ISH to ensure that
> it's visbile to other CPUs at the instant we call perf_aux_output_end().
> 
> Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
> might see the aux buffer pointers updated before the samples are
> viisble, and hence read junk from the buffer.
> 
> We can add a comment to that effect (or rework perf_aux_output_end()
> somehow to handle that ordering).

Given that DSB is about completion rather than ordering, completion only
matters for endpoints and the endpoint in this scenarion is part of the
same observer, DSB NSH should be sufficient. Ordering of accesses as
observed by other CPUs should be handled with DMB or acquire/release.

So if the aux buffer code is missing barriers, we should add them there,
like you proposed before:

https://lore.kernel.org/lkml/20180510130632.34497-1-mark.rutland@arm.com/

What happened to that?

Will
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-19 13:01         ` Will Deacon
  0 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2020-10-19 13:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Alexandru Elisei, Marc Zyngier, linux-arm-kernel, kvmarm,
	linux-kernel, james.morse, julien.thierry.kdev, suzuki.poulose,
	catalin.marinas

On Mon, Oct 19, 2020 at 01:24:55PM +0100, Mark Rutland wrote:
> On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
> > On 10/6/20 4:32 PM, Marc Zyngier wrote:
> > > On Tue, 06 Oct 2020 16:05:20 +0100,
> > > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > >> From ARM DDI 0487F.b, page D9-2807:
> > >>
> > >> "Although the Statistical Profiling Extension acts as another observer in
> > >> the system, for determining the Shareability domain of the DSB
> > >> instructions, the writes of sample records are treated as coming from the
> > >> PE that is being profiled."
> > >>
> > >> Similarly, on page D9-2801:
> > >>
> > >> "The memory type and attributes that are used for a write by the
> > >> Statistical Profiling Extension to the Profiling Buffer is taken from the
> > >> translation table entries for the virtual address being written to. That
> > >> is:
> > >> - The writes are treated as coming from an observer that is coherent with
> > >>   all observers in the Shareability domain that is defined by the
> > >>   translation tables."
> > >>
> > >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
> > >> writes to the profiling buffer have completed.
> > > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
> > > all we are trying to ensure is that the CPU we are running on has
> > > drained its own queue of accesses.
> > >
> > > The accesses being made within the IS domain doesn't invalidate the
> > > fact that they are still per-CPU, because "the writes of sample
> > > records are treated as coming from the PE that is being profiled.".
> > >
> > > So why should we have an IS-wide synchronisation for accesses that are
> > > purely local?
> > 
> > I think I might have misunderstood how perf spe works. Below is my original train
> > of thought.
> > 
> > In the buffer management event interrupt we drain the buffer, and if the buffer is
> > full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). The comment
> > for perf_aux_output_end() says "Commit the data written by hardware into the ring
> > buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the perf buffer.
> > It is the pmu driver's responsibility to observe ordering rules of the hardware,
> > so that all the data is externally visible before this is called." My conclusion
> > was that after we drain the buffer, the data must be visible to all CPUs.
> 
> FWIW, this reasoning sounds correct to me. The DSB NSH will be
> sufficient to drain the buffer, but we need the DSB ISH to ensure that
> it's visbile to other CPUs at the instant we call perf_aux_output_end().
> 
> Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
> might see the aux buffer pointers updated before the samples are
> viisble, and hence read junk from the buffer.
> 
> We can add a comment to that effect (or rework perf_aux_output_end()
> somehow to handle that ordering).

Given that DSB is about completion rather than ordering, completion only
matters for endpoints and the endpoint in this scenarion is part of the
same observer, DSB NSH should be sufficient. Ordering of accesses as
observed by other CPUs should be handled with DMB or acquire/release.

So if the aux buffer code is missing barriers, we should add them there,
like you proposed before:

https://lore.kernel.org/lkml/20180510130632.34497-1-mark.rutland@arm.com/

What happened to that?

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer
@ 2020-10-19 13:01         ` Will Deacon
  0 siblings, 0 replies; 18+ messages in thread
From: Will Deacon @ 2020-10-19 13:01 UTC (permalink / raw)
  To: Mark Rutland
  Cc: suzuki.poulose, Marc Zyngier, linux-kernel, james.morse,
	julien.thierry.kdev, catalin.marinas, Alexandru Elisei, kvmarm,
	linux-arm-kernel

On Mon, Oct 19, 2020 at 01:24:55PM +0100, Mark Rutland wrote:
> On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
> > On 10/6/20 4:32 PM, Marc Zyngier wrote:
> > > On Tue, 06 Oct 2020 16:05:20 +0100,
> > > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > >> From ARM DDI 0487F.b, page D9-2807:
> > >>
> > >> "Although the Statistical Profiling Extension acts as another observer in
> > >> the system, for determining the Shareability domain of the DSB
> > >> instructions, the writes of sample records are treated as coming from the
> > >> PE that is being profiled."
> > >>
> > >> Similarly, on page D9-2801:
> > >>
> > >> "The memory type and attributes that are used for a write by the
> > >> Statistical Profiling Extension to the Profiling Buffer is taken from the
> > >> translation table entries for the virtual address being written to. That
> > >> is:
> > >> - The writes are treated as coming from an observer that is coherent with
> > >>   all observers in the Shareability domain that is defined by the
> > >>   translation tables."
> > >>
> > >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
> > >> writes to the profiling buffer have completed.
> > > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
> > > all we are trying to ensure is that the CPU we are running on has
> > > drained its own queue of accesses.
> > >
> > > The accesses being made within the IS domain doesn't invalidate the
> > > fact that they are still per-CPU, because "the writes of sample
> > > records are treated as coming from the PE that is being profiled.".
> > >
> > > So why should we have an IS-wide synchronisation for accesses that are
> > > purely local?
> > 
> > I think I might have misunderstood how perf spe works. Below is my original train
> > of thought.
> > 
> > In the buffer management event interrupt we drain the buffer, and if the buffer is
> > full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). The comment
> > for perf_aux_output_end() says "Commit the data written by hardware into the ring
> > buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the perf buffer.
> > It is the pmu driver's responsibility to observe ordering rules of the hardware,
> > so that all the data is externally visible before this is called." My conclusion
> > was that after we drain the buffer, the data must be visible to all CPUs.
> 
> FWIW, this reasoning sounds correct to me. The DSB NSH will be
> sufficient to drain the buffer, but we need the DSB ISH to ensure that
> it's visbile to other CPUs at the instant we call perf_aux_output_end().
> 
> Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
> might see the aux buffer pointers updated before the samples are
> viisble, and hence read junk from the buffer.
> 
> We can add a comment to that effect (or rework perf_aux_output_end()
> somehow to handle that ordering).

Given that DSB is about completion rather than ordering, completion only
matters for endpoints and the endpoint in this scenarion is part of the
same observer, DSB NSH should be sufficient. Ordering of accesses as
observed by other CPUs should be handled with DMB or acquire/release.

So if the aux buffer code is missing barriers, we should add them there,
like you proposed before:

https://lore.kernel.org/lkml/20180510130632.34497-1-mark.rutland@arm.com/

What happened to that?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-10-19 13:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-10-06 15:05 [PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer Alexandru Elisei
2020-10-06 15:05 ` Alexandru Elisei
2020-10-06 15:05 ` Alexandru Elisei
2020-10-06 15:32 ` Marc Zyngier
2020-10-06 15:32   ` Marc Zyngier
2020-10-06 15:32   ` Marc Zyngier
2020-10-06 16:13   ` Alexandru Elisei
2020-10-06 16:13     ` Alexandru Elisei
2020-10-06 16:13     ` Alexandru Elisei
2020-10-19 12:24     ` Mark Rutland
2020-10-19 12:24       ` Mark Rutland
2020-10-19 12:24       ` Mark Rutland
2020-10-19 12:55       ` Marc Zyngier
2020-10-19 12:55         ` Marc Zyngier
2020-10-19 12:55         ` Marc Zyngier
2020-10-19 13:01       ` Will Deacon
2020-10-19 13:01         ` Will Deacon
2020-10-19 13:01         ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.