[PATCH 0/4] riscv: Allow userspace to directly access perf counters

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] riscv: Allow userspace to directly access perf counters
@ 2023-04-13 16:17 Alexandre Ghiti
  2023-04-13 16:17 ` [PATCH 1/4] perf: Fix wrong comment about default event_idx Alexandre Ghiti
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Alexandre Ghiti @ 2023-04-13 16:17 UTC (permalink / raw)
  To: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel
  Cc: Alexandre Ghiti

riscv used to allow direct access to cycle/time/instret counters,
bypassing the perf framework, this patchset intends to allow the user to
mmap any counter when accessed through perf. But we can't break the
existing behaviour so we introduce a sysctl perf_user_access like arm64
does, which defaults to the legacy mode described above.

The core of this patchset lies in patch 4, the first 3 patches are
simple fixes.

base-commit-tag: v6.3-rc1

Alexandre Ghiti (4):
  perf: Fix wrong comment about default event_idx
  include: riscv: Fix wrong include guard in riscv_pmu.h
  riscv: Make legacy counter enum match the HW numbering
  riscv: Enable perf counters user access only through perf

 Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
 arch/riscv/include/asm/perf_event.h         |   3 +
 arch/riscv/kernel/Makefile                  |   2 +-
 arch/riscv/kernel/perf_event.c              |  65 +++++++++++
 drivers/perf/riscv_pmu.c                    |  42 ++++++++
 drivers/perf/riscv_pmu_legacy.c             |  24 ++++-
 drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
 include/linux/perf/riscv_pmu.h              |   9 +-
 include/linux/perf_event.h                  |   3 +-
 tools/lib/perf/mmap.c                       |  65 +++++++++++
 10 files changed, 332 insertions(+), 17 deletions(-)
 create mode 100644 arch/riscv/kernel/perf_event.c

-- 
2.37.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/4] perf: Fix wrong comment about default event_idx
  2023-04-13 16:17 [PATCH 0/4] riscv: Allow userspace to directly access perf counters Alexandre Ghiti
@ 2023-04-13 16:17 ` Alexandre Ghiti
  2023-04-13 16:17 ` [PATCH 2/4] include: riscv: Fix wrong include guard in riscv_pmu.h Alexandre Ghiti
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Alexandre Ghiti @ 2023-04-13 16:17 UTC (permalink / raw)
  To: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel
  Cc: Alexandre Ghiti

event_idx default implementation returns 0, not idx + 1.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 include/linux/perf_event.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d5628a7b5eaa..56fe43b20966 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -442,7 +442,8 @@ struct pmu {
 
 	/*
 	 * Will return the value for perf_event_mmap_page::index for this event,
-	 * if no implementation is provided it will default to: event->hw.idx + 1.
+	 * if no implementation is provided it will default to 0 (see
+	 * perf_event_idx_default).
 	 */
 	int (*event_idx)		(struct perf_event *event); /*optional */
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/4] include: riscv: Fix wrong include guard in riscv_pmu.h
  2023-04-13 16:17 [PATCH 0/4] riscv: Allow userspace to directly access perf counters Alexandre Ghiti
  2023-04-13 16:17 ` [PATCH 1/4] perf: Fix wrong comment about default event_idx Alexandre Ghiti
@ 2023-04-13 16:17 ` Alexandre Ghiti
  2023-04-18 18:26   ` Conor Dooley
  2023-04-13 16:17 ` [PATCH 3/4] riscv: Make legacy counter enum match the HW numbering Alexandre Ghiti
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Alexandre Ghiti @ 2023-04-13 16:17 UTC (permalink / raw)
  To: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel
  Cc: Alexandre Ghiti

The current include guard prevents the inclusion of asm/perf_event.h
which uses the same include guard: fix the one in riscv_pmu.h so that it
matches the file name.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 include/linux/perf/riscv_pmu.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 43fc892aa7d9..9f70d94942e0 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -6,8 +6,8 @@
  *
  */
 
-#ifndef _ASM_RISCV_PERF_EVENT_H
-#define _ASM_RISCV_PERF_EVENT_H
+#ifndef _RISCV_PMU_H
+#define _RISCV_PMU_H
 
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
@@ -81,4 +81,4 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr);
 
 #endif /* CONFIG_RISCV_PMU */
 
-#endif /* _ASM_RISCV_PERF_EVENT_H */
+#endif /* _RISCV_PMU_H */
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/4] include: riscv: Fix wrong include guard in riscv_pmu.h
  2023-04-13 16:17 ` [PATCH 2/4] include: riscv: Fix wrong include guard in riscv_pmu.h Alexandre Ghiti
@ 2023-04-18 18:26   ` Conor Dooley
  0 siblings, 0 replies; 26+ messages in thread
From: Conor Dooley @ 2023-04-18 18:26 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 396 bytes --]

On Thu, Apr 13, 2023 at 06:17:23PM +0200, Alexandre Ghiti wrote:
> The current include guard prevents the inclusion of asm/perf_event.h
> which uses the same include guard: fix the one in riscv_pmu.h so that it
> matches the file name.
> 
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>

lol, good one.
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>

Thanks,
Conor.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/4] riscv: Make legacy counter enum match the HW numbering
  2023-04-13 16:17 [PATCH 0/4] riscv: Allow userspace to directly access perf counters Alexandre Ghiti
  2023-04-13 16:17 ` [PATCH 1/4] perf: Fix wrong comment about default event_idx Alexandre Ghiti
  2023-04-13 16:17 ` [PATCH 2/4] include: riscv: Fix wrong include guard in riscv_pmu.h Alexandre Ghiti
@ 2023-04-13 16:17 ` Alexandre Ghiti
  2023-04-13 16:17 ` [PATCH 4/4] riscv: Enable perf counters user access only through perf Alexandre Ghiti
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 26+ messages in thread
From: Alexandre Ghiti @ 2023-04-13 16:17 UTC (permalink / raw)
  To: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel
  Cc: Alexandre Ghiti

RISCV_PMU_LEGACY_INSTRET used to be set to 1 whereas the offset of this
hardware counter from CSR_CYCLE is actually 2: make this offset match the
real hw offset so that we can directly expose those values to userspace.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 drivers/perf/riscv_pmu_legacy.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
index ca9e20bfc7ac..0d8c9d8849ee 100644
--- a/drivers/perf/riscv_pmu_legacy.c
+++ b/drivers/perf/riscv_pmu_legacy.c
@@ -12,8 +12,11 @@
 #include <linux/perf/riscv_pmu.h>
 #include <linux/platform_device.h>
 
-#define RISCV_PMU_LEGACY_CYCLE		0
-#define RISCV_PMU_LEGACY_INSTRET	1
+enum {
+	RISCV_PMU_LEGACY_CYCLE,
+	RISCV_PMU_LEGACY_TIME,
+	RISCV_PMU_LEGACY_INSTRET
+};
 
 static bool pmu_init_done;
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-13 16:17 [PATCH 0/4] riscv: Allow userspace to directly access perf counters Alexandre Ghiti
                   ` (2 preceding siblings ...)
  2023-04-13 16:17 ` [PATCH 3/4] riscv: Make legacy counter enum match the HW numbering Alexandre Ghiti
@ 2023-04-13 16:17 ` Alexandre Ghiti
  2023-04-13 21:20   ` kernel test robot
                     ` (3 more replies)
  2023-04-13 16:36 ` [PATCH 0/4] riscv: Allow userspace to directly access perf counters Ian Rogers
  2023-04-13 19:17 ` Atish Patra
  5 siblings, 4 replies; 26+ messages in thread
From: Alexandre Ghiti @ 2023-04-13 16:17 UTC (permalink / raw)
  To: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel
  Cc: Alexandre Ghiti

We used to unconditionnally expose the cycle and instret csrs to
userspace, which gives rise to security concerns.

So only allow access to hw counters from userspace through the perf
framework which will handle context switchs, per-task events...etc. But
as we cannot break userspace, we give the user the choice to go back to
the previous behaviour by setting the sysctl perf_user_access.

We also introduce a means to directly map the hardware counters to
userspace, thus avoiding the need for syscalls whenever an application
wants to access counters values.

Note that arch_perf_update_userpage is a copy of arm64 code.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
---
 Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
 arch/riscv/include/asm/perf_event.h         |   3 +
 arch/riscv/kernel/Makefile                  |   2 +-
 arch/riscv/kernel/perf_event.c              |  65 +++++++++++
 drivers/perf/riscv_pmu.c                    |  42 ++++++++
 drivers/perf/riscv_pmu_legacy.c             |  17 +++
 drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
 include/linux/perf/riscv_pmu.h              |   3 +
 tools/lib/perf/mmap.c                       |  65 +++++++++++
 9 files changed, 322 insertions(+), 11 deletions(-)
 create mode 100644 arch/riscv/kernel/perf_event.c

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 4b7bfea28cd7..02b2a40a3647 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -941,16 +941,31 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
 The default value is 8.
 
 
-perf_user_access (arm64 only)
-=================================
+perf_user_access (arm64 and riscv only)
+=======================================
+
+Controls user space access for reading perf event counters.
 
-Controls user space access for reading perf event counters. When set to 1,
-user space can read performance monitor counter registers directly.
+arm64
+=====
 
 The default value is 0 (access disabled).
+When set to 1, user space can read performance monitor counter registers
+directly.
 
 See Documentation/arm64/perf.rst for more information.
 
+riscv
+=====
+
+When set to 0, user access is disabled.
+
+When set to 1, user space can read performance monitor counter registers
+directly only through perf, any direct access without perf intervention will
+trigger an illegal instruction.
+
+The default value is 2, it enables the legacy mode, that is user space has
+direct access to cycle, time and insret CSRs only.
 
 pid_max
 =======
diff --git a/arch/riscv/include/asm/perf_event.h b/arch/riscv/include/asm/perf_event.h
index d42c901f9a97..9fdfdd9dc92d 100644
--- a/arch/riscv/include/asm/perf_event.h
+++ b/arch/riscv/include/asm/perf_event.h
@@ -9,5 +9,8 @@
 #define _ASM_RISCV_PERF_EVENT_H
 
 #include <linux/perf_event.h>
+
+#define PERF_EVENT_FLAG_LEGACY	1
+
 #define perf_arch_bpf_user_pt_regs(regs) (struct user_regs_struct *)regs
 #endif /* _ASM_RISCV_PERF_EVENT_H */
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index aa22f87faeae..9ae951b07847 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -70,7 +70,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
 
 obj-$(CONFIG_TRACE_IRQFLAGS)	+= trace_irq.o
 
-obj-$(CONFIG_PERF_EVENTS)	+= perf_callchain.o
+obj-$(CONFIG_PERF_EVENTS)	+= perf_callchain.o perf_event.o
 obj-$(CONFIG_HAVE_PERF_REGS)	+= perf_regs.o
 obj-$(CONFIG_RISCV_SBI)		+= sbi.o
 ifeq ($(CONFIG_RISCV_SBI), y)
diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
new file mode 100644
index 000000000000..4a75ab628bfb
--- /dev/null
+++ b/arch/riscv/kernel/perf_event.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/perf/riscv_pmu.h>
+#include <linux/sched_clock.h>
+
+void arch_perf_update_userpage(struct perf_event *event,
+			       struct perf_event_mmap_page *userpg, u64 now)
+{
+	struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+	struct clock_read_data *rd;
+	unsigned int seq;
+	u64 ns;
+
+	userpg->cap_user_time = 0;
+	userpg->cap_user_time_zero = 0;
+	userpg->cap_user_time_short = 0;
+	userpg->cap_user_rdpmc =
+		!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
+
+	/*
+	 * The counters are 64-bit but the priv spec doesn't mandate all the
+	 * bits to be implemented: that's why, counter width can vary based on
+	 * the cpu vendor.
+	 */
+	userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
+
+	do {
+		rd = sched_clock_read_begin(&seq);
+
+		userpg->time_mult = rd->mult;
+		userpg->time_shift = rd->shift;
+		userpg->time_zero = rd->epoch_ns;
+		userpg->time_cycles = rd->epoch_cyc;
+		userpg->time_mask = rd->sched_clock_mask;
+
+		/*
+		 * Subtract the cycle base, such that software that
+		 * doesn't know about cap_user_time_short still 'works'
+		 * assuming no wraps.
+		 */
+		ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
+		userpg->time_zero -= ns;
+
+	} while (sched_clock_read_retry(seq));
+
+	userpg->time_offset = userpg->time_zero - now;
+
+	/*
+	 * time_shift is not expected to be greater than 31 due to
+	 * the original published conversion algorithm shifting a
+	 * 32-bit value (now specifies a 64-bit value) - refer
+	 * perf_event_mmap_page documentation in perf_event.h.
+	 */
+	if (userpg->time_shift == 32) {
+		userpg->time_shift = 31;
+		userpg->time_mult >>= 1;
+	}
+
+	/*
+	 * Internal timekeeping for enabled/running/stopped times
+	 * is always computed with the sched_clock.
+	 */
+	userpg->cap_user_time = 1;
+	userpg->cap_user_time_zero = 1;
+	userpg->cap_user_time_short = 1;
+}
diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
index ebca5eab9c9b..12675ee1123c 100644
--- a/drivers/perf/riscv_pmu.c
+++ b/drivers/perf/riscv_pmu.c
@@ -171,6 +171,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
 
 	local64_set(&hwc->prev_count, (u64)-left);
 
+	perf_event_update_userpage(event);
+
 	return overflow;
 }
 
@@ -283,6 +285,43 @@ static int riscv_pmu_event_init(struct perf_event *event)
 	return 0;
 }
 
+static int riscv_pmu_event_idx(struct perf_event *event)
+{
+	struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+
+	if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
+		return 0;
+
+	/*
+	 * cycle and instret can either be retrieved from their fixed counters
+	 * or from programmable counters, the latter being the preferred way
+	 * since cycle and instret counters do not support sampling.
+	 */
+
+	return rvpmu->csr_index(event) + 1;
+}
+
+static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
+{
+	/*
+	 * The user mmapped the event to directly access it: this is where
+	 * we determine based on sysctl_perf_user_access if we grant userspace
+	 * the direct access to this event. That means that within the same
+	 * task, some events may be directly accessible and some other may not,
+	 * if the user changes the value of sysctl_perf_user_accesss in the
+	 * meantime.
+	 */
+	struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
+
+	event->hw.flags |= rvpmu->event_flags(event);
+	perf_event_update_userpage(event);
+}
+
+static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
+{
+	event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
+}
+
 struct riscv_pmu *riscv_pmu_alloc(void)
 {
 	struct riscv_pmu *pmu;
@@ -307,6 +346,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
 	}
 	pmu->pmu = (struct pmu) {
 		.event_init	= riscv_pmu_event_init,
+		.event_mapped	= riscv_pmu_event_mapped,
+		.event_unmapped	= riscv_pmu_event_unmapped,
+		.event_idx	= riscv_pmu_event_idx,
 		.add		= riscv_pmu_add,
 		.del		= riscv_pmu_del,
 		.start		= riscv_pmu_start,
diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
index 0d8c9d8849ee..35c4c9097a0f 100644
--- a/drivers/perf/riscv_pmu_legacy.c
+++ b/drivers/perf/riscv_pmu_legacy.c
@@ -74,6 +74,21 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
 	local64_set(&hwc->prev_count, initial_val);
 }
 
+static uint8_t pmu_legacy_csr_index(struct perf_event *event)
+{
+	return event->hw.idx;
+}
+
+static int pmu_legacy_event_flags(struct perf_event *event)
+{
+	/* In legacy mode, the first 3 CSRs are available. */
+	if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
+	    event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
+		return 0;
+
+	return PERF_EVENT_FLAG_USER_READ_CNT;
+}
+
 /*
  * This is just a simple implementation to allow legacy implementations
  * compatible with new RISC-V PMU driver framework.
@@ -94,6 +109,8 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
 	pmu->ctr_get_width = NULL;
 	pmu->ctr_clear_idx = NULL;
 	pmu->ctr_read = pmu_legacy_read_ctr;
+	pmu->event_flags = pmu_legacy_event_flags;
+	pmu->csr_index = pmu_legacy_csr_index;
 
 	perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
 }
diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
index 70cb50fd41c2..af7f3128b6b8 100644
--- a/drivers/perf/riscv_pmu_sbi.c
+++ b/drivers/perf/riscv_pmu_sbi.c
@@ -24,6 +24,10 @@
 #include <asm/sbi.h>
 #include <asm/hwcap.h>
 
+#define SYSCTL_NO_USER_ACCESS	0
+#define SYSCTL_USER_ACCESS	1
+#define SYSCTL_LEGACY		2
+
 PMU_FORMAT_ATTR(event, "config:0-47");
 PMU_FORMAT_ATTR(firmware, "config:63");
 
@@ -43,6 +47,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
 	NULL,
 };
 
+/* Allow legacy access by default */
+static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
+
 /*
  * RISC-V doesn't have heterogeneous harts yet. This need to be part of
  * per_cpu in case of harts with different pmu counters
@@ -301,6 +308,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
 }
 EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
 
+static uint8_t pmu_sbi_csr_index(struct perf_event *event)
+{
+	return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
+}
+
 static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
 {
 	unsigned long cflags = 0;
@@ -329,18 +341,30 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
 	struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
 	struct sbiret ret;
 	int idx;
-	uint64_t cbase = 0;
+	uint64_t cbase = 0, cmask = rvpmu->cmask;
 	unsigned long cflags = 0;
 
 	cflags = pmu_sbi_get_filter_flags(event);
+
+	/* In legacy mode, we have to force the fixed counters for those events */
+	if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
+		if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
+			cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
+			cmask = 1;
+		} else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
+			cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
+			cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
+		}
+	}
+
 	/* retrieve the available counter index */
 #if defined(CONFIG_32BIT)
 	ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
-			rvpmu->cmask, cflags, hwc->event_base, hwc->config,
+			cmask, cflags, hwc->event_base, hwc->config,
 			hwc->config >> 32);
 #else
 	ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
-			rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
+			cmask, cflags, hwc->event_base, hwc->config, 0);
 #endif
 	if (ret.error) {
 		pr_debug("Not able to find a counter for event %lx config %llx\n",
@@ -490,6 +514,11 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
 	if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
 		pr_err("Starting counter idx %d failed with error %d\n",
 			hwc->idx, sbi_err_map_linux_errno(ret.error));
+
+	if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
+	    event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
+		csr_write(CSR_SCOUNTEREN,
+			  csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
 }
 
 static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
@@ -497,6 +526,11 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
 	struct sbiret ret;
 	struct hw_perf_event *hwc = &event->hw;
 
+	if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
+	    event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
+		csr_write(CSR_SCOUNTEREN,
+			  csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
+
 	ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
 	if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
 		flag != SBI_PMU_STOP_FLAG_RESET)
@@ -704,10 +738,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
 	struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
 
 	/*
-	 * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
-	 * as is necessary to maintain uABI compatibility.
+	 * We keep enabling userspace access to CYCLE, TIME and INSRET via the
+	 * legacy option but that will be removed in the future.
 	 */
-	csr_write(CSR_SCOUNTEREN, 0x7);
+	if (sysctl_perf_user_access == SYSCTL_LEGACY)
+		csr_write(CSR_SCOUNTEREN, 0x7);
+	else
+		csr_write(CSR_SCOUNTEREN, 0x2);
 
 	/* Stop all the counters so that they can be enabled from perf */
 	pmu_sbi_stop_all(pmu);
@@ -851,6 +888,66 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
 	cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
 }
 
+static int pmu_sbi_event_flags(struct perf_event *event)
+{
+	if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
+		return 0;
+
+	/* In legacy mode, the first 3 CSRs are available. */
+	if (sysctl_perf_user_access == SYSCTL_LEGACY) {
+		int flags = PERF_EVENT_FLAG_LEGACY;
+
+		if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES ||
+		    event->attr.config == PERF_COUNT_HW_INSTRUCTIONS)
+			flags |= PERF_EVENT_FLAG_USER_READ_CNT;
+
+		return flags;
+	}
+
+	return PERF_EVENT_FLAG_USER_READ_CNT;
+}
+
+static void riscv_pmu_update_counter_access(void *info)
+{
+	if (sysctl_perf_user_access == SYSCTL_LEGACY)
+		csr_write(CSR_SCOUNTEREN, 0x7);
+	else
+		csr_write(CSR_SCOUNTEREN, 0x2);
+}
+
+static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
+					      int write, void *buffer,
+					      size_t *lenp, loff_t *ppos)
+{
+	int prev = sysctl_perf_user_access;
+	int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	/*
+	 * Test against the previous value since we clear SCOUNTEREN when
+	 * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
+	 * not do that if that was already the case.
+	 */
+	if (ret || !write || prev == sysctl_perf_user_access)
+		return ret;
+
+	on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
+
+	return 0;
+}
+
+static struct ctl_table sbi_pmu_sysctl_table[] = {
+	{
+		.procname       = "perf_user_access",
+		.data		= &sysctl_perf_user_access,
+		.maxlen		= sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler	= riscv_pmu_proc_user_access_handler,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_TWO,
+	},
+	{ }
+};
+
 static int pmu_sbi_device_probe(struct platform_device *pdev)
 {
 	struct riscv_pmu *pmu = NULL;
@@ -888,6 +985,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
 	pmu->ctr_get_width = pmu_sbi_ctr_get_width;
 	pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
 	pmu->ctr_read = pmu_sbi_ctr_read;
+	pmu->event_flags = pmu_sbi_event_flags;
+	pmu->csr_index = pmu_sbi_csr_index;
 
 	ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
 	if (ret)
@@ -901,6 +1000,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
 	if (ret)
 		goto out_unregister;
 
+	register_sysctl("kernel", sbi_pmu_sysctl_table);
+
 	return 0;
 
 out_unregister:
diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
index 9f70d94942e0..ba19634d815c 100644
--- a/include/linux/perf/riscv_pmu.h
+++ b/include/linux/perf/riscv_pmu.h
@@ -12,6 +12,7 @@
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
 #include <linux/interrupt.h>
+#include <asm/perf_event.h>
 
 #ifdef CONFIG_RISCV_PMU
 
@@ -55,6 +56,8 @@ struct riscv_pmu {
 	void		(*ctr_start)(struct perf_event *event, u64 init_val);
 	void		(*ctr_stop)(struct perf_event *event, unsigned long flag);
 	int		(*event_map)(struct perf_event *event, u64 *config);
+	int		(*event_flags)(struct perf_event *event);
+	uint8_t		(*csr_index)(struct perf_event *event);
 
 	struct cpu_hw_events	__percpu *hw_events;
 	struct hlist_node	node;
diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
index 0d1634cedf44..18f2abb1584a 100644
--- a/tools/lib/perf/mmap.c
+++ b/tools/lib/perf/mmap.c
@@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
 
 static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
 
+#elif defined(__riscv) && __riscv_xlen == 64
+
+#define CSR_CYCLE	0xc00
+#define CSR_TIME	0xc01
+#define CSR_CYCLEH	0xc80
+
+#define csr_read(csr)						\
+({								\
+	register unsigned long __v;				\
+		__asm__ __volatile__ ("csrr %0, " #csr		\
+		 : "=r" (__v) :					\
+		 : "memory");					\
+		 __v;						\
+})
+
+static unsigned long csr_read_num(int csr_num)
+{
+#define switchcase_csr_read(__csr_num, __val)           {\
+	case __csr_num:                                 \
+		__val = csr_read(__csr_num);            \
+		break; }
+#define switchcase_csr_read_2(__csr_num, __val)         {\
+	switchcase_csr_read(__csr_num + 0, __val)        \
+	switchcase_csr_read(__csr_num + 1, __val)}
+#define switchcase_csr_read_4(__csr_num, __val)         {\
+	switchcase_csr_read_2(__csr_num + 0, __val)      \
+	switchcase_csr_read_2(__csr_num + 2, __val)}
+#define switchcase_csr_read_8(__csr_num, __val)         {\
+	switchcase_csr_read_4(__csr_num + 0, __val)      \
+	switchcase_csr_read_4(__csr_num + 4, __val)}
+#define switchcase_csr_read_16(__csr_num, __val)        {\
+	switchcase_csr_read_8(__csr_num + 0, __val)      \
+	switchcase_csr_read_8(__csr_num + 8, __val)}
+#define switchcase_csr_read_32(__csr_num, __val)        {\
+	switchcase_csr_read_16(__csr_num + 0, __val)     \
+	switchcase_csr_read_16(__csr_num + 16, __val)}
+
+	unsigned long ret = 0;
+
+	switch (csr_num) {
+	switchcase_csr_read_32(CSR_CYCLE, ret)
+	switchcase_csr_read_32(CSR_CYCLEH, ret)
+	default :
+		break;
+	}
+
+	return ret;
+#undef switchcase_csr_read_32
+#undef switchcase_csr_read_16
+#undef switchcase_csr_read_8
+#undef switchcase_csr_read_4
+#undef switchcase_csr_read_2
+#undef switchcase_csr_read
+}
+
+static u64 read_perf_counter(unsigned int counter)
+{
+	return csr_read_num(CSR_CYCLE + counter);
+}
+
+static u64 read_timestamp(void)
+{
+	return csr_read_num(CSR_TIME);
+}
+
 #else
 static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
 static u64 read_timestamp(void) { return 0; }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-13 16:17 ` [PATCH 4/4] riscv: Enable perf counters user access only through perf Alexandre Ghiti
@ 2023-04-13 21:20   ` kernel test robot
  2023-04-14  2:09   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2023-04-13 21:20 UTC (permalink / raw)
  To: Alexandre Ghiti, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel
  Cc: oe-kbuild-all, Alexandre Ghiti

Hi Alexandre,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on acme/perf/core tip/master tip/auto-latest linus/master v6.3-rc6]
[cannot apply to next-20230413]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Alexandre-Ghiti/perf-Fix-wrong-comment-about-default-event_idx/20230414-002232
patch link:    https://lore.kernel.org/r/20230413161725.195417-5-alexghiti%40rivosinc.com
patch subject: [PATCH 4/4] riscv: Enable perf counters user access only through perf
config: riscv-randconfig-r021-20230412 (https://download.01.org/0day-ci/archive/20230414/202304140522.RGhxahvD-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/8ca9b21cbf2c0b91ee35356c01aef9da7d874e55
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Alexandre-Ghiti/perf-Fix-wrong-comment-about-default-event_idx/20230414-002232
        git checkout 8ca9b21cbf2c0b91ee35356c01aef9da7d874e55
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash arch/riscv/kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304140522.RGhxahvD-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   arch/riscv/kernel/perf_event.c: In function 'arch_perf_update_userpage':
>> arch/riscv/kernel/perf_event.c:8:35: error: implicit declaration of function 'to_riscv_pmu' [-Werror=implicit-function-declaration]
       8 |         struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
         |                                   ^~~~~~~~~~~~
>> arch/riscv/kernel/perf_event.c:8:35: warning: initialization of 'struct riscv_pmu *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
>> arch/riscv/kernel/perf_event.c:24:34: error: invalid use of undefined type 'struct riscv_pmu'
      24 |         userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
         |                                  ^~
   cc1: some warnings being treated as errors


vim +/to_riscv_pmu +8 arch/riscv/kernel/perf_event.c

     4	
     5	void arch_perf_update_userpage(struct perf_event *event,
     6				       struct perf_event_mmap_page *userpg, u64 now)
     7	{
   > 8		struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
     9		struct clock_read_data *rd;
    10		unsigned int seq;
    11		u64 ns;
    12	
    13		userpg->cap_user_time = 0;
    14		userpg->cap_user_time_zero = 0;
    15		userpg->cap_user_time_short = 0;
    16		userpg->cap_user_rdpmc =
    17			!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
    18	
    19		/*
    20		 * The counters are 64-bit but the priv spec doesn't mandate all the
    21		 * bits to be implemented: that's why, counter width can vary based on
    22		 * the cpu vendor.
    23		 */
  > 24		userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-13 16:17 ` [PATCH 4/4] riscv: Enable perf counters user access only through perf Alexandre Ghiti
  2023-04-13 21:20   ` kernel test robot
@ 2023-04-14  2:09   ` kernel test robot
  2023-04-26 12:57   ` Andrew Jones
  2023-05-01  2:09   ` Bagas Sanjaya
  3 siblings, 0 replies; 26+ messages in thread
From: kernel test robot @ 2023-04-14  2:09 UTC (permalink / raw)
  To: Alexandre Ghiti, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel
  Cc: llvm, oe-kbuild-all, Alexandre Ghiti

Hi Alexandre,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/perf/core]
[also build test ERROR on acme/perf/core tip/master tip/auto-latest linus/master v6.3-rc6]
[cannot apply to next-20230413]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Alexandre-Ghiti/perf-Fix-wrong-comment-about-default-event_idx/20230414-002232
patch link:    https://lore.kernel.org/r/20230413161725.195417-5-alexghiti%40rivosinc.com
patch subject: [PATCH 4/4] riscv: Enable perf counters user access only through perf
config: riscv-randconfig-r036-20230412 (https://download.01.org/0day-ci/archive/20230414/202304140904.9oAVhFHu-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project 9638da200e00bd069e6dd63604e14cbafede9324)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/8ca9b21cbf2c0b91ee35356c01aef9da7d874e55
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Alexandre-Ghiti/perf-Fix-wrong-comment-about-default-event_idx/20230414-002232
        git checkout 8ca9b21cbf2c0b91ee35356c01aef9da7d874e55
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=riscv SHELL=/bin/bash arch/riscv/kernel/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304140904.9oAVhFHu-lkp@intel.com/

All errors (new ones prefixed by >>):

>> arch/riscv/kernel/perf_event.c:8:28: error: call to undeclared function 'to_riscv_pmu'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
                                     ^
>> arch/riscv/kernel/perf_event.c:8:20: error: incompatible integer to pointer conversion initializing 'struct riscv_pmu *' with an expression of type 'int' [-Wint-conversion]
           struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
                             ^       ~~~~~~~~~~~~~~~~~~~~~~~~
>> arch/riscv/kernel/perf_event.c:24:27: error: incomplete definition of type 'struct riscv_pmu'
           userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
                               ~~~~~^
   arch/riscv/kernel/perf_event.c:8:9: note: forward declaration of 'struct riscv_pmu'
           struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
                  ^
   3 errors generated.


vim +/to_riscv_pmu +8 arch/riscv/kernel/perf_event.c

     4	
     5	void arch_perf_update_userpage(struct perf_event *event,
     6				       struct perf_event_mmap_page *userpg, u64 now)
     7	{
   > 8		struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
     9		struct clock_read_data *rd;
    10		unsigned int seq;
    11		u64 ns;
    12	
    13		userpg->cap_user_time = 0;
    14		userpg->cap_user_time_zero = 0;
    15		userpg->cap_user_time_short = 0;
    16		userpg->cap_user_rdpmc =
    17			!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
    18	
    19		/*
    20		 * The counters are 64-bit but the priv spec doesn't mandate all the
    21		 * bits to be implemented: that's why, counter width can vary based on
    22		 * the cpu vendor.
    23		 */
  > 24		userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-13 16:17 ` [PATCH 4/4] riscv: Enable perf counters user access only through perf Alexandre Ghiti
  2023-04-13 21:20   ` kernel test robot
  2023-04-14  2:09   ` kernel test robot
@ 2023-04-26 12:57   ` Andrew Jones
  2023-04-26 13:17     ` Alexandre Ghiti
  2023-05-01  2:09   ` Bagas Sanjaya
  3 siblings, 1 reply; 26+ messages in thread
From: Andrew Jones @ 2023-04-26 12:57 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel

On Thu, Apr 13, 2023 at 06:17:25PM +0200, Alexandre Ghiti wrote:
> We used to unconditionnally expose the cycle and instret csrs to
> userspace, which gives rise to security concerns.
> 
> So only allow access to hw counters from userspace through the perf
> framework which will handle context switchs, per-task events...etc. But
> as we cannot break userspace, we give the user the choice to go back to
> the previous behaviour by setting the sysctl perf_user_access.
> 
> We also introduce a means to directly map the hardware counters to
> userspace, thus avoiding the need for syscalls whenever an application
> wants to access counters values.
> 
> Note that arch_perf_update_userpage is a copy of arm64 code.
> 
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> ---
>  Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
>  arch/riscv/include/asm/perf_event.h         |   3 +
>  arch/riscv/kernel/Makefile                  |   2 +-
>  arch/riscv/kernel/perf_event.c              |  65 +++++++++++
>  drivers/perf/riscv_pmu.c                    |  42 ++++++++
>  drivers/perf/riscv_pmu_legacy.c             |  17 +++
>  drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
>  include/linux/perf/riscv_pmu.h              |   3 +
>  tools/lib/perf/mmap.c                       |  65 +++++++++++
>  9 files changed, 322 insertions(+), 11 deletions(-)
>  create mode 100644 arch/riscv/kernel/perf_event.c
> 
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index 4b7bfea28cd7..02b2a40a3647 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -941,16 +941,31 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
>  The default value is 8.
>  
>  
> -perf_user_access (arm64 only)
> -=================================
> +perf_user_access (arm64 and riscv only)
> +=======================================
> +
> +Controls user space access for reading perf event counters.
>  
> -Controls user space access for reading perf event counters. When set to 1,
> -user space can read performance monitor counter registers directly.
> +arm64
> +=====
>  
>  The default value is 0 (access disabled).
> +When set to 1, user space can read performance monitor counter registers
> +directly.
>  
>  See Documentation/arm64/perf.rst for more information.
>  
> +riscv
> +=====
> +
> +When set to 0, user access is disabled.
> +
> +When set to 1, user space can read performance monitor counter registers
> +directly only through perf, any direct access without perf intervention will
> +trigger an illegal instruction.
> +
> +The default value is 2, it enables the legacy mode, that is user space has
> +direct access to cycle, time and insret CSRs only.

I think this default value should be a Kconfig symbol, allowing kernels to
be built with a secure default.

>  
>  pid_max
>  =======
> diff --git a/arch/riscv/include/asm/perf_event.h b/arch/riscv/include/asm/perf_event.h
> index d42c901f9a97..9fdfdd9dc92d 100644
> --- a/arch/riscv/include/asm/perf_event.h
> +++ b/arch/riscv/include/asm/perf_event.h
> @@ -9,5 +9,8 @@
>  #define _ASM_RISCV_PERF_EVENT_H
>  
>  #include <linux/perf_event.h>
> +
> +#define PERF_EVENT_FLAG_LEGACY	1
> +
>  #define perf_arch_bpf_user_pt_regs(regs) (struct user_regs_struct *)regs
>  #endif /* _ASM_RISCV_PERF_EVENT_H */
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index aa22f87faeae..9ae951b07847 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -70,7 +70,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
>  
>  obj-$(CONFIG_TRACE_IRQFLAGS)	+= trace_irq.o
>  
> -obj-$(CONFIG_PERF_EVENTS)	+= perf_callchain.o
> +obj-$(CONFIG_PERF_EVENTS)	+= perf_callchain.o perf_event.o
>  obj-$(CONFIG_HAVE_PERF_REGS)	+= perf_regs.o
>  obj-$(CONFIG_RISCV_SBI)		+= sbi.o
>  ifeq ($(CONFIG_RISCV_SBI), y)
> diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> new file mode 100644
> index 000000000000..4a75ab628bfb
> --- /dev/null
> +++ b/arch/riscv/kernel/perf_event.c
> @@ -0,0 +1,65 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <linux/perf/riscv_pmu.h>
> +#include <linux/sched_clock.h>
> +
> +void arch_perf_update_userpage(struct perf_event *event,
> +			       struct perf_event_mmap_page *userpg, u64 now)
> +{
> +	struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +	struct clock_read_data *rd;
> +	unsigned int seq;
> +	u64 ns;
> +
> +	userpg->cap_user_time = 0;
> +	userpg->cap_user_time_zero = 0;
> +	userpg->cap_user_time_short = 0;
> +	userpg->cap_user_rdpmc =
> +		!!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> +
> +	/*
> +	 * The counters are 64-bit but the priv spec doesn't mandate all the
> +	 * bits to be implemented: that's why, counter width can vary based on
> +	 * the cpu vendor.
> +	 */
> +	userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
> +
> +	do {
> +		rd = sched_clock_read_begin(&seq);
> +
> +		userpg->time_mult = rd->mult;
> +		userpg->time_shift = rd->shift;
> +		userpg->time_zero = rd->epoch_ns;
> +		userpg->time_cycles = rd->epoch_cyc;
> +		userpg->time_mask = rd->sched_clock_mask;
> +
> +		/*
> +		 * Subtract the cycle base, such that software that
> +		 * doesn't know about cap_user_time_short still 'works'
> +		 * assuming no wraps.
> +		 */
> +		ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
> +		userpg->time_zero -= ns;
> +
> +	} while (sched_clock_read_retry(seq));
> +
> +	userpg->time_offset = userpg->time_zero - now;
> +
> +	/*
> +	 * time_shift is not expected to be greater than 31 due to
> +	 * the original published conversion algorithm shifting a
> +	 * 32-bit value (now specifies a 64-bit value) - refer
> +	 * perf_event_mmap_page documentation in perf_event.h.
> +	 */
> +	if (userpg->time_shift == 32) {
> +		userpg->time_shift = 31;
> +		userpg->time_mult >>= 1;
> +	}
> +
> +	/*
> +	 * Internal timekeeping for enabled/running/stopped times
> +	 * is always computed with the sched_clock.
> +	 */
> +	userpg->cap_user_time = 1;
> +	userpg->cap_user_time_zero = 1;
> +	userpg->cap_user_time_short = 1;
> +}
> diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> index ebca5eab9c9b..12675ee1123c 100644
> --- a/drivers/perf/riscv_pmu.c
> +++ b/drivers/perf/riscv_pmu.c
> @@ -171,6 +171,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
>  
>  	local64_set(&hwc->prev_count, (u64)-left);
>  
> +	perf_event_update_userpage(event);
> +
>  	return overflow;
>  }
>  
> @@ -283,6 +285,43 @@ static int riscv_pmu_event_init(struct perf_event *event)
>  	return 0;
>  }
>  
> +static int riscv_pmu_event_idx(struct perf_event *event)
> +{
> +	struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +
> +	if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
> +		return 0;
> +
> +	/*
> +	 * cycle and instret can either be retrieved from their fixed counters
> +	 * or from programmable counters, the latter being the preferred way
> +	 * since cycle and instret counters do not support sampling.
> +	 */
> +
> +	return rvpmu->csr_index(event) + 1;
> +}
> +
> +static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> +{
> +	/*
> +	 * The user mmapped the event to directly access it: this is where
> +	 * we determine based on sysctl_perf_user_access if we grant userspace
> +	 * the direct access to this event. That means that within the same
> +	 * task, some events may be directly accessible and some other may not,
> +	 * if the user changes the value of sysctl_perf_user_accesss in the
> +	 * meantime.
> +	 */
> +	struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> +
> +	event->hw.flags |= rvpmu->event_flags(event);
> +	perf_event_update_userpage(event);
> +}
> +
> +static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> +{
> +	event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> +}
> +
>  struct riscv_pmu *riscv_pmu_alloc(void)
>  {
>  	struct riscv_pmu *pmu;
> @@ -307,6 +346,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
>  	}
>  	pmu->pmu = (struct pmu) {
>  		.event_init	= riscv_pmu_event_init,
> +		.event_mapped	= riscv_pmu_event_mapped,
> +		.event_unmapped	= riscv_pmu_event_unmapped,
> +		.event_idx	= riscv_pmu_event_idx,
>  		.add		= riscv_pmu_add,
>  		.del		= riscv_pmu_del,
>  		.start		= riscv_pmu_start,
> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
> index 0d8c9d8849ee..35c4c9097a0f 100644
> --- a/drivers/perf/riscv_pmu_legacy.c
> +++ b/drivers/perf/riscv_pmu_legacy.c
> @@ -74,6 +74,21 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
>  	local64_set(&hwc->prev_count, initial_val);
>  }
>  
> +static uint8_t pmu_legacy_csr_index(struct perf_event *event)
> +{
> +	return event->hw.idx;
> +}
> +
> +static int pmu_legacy_event_flags(struct perf_event *event)
> +{
> +	/* In legacy mode, the first 3 CSRs are available. */
> +	if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> +	    event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
> +		return 0;
> +
> +	return PERF_EVENT_FLAG_USER_READ_CNT;
> +}
> +
>  /*
>   * This is just a simple implementation to allow legacy implementations
>   * compatible with new RISC-V PMU driver framework.
> @@ -94,6 +109,8 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
>  	pmu->ctr_get_width = NULL;
>  	pmu->ctr_clear_idx = NULL;
>  	pmu->ctr_read = pmu_legacy_read_ctr;
> +	pmu->event_flags = pmu_legacy_event_flags;
> +	pmu->csr_index = pmu_legacy_csr_index;
>  
>  	perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
>  }
> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> index 70cb50fd41c2..af7f3128b6b8 100644
> --- a/drivers/perf/riscv_pmu_sbi.c
> +++ b/drivers/perf/riscv_pmu_sbi.c
> @@ -24,6 +24,10 @@
>  #include <asm/sbi.h>
>  #include <asm/hwcap.h>
>  
> +#define SYSCTL_NO_USER_ACCESS	0
> +#define SYSCTL_USER_ACCESS	1
> +#define SYSCTL_LEGACY		2
> +
>  PMU_FORMAT_ATTR(event, "config:0-47");
>  PMU_FORMAT_ATTR(firmware, "config:63");
>  
> @@ -43,6 +47,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
>  	NULL,
>  };
>  
> +/* Allow legacy access by default */
> +static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
> +
>  /*
>   * RISC-V doesn't have heterogeneous harts yet. This need to be part of
>   * per_cpu in case of harts with different pmu counters
> @@ -301,6 +308,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
>  }
>  EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
>  
> +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
> +{
> +	return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
> +}
> +
>  static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
>  {
>  	unsigned long cflags = 0;
> @@ -329,18 +341,30 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
>  	struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
>  	struct sbiret ret;
>  	int idx;
> -	uint64_t cbase = 0;
> +	uint64_t cbase = 0, cmask = rvpmu->cmask;
>  	unsigned long cflags = 0;
>  
>  	cflags = pmu_sbi_get_filter_flags(event);
> +
> +	/* In legacy mode, we have to force the fixed counters for those events */
> +	if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
> +		if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
> +			cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> +			cmask = 1;
> +		} else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
> +			cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> +			cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
> +		}
> +	}
> +
>  	/* retrieve the available counter index */
>  #if defined(CONFIG_32BIT)
>  	ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> -			rvpmu->cmask, cflags, hwc->event_base, hwc->config,
> +			cmask, cflags, hwc->event_base, hwc->config,
>  			hwc->config >> 32);
>  #else
>  	ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> -			rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
> +			cmask, cflags, hwc->event_base, hwc->config, 0);
>  #endif
>  	if (ret.error) {
>  		pr_debug("Not able to find a counter for event %lx config %llx\n",
> @@ -490,6 +514,11 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
>  	if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
>  		pr_err("Starting counter idx %d failed with error %d\n",
>  			hwc->idx, sbi_err_map_linux_errno(ret.error));
> +
> +	if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
> +	    event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
> +		csr_write(CSR_SCOUNTEREN,
> +			  csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
>  }
>  
>  static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> @@ -497,6 +526,11 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
>  	struct sbiret ret;
>  	struct hw_perf_event *hwc = &event->hw;
>  
> +	if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
> +	    event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
> +		csr_write(CSR_SCOUNTEREN,
> +			  csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
> +
>  	ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
>  	if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
>  		flag != SBI_PMU_STOP_FLAG_RESET)
> @@ -704,10 +738,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
>  	struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
>  
>  	/*
> -	 * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
> -	 * as is necessary to maintain uABI compatibility.
> +	 * We keep enabling userspace access to CYCLE, TIME and INSRET via the
> +	 * legacy option but that will be removed in the future.

Will it? The documentation hunk didn't mention that value 2 was depreciated.

>  	 */
> -	csr_write(CSR_SCOUNTEREN, 0x7);
> +	if (sysctl_perf_user_access == SYSCTL_LEGACY)
> +		csr_write(CSR_SCOUNTEREN, 0x7);
> +	else
> +		csr_write(CSR_SCOUNTEREN, 0x2);
>  
>  	/* Stop all the counters so that they can be enabled from perf */
>  	pmu_sbi_stop_all(pmu);
> @@ -851,6 +888,66 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
>  	cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
>  }
>  
> +static int pmu_sbi_event_flags(struct perf_event *event)
> +{
> +	if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
> +		return 0;
> +
> +	/* In legacy mode, the first 3 CSRs are available. */
> +	if (sysctl_perf_user_access == SYSCTL_LEGACY) {
> +		int flags = PERF_EVENT_FLAG_LEGACY;
> +
> +		if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES ||
> +		    event->attr.config == PERF_COUNT_HW_INSTRUCTIONS)
> +			flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> +
> +		return flags;
> +	}
> +
> +	return PERF_EVENT_FLAG_USER_READ_CNT;
> +}
> +
> +static void riscv_pmu_update_counter_access(void *info)
> +{
> +	if (sysctl_perf_user_access == SYSCTL_LEGACY)
> +		csr_write(CSR_SCOUNTEREN, 0x7);
> +	else
> +		csr_write(CSR_SCOUNTEREN, 0x2);
> +}
> +
> +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
> +					      int write, void *buffer,
> +					      size_t *lenp, loff_t *ppos)
> +{
> +	int prev = sysctl_perf_user_access;
> +	int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +
> +	/*
> +	 * Test against the previous value since we clear SCOUNTEREN when
> +	 * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
> +	 * not do that if that was already the case.
> +	 */
> +	if (ret || !write || prev == sysctl_perf_user_access)
> +		return ret;
> +
> +	on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
> +
> +	return 0;
> +}
> +
> +static struct ctl_table sbi_pmu_sysctl_table[] = {
> +	{
> +		.procname       = "perf_user_access",
> +		.data		= &sysctl_perf_user_access,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler	= riscv_pmu_proc_user_access_handler,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_TWO,
> +	},
> +	{ }
> +};
> +
>  static int pmu_sbi_device_probe(struct platform_device *pdev)
>  {
>  	struct riscv_pmu *pmu = NULL;
> @@ -888,6 +985,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>  	pmu->ctr_get_width = pmu_sbi_ctr_get_width;
>  	pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
>  	pmu->ctr_read = pmu_sbi_ctr_read;
> +	pmu->event_flags = pmu_sbi_event_flags;
> +	pmu->csr_index = pmu_sbi_csr_index;
>  
>  	ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
>  	if (ret)
> @@ -901,6 +1000,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>  	if (ret)
>  		goto out_unregister;
>  
> +	register_sysctl("kernel", sbi_pmu_sysctl_table);
> +
>  	return 0;
>  
>  out_unregister:
> diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> index 9f70d94942e0..ba19634d815c 100644
> --- a/include/linux/perf/riscv_pmu.h
> +++ b/include/linux/perf/riscv_pmu.h
> @@ -12,6 +12,7 @@
>  #include <linux/perf_event.h>
>  #include <linux/ptrace.h>
>  #include <linux/interrupt.h>
> +#include <asm/perf_event.h>
>  
>  #ifdef CONFIG_RISCV_PMU
>  
> @@ -55,6 +56,8 @@ struct riscv_pmu {
>  	void		(*ctr_start)(struct perf_event *event, u64 init_val);
>  	void		(*ctr_stop)(struct perf_event *event, unsigned long flag);
>  	int		(*event_map)(struct perf_event *event, u64 *config);
> +	int		(*event_flags)(struct perf_event *event);
> +	uint8_t		(*csr_index)(struct perf_event *event);
>  
>  	struct cpu_hw_events	__percpu *hw_events;
>  	struct hlist_node	node;
> diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
> index 0d1634cedf44..18f2abb1584a 100644
> --- a/tools/lib/perf/mmap.c
> +++ b/tools/lib/perf/mmap.c
> @@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
>  
>  static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
>  
> +#elif defined(__riscv) && __riscv_xlen == 64

It's enough to just check __riscv_xlen.

> +
> +#define CSR_CYCLE	0xc00
> +#define CSR_TIME	0xc01
> +#define CSR_CYCLEH	0xc80
> +
> +#define csr_read(csr)						\
> +({								\
> +	register unsigned long __v;				\
> +		__asm__ __volatile__ ("csrr %0, " #csr		\
> +		 : "=r" (__v) :					\
> +		 : "memory");					\
> +		 __v;						\
> +})
> +
> +static unsigned long csr_read_num(int csr_num)
> +{
> +#define switchcase_csr_read(__csr_num, __val)           {\
> +	case __csr_num:                                 \
> +		__val = csr_read(__csr_num);            \
> +		break; }
> +#define switchcase_csr_read_2(__csr_num, __val)         {\
> +	switchcase_csr_read(__csr_num + 0, __val)        \
> +	switchcase_csr_read(__csr_num + 1, __val)}
> +#define switchcase_csr_read_4(__csr_num, __val)         {\
> +	switchcase_csr_read_2(__csr_num + 0, __val)      \
> +	switchcase_csr_read_2(__csr_num + 2, __val)}
> +#define switchcase_csr_read_8(__csr_num, __val)         {\
> +	switchcase_csr_read_4(__csr_num + 0, __val)      \
> +	switchcase_csr_read_4(__csr_num + 4, __val)}
> +#define switchcase_csr_read_16(__csr_num, __val)        {\
> +	switchcase_csr_read_8(__csr_num + 0, __val)      \
> +	switchcase_csr_read_8(__csr_num + 8, __val)}
> +#define switchcase_csr_read_32(__csr_num, __val)        {\
> +	switchcase_csr_read_16(__csr_num + 0, __val)     \
> +	switchcase_csr_read_16(__csr_num + 16, __val)}
> +
> +	unsigned long ret = 0;
> +
> +	switch (csr_num) {
> +	switchcase_csr_read_32(CSR_CYCLE, ret)
> +	switchcase_csr_read_32(CSR_CYCLEH, ret)
> +	default :
               ^ extra space

> +		break;
> +	}
> +
> +	return ret;
> +#undef switchcase_csr_read_32
> +#undef switchcase_csr_read_16
> +#undef switchcase_csr_read_8
> +#undef switchcase_csr_read_4
> +#undef switchcase_csr_read_2
> +#undef switchcase_csr_read
> +}
> +
> +static u64 read_perf_counter(unsigned int counter)
> +{
> +	return csr_read_num(CSR_CYCLE + counter);
> +}
> +
> +static u64 read_timestamp(void)
> +{
> +	return csr_read_num(CSR_TIME);
> +}
> +
>  #else
>  static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
>  static u64 read_timestamp(void) { return 0; }
> -- 
> 2.37.2
>

A lot going on this patch. It'd be easier to review if it was broken up a
bit. E.g. import of arm code, the tools/lib/perf/mmap.c hunk, and whatever
else makes sense.

Thanks,
drew

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-26 12:57   ` Andrew Jones
@ 2023-04-26 13:17     ` Alexandre Ghiti
  2023-04-26 13:25       ` Andrew Jones
  2023-05-09 12:24       ` Emil Renner Berthing
  0 siblings, 2 replies; 26+ messages in thread
From: Alexandre Ghiti @ 2023-04-26 13:17 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel

On Wed, Apr 26, 2023 at 2:57 PM Andrew Jones <ajones@ventanamicro.com> wrote:
>
> On Thu, Apr 13, 2023 at 06:17:25PM +0200, Alexandre Ghiti wrote:
> > We used to unconditionnally expose the cycle and instret csrs to
> > userspace, which gives rise to security concerns.
> >
> > So only allow access to hw counters from userspace through the perf
> > framework which will handle context switchs, per-task events...etc. But
> > as we cannot break userspace, we give the user the choice to go back to
> > the previous behaviour by setting the sysctl perf_user_access.
> >
> > We also introduce a means to directly map the hardware counters to
> > userspace, thus avoiding the need for syscalls whenever an application
> > wants to access counters values.
> >
> > Note that arch_perf_update_userpage is a copy of arm64 code.
> >
> > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > ---
> >  Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
> >  arch/riscv/include/asm/perf_event.h         |   3 +
> >  arch/riscv/kernel/Makefile                  |   2 +-
> >  arch/riscv/kernel/perf_event.c              |  65 +++++++++++
> >  drivers/perf/riscv_pmu.c                    |  42 ++++++++
> >  drivers/perf/riscv_pmu_legacy.c             |  17 +++
> >  drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
> >  include/linux/perf/riscv_pmu.h              |   3 +
> >  tools/lib/perf/mmap.c                       |  65 +++++++++++
> >  9 files changed, 322 insertions(+), 11 deletions(-)
> >  create mode 100644 arch/riscv/kernel/perf_event.c
> >
> > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > index 4b7bfea28cd7..02b2a40a3647 100644
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -941,16 +941,31 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
> >  The default value is 8.
> >
> >
> > -perf_user_access (arm64 only)
> > -=================================
> > +perf_user_access (arm64 and riscv only)
> > +=======================================
> > +
> > +Controls user space access for reading perf event counters.
> >
> > -Controls user space access for reading perf event counters. When set to 1,
> > -user space can read performance monitor counter registers directly.
> > +arm64
> > +=====
> >
> >  The default value is 0 (access disabled).
> > +When set to 1, user space can read performance monitor counter registers
> > +directly.
> >
> >  See Documentation/arm64/perf.rst for more information.
> >
> > +riscv
> > +=====
> > +
> > +When set to 0, user access is disabled.
> > +
> > +When set to 1, user space can read performance monitor counter registers
> > +directly only through perf, any direct access without perf intervention will
> > +trigger an illegal instruction.
> > +
> > +The default value is 2, it enables the legacy mode, that is user space has
> > +direct access to cycle, time and insret CSRs only.
>
> I think this default value should be a Kconfig symbol, allowing kernels to
> be built with a secure default.

Actually I was more in favor of having the default to 1 (ie the secure
option) and let the distros deal with the legacy mode (via a sysctl
parameter on the command line) as long as user-space has not been
fixed: does that make sense?

>
> >
> >  pid_max
> >  =======
> > diff --git a/arch/riscv/include/asm/perf_event.h b/arch/riscv/include/asm/perf_event.h
> > index d42c901f9a97..9fdfdd9dc92d 100644
> > --- a/arch/riscv/include/asm/perf_event.h
> > +++ b/arch/riscv/include/asm/perf_event.h
> > @@ -9,5 +9,8 @@
> >  #define _ASM_RISCV_PERF_EVENT_H
> >
> >  #include <linux/perf_event.h>
> > +
> > +#define PERF_EVENT_FLAG_LEGACY       1
> > +
> >  #define perf_arch_bpf_user_pt_regs(regs) (struct user_regs_struct *)regs
> >  #endif /* _ASM_RISCV_PERF_EVENT_H */
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index aa22f87faeae..9ae951b07847 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -70,7 +70,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE)        += mcount-dyn.o
> >
> >  obj-$(CONFIG_TRACE_IRQFLAGS) += trace_irq.o
> >
> > -obj-$(CONFIG_PERF_EVENTS)    += perf_callchain.o
> > +obj-$(CONFIG_PERF_EVENTS)    += perf_callchain.o perf_event.o
> >  obj-$(CONFIG_HAVE_PERF_REGS) += perf_regs.o
> >  obj-$(CONFIG_RISCV_SBI)              += sbi.o
> >  ifeq ($(CONFIG_RISCV_SBI), y)
> > diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> > new file mode 100644
> > index 000000000000..4a75ab628bfb
> > --- /dev/null
> > +++ b/arch/riscv/kernel/perf_event.c
> > @@ -0,0 +1,65 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +#include <linux/perf/riscv_pmu.h>
> > +#include <linux/sched_clock.h>
> > +
> > +void arch_perf_update_userpage(struct perf_event *event,
> > +                            struct perf_event_mmap_page *userpg, u64 now)
> > +{
> > +     struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > +     struct clock_read_data *rd;
> > +     unsigned int seq;
> > +     u64 ns;
> > +
> > +     userpg->cap_user_time = 0;
> > +     userpg->cap_user_time_zero = 0;
> > +     userpg->cap_user_time_short = 0;
> > +     userpg->cap_user_rdpmc =
> > +             !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> > +
> > +     /*
> > +      * The counters are 64-bit but the priv spec doesn't mandate all the
> > +      * bits to be implemented: that's why, counter width can vary based on
> > +      * the cpu vendor.
> > +      */
> > +     userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
> > +
> > +     do {
> > +             rd = sched_clock_read_begin(&seq);
> > +
> > +             userpg->time_mult = rd->mult;
> > +             userpg->time_shift = rd->shift;
> > +             userpg->time_zero = rd->epoch_ns;
> > +             userpg->time_cycles = rd->epoch_cyc;
> > +             userpg->time_mask = rd->sched_clock_mask;
> > +
> > +             /*
> > +              * Subtract the cycle base, such that software that
> > +              * doesn't know about cap_user_time_short still 'works'
> > +              * assuming no wraps.
> > +              */
> > +             ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
> > +             userpg->time_zero -= ns;
> > +
> > +     } while (sched_clock_read_retry(seq));
> > +
> > +     userpg->time_offset = userpg->time_zero - now;
> > +
> > +     /*
> > +      * time_shift is not expected to be greater than 31 due to
> > +      * the original published conversion algorithm shifting a
> > +      * 32-bit value (now specifies a 64-bit value) - refer
> > +      * perf_event_mmap_page documentation in perf_event.h.
> > +      */
> > +     if (userpg->time_shift == 32) {
> > +             userpg->time_shift = 31;
> > +             userpg->time_mult >>= 1;
> > +     }
> > +
> > +     /*
> > +      * Internal timekeeping for enabled/running/stopped times
> > +      * is always computed with the sched_clock.
> > +      */
> > +     userpg->cap_user_time = 1;
> > +     userpg->cap_user_time_zero = 1;
> > +     userpg->cap_user_time_short = 1;
> > +}
> > diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> > index ebca5eab9c9b..12675ee1123c 100644
> > --- a/drivers/perf/riscv_pmu.c
> > +++ b/drivers/perf/riscv_pmu.c
> > @@ -171,6 +171,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
> >
> >       local64_set(&hwc->prev_count, (u64)-left);
> >
> > +     perf_event_update_userpage(event);
> > +
> >       return overflow;
> >  }
> >
> > @@ -283,6 +285,43 @@ static int riscv_pmu_event_init(struct perf_event *event)
> >       return 0;
> >  }
> >
> > +static int riscv_pmu_event_idx(struct perf_event *event)
> > +{
> > +     struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > +
> > +     if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
> > +             return 0;
> > +
> > +     /*
> > +      * cycle and instret can either be retrieved from their fixed counters
> > +      * or from programmable counters, the latter being the preferred way
> > +      * since cycle and instret counters do not support sampling.
> > +      */
> > +
> > +     return rvpmu->csr_index(event) + 1;
> > +}
> > +
> > +static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > +     /*
> > +      * The user mmapped the event to directly access it: this is where
> > +      * we determine based on sysctl_perf_user_access if we grant userspace
> > +      * the direct access to this event. That means that within the same
> > +      * task, some events may be directly accessible and some other may not,
> > +      * if the user changes the value of sysctl_perf_user_accesss in the
> > +      * meantime.
> > +      */
> > +     struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > +
> > +     event->hw.flags |= rvpmu->event_flags(event);
> > +     perf_event_update_userpage(event);
> > +}
> > +
> > +static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > +{
> > +     event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> > +}
> > +
> >  struct riscv_pmu *riscv_pmu_alloc(void)
> >  {
> >       struct riscv_pmu *pmu;
> > @@ -307,6 +346,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
> >       }
> >       pmu->pmu = (struct pmu) {
> >               .event_init     = riscv_pmu_event_init,
> > +             .event_mapped   = riscv_pmu_event_mapped,
> > +             .event_unmapped = riscv_pmu_event_unmapped,
> > +             .event_idx      = riscv_pmu_event_idx,
> >               .add            = riscv_pmu_add,
> >               .del            = riscv_pmu_del,
> >               .start          = riscv_pmu_start,
> > diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
> > index 0d8c9d8849ee..35c4c9097a0f 100644
> > --- a/drivers/perf/riscv_pmu_legacy.c
> > +++ b/drivers/perf/riscv_pmu_legacy.c
> > @@ -74,6 +74,21 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
> >       local64_set(&hwc->prev_count, initial_val);
> >  }
> >
> > +static uint8_t pmu_legacy_csr_index(struct perf_event *event)
> > +{
> > +     return event->hw.idx;
> > +}
> > +
> > +static int pmu_legacy_event_flags(struct perf_event *event)
> > +{
> > +     /* In legacy mode, the first 3 CSRs are available. */
> > +     if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> > +         event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
> > +             return 0;
> > +
> > +     return PERF_EVENT_FLAG_USER_READ_CNT;
> > +}
> > +
> >  /*
> >   * This is just a simple implementation to allow legacy implementations
> >   * compatible with new RISC-V PMU driver framework.
> > @@ -94,6 +109,8 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
> >       pmu->ctr_get_width = NULL;
> >       pmu->ctr_clear_idx = NULL;
> >       pmu->ctr_read = pmu_legacy_read_ctr;
> > +     pmu->event_flags = pmu_legacy_event_flags;
> > +     pmu->csr_index = pmu_legacy_csr_index;
> >
> >       perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
> >  }
> > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > index 70cb50fd41c2..af7f3128b6b8 100644
> > --- a/drivers/perf/riscv_pmu_sbi.c
> > +++ b/drivers/perf/riscv_pmu_sbi.c
> > @@ -24,6 +24,10 @@
> >  #include <asm/sbi.h>
> >  #include <asm/hwcap.h>
> >
> > +#define SYSCTL_NO_USER_ACCESS        0
> > +#define SYSCTL_USER_ACCESS   1
> > +#define SYSCTL_LEGACY                2
> > +
> >  PMU_FORMAT_ATTR(event, "config:0-47");
> >  PMU_FORMAT_ATTR(firmware, "config:63");
> >
> > @@ -43,6 +47,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
> >       NULL,
> >  };
> >
> > +/* Allow legacy access by default */
> > +static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
> > +
> >  /*
> >   * RISC-V doesn't have heterogeneous harts yet. This need to be part of
> >   * per_cpu in case of harts with different pmu counters
> > @@ -301,6 +308,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
> >  }
> >  EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
> >
> > +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
> > +{
> > +     return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
> > +}
> > +
> >  static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> >  {
> >       unsigned long cflags = 0;
> > @@ -329,18 +341,30 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
> >       struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
> >       struct sbiret ret;
> >       int idx;
> > -     uint64_t cbase = 0;
> > +     uint64_t cbase = 0, cmask = rvpmu->cmask;
> >       unsigned long cflags = 0;
> >
> >       cflags = pmu_sbi_get_filter_flags(event);
> > +
> > +     /* In legacy mode, we have to force the fixed counters for those events */
> > +     if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
> > +             if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
> > +                     cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> > +                     cmask = 1;
> > +             } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
> > +                     cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> > +                     cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
> > +             }
> > +     }
> > +
> >       /* retrieve the available counter index */
> >  #if defined(CONFIG_32BIT)
> >       ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> > -                     rvpmu->cmask, cflags, hwc->event_base, hwc->config,
> > +                     cmask, cflags, hwc->event_base, hwc->config,
> >                       hwc->config >> 32);
> >  #else
> >       ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> > -                     rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
> > +                     cmask, cflags, hwc->event_base, hwc->config, 0);
> >  #endif
> >       if (ret.error) {
> >               pr_debug("Not able to find a counter for event %lx config %llx\n",
> > @@ -490,6 +514,11 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> >       if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
> >               pr_err("Starting counter idx %d failed with error %d\n",
> >                       hwc->idx, sbi_err_map_linux_errno(ret.error));
> > +
> > +     if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
> > +         event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
> > +             csr_write(CSR_SCOUNTEREN,
> > +                       csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
> >  }
> >
> >  static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> > @@ -497,6 +526,11 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> >       struct sbiret ret;
> >       struct hw_perf_event *hwc = &event->hw;
> >
> > +     if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
> > +         event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
> > +             csr_write(CSR_SCOUNTEREN,
> > +                       csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
> > +
> >       ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> >       if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> >               flag != SBI_PMU_STOP_FLAG_RESET)
> > @@ -704,10 +738,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> >       struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> >
> >       /*
> > -      * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
> > -      * as is necessary to maintain uABI compatibility.
> > +      * We keep enabling userspace access to CYCLE, TIME and INSRET via the
> > +      * legacy option but that will be removed in the future.
>
> Will it? The documentation hunk didn't mention that value 2 was depreciated.

You're right, I'll add that to the documentation too, thanks.

>
> >        */
> > -     csr_write(CSR_SCOUNTEREN, 0x7);
> > +     if (sysctl_perf_user_access == SYSCTL_LEGACY)
> > +             csr_write(CSR_SCOUNTEREN, 0x7);
> > +     else
> > +             csr_write(CSR_SCOUNTEREN, 0x2);
> >
> >       /* Stop all the counters so that they can be enabled from perf */
> >       pmu_sbi_stop_all(pmu);
> > @@ -851,6 +888,66 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
> >       cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> >  }
> >
> > +static int pmu_sbi_event_flags(struct perf_event *event)
> > +{
> > +     if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
> > +             return 0;
> > +
> > +     /* In legacy mode, the first 3 CSRs are available. */
> > +     if (sysctl_perf_user_access == SYSCTL_LEGACY) {
> > +             int flags = PERF_EVENT_FLAG_LEGACY;
> > +
> > +             if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES ||
> > +                 event->attr.config == PERF_COUNT_HW_INSTRUCTIONS)
> > +                     flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> > +
> > +             return flags;
> > +     }
> > +
> > +     return PERF_EVENT_FLAG_USER_READ_CNT;
> > +}
> > +
> > +static void riscv_pmu_update_counter_access(void *info)
> > +{
> > +     if (sysctl_perf_user_access == SYSCTL_LEGACY)
> > +             csr_write(CSR_SCOUNTEREN, 0x7);
> > +     else
> > +             csr_write(CSR_SCOUNTEREN, 0x2);
> > +}
> > +
> > +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
> > +                                           int write, void *buffer,
> > +                                           size_t *lenp, loff_t *ppos)
> > +{
> > +     int prev = sysctl_perf_user_access;
> > +     int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > +
> > +     /*
> > +      * Test against the previous value since we clear SCOUNTEREN when
> > +      * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
> > +      * not do that if that was already the case.
> > +      */
> > +     if (ret || !write || prev == sysctl_perf_user_access)
> > +             return ret;
> > +
> > +     on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
> > +
> > +     return 0;
> > +}
> > +
> > +static struct ctl_table sbi_pmu_sysctl_table[] = {
> > +     {
> > +             .procname       = "perf_user_access",
> > +             .data           = &sysctl_perf_user_access,
> > +             .maxlen         = sizeof(unsigned int),
> > +             .mode           = 0644,
> > +             .proc_handler   = riscv_pmu_proc_user_access_handler,
> > +             .extra1         = SYSCTL_ZERO,
> > +             .extra2         = SYSCTL_TWO,
> > +     },
> > +     { }
> > +};
> > +
> >  static int pmu_sbi_device_probe(struct platform_device *pdev)
> >  {
> >       struct riscv_pmu *pmu = NULL;
> > @@ -888,6 +985,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> >       pmu->ctr_get_width = pmu_sbi_ctr_get_width;
> >       pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
> >       pmu->ctr_read = pmu_sbi_ctr_read;
> > +     pmu->event_flags = pmu_sbi_event_flags;
> > +     pmu->csr_index = pmu_sbi_csr_index;
> >
> >       ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> >       if (ret)
> > @@ -901,6 +1000,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> >       if (ret)
> >               goto out_unregister;
> >
> > +     register_sysctl("kernel", sbi_pmu_sysctl_table);
> > +
> >       return 0;
> >
> >  out_unregister:
> > diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> > index 9f70d94942e0..ba19634d815c 100644
> > --- a/include/linux/perf/riscv_pmu.h
> > +++ b/include/linux/perf/riscv_pmu.h
> > @@ -12,6 +12,7 @@
> >  #include <linux/perf_event.h>
> >  #include <linux/ptrace.h>
> >  #include <linux/interrupt.h>
> > +#include <asm/perf_event.h>
> >
> >  #ifdef CONFIG_RISCV_PMU
> >
> > @@ -55,6 +56,8 @@ struct riscv_pmu {
> >       void            (*ctr_start)(struct perf_event *event, u64 init_val);
> >       void            (*ctr_stop)(struct perf_event *event, unsigned long flag);
> >       int             (*event_map)(struct perf_event *event, u64 *config);
> > +     int             (*event_flags)(struct perf_event *event);
> > +     uint8_t         (*csr_index)(struct perf_event *event);
> >
> >       struct cpu_hw_events    __percpu *hw_events;
> >       struct hlist_node       node;
> > diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
> > index 0d1634cedf44..18f2abb1584a 100644
> > --- a/tools/lib/perf/mmap.c
> > +++ b/tools/lib/perf/mmap.c
> > @@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
> >
> >  static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
> >
> > +#elif defined(__riscv) && __riscv_xlen == 64
>
> It's enough to just check __riscv_xlen.

Right, thanks

>
> > +
> > +#define CSR_CYCLE    0xc00
> > +#define CSR_TIME     0xc01
> > +#define CSR_CYCLEH   0xc80
> > +
> > +#define csr_read(csr)                                                \
> > +({                                                           \
> > +     register unsigned long __v;                             \
> > +             __asm__ __volatile__ ("csrr %0, " #csr          \
> > +              : "=r" (__v) :                                 \
> > +              : "memory");                                   \
> > +              __v;                                           \
> > +})
> > +
> > +static unsigned long csr_read_num(int csr_num)
> > +{
> > +#define switchcase_csr_read(__csr_num, __val)           {\
> > +     case __csr_num:                                 \
> > +             __val = csr_read(__csr_num);            \
> > +             break; }
> > +#define switchcase_csr_read_2(__csr_num, __val)         {\
> > +     switchcase_csr_read(__csr_num + 0, __val)        \
> > +     switchcase_csr_read(__csr_num + 1, __val)}
> > +#define switchcase_csr_read_4(__csr_num, __val)         {\
> > +     switchcase_csr_read_2(__csr_num + 0, __val)      \
> > +     switchcase_csr_read_2(__csr_num + 2, __val)}
> > +#define switchcase_csr_read_8(__csr_num, __val)         {\
> > +     switchcase_csr_read_4(__csr_num + 0, __val)      \
> > +     switchcase_csr_read_4(__csr_num + 4, __val)}
> > +#define switchcase_csr_read_16(__csr_num, __val)        {\
> > +     switchcase_csr_read_8(__csr_num + 0, __val)      \
> > +     switchcase_csr_read_8(__csr_num + 8, __val)}
> > +#define switchcase_csr_read_32(__csr_num, __val)        {\
> > +     switchcase_csr_read_16(__csr_num + 0, __val)     \
> > +     switchcase_csr_read_16(__csr_num + 16, __val)}
> > +
> > +     unsigned long ret = 0;
> > +
> > +     switch (csr_num) {
> > +     switchcase_csr_read_32(CSR_CYCLE, ret)
> > +     switchcase_csr_read_32(CSR_CYCLEH, ret)
> > +     default :
>                ^ extra space
>

Thanks

> > +             break;
> > +     }
> > +
> > +     return ret;
> > +#undef switchcase_csr_read_32
> > +#undef switchcase_csr_read_16
> > +#undef switchcase_csr_read_8
> > +#undef switchcase_csr_read_4
> > +#undef switchcase_csr_read_2
> > +#undef switchcase_csr_read
> > +}
> > +
> > +static u64 read_perf_counter(unsigned int counter)
> > +{
> > +     return csr_read_num(CSR_CYCLE + counter);
> > +}
> > +
> > +static u64 read_timestamp(void)
> > +{
> > +     return csr_read_num(CSR_TIME);
> > +}
> > +
> >  #else
> >  static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
> >  static u64 read_timestamp(void) { return 0; }
> > --
> > 2.37.2
> >
>
> A lot going on this patch. It'd be easier to review if it was broken up a
> bit. E.g. import of arm code, the tools/lib/perf/mmap.c hunk, and whatever
> else makes sense.

Ok, will do that in v2!

>
> Thanks,
> drew

Thanks,

Alex

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-26 13:17     ` Alexandre Ghiti
@ 2023-04-26 13:25       ` Andrew Jones
  2023-04-29  6:19         ` Atish Patra
  2023-05-09 12:24       ` Emil Renner Berthing
  1 sibling, 1 reply; 26+ messages in thread
From: Andrew Jones @ 2023-04-26 13:25 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel

On Wed, Apr 26, 2023 at 03:17:01PM +0200, Alexandre Ghiti wrote:
> On Wed, Apr 26, 2023 at 2:57 PM Andrew Jones <ajones@ventanamicro.com> wrote:
> >
> > On Thu, Apr 13, 2023 at 06:17:25PM +0200, Alexandre Ghiti wrote:
> > > We used to unconditionnally expose the cycle and instret csrs to
> > > userspace, which gives rise to security concerns.
> > >
> > > So only allow access to hw counters from userspace through the perf
> > > framework which will handle context switchs, per-task events...etc. But
> > > as we cannot break userspace, we give the user the choice to go back to
> > > the previous behaviour by setting the sysctl perf_user_access.
> > >
> > > We also introduce a means to directly map the hardware counters to
> > > userspace, thus avoiding the need for syscalls whenever an application
> > > wants to access counters values.
> > >
> > > Note that arch_perf_update_userpage is a copy of arm64 code.
> > >
> > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > ---
> > >  Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
> > >  arch/riscv/include/asm/perf_event.h         |   3 +
> > >  arch/riscv/kernel/Makefile                  |   2 +-
> > >  arch/riscv/kernel/perf_event.c              |  65 +++++++++++
> > >  drivers/perf/riscv_pmu.c                    |  42 ++++++++
> > >  drivers/perf/riscv_pmu_legacy.c             |  17 +++
> > >  drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
> > >  include/linux/perf/riscv_pmu.h              |   3 +
> > >  tools/lib/perf/mmap.c                       |  65 +++++++++++
> > >  9 files changed, 322 insertions(+), 11 deletions(-)
> > >  create mode 100644 arch/riscv/kernel/perf_event.c
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > > index 4b7bfea28cd7..02b2a40a3647 100644
> > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > @@ -941,16 +941,31 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
> > >  The default value is 8.
> > >
> > >
> > > -perf_user_access (arm64 only)
> > > -=================================
> > > +perf_user_access (arm64 and riscv only)
> > > +=======================================
> > > +
> > > +Controls user space access for reading perf event counters.
> > >
> > > -Controls user space access for reading perf event counters. When set to 1,
> > > -user space can read performance monitor counter registers directly.
> > > +arm64
> > > +=====
> > >
> > >  The default value is 0 (access disabled).
> > > +When set to 1, user space can read performance monitor counter registers
> > > +directly.
> > >
> > >  See Documentation/arm64/perf.rst for more information.
> > >
> > > +riscv
> > > +=====
> > > +
> > > +When set to 0, user access is disabled.
> > > +
> > > +When set to 1, user space can read performance monitor counter registers
> > > +directly only through perf, any direct access without perf intervention will
> > > +trigger an illegal instruction.
> > > +
> > > +The default value is 2, it enables the legacy mode, that is user space has
> > > +direct access to cycle, time and insret CSRs only.
> >
> > I think this default value should be a Kconfig symbol, allowing kernels to
> > be built with a secure default.
> 
> Actually I was more in favor of having the default to 1 (ie the secure
> option) and let the distros deal with the legacy mode (via a sysctl
> parameter on the command line) as long as user-space has not been
> fixed: does that make sense?

Yes, I'd prefer that too. I assumed the default was 2 in this patch
because we couldn't set it to 1 for some reason.

Thanks,
drew

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-26 13:25       ` Andrew Jones
@ 2023-04-29  6:19         ` Atish Patra
  2023-04-29  6:50           ` Atish Patra
  0 siblings, 1 reply; 26+ messages in thread
From: Atish Patra @ 2023-04-29  6:19 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Alexandre Ghiti, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon, Rob Herring,
	linux-doc, linux-kernel, linux-perf-users, linux-riscv,
	linux-arm-kernel, David Abdurachmanov, Heinrich Schuchardt,
	Andreas Schwab, mafm, aurel32

On Wed, Apr 26, 2023 at 6:55 PM Andrew Jones <ajones@ventanamicro.com> wrote:
>
> On Wed, Apr 26, 2023 at 03:17:01PM +0200, Alexandre Ghiti wrote:
> > On Wed, Apr 26, 2023 at 2:57 PM Andrew Jones <ajones@ventanamicro.com> wrote:
> > >
> > > On Thu, Apr 13, 2023 at 06:17:25PM +0200, Alexandre Ghiti wrote:
> > > > We used to unconditionnally expose the cycle and instret csrs to
> > > > userspace, which gives rise to security concerns.
> > > >
> > > > So only allow access to hw counters from userspace through the perf
> > > > framework which will handle context switchs, per-task events...etc. But
> > > > as we cannot break userspace, we give the user the choice to go back to
> > > > the previous behaviour by setting the sysctl perf_user_access.
> > > >
> > > > We also introduce a means to directly map the hardware counters to
> > > > userspace, thus avoiding the need for syscalls whenever an application
> > > > wants to access counters values.
> > > >
> > > > Note that arch_perf_update_userpage is a copy of arm64 code.
> > > >
> > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > ---
> > > >  Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
> > > >  arch/riscv/include/asm/perf_event.h         |   3 +
> > > >  arch/riscv/kernel/Makefile                  |   2 +-
> > > >  arch/riscv/kernel/perf_event.c              |  65 +++++++++++
> > > >  drivers/perf/riscv_pmu.c                    |  42 ++++++++
> > > >  drivers/perf/riscv_pmu_legacy.c             |  17 +++
> > > >  drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
> > > >  include/linux/perf/riscv_pmu.h              |   3 +
> > > >  tools/lib/perf/mmap.c                       |  65 +++++++++++
> > > >  9 files changed, 322 insertions(+), 11 deletions(-)
> > > >  create mode 100644 arch/riscv/kernel/perf_event.c
> > > >
> > > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > > > index 4b7bfea28cd7..02b2a40a3647 100644
> > > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > > @@ -941,16 +941,31 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
> > > >  The default value is 8.
> > > >
> > > >
> > > > -perf_user_access (arm64 only)
> > > > -=================================
> > > > +perf_user_access (arm64 and riscv only)
> > > > +=======================================
> > > > +
> > > > +Controls user space access for reading perf event counters.
> > > >
> > > > -Controls user space access for reading perf event counters. When set to 1,
> > > > -user space can read performance monitor counter registers directly.
> > > > +arm64
> > > > +=====
> > > >
> > > >  The default value is 0 (access disabled).
> > > > +When set to 1, user space can read performance monitor counter registers
> > > > +directly.
> > > >
> > > >  See Documentation/arm64/perf.rst for more information.
> > > >
> > > > +riscv
> > > > +=====
> > > > +
> > > > +When set to 0, user access is disabled.
> > > > +
> > > > +When set to 1, user space can read performance monitor counter registers
> > > > +directly only through perf, any direct access without perf intervention will
> > > > +trigger an illegal instruction.
> > > > +
> > > > +The default value is 2, it enables the legacy mode, that is user space has
> > > > +direct access to cycle, time and insret CSRs only.
> > >
> > > I think this default value should be a Kconfig symbol, allowing kernels to
> > > be built with a secure default.
> >
> > Actually I was more in favor of having the default to 1 (ie the secure
> > option) and let the distros deal with the legacy mode (via a sysctl
> > parameter on the command line) as long as user-space has not been
> > fixed: does that make sense?
>
> Yes, I'd prefer that too. I assumed the default was 2 in this patch
> because we couldn't set it to 1 for some reason.
>

I would prefer that too. However, it was set to 2 because it would break
the user space application depending on the legacy behavior as soon as the
patches are upstream. That is the reason
palmer suggested keeping the default value to 2 in order to avoid that.

+distro folks (cc'd)
If the distro maintainer can confirm that this would be a non-issue, I am okay
with setting the default to 1.


> Thanks,
> drew



-- 
Regards,
Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-29  6:19         ` Atish Patra
@ 2023-04-29  6:50           ` Atish Patra
  0 siblings, 0 replies; 26+ messages in thread
From: Atish Patra @ 2023-04-29  6:50 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Alexandre Ghiti, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon, Rob Herring,
	linux-doc, linux-kernel, linux-perf-users, linux-riscv,
	linux-arm-kernel, David Abdurachmanov, Heinrich Schuchardt,
	Andreas Schwab, mafm, aurel32

On Sat, Apr 29, 2023 at 11:49 AM Atish Patra <atishp@atishpatra.org> wrote:
>
> On Wed, Apr 26, 2023 at 6:55 PM Andrew Jones <ajones@ventanamicro.com> wrote:
> >
> > On Wed, Apr 26, 2023 at 03:17:01PM +0200, Alexandre Ghiti wrote:
> > > On Wed, Apr 26, 2023 at 2:57 PM Andrew Jones <ajones@ventanamicro.com> wrote:
> > > >
> > > > On Thu, Apr 13, 2023 at 06:17:25PM +0200, Alexandre Ghiti wrote:
> > > > > We used to unconditionnally expose the cycle and instret csrs to
> > > > > userspace, which gives rise to security concerns.
> > > > >
> > > > > So only allow access to hw counters from userspace through the perf
> > > > > framework which will handle context switchs, per-task events...etc. But
> > > > > as we cannot break userspace, we give the user the choice to go back to
> > > > > the previous behaviour by setting the sysctl perf_user_access.
> > > > >
> > > > > We also introduce a means to directly map the hardware counters to
> > > > > userspace, thus avoiding the need for syscalls whenever an application
> > > > > wants to access counters values.
> > > > >
> > > > > Note that arch_perf_update_userpage is a copy of arm64 code.
> > > > >
> > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > ---
> > > > >  Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
> > > > >  arch/riscv/include/asm/perf_event.h         |   3 +
> > > > >  arch/riscv/kernel/Makefile                  |   2 +-
> > > > >  arch/riscv/kernel/perf_event.c              |  65 +++++++++++
> > > > >  drivers/perf/riscv_pmu.c                    |  42 ++++++++
> > > > >  drivers/perf/riscv_pmu_legacy.c             |  17 +++
> > > > >  drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
> > > > >  include/linux/perf/riscv_pmu.h              |   3 +
> > > > >  tools/lib/perf/mmap.c                       |  65 +++++++++++
> > > > >  9 files changed, 322 insertions(+), 11 deletions(-)
> > > > >  create mode 100644 arch/riscv/kernel/perf_event.c
> > > > >
> > > > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > > > > index 4b7bfea28cd7..02b2a40a3647 100644
> > > > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > > > @@ -941,16 +941,31 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
> > > > >  The default value is 8.
> > > > >
> > > > >
> > > > > -perf_user_access (arm64 only)
> > > > > -=================================
> > > > > +perf_user_access (arm64 and riscv only)
> > > > > +=======================================
> > > > > +
> > > > > +Controls user space access for reading perf event counters.
> > > > >
> > > > > -Controls user space access for reading perf event counters. When set to 1,
> > > > > -user space can read performance monitor counter registers directly.
> > > > > +arm64
> > > > > +=====
> > > > >
> > > > >  The default value is 0 (access disabled).
> > > > > +When set to 1, user space can read performance monitor counter registers
> > > > > +directly.
> > > > >
> > > > >  See Documentation/arm64/perf.rst for more information.
> > > > >
> > > > > +riscv
> > > > > +=====
> > > > > +
> > > > > +When set to 0, user access is disabled.
> > > > > +
> > > > > +When set to 1, user space can read performance monitor counter registers
> > > > > +directly only through perf, any direct access without perf intervention will
> > > > > +trigger an illegal instruction.
> > > > > +
> > > > > +The default value is 2, it enables the legacy mode, that is user space has
> > > > > +direct access to cycle, time and insret CSRs only.
> > > >
> > > > I think this default value should be a Kconfig symbol, allowing kernels to
> > > > be built with a secure default.
> > >
> > > Actually I was more in favor of having the default to 1 (ie the secure
> > > option) and let the distros deal with the legacy mode (via a sysctl
> > > parameter on the command line) as long as user-space has not been
> > > fixed: does that make sense?
> >
> > Yes, I'd prefer that too. I assumed the default was 2 in this patch
> > because we couldn't set it to 1 for some reason.
> >
>
> I would prefer that too. However, it was set to 2 because it would break
> the user space application depending on the legacy behavior as soon as the
> patches are upstream. That is the reason
> palmer suggested keeping the default value to 2 in order to avoid that.
>
> +distro folks (cc'd)
> If the distro maintainer can confirm that this would be a non-issue, I am okay
> with setting the default to 1.
>

@David Abdurachmanov reminds me of ARM64 code where it is set to zero.
The upstream kernel doesn't even enable userspace via perf. The
default in x86 is 1 though.

+Rob Herring (who enabled the ARM64 support[1])
@Rob: If you can shed some light on the reasoning behind setting to
disabled, that would help
us make a more informed decision.

https://github.com/torvalds/linux/commit/e2012600810c9ded81f6f63a8d04781be3c300ad

>
> > Thanks,
> > drew
>
>
>
> --
> Regards,
> Atish



-- 
Regards,
Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-26 13:17     ` Alexandre Ghiti
  2023-04-26 13:25       ` Andrew Jones
@ 2023-05-09 12:24       ` Emil Renner Berthing
  2023-05-09 13:40         ` Alexandre Ghiti
  1 sibling, 1 reply; 26+ messages in thread
From: Emil Renner Berthing @ 2023-05-09 12:24 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Andrew Jones, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel

On Wed, 26 Apr 2023 at 15:19, Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> On Wed, Apr 26, 2023 at 2:57 PM Andrew Jones <ajones@ventanamicro.com> wrote:
> >
> > On Thu, Apr 13, 2023 at 06:17:25PM +0200, Alexandre Ghiti wrote:
> > > We used to unconditionnally expose the cycle and instret csrs to
> > > userspace, which gives rise to security concerns.
> > >
> > > So only allow access to hw counters from userspace through the perf
> > > framework which will handle context switchs, per-task events...etc. But
> > > as we cannot break userspace, we give the user the choice to go back to
> > > the previous behaviour by setting the sysctl perf_user_access.
> > >
> > > We also introduce a means to directly map the hardware counters to
> > > userspace, thus avoiding the need for syscalls whenever an application
> > > wants to access counters values.
> > >
> > > Note that arch_perf_update_userpage is a copy of arm64 code.
> > >
> > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > ---
> > >  Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
> > >  arch/riscv/include/asm/perf_event.h         |   3 +
> > >  arch/riscv/kernel/Makefile                  |   2 +-
> > >  arch/riscv/kernel/perf_event.c              |  65 +++++++++++
> > >  drivers/perf/riscv_pmu.c                    |  42 ++++++++
> > >  drivers/perf/riscv_pmu_legacy.c             |  17 +++
> > >  drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
> > >  include/linux/perf/riscv_pmu.h              |   3 +
> > >  tools/lib/perf/mmap.c                       |  65 +++++++++++
> > >  9 files changed, 322 insertions(+), 11 deletions(-)
> > >  create mode 100644 arch/riscv/kernel/perf_event.c
> > >
> > > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> > > index 4b7bfea28cd7..02b2a40a3647 100644
> > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > @@ -941,16 +941,31 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
> > >  The default value is 8.
> > >
> > >
> > > -perf_user_access (arm64 only)
> > > -=================================
> > > +perf_user_access (arm64 and riscv only)
> > > +=======================================
> > > +
> > > +Controls user space access for reading perf event counters.
> > >
> > > -Controls user space access for reading perf event counters. When set to 1,
> > > -user space can read performance monitor counter registers directly.
> > > +arm64
> > > +=====
> > >
> > >  The default value is 0 (access disabled).
> > > +When set to 1, user space can read performance monitor counter registers
> > > +directly.
> > >
> > >  See Documentation/arm64/perf.rst for more information.
> > >
> > > +riscv
> > > +=====
> > > +
> > > +When set to 0, user access is disabled.
> > > +
> > > +When set to 1, user space can read performance monitor counter registers
> > > +directly only through perf, any direct access without perf intervention will
> > > +trigger an illegal instruction.
> > > +
> > > +The default value is 2, it enables the legacy mode, that is user space has
> > > +direct access to cycle, time and insret CSRs only.
> >
> > I think this default value should be a Kconfig symbol, allowing kernels to
> > be built with a secure default.
>
> Actually I was more in favor of having the default to 1 (ie the secure
> option) and let the distros deal with the legacy mode (via a sysctl
> parameter on the command line) as long as user-space has not been
> fixed: does that make sense?

With the Linux policy of not breaking userspace I wouldn't think
having anything but 2 as the default is ok. Is there a reason we can't
have a mode that allows both the legacy and perf interface?

> > >
> > >  pid_max
> > >  =======
> > > diff --git a/arch/riscv/include/asm/perf_event.h b/arch/riscv/include/asm/perf_event.h
> > > index d42c901f9a97..9fdfdd9dc92d 100644
> > > --- a/arch/riscv/include/asm/perf_event.h
> > > +++ b/arch/riscv/include/asm/perf_event.h
> > > @@ -9,5 +9,8 @@
> > >  #define _ASM_RISCV_PERF_EVENT_H
> > >
> > >  #include <linux/perf_event.h>
> > > +
> > > +#define PERF_EVENT_FLAG_LEGACY       1
> > > +
> > >  #define perf_arch_bpf_user_pt_regs(regs) (struct user_regs_struct *)regs
> > >  #endif /* _ASM_RISCV_PERF_EVENT_H */
> > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > index aa22f87faeae..9ae951b07847 100644
> > > --- a/arch/riscv/kernel/Makefile
> > > +++ b/arch/riscv/kernel/Makefile
> > > @@ -70,7 +70,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE)        += mcount-dyn.o
> > >
> > >  obj-$(CONFIG_TRACE_IRQFLAGS) += trace_irq.o
> > >
> > > -obj-$(CONFIG_PERF_EVENTS)    += perf_callchain.o
> > > +obj-$(CONFIG_PERF_EVENTS)    += perf_callchain.o perf_event.o
> > >  obj-$(CONFIG_HAVE_PERF_REGS) += perf_regs.o
> > >  obj-$(CONFIG_RISCV_SBI)              += sbi.o
> > >  ifeq ($(CONFIG_RISCV_SBI), y)
> > > diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> > > new file mode 100644
> > > index 000000000000..4a75ab628bfb
> > > --- /dev/null
> > > +++ b/arch/riscv/kernel/perf_event.c
> > > @@ -0,0 +1,65 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +#include <linux/perf/riscv_pmu.h>
> > > +#include <linux/sched_clock.h>
> > > +
> > > +void arch_perf_update_userpage(struct perf_event *event,
> > > +                            struct perf_event_mmap_page *userpg, u64 now)
> > > +{
> > > +     struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > > +     struct clock_read_data *rd;
> > > +     unsigned int seq;
> > > +     u64 ns;
> > > +
> > > +     userpg->cap_user_time = 0;
> > > +     userpg->cap_user_time_zero = 0;
> > > +     userpg->cap_user_time_short = 0;
> > > +     userpg->cap_user_rdpmc =
> > > +             !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
> > > +
> > > +     /*
> > > +      * The counters are 64-bit but the priv spec doesn't mandate all the
> > > +      * bits to be implemented: that's why, counter width can vary based on
> > > +      * the cpu vendor.
> > > +      */
> > > +     userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
> > > +
> > > +     do {
> > > +             rd = sched_clock_read_begin(&seq);
> > > +
> > > +             userpg->time_mult = rd->mult;
> > > +             userpg->time_shift = rd->shift;
> > > +             userpg->time_zero = rd->epoch_ns;
> > > +             userpg->time_cycles = rd->epoch_cyc;
> > > +             userpg->time_mask = rd->sched_clock_mask;
> > > +
> > > +             /*
> > > +              * Subtract the cycle base, such that software that
> > > +              * doesn't know about cap_user_time_short still 'works'
> > > +              * assuming no wraps.
> > > +              */
> > > +             ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
> > > +             userpg->time_zero -= ns;
> > > +
> > > +     } while (sched_clock_read_retry(seq));
> > > +
> > > +     userpg->time_offset = userpg->time_zero - now;
> > > +
> > > +     /*
> > > +      * time_shift is not expected to be greater than 31 due to
> > > +      * the original published conversion algorithm shifting a
> > > +      * 32-bit value (now specifies a 64-bit value) - refer
> > > +      * perf_event_mmap_page documentation in perf_event.h.
> > > +      */
> > > +     if (userpg->time_shift == 32) {
> > > +             userpg->time_shift = 31;
> > > +             userpg->time_mult >>= 1;
> > > +     }
> > > +
> > > +     /*
> > > +      * Internal timekeeping for enabled/running/stopped times
> > > +      * is always computed with the sched_clock.
> > > +      */
> > > +     userpg->cap_user_time = 1;
> > > +     userpg->cap_user_time_zero = 1;
> > > +     userpg->cap_user_time_short = 1;
> > > +}
> > > diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
> > > index ebca5eab9c9b..12675ee1123c 100644
> > > --- a/drivers/perf/riscv_pmu.c
> > > +++ b/drivers/perf/riscv_pmu.c
> > > @@ -171,6 +171,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
> > >
> > >       local64_set(&hwc->prev_count, (u64)-left);
> > >
> > > +     perf_event_update_userpage(event);
> > > +
> > >       return overflow;
> > >  }
> > >
> > > @@ -283,6 +285,43 @@ static int riscv_pmu_event_init(struct perf_event *event)
> > >       return 0;
> > >  }
> > >
> > > +static int riscv_pmu_event_idx(struct perf_event *event)
> > > +{
> > > +     struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > > +
> > > +     if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
> > > +             return 0;
> > > +
> > > +     /*
> > > +      * cycle and instret can either be retrieved from their fixed counters
> > > +      * or from programmable counters, the latter being the preferred way
> > > +      * since cycle and instret counters do not support sampling.
> > > +      */
> > > +
> > > +     return rvpmu->csr_index(event) + 1;
> > > +}
> > > +
> > > +static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
> > > +{
> > > +     /*
> > > +      * The user mmapped the event to directly access it: this is where
> > > +      * we determine based on sysctl_perf_user_access if we grant userspace
> > > +      * the direct access to this event. That means that within the same
> > > +      * task, some events may be directly accessible and some other may not,
> > > +      * if the user changes the value of sysctl_perf_user_accesss in the
> > > +      * meantime.
> > > +      */
> > > +     struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
> > > +
> > > +     event->hw.flags |= rvpmu->event_flags(event);
> > > +     perf_event_update_userpage(event);
> > > +}
> > > +
> > > +static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
> > > +{
> > > +     event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
> > > +}
> > > +
> > >  struct riscv_pmu *riscv_pmu_alloc(void)
> > >  {
> > >       struct riscv_pmu *pmu;
> > > @@ -307,6 +346,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
> > >       }
> > >       pmu->pmu = (struct pmu) {
> > >               .event_init     = riscv_pmu_event_init,
> > > +             .event_mapped   = riscv_pmu_event_mapped,
> > > +             .event_unmapped = riscv_pmu_event_unmapped,
> > > +             .event_idx      = riscv_pmu_event_idx,
> > >               .add            = riscv_pmu_add,
> > >               .del            = riscv_pmu_del,
> > >               .start          = riscv_pmu_start,
> > > diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
> > > index 0d8c9d8849ee..35c4c9097a0f 100644
> > > --- a/drivers/perf/riscv_pmu_legacy.c
> > > +++ b/drivers/perf/riscv_pmu_legacy.c
> > > @@ -74,6 +74,21 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
> > >       local64_set(&hwc->prev_count, initial_val);
> > >  }
> > >
> > > +static uint8_t pmu_legacy_csr_index(struct perf_event *event)
> > > +{
> > > +     return event->hw.idx;
> > > +}
> > > +
> > > +static int pmu_legacy_event_flags(struct perf_event *event)
> > > +{
> > > +     /* In legacy mode, the first 3 CSRs are available. */
> > > +     if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
> > > +         event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
> > > +             return 0;
> > > +
> > > +     return PERF_EVENT_FLAG_USER_READ_CNT;
> > > +}
> > > +
> > >  /*
> > >   * This is just a simple implementation to allow legacy implementations
> > >   * compatible with new RISC-V PMU driver framework.
> > > @@ -94,6 +109,8 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
> > >       pmu->ctr_get_width = NULL;
> > >       pmu->ctr_clear_idx = NULL;
> > >       pmu->ctr_read = pmu_legacy_read_ctr;
> > > +     pmu->event_flags = pmu_legacy_event_flags;
> > > +     pmu->csr_index = pmu_legacy_csr_index;
> > >
> > >       perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
> > >  }
> > > diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
> > > index 70cb50fd41c2..af7f3128b6b8 100644
> > > --- a/drivers/perf/riscv_pmu_sbi.c
> > > +++ b/drivers/perf/riscv_pmu_sbi.c
> > > @@ -24,6 +24,10 @@
> > >  #include <asm/sbi.h>
> > >  #include <asm/hwcap.h>
> > >
> > > +#define SYSCTL_NO_USER_ACCESS        0
> > > +#define SYSCTL_USER_ACCESS   1
> > > +#define SYSCTL_LEGACY                2
> > > +
> > >  PMU_FORMAT_ATTR(event, "config:0-47");
> > >  PMU_FORMAT_ATTR(firmware, "config:63");
> > >
> > > @@ -43,6 +47,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
> > >       NULL,
> > >  };
> > >
> > > +/* Allow legacy access by default */
> > > +static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
> > > +
> > >  /*
> > >   * RISC-V doesn't have heterogeneous harts yet. This need to be part of
> > >   * per_cpu in case of harts with different pmu counters
> > > @@ -301,6 +308,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
> > >  }
> > >  EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
> > >
> > > +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
> > > +{
> > > +     return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
> > > +}
> > > +
> > >  static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
> > >  {
> > >       unsigned long cflags = 0;
> > > @@ -329,18 +341,30 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
> > >       struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
> > >       struct sbiret ret;
> > >       int idx;
> > > -     uint64_t cbase = 0;
> > > +     uint64_t cbase = 0, cmask = rvpmu->cmask;
> > >       unsigned long cflags = 0;
> > >
> > >       cflags = pmu_sbi_get_filter_flags(event);
> > > +
> > > +     /* In legacy mode, we have to force the fixed counters for those events */
> > > +     if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
> > > +             if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
> > > +                     cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> > > +                     cmask = 1;
> > > +             } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
> > > +                     cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
> > > +                     cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
> > > +             }
> > > +     }
> > > +
> > >       /* retrieve the available counter index */
> > >  #if defined(CONFIG_32BIT)
> > >       ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> > > -                     rvpmu->cmask, cflags, hwc->event_base, hwc->config,
> > > +                     cmask, cflags, hwc->event_base, hwc->config,
> > >                       hwc->config >> 32);
> > >  #else
> > >       ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
> > > -                     rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
> > > +                     cmask, cflags, hwc->event_base, hwc->config, 0);
> > >  #endif
> > >       if (ret.error) {
> > >               pr_debug("Not able to find a counter for event %lx config %llx\n",
> > > @@ -490,6 +514,11 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
> > >       if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
> > >               pr_err("Starting counter idx %d failed with error %d\n",
> > >                       hwc->idx, sbi_err_map_linux_errno(ret.error));
> > > +
> > > +     if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
> > > +         event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
> > > +             csr_write(CSR_SCOUNTEREN,
> > > +                       csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
> > >  }
> > >
> > >  static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> > > @@ -497,6 +526,11 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
> > >       struct sbiret ret;
> > >       struct hw_perf_event *hwc = &event->hw;
> > >
> > > +     if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
> > > +         event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
> > > +             csr_write(CSR_SCOUNTEREN,
> > > +                       csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
> > > +
> > >       ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
> > >       if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
> > >               flag != SBI_PMU_STOP_FLAG_RESET)
> > > @@ -704,10 +738,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
> > >       struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
> > >
> > >       /*
> > > -      * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
> > > -      * as is necessary to maintain uABI compatibility.
> > > +      * We keep enabling userspace access to CYCLE, TIME and INSRET via the
> > > +      * legacy option but that will be removed in the future.
> >
> > Will it? The documentation hunk didn't mention that value 2 was depreciated.
>
> You're right, I'll add that to the documentation too, thanks.
>
> >
> > >        */
> > > -     csr_write(CSR_SCOUNTEREN, 0x7);
> > > +     if (sysctl_perf_user_access == SYSCTL_LEGACY)
> > > +             csr_write(CSR_SCOUNTEREN, 0x7);
> > > +     else
> > > +             csr_write(CSR_SCOUNTEREN, 0x2);
> > >
> > >       /* Stop all the counters so that they can be enabled from perf */
> > >       pmu_sbi_stop_all(pmu);
> > > @@ -851,6 +888,66 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
> > >       cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> > >  }
> > >
> > > +static int pmu_sbi_event_flags(struct perf_event *event)
> > > +{
> > > +     if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
> > > +             return 0;
> > > +
> > > +     /* In legacy mode, the first 3 CSRs are available. */
> > > +     if (sysctl_perf_user_access == SYSCTL_LEGACY) {
> > > +             int flags = PERF_EVENT_FLAG_LEGACY;
> > > +
> > > +             if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES ||
> > > +                 event->attr.config == PERF_COUNT_HW_INSTRUCTIONS)
> > > +                     flags |= PERF_EVENT_FLAG_USER_READ_CNT;
> > > +
> > > +             return flags;
> > > +     }
> > > +
> > > +     return PERF_EVENT_FLAG_USER_READ_CNT;
> > > +}
> > > +
> > > +static void riscv_pmu_update_counter_access(void *info)
> > > +{
> > > +     if (sysctl_perf_user_access == SYSCTL_LEGACY)
> > > +             csr_write(CSR_SCOUNTEREN, 0x7);
> > > +     else
> > > +             csr_write(CSR_SCOUNTEREN, 0x2);
> > > +}
> > > +
> > > +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
> > > +                                           int write, void *buffer,
> > > +                                           size_t *lenp, loff_t *ppos)
> > > +{
> > > +     int prev = sysctl_perf_user_access;
> > > +     int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > > +
> > > +     /*
> > > +      * Test against the previous value since we clear SCOUNTEREN when
> > > +      * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
> > > +      * not do that if that was already the case.
> > > +      */
> > > +     if (ret || !write || prev == sysctl_perf_user_access)
> > > +             return ret;
> > > +
> > > +     on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +static struct ctl_table sbi_pmu_sysctl_table[] = {
> > > +     {
> > > +             .procname       = "perf_user_access",
> > > +             .data           = &sysctl_perf_user_access,
> > > +             .maxlen         = sizeof(unsigned int),
> > > +             .mode           = 0644,
> > > +             .proc_handler   = riscv_pmu_proc_user_access_handler,
> > > +             .extra1         = SYSCTL_ZERO,
> > > +             .extra2         = SYSCTL_TWO,
> > > +     },
> > > +     { }
> > > +};
> > > +
> > >  static int pmu_sbi_device_probe(struct platform_device *pdev)
> > >  {
> > >       struct riscv_pmu *pmu = NULL;
> > > @@ -888,6 +985,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> > >       pmu->ctr_get_width = pmu_sbi_ctr_get_width;
> > >       pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
> > >       pmu->ctr_read = pmu_sbi_ctr_read;
> > > +     pmu->event_flags = pmu_sbi_event_flags;
> > > +     pmu->csr_index = pmu_sbi_csr_index;
> > >
> > >       ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
> > >       if (ret)
> > > @@ -901,6 +1000,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
> > >       if (ret)
> > >               goto out_unregister;
> > >
> > > +     register_sysctl("kernel", sbi_pmu_sysctl_table);
> > > +
> > >       return 0;
> > >
> > >  out_unregister:
> > > diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
> > > index 9f70d94942e0..ba19634d815c 100644
> > > --- a/include/linux/perf/riscv_pmu.h
> > > +++ b/include/linux/perf/riscv_pmu.h
> > > @@ -12,6 +12,7 @@
> > >  #include <linux/perf_event.h>
> > >  #include <linux/ptrace.h>
> > >  #include <linux/interrupt.h>
> > > +#include <asm/perf_event.h>
> > >
> > >  #ifdef CONFIG_RISCV_PMU
> > >
> > > @@ -55,6 +56,8 @@ struct riscv_pmu {
> > >       void            (*ctr_start)(struct perf_event *event, u64 init_val);
> > >       void            (*ctr_stop)(struct perf_event *event, unsigned long flag);
> > >       int             (*event_map)(struct perf_event *event, u64 *config);
> > > +     int             (*event_flags)(struct perf_event *event);
> > > +     uint8_t         (*csr_index)(struct perf_event *event);
> > >
> > >       struct cpu_hw_events    __percpu *hw_events;
> > >       struct hlist_node       node;
> > > diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
> > > index 0d1634cedf44..18f2abb1584a 100644
> > > --- a/tools/lib/perf/mmap.c
> > > +++ b/tools/lib/perf/mmap.c
> > > @@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
> > >
> > >  static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
> > >
> > > +#elif defined(__riscv) && __riscv_xlen == 64
> >
> > It's enough to just check __riscv_xlen.
>
> Right, thanks
>
> >
> > > +
> > > +#define CSR_CYCLE    0xc00
> > > +#define CSR_TIME     0xc01
> > > +#define CSR_CYCLEH   0xc80
> > > +
> > > +#define csr_read(csr)                                                \
> > > +({                                                           \
> > > +     register unsigned long __v;                             \
> > > +             __asm__ __volatile__ ("csrr %0, " #csr          \
> > > +              : "=r" (__v) :                                 \
> > > +              : "memory");                                   \
> > > +              __v;                                           \
> > > +})
> > > +
> > > +static unsigned long csr_read_num(int csr_num)
> > > +{
> > > +#define switchcase_csr_read(__csr_num, __val)           {\
> > > +     case __csr_num:                                 \
> > > +             __val = csr_read(__csr_num);            \
> > > +             break; }
> > > +#define switchcase_csr_read_2(__csr_num, __val)         {\
> > > +     switchcase_csr_read(__csr_num + 0, __val)        \
> > > +     switchcase_csr_read(__csr_num + 1, __val)}
> > > +#define switchcase_csr_read_4(__csr_num, __val)         {\
> > > +     switchcase_csr_read_2(__csr_num + 0, __val)      \
> > > +     switchcase_csr_read_2(__csr_num + 2, __val)}
> > > +#define switchcase_csr_read_8(__csr_num, __val)         {\
> > > +     switchcase_csr_read_4(__csr_num + 0, __val)      \
> > > +     switchcase_csr_read_4(__csr_num + 4, __val)}
> > > +#define switchcase_csr_read_16(__csr_num, __val)        {\
> > > +     switchcase_csr_read_8(__csr_num + 0, __val)      \
> > > +     switchcase_csr_read_8(__csr_num + 8, __val)}
> > > +#define switchcase_csr_read_32(__csr_num, __val)        {\
> > > +     switchcase_csr_read_16(__csr_num + 0, __val)     \
> > > +     switchcase_csr_read_16(__csr_num + 16, __val)}
> > > +
> > > +     unsigned long ret = 0;
> > > +
> > > +     switch (csr_num) {
> > > +     switchcase_csr_read_32(CSR_CYCLE, ret)
> > > +     switchcase_csr_read_32(CSR_CYCLEH, ret)
> > > +     default :
> >                ^ extra space
> >
>
> Thanks
>
> > > +             break;
> > > +     }
> > > +
> > > +     return ret;
> > > +#undef switchcase_csr_read_32
> > > +#undef switchcase_csr_read_16
> > > +#undef switchcase_csr_read_8
> > > +#undef switchcase_csr_read_4
> > > +#undef switchcase_csr_read_2
> > > +#undef switchcase_csr_read
> > > +}
> > > +
> > > +static u64 read_perf_counter(unsigned int counter)
> > > +{
> > > +     return csr_read_num(CSR_CYCLE + counter);
> > > +}
> > > +
> > > +static u64 read_timestamp(void)
> > > +{
> > > +     return csr_read_num(CSR_TIME);
> > > +}
> > > +
> > >  #else
> > >  static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
> > >  static u64 read_timestamp(void) { return 0; }
> > > --
> > > 2.37.2
> > >
> >
> > A lot going on this patch. It'd be easier to review if it was broken up a
> > bit. E.g. import of arm code, the tools/lib/perf/mmap.c hunk, and whatever
> > else makes sense.
>
> Ok, will do that in v2!
>
> >
> > Thanks,
> > drew
>
> Thanks,
>
> Alex
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-05-09 12:24       ` Emil Renner Berthing
@ 2023-05-09 13:40         ` Alexandre Ghiti
  0 siblings, 0 replies; 26+ messages in thread
From: Alexandre Ghiti @ 2023-05-09 13:40 UTC (permalink / raw)
  To: Emil Renner Berthing, Alexandre Ghiti
  Cc: Andrew Jones, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel


On 5/9/23 14:24, Emil Renner Berthing wrote:
> On Wed, 26 Apr 2023 at 15:19, Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>> On Wed, Apr 26, 2023 at 2:57 PM Andrew Jones <ajones@ventanamicro.com> wrote:
>>> On Thu, Apr 13, 2023 at 06:17:25PM +0200, Alexandre Ghiti wrote:
>>>> We used to unconditionnally expose the cycle and instret csrs to
>>>> userspace, which gives rise to security concerns.
>>>>
>>>> So only allow access to hw counters from userspace through the perf
>>>> framework which will handle context switchs, per-task events...etc. But
>>>> as we cannot break userspace, we give the user the choice to go back to
>>>> the previous behaviour by setting the sysctl perf_user_access.
>>>>
>>>> We also introduce a means to directly map the hardware counters to
>>>> userspace, thus avoiding the need for syscalls whenever an application
>>>> wants to access counters values.
>>>>
>>>> Note that arch_perf_update_userpage is a copy of arm64 code.
>>>>
>>>> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
>>>> ---
>>>>   Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
>>>>   arch/riscv/include/asm/perf_event.h         |   3 +
>>>>   arch/riscv/kernel/Makefile                  |   2 +-
>>>>   arch/riscv/kernel/perf_event.c              |  65 +++++++++++
>>>>   drivers/perf/riscv_pmu.c                    |  42 ++++++++
>>>>   drivers/perf/riscv_pmu_legacy.c             |  17 +++
>>>>   drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
>>>>   include/linux/perf/riscv_pmu.h              |   3 +
>>>>   tools/lib/perf/mmap.c                       |  65 +++++++++++
>>>>   9 files changed, 322 insertions(+), 11 deletions(-)
>>>>   create mode 100644 arch/riscv/kernel/perf_event.c
>>>>
>>>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>>>> index 4b7bfea28cd7..02b2a40a3647 100644
>>>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>>>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>>>> @@ -941,16 +941,31 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
>>>>   The default value is 8.
>>>>
>>>>
>>>> -perf_user_access (arm64 only)
>>>> -=================================
>>>> +perf_user_access (arm64 and riscv only)
>>>> +=======================================
>>>> +
>>>> +Controls user space access for reading perf event counters.
>>>>
>>>> -Controls user space access for reading perf event counters. When set to 1,
>>>> -user space can read performance monitor counter registers directly.
>>>> +arm64
>>>> +=====
>>>>
>>>>   The default value is 0 (access disabled).
>>>> +When set to 1, user space can read performance monitor counter registers
>>>> +directly.
>>>>
>>>>   See Documentation/arm64/perf.rst for more information.
>>>>
>>>> +riscv
>>>> +=====
>>>> +
>>>> +When set to 0, user access is disabled.
>>>> +
>>>> +When set to 1, user space can read performance monitor counter registers
>>>> +directly only through perf, any direct access without perf intervention will
>>>> +trigger an illegal instruction.
>>>> +
>>>> +The default value is 2, it enables the legacy mode, that is user space has
>>>> +direct access to cycle, time and insret CSRs only.
>>> I think this default value should be a Kconfig symbol, allowing kernels to
>>> be built with a secure default.
>> Actually I was more in favor of having the default to 1 (ie the secure
>> option) and let the distros deal with the legacy mode (via a sysctl
>> parameter on the command line) as long as user-space has not been
>> fixed: does that make sense?
> With the Linux policy of not breaking userspace I wouldn't think
> having anything but 2 as the default is ok. Is there a reason we can't
> have a mode that allows both the legacy and perf interface?


No, perf will enable/disable counters at context switch so the legacy 
applications that expect the CSRs to be accessible will fail and the 
goal of using perf is to avoid leaking application details.


>
>>>>   pid_max
>>>>   =======
>>>> diff --git a/arch/riscv/include/asm/perf_event.h b/arch/riscv/include/asm/perf_event.h
>>>> index d42c901f9a97..9fdfdd9dc92d 100644
>>>> --- a/arch/riscv/include/asm/perf_event.h
>>>> +++ b/arch/riscv/include/asm/perf_event.h
>>>> @@ -9,5 +9,8 @@
>>>>   #define _ASM_RISCV_PERF_EVENT_H
>>>>
>>>>   #include <linux/perf_event.h>
>>>> +
>>>> +#define PERF_EVENT_FLAG_LEGACY       1
>>>> +
>>>>   #define perf_arch_bpf_user_pt_regs(regs) (struct user_regs_struct *)regs
>>>>   #endif /* _ASM_RISCV_PERF_EVENT_H */
>>>> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
>>>> index aa22f87faeae..9ae951b07847 100644
>>>> --- a/arch/riscv/kernel/Makefile
>>>> +++ b/arch/riscv/kernel/Makefile
>>>> @@ -70,7 +70,7 @@ obj-$(CONFIG_DYNAMIC_FTRACE)        += mcount-dyn.o
>>>>
>>>>   obj-$(CONFIG_TRACE_IRQFLAGS) += trace_irq.o
>>>>
>>>> -obj-$(CONFIG_PERF_EVENTS)    += perf_callchain.o
>>>> +obj-$(CONFIG_PERF_EVENTS)    += perf_callchain.o perf_event.o
>>>>   obj-$(CONFIG_HAVE_PERF_REGS) += perf_regs.o
>>>>   obj-$(CONFIG_RISCV_SBI)              += sbi.o
>>>>   ifeq ($(CONFIG_RISCV_SBI), y)
>>>> diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
>>>> new file mode 100644
>>>> index 000000000000..4a75ab628bfb
>>>> --- /dev/null
>>>> +++ b/arch/riscv/kernel/perf_event.c
>>>> @@ -0,0 +1,65 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +#include <linux/perf/riscv_pmu.h>
>>>> +#include <linux/sched_clock.h>
>>>> +
>>>> +void arch_perf_update_userpage(struct perf_event *event,
>>>> +                            struct perf_event_mmap_page *userpg, u64 now)
>>>> +{
>>>> +     struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
>>>> +     struct clock_read_data *rd;
>>>> +     unsigned int seq;
>>>> +     u64 ns;
>>>> +
>>>> +     userpg->cap_user_time = 0;
>>>> +     userpg->cap_user_time_zero = 0;
>>>> +     userpg->cap_user_time_short = 0;
>>>> +     userpg->cap_user_rdpmc =
>>>> +             !!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT);
>>>> +
>>>> +     /*
>>>> +      * The counters are 64-bit but the priv spec doesn't mandate all the
>>>> +      * bits to be implemented: that's why, counter width can vary based on
>>>> +      * the cpu vendor.
>>>> +      */
>>>> +     userpg->pmc_width = rvpmu->ctr_get_width(event->hw.idx) + 1;
>>>> +
>>>> +     do {
>>>> +             rd = sched_clock_read_begin(&seq);
>>>> +
>>>> +             userpg->time_mult = rd->mult;
>>>> +             userpg->time_shift = rd->shift;
>>>> +             userpg->time_zero = rd->epoch_ns;
>>>> +             userpg->time_cycles = rd->epoch_cyc;
>>>> +             userpg->time_mask = rd->sched_clock_mask;
>>>> +
>>>> +             /*
>>>> +              * Subtract the cycle base, such that software that
>>>> +              * doesn't know about cap_user_time_short still 'works'
>>>> +              * assuming no wraps.
>>>> +              */
>>>> +             ns = mul_u64_u32_shr(rd->epoch_cyc, rd->mult, rd->shift);
>>>> +             userpg->time_zero -= ns;
>>>> +
>>>> +     } while (sched_clock_read_retry(seq));
>>>> +
>>>> +     userpg->time_offset = userpg->time_zero - now;
>>>> +
>>>> +     /*
>>>> +      * time_shift is not expected to be greater than 31 due to
>>>> +      * the original published conversion algorithm shifting a
>>>> +      * 32-bit value (now specifies a 64-bit value) - refer
>>>> +      * perf_event_mmap_page documentation in perf_event.h.
>>>> +      */
>>>> +     if (userpg->time_shift == 32) {
>>>> +             userpg->time_shift = 31;
>>>> +             userpg->time_mult >>= 1;
>>>> +     }
>>>> +
>>>> +     /*
>>>> +      * Internal timekeeping for enabled/running/stopped times
>>>> +      * is always computed with the sched_clock.
>>>> +      */
>>>> +     userpg->cap_user_time = 1;
>>>> +     userpg->cap_user_time_zero = 1;
>>>> +     userpg->cap_user_time_short = 1;
>>>> +}
>>>> diff --git a/drivers/perf/riscv_pmu.c b/drivers/perf/riscv_pmu.c
>>>> index ebca5eab9c9b..12675ee1123c 100644
>>>> --- a/drivers/perf/riscv_pmu.c
>>>> +++ b/drivers/perf/riscv_pmu.c
>>>> @@ -171,6 +171,8 @@ int riscv_pmu_event_set_period(struct perf_event *event)
>>>>
>>>>        local64_set(&hwc->prev_count, (u64)-left);
>>>>
>>>> +     perf_event_update_userpage(event);
>>>> +
>>>>        return overflow;
>>>>   }
>>>>
>>>> @@ -283,6 +285,43 @@ static int riscv_pmu_event_init(struct perf_event *event)
>>>>        return 0;
>>>>   }
>>>>
>>>> +static int riscv_pmu_event_idx(struct perf_event *event)
>>>> +{
>>>> +     struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
>>>> +
>>>> +     if (!(event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT))
>>>> +             return 0;
>>>> +
>>>> +     /*
>>>> +      * cycle and instret can either be retrieved from their fixed counters
>>>> +      * or from programmable counters, the latter being the preferred way
>>>> +      * since cycle and instret counters do not support sampling.
>>>> +      */
>>>> +
>>>> +     return rvpmu->csr_index(event) + 1;
>>>> +}
>>>> +
>>>> +static void riscv_pmu_event_mapped(struct perf_event *event, struct mm_struct *mm)
>>>> +{
>>>> +     /*
>>>> +      * The user mmapped the event to directly access it: this is where
>>>> +      * we determine based on sysctl_perf_user_access if we grant userspace
>>>> +      * the direct access to this event. That means that within the same
>>>> +      * task, some events may be directly accessible and some other may not,
>>>> +      * if the user changes the value of sysctl_perf_user_accesss in the
>>>> +      * meantime.
>>>> +      */
>>>> +     struct riscv_pmu *rvpmu = to_riscv_pmu(event->pmu);
>>>> +
>>>> +     event->hw.flags |= rvpmu->event_flags(event);
>>>> +     perf_event_update_userpage(event);
>>>> +}
>>>> +
>>>> +static void riscv_pmu_event_unmapped(struct perf_event *event, struct mm_struct *mm)
>>>> +{
>>>> +     event->hw.flags &= ~PERF_EVENT_FLAG_USER_READ_CNT;
>>>> +}
>>>> +
>>>>   struct riscv_pmu *riscv_pmu_alloc(void)
>>>>   {
>>>>        struct riscv_pmu *pmu;
>>>> @@ -307,6 +346,9 @@ struct riscv_pmu *riscv_pmu_alloc(void)
>>>>        }
>>>>        pmu->pmu = (struct pmu) {
>>>>                .event_init     = riscv_pmu_event_init,
>>>> +             .event_mapped   = riscv_pmu_event_mapped,
>>>> +             .event_unmapped = riscv_pmu_event_unmapped,
>>>> +             .event_idx      = riscv_pmu_event_idx,
>>>>                .add            = riscv_pmu_add,
>>>>                .del            = riscv_pmu_del,
>>>>                .start          = riscv_pmu_start,
>>>> diff --git a/drivers/perf/riscv_pmu_legacy.c b/drivers/perf/riscv_pmu_legacy.c
>>>> index 0d8c9d8849ee..35c4c9097a0f 100644
>>>> --- a/drivers/perf/riscv_pmu_legacy.c
>>>> +++ b/drivers/perf/riscv_pmu_legacy.c
>>>> @@ -74,6 +74,21 @@ static void pmu_legacy_ctr_start(struct perf_event *event, u64 ival)
>>>>        local64_set(&hwc->prev_count, initial_val);
>>>>   }
>>>>
>>>> +static uint8_t pmu_legacy_csr_index(struct perf_event *event)
>>>> +{
>>>> +     return event->hw.idx;
>>>> +}
>>>> +
>>>> +static int pmu_legacy_event_flags(struct perf_event *event)
>>>> +{
>>>> +     /* In legacy mode, the first 3 CSRs are available. */
>>>> +     if (event->attr.config != PERF_COUNT_HW_CPU_CYCLES &&
>>>> +         event->attr.config != PERF_COUNT_HW_INSTRUCTIONS)
>>>> +             return 0;
>>>> +
>>>> +     return PERF_EVENT_FLAG_USER_READ_CNT;
>>>> +}
>>>> +
>>>>   /*
>>>>    * This is just a simple implementation to allow legacy implementations
>>>>    * compatible with new RISC-V PMU driver framework.
>>>> @@ -94,6 +109,8 @@ static void pmu_legacy_init(struct riscv_pmu *pmu)
>>>>        pmu->ctr_get_width = NULL;
>>>>        pmu->ctr_clear_idx = NULL;
>>>>        pmu->ctr_read = pmu_legacy_read_ctr;
>>>> +     pmu->event_flags = pmu_legacy_event_flags;
>>>> +     pmu->csr_index = pmu_legacy_csr_index;
>>>>
>>>>        perf_pmu_register(&pmu->pmu, "cpu", PERF_TYPE_RAW);
>>>>   }
>>>> diff --git a/drivers/perf/riscv_pmu_sbi.c b/drivers/perf/riscv_pmu_sbi.c
>>>> index 70cb50fd41c2..af7f3128b6b8 100644
>>>> --- a/drivers/perf/riscv_pmu_sbi.c
>>>> +++ b/drivers/perf/riscv_pmu_sbi.c
>>>> @@ -24,6 +24,10 @@
>>>>   #include <asm/sbi.h>
>>>>   #include <asm/hwcap.h>
>>>>
>>>> +#define SYSCTL_NO_USER_ACCESS        0
>>>> +#define SYSCTL_USER_ACCESS   1
>>>> +#define SYSCTL_LEGACY                2
>>>> +
>>>>   PMU_FORMAT_ATTR(event, "config:0-47");
>>>>   PMU_FORMAT_ATTR(firmware, "config:63");
>>>>
>>>> @@ -43,6 +47,9 @@ static const struct attribute_group *riscv_pmu_attr_groups[] = {
>>>>        NULL,
>>>>   };
>>>>
>>>> +/* Allow legacy access by default */
>>>> +static int sysctl_perf_user_access __read_mostly = SYSCTL_LEGACY;
>>>> +
>>>>   /*
>>>>    * RISC-V doesn't have heterogeneous harts yet. This need to be part of
>>>>    * per_cpu in case of harts with different pmu counters
>>>> @@ -301,6 +308,11 @@ int riscv_pmu_get_hpm_info(u32 *hw_ctr_width, u32 *num_hw_ctr)
>>>>   }
>>>>   EXPORT_SYMBOL_GPL(riscv_pmu_get_hpm_info);
>>>>
>>>> +static uint8_t pmu_sbi_csr_index(struct perf_event *event)
>>>> +{
>>>> +     return pmu_ctr_list[event->hw.idx].csr - CSR_CYCLE;
>>>> +}
>>>> +
>>>>   static unsigned long pmu_sbi_get_filter_flags(struct perf_event *event)
>>>>   {
>>>>        unsigned long cflags = 0;
>>>> @@ -329,18 +341,30 @@ static int pmu_sbi_ctr_get_idx(struct perf_event *event)
>>>>        struct cpu_hw_events *cpuc = this_cpu_ptr(rvpmu->hw_events);
>>>>        struct sbiret ret;
>>>>        int idx;
>>>> -     uint64_t cbase = 0;
>>>> +     uint64_t cbase = 0, cmask = rvpmu->cmask;
>>>>        unsigned long cflags = 0;
>>>>
>>>>        cflags = pmu_sbi_get_filter_flags(event);
>>>> +
>>>> +     /* In legacy mode, we have to force the fixed counters for those events */
>>>> +     if (hwc->flags & PERF_EVENT_FLAG_LEGACY) {
>>>> +             if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES) {
>>>> +                     cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
>>>> +                     cmask = 1;
>>>> +             } else if (event->attr.config == PERF_COUNT_HW_INSTRUCTIONS) {
>>>> +                     cflags |= SBI_PMU_CFG_FLAG_SKIP_MATCH;
>>>> +                     cmask = 1UL << (CSR_INSTRET - CSR_CYCLE);
>>>> +             }
>>>> +     }
>>>> +
>>>>        /* retrieve the available counter index */
>>>>   #if defined(CONFIG_32BIT)
>>>>        ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
>>>> -                     rvpmu->cmask, cflags, hwc->event_base, hwc->config,
>>>> +                     cmask, cflags, hwc->event_base, hwc->config,
>>>>                        hwc->config >> 32);
>>>>   #else
>>>>        ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_CFG_MATCH, cbase,
>>>> -                     rvpmu->cmask, cflags, hwc->event_base, hwc->config, 0);
>>>> +                     cmask, cflags, hwc->event_base, hwc->config, 0);
>>>>   #endif
>>>>        if (ret.error) {
>>>>                pr_debug("Not able to find a counter for event %lx config %llx\n",
>>>> @@ -490,6 +514,11 @@ static void pmu_sbi_ctr_start(struct perf_event *event, u64 ival)
>>>>        if (ret.error && (ret.error != SBI_ERR_ALREADY_STARTED))
>>>>                pr_err("Starting counter idx %d failed with error %d\n",
>>>>                        hwc->idx, sbi_err_map_linux_errno(ret.error));
>>>> +
>>>> +     if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
>>>> +         event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
>>>> +             csr_write(CSR_SCOUNTEREN,
>>>> +                       csr_read(CSR_SCOUNTEREN) | (1 << pmu_sbi_csr_index(event)));
>>>>   }
>>>>
>>>>   static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
>>>> @@ -497,6 +526,11 @@ static void pmu_sbi_ctr_stop(struct perf_event *event, unsigned long flag)
>>>>        struct sbiret ret;
>>>>        struct hw_perf_event *hwc = &event->hw;
>>>>
>>>> +     if (!(event->hw.flags & PERF_EVENT_FLAG_LEGACY) &&
>>>> +         event->hw.flags & PERF_EVENT_FLAG_USER_READ_CNT)
>>>> +             csr_write(CSR_SCOUNTEREN,
>>>> +                       csr_read(CSR_SCOUNTEREN) & ~(1 << pmu_sbi_csr_index(event)));
>>>> +
>>>>        ret = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_STOP, hwc->idx, 1, flag, 0, 0, 0);
>>>>        if (ret.error && (ret.error != SBI_ERR_ALREADY_STOPPED) &&
>>>>                flag != SBI_PMU_STOP_FLAG_RESET)
>>>> @@ -704,10 +738,13 @@ static int pmu_sbi_starting_cpu(unsigned int cpu, struct hlist_node *node)
>>>>        struct cpu_hw_events *cpu_hw_evt = this_cpu_ptr(pmu->hw_events);
>>>>
>>>>        /*
>>>> -      * Enable the access for CYCLE, TIME, and INSTRET CSRs from userspace,
>>>> -      * as is necessary to maintain uABI compatibility.
>>>> +      * We keep enabling userspace access to CYCLE, TIME and INSRET via the
>>>> +      * legacy option but that will be removed in the future.
>>> Will it? The documentation hunk didn't mention that value 2 was depreciated.
>> You're right, I'll add that to the documentation too, thanks.
>>
>>>>         */
>>>> -     csr_write(CSR_SCOUNTEREN, 0x7);
>>>> +     if (sysctl_perf_user_access == SYSCTL_LEGACY)
>>>> +             csr_write(CSR_SCOUNTEREN, 0x7);
>>>> +     else
>>>> +             csr_write(CSR_SCOUNTEREN, 0x2);
>>>>
>>>>        /* Stop all the counters so that they can be enabled from perf */
>>>>        pmu_sbi_stop_all(pmu);
>>>> @@ -851,6 +888,66 @@ static void riscv_pmu_destroy(struct riscv_pmu *pmu)
>>>>        cpuhp_state_remove_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
>>>>   }
>>>>
>>>> +static int pmu_sbi_event_flags(struct perf_event *event)
>>>> +{
>>>> +     if (sysctl_perf_user_access == SYSCTL_NO_USER_ACCESS)
>>>> +             return 0;
>>>> +
>>>> +     /* In legacy mode, the first 3 CSRs are available. */
>>>> +     if (sysctl_perf_user_access == SYSCTL_LEGACY) {
>>>> +             int flags = PERF_EVENT_FLAG_LEGACY;
>>>> +
>>>> +             if (event->attr.config == PERF_COUNT_HW_CPU_CYCLES ||
>>>> +                 event->attr.config == PERF_COUNT_HW_INSTRUCTIONS)
>>>> +                     flags |= PERF_EVENT_FLAG_USER_READ_CNT;
>>>> +
>>>> +             return flags;
>>>> +     }
>>>> +
>>>> +     return PERF_EVENT_FLAG_USER_READ_CNT;
>>>> +}
>>>> +
>>>> +static void riscv_pmu_update_counter_access(void *info)
>>>> +{
>>>> +     if (sysctl_perf_user_access == SYSCTL_LEGACY)
>>>> +             csr_write(CSR_SCOUNTEREN, 0x7);
>>>> +     else
>>>> +             csr_write(CSR_SCOUNTEREN, 0x2);
>>>> +}
>>>> +
>>>> +static int riscv_pmu_proc_user_access_handler(struct ctl_table *table,
>>>> +                                           int write, void *buffer,
>>>> +                                           size_t *lenp, loff_t *ppos)
>>>> +{
>>>> +     int prev = sysctl_perf_user_access;
>>>> +     int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>>>> +
>>>> +     /*
>>>> +      * Test against the previous value since we clear SCOUNTEREN when
>>>> +      * sysctl_perf_user_access is set to SYSCTL_USER_ACCESS, but we should
>>>> +      * not do that if that was already the case.
>>>> +      */
>>>> +     if (ret || !write || prev == sysctl_perf_user_access)
>>>> +             return ret;
>>>> +
>>>> +     on_each_cpu(riscv_pmu_update_counter_access, (void *)&prev, 1);
>>>> +
>>>> +     return 0;
>>>> +}
>>>> +
>>>> +static struct ctl_table sbi_pmu_sysctl_table[] = {
>>>> +     {
>>>> +             .procname       = "perf_user_access",
>>>> +             .data           = &sysctl_perf_user_access,
>>>> +             .maxlen         = sizeof(unsigned int),
>>>> +             .mode           = 0644,
>>>> +             .proc_handler   = riscv_pmu_proc_user_access_handler,
>>>> +             .extra1         = SYSCTL_ZERO,
>>>> +             .extra2         = SYSCTL_TWO,
>>>> +     },
>>>> +     { }
>>>> +};
>>>> +
>>>>   static int pmu_sbi_device_probe(struct platform_device *pdev)
>>>>   {
>>>>        struct riscv_pmu *pmu = NULL;
>>>> @@ -888,6 +985,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>>>>        pmu->ctr_get_width = pmu_sbi_ctr_get_width;
>>>>        pmu->ctr_clear_idx = pmu_sbi_ctr_clear_idx;
>>>>        pmu->ctr_read = pmu_sbi_ctr_read;
>>>> +     pmu->event_flags = pmu_sbi_event_flags;
>>>> +     pmu->csr_index = pmu_sbi_csr_index;
>>>>
>>>>        ret = cpuhp_state_add_instance(CPUHP_AP_PERF_RISCV_STARTING, &pmu->node);
>>>>        if (ret)
>>>> @@ -901,6 +1000,8 @@ static int pmu_sbi_device_probe(struct platform_device *pdev)
>>>>        if (ret)
>>>>                goto out_unregister;
>>>>
>>>> +     register_sysctl("kernel", sbi_pmu_sysctl_table);
>>>> +
>>>>        return 0;
>>>>
>>>>   out_unregister:
>>>> diff --git a/include/linux/perf/riscv_pmu.h b/include/linux/perf/riscv_pmu.h
>>>> index 9f70d94942e0..ba19634d815c 100644
>>>> --- a/include/linux/perf/riscv_pmu.h
>>>> +++ b/include/linux/perf/riscv_pmu.h
>>>> @@ -12,6 +12,7 @@
>>>>   #include <linux/perf_event.h>
>>>>   #include <linux/ptrace.h>
>>>>   #include <linux/interrupt.h>
>>>> +#include <asm/perf_event.h>
>>>>
>>>>   #ifdef CONFIG_RISCV_PMU
>>>>
>>>> @@ -55,6 +56,8 @@ struct riscv_pmu {
>>>>        void            (*ctr_start)(struct perf_event *event, u64 init_val);
>>>>        void            (*ctr_stop)(struct perf_event *event, unsigned long flag);
>>>>        int             (*event_map)(struct perf_event *event, u64 *config);
>>>> +     int             (*event_flags)(struct perf_event *event);
>>>> +     uint8_t         (*csr_index)(struct perf_event *event);
>>>>
>>>>        struct cpu_hw_events    __percpu *hw_events;
>>>>        struct hlist_node       node;
>>>> diff --git a/tools/lib/perf/mmap.c b/tools/lib/perf/mmap.c
>>>> index 0d1634cedf44..18f2abb1584a 100644
>>>> --- a/tools/lib/perf/mmap.c
>>>> +++ b/tools/lib/perf/mmap.c
>>>> @@ -392,6 +392,71 @@ static u64 read_perf_counter(unsigned int counter)
>>>>
>>>>   static u64 read_timestamp(void) { return read_sysreg(cntvct_el0); }
>>>>
>>>> +#elif defined(__riscv) && __riscv_xlen == 64
>>> It's enough to just check __riscv_xlen.
>> Right, thanks
>>
>>>> +
>>>> +#define CSR_CYCLE    0xc00
>>>> +#define CSR_TIME     0xc01
>>>> +#define CSR_CYCLEH   0xc80
>>>> +
>>>> +#define csr_read(csr)                                                \
>>>> +({                                                           \
>>>> +     register unsigned long __v;                             \
>>>> +             __asm__ __volatile__ ("csrr %0, " #csr          \
>>>> +              : "=r" (__v) :                                 \
>>>> +              : "memory");                                   \
>>>> +              __v;                                           \
>>>> +})
>>>> +
>>>> +static unsigned long csr_read_num(int csr_num)
>>>> +{
>>>> +#define switchcase_csr_read(__csr_num, __val)           {\
>>>> +     case __csr_num:                                 \
>>>> +             __val = csr_read(__csr_num);            \
>>>> +             break; }
>>>> +#define switchcase_csr_read_2(__csr_num, __val)         {\
>>>> +     switchcase_csr_read(__csr_num + 0, __val)        \
>>>> +     switchcase_csr_read(__csr_num + 1, __val)}
>>>> +#define switchcase_csr_read_4(__csr_num, __val)         {\
>>>> +     switchcase_csr_read_2(__csr_num + 0, __val)      \
>>>> +     switchcase_csr_read_2(__csr_num + 2, __val)}
>>>> +#define switchcase_csr_read_8(__csr_num, __val)         {\
>>>> +     switchcase_csr_read_4(__csr_num + 0, __val)      \
>>>> +     switchcase_csr_read_4(__csr_num + 4, __val)}
>>>> +#define switchcase_csr_read_16(__csr_num, __val)        {\
>>>> +     switchcase_csr_read_8(__csr_num + 0, __val)      \
>>>> +     switchcase_csr_read_8(__csr_num + 8, __val)}
>>>> +#define switchcase_csr_read_32(__csr_num, __val)        {\
>>>> +     switchcase_csr_read_16(__csr_num + 0, __val)     \
>>>> +     switchcase_csr_read_16(__csr_num + 16, __val)}
>>>> +
>>>> +     unsigned long ret = 0;
>>>> +
>>>> +     switch (csr_num) {
>>>> +     switchcase_csr_read_32(CSR_CYCLE, ret)
>>>> +     switchcase_csr_read_32(CSR_CYCLEH, ret)
>>>> +     default :
>>>                 ^ extra space
>>>
>> Thanks
>>
>>>> +             break;
>>>> +     }
>>>> +
>>>> +     return ret;
>>>> +#undef switchcase_csr_read_32
>>>> +#undef switchcase_csr_read_16
>>>> +#undef switchcase_csr_read_8
>>>> +#undef switchcase_csr_read_4
>>>> +#undef switchcase_csr_read_2
>>>> +#undef switchcase_csr_read
>>>> +}
>>>> +
>>>> +static u64 read_perf_counter(unsigned int counter)
>>>> +{
>>>> +     return csr_read_num(CSR_CYCLE + counter);
>>>> +}
>>>> +
>>>> +static u64 read_timestamp(void)
>>>> +{
>>>> +     return csr_read_num(CSR_TIME);
>>>> +}
>>>> +
>>>>   #else
>>>>   static u64 read_perf_counter(unsigned int counter __maybe_unused) { return 0; }
>>>>   static u64 read_timestamp(void) { return 0; }
>>>> --
>>>> 2.37.2
>>>>
>>> A lot going on this patch. It'd be easier to review if it was broken up a
>>> bit. E.g. import of arm code, the tools/lib/perf/mmap.c hunk, and whatever
>>> else makes sense.
>> Ok, will do that in v2!
>>
>>> Thanks,
>>> drew
>> Thanks,
>>
>> Alex
>>
>> _______________________________________________
>> linux-riscv mailing list
>> linux-riscv@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-riscv
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/4] riscv: Enable perf counters user access only through perf
  2023-04-13 16:17 ` [PATCH 4/4] riscv: Enable perf counters user access only through perf Alexandre Ghiti
                     ` (2 preceding siblings ...)
  2023-04-26 12:57   ` Andrew Jones
@ 2023-05-01  2:09   ` Bagas Sanjaya
  3 siblings, 0 replies; 26+ messages in thread
From: Bagas Sanjaya @ 2023-05-01  2:09 UTC (permalink / raw)
  To: Alexandre Ghiti, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Atish Patra, Anup Patel, Will Deacon,
	Rob Herring, linux-doc, linux-kernel, linux-perf-users,
	linux-riscv, linux-arm-kernel

On 4/13/23 23:17, Alexandre Ghiti wrote:
> +The default value is 2, it enables the legacy mode, that is user space has
> +direct access to cycle, time and insret CSRs only.
>  

"The default value is 2, which enables legacy mode (user space has direct
access to cycle, time, and insret CSRs only)."

-- 
An old man doll... just what I always wanted! - Clara


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-13 16:17 [PATCH 0/4] riscv: Allow userspace to directly access perf counters Alexandre Ghiti
                   ` (3 preceding siblings ...)
  2023-04-13 16:17 ` [PATCH 4/4] riscv: Enable perf counters user access only through perf Alexandre Ghiti
@ 2023-04-13 16:36 ` Ian Rogers
  2023-04-13 19:17 ` Atish Patra
  5 siblings, 0 replies; 26+ messages in thread
From: Ian Rogers @ 2023-04-13 16:36 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Atish Patra, Anup Patel, Will Deacon, Rob Herring, linux-doc,
	linux-kernel, linux-perf-users, linux-riscv, linux-arm-kernel,
	paranlee

On Thu, Apr 13, 2023 at 9:17 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> riscv used to allow direct access to cycle/time/instret counters,
> bypassing the perf framework, this patchset intends to allow the user to
> mmap any counter when accessed through perf. But we can't break the
> existing behaviour so we introduce a sysctl perf_user_access like arm64
> does, which defaults to the legacy mode described above.
>
> The core of this patchset lies in patch 4, the first 3 patches are
> simple fixes.
>
> base-commit-tag: v6.3-rc1
>
> Alexandre Ghiti (4):
>   perf: Fix wrong comment about default event_idx
>   include: riscv: Fix wrong include guard in riscv_pmu.h
>   riscv: Make legacy counter enum match the HW numbering
>   riscv: Enable perf counters user access only through perf

Presumably the test also needs patching:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/tests/mmap-basic.c?h=perf-tools-next#n287

Thanks,
Ian


>  Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
>  arch/riscv/include/asm/perf_event.h         |   3 +
>  arch/riscv/kernel/Makefile                  |   2 +-
>  arch/riscv/kernel/perf_event.c              |  65 +++++++++++
>  drivers/perf/riscv_pmu.c                    |  42 ++++++++
>  drivers/perf/riscv_pmu_legacy.c             |  24 ++++-
>  drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
>  include/linux/perf/riscv_pmu.h              |   9 +-
>  include/linux/perf_event.h                  |   3 +-
>  tools/lib/perf/mmap.c                       |  65 +++++++++++
>  10 files changed, 332 insertions(+), 17 deletions(-)
>  create mode 100644 arch/riscv/kernel/perf_event.c
>
> --
> 2.37.2
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-13 16:17 [PATCH 0/4] riscv: Allow userspace to directly access perf counters Alexandre Ghiti
                   ` (4 preceding siblings ...)
  2023-04-13 16:36 ` [PATCH 0/4] riscv: Allow userspace to directly access perf counters Ian Rogers
@ 2023-04-13 19:17 ` Atish Patra
  2023-04-13 21:10   ` David Laight
  5 siblings, 1 reply; 26+ messages in thread
From: Atish Patra @ 2023-04-13 19:17 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon, Rob Herring,
	linux-doc, linux-kernel, linux-perf-users, linux-riscv,
	linux-arm-kernel

On Thu, Apr 13, 2023 at 9:47 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> riscv used to allow direct access to cycle/time/instret counters,
> bypassing the perf framework, this patchset intends to allow the user to
> mmap any counter when accessed through perf. But we can't break the
> existing behaviour so we introduce a sysctl perf_user_access like arm64
> does, which defaults to the legacy mode described above.
>

It would be good provide additional direction for user space packages:

The legacy behavior is supported for now in order to avoid breaking
existing software.
However, reading counters directly without perf interaction may
provide incorrect values which
the userspace software must avoid. We are hoping that the user space
packages which
read the cycle/instret directly, will move to the proper interface
eventually if they actually need it.
Most of the users are supposed to read "time" instead of "cycle" if
they intend to read timestamps.

The legacy sysctl option will be removed in the future. The plan is
that the distros will
set the default option to SYSCTL_USER_ACCESS which enables user
counters only through perf
sooner (as soon as they make sure the packages built for that distro
don't read cycle/instret) directly.

> The core of this patchset lies in patch 4, the first 3 patches are
> simple fixes.
>
> base-commit-tag: v6.3-rc1
>
> Alexandre Ghiti (4):
>   perf: Fix wrong comment about default event_idx
>   include: riscv: Fix wrong include guard in riscv_pmu.h
>   riscv: Make legacy counter enum match the HW numbering
>   riscv: Enable perf counters user access only through perf
>
>  Documentation/admin-guide/sysctl/kernel.rst |  23 +++-
>  arch/riscv/include/asm/perf_event.h         |   3 +
>  arch/riscv/kernel/Makefile                  |   2 +-
>  arch/riscv/kernel/perf_event.c              |  65 +++++++++++
>  drivers/perf/riscv_pmu.c                    |  42 ++++++++
>  drivers/perf/riscv_pmu_legacy.c             |  24 ++++-
>  drivers/perf/riscv_pmu_sbi.c                | 113 ++++++++++++++++++--
>  include/linux/perf/riscv_pmu.h              |   9 +-
>  include/linux/perf_event.h                  |   3 +-
>  tools/lib/perf/mmap.c                       |  65 +++++++++++
>  10 files changed, 332 insertions(+), 17 deletions(-)
>  create mode 100644 arch/riscv/kernel/perf_event.c
>
> --
> 2.37.2
>


-- 
Regards,
Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-13 19:17 ` Atish Patra
@ 2023-04-13 21:10   ` David Laight
  2023-04-18 16:43     ` Atish Patra
  0 siblings, 1 reply; 26+ messages in thread
From: David Laight @ 2023-04-13 21:10 UTC (permalink / raw)
  To: 'Atish Patra', Alexandre Ghiti
  Cc: Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon, Rob Herring,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-riscv@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org

From: Atish Patra
> Sent: 13 April 2023 20:18
> 
> On Thu, Apr 13, 2023 at 9:47 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> >
> > riscv used to allow direct access to cycle/time/instret counters,
> > bypassing the perf framework, this patchset intends to allow the user to
> > mmap any counter when accessed through perf. But we can't break the
> > existing behaviour so we introduce a sysctl perf_user_access like arm64
> > does, which defaults to the legacy mode described above.
> >
> 
> It would be good provide additional direction for user space packages:
> 
> The legacy behavior is supported for now in order to avoid breaking
> existing software.
> However, reading counters directly without perf interaction may
> provide incorrect values which
> the userspace software must avoid. We are hoping that the user space
> packages which
> read the cycle/instret directly, will move to the proper interface
> eventually if they actually need it.
> Most of the users are supposed to read "time" instead of "cycle" if
> they intend to read timestamps.

If you are trying to measure the performance of short code
fragments then you need pretty much raw access directly to
the cycle/clock count register.

I've done this on x86 to compare the actual cycle times
of different implementations of the IP checksum loop
(and compare them to the theoretical limit).
The perf framework just added far too much latency,
only directly reading the cpu registers gave anything
like reliable (and consistent) answers.

Clearly process switches (especially cpu migrations) cause
problems, but they are obviously invalid values and can
be ignored.

So while a lot of uses may be 'happy' with the values the
perf framework gives, sometimes you do need to directly
read the relevant registers.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-13 21:10   ` David Laight
@ 2023-04-18 16:43     ` Atish Patra
  2023-04-18 18:15       ` Ian Rogers
  0 siblings, 1 reply; 26+ messages in thread
From: Atish Patra @ 2023-04-18 16:43 UTC (permalink / raw)
  To: David Laight
  Cc: Alexandre Ghiti, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon, Rob Herring,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-riscv@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org

On Fri, Apr 14, 2023 at 2:40 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Atish Patra
> > Sent: 13 April 2023 20:18
> >
> > On Thu, Apr 13, 2023 at 9:47 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > >
> > > riscv used to allow direct access to cycle/time/instret counters,
> > > bypassing the perf framework, this patchset intends to allow the user to
> > > mmap any counter when accessed through perf. But we can't break the
> > > existing behaviour so we introduce a sysctl perf_user_access like arm64
> > > does, which defaults to the legacy mode described above.
> > >
> >
> > It would be good provide additional direction for user space packages:
> >
> > The legacy behavior is supported for now in order to avoid breaking
> > existing software.
> > However, reading counters directly without perf interaction may
> > provide incorrect values which
> > the userspace software must avoid. We are hoping that the user space
> > packages which
> > read the cycle/instret directly, will move to the proper interface
> > eventually if they actually need it.
> > Most of the users are supposed to read "time" instead of "cycle" if
> > they intend to read timestamps.
>
> If you are trying to measure the performance of short code
> fragments then you need pretty much raw access directly to
> the cycle/clock count register.
>
> I've done this on x86 to compare the actual cycle times
> of different implementations of the IP checksum loop
> (and compare them to the theoretical limit).
> The perf framework just added far too much latency,
> only directly reading the cpu registers gave anything
> like reliable (and consistent) answers.
>

This series allows direct access to the counters once configured
through the perf.
Earlier the cycle/instret counters are directly exposed to the
userspace without kernel/perf frameworking knowing
when/which user space application is reading it. That has security implications.

With this series applied, the user space application just needs to
configure the event (cycle/instret) through perf syscall.
Once configured, the userspace application can find out the counter
information from the mmap & directly
read the counter. There is no latency while reading the counters.

This mechanism allows stop/clear the counters when the requesting task
is not running. It also takes care of context switching
which may result in invalid values as you mentioned below. This is
nothing new and all other arch (x86, ARM64) allow user space
counter read through the same mechanism.

Here is the relevant upstream discussion:
https://lore.kernel.org/lkml/Y7wLa7I2hlz3rKw%2F@hirez.programming.kicks-ass.net/T/

ARM64:
https://docs.kernel.org/arm64/perf.html?highlight=perf_user_access#perf-userspace-pmu-hardware-counter-access

example usage in x86:
https://github.com/andikleen/pmu-tools/blob/master/jevents/rdpmc.c

> Clearly process switches (especially cpu migrations) cause
> problems, but they are obviously invalid values and can
> be ignored.
>
> So while a lot of uses may be 'happy' with the values the
> perf framework gives, sometimes you do need to directly
> read the relevant registers.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)



-- 
Regards,
Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-18 16:43     ` Atish Patra
@ 2023-04-18 18:15       ` Ian Rogers
  2023-04-18 20:30         ` Atish Patra
  0 siblings, 1 reply; 26+ messages in thread
From: Ian Rogers @ 2023-04-18 18:15 UTC (permalink / raw)
  To: Atish Patra
  Cc: David Laight, Alexandre Ghiti, Jonathan Corbet, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon, Rob Herring,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-riscv@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org, paranlee

On Tue, Apr 18, 2023 at 9:43 AM Atish Patra <atishp@atishpatra.org> wrote:
>
> On Fri, Apr 14, 2023 at 2:40 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Atish Patra
> > > Sent: 13 April 2023 20:18
> > >
> > > On Thu, Apr 13, 2023 at 9:47 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > >
> > > > riscv used to allow direct access to cycle/time/instret counters,
> > > > bypassing the perf framework, this patchset intends to allow the user to
> > > > mmap any counter when accessed through perf. But we can't break the
> > > > existing behaviour so we introduce a sysctl perf_user_access like arm64
> > > > does, which defaults to the legacy mode described above.
> > > >
> > >
> > > It would be good provide additional direction for user space packages:
> > >
> > > The legacy behavior is supported for now in order to avoid breaking
> > > existing software.
> > > However, reading counters directly without perf interaction may
> > > provide incorrect values which
> > > the userspace software must avoid. We are hoping that the user space
> > > packages which
> > > read the cycle/instret directly, will move to the proper interface
> > > eventually if they actually need it.
> > > Most of the users are supposed to read "time" instead of "cycle" if
> > > they intend to read timestamps.
> >
> > If you are trying to measure the performance of short code
> > fragments then you need pretty much raw access directly to
> > the cycle/clock count register.
> >
> > I've done this on x86 to compare the actual cycle times
> > of different implementations of the IP checksum loop
> > (and compare them to the theoretical limit).
> > The perf framework just added far too much latency,
> > only directly reading the cpu registers gave anything
> > like reliable (and consistent) answers.
> >
>
> This series allows direct access to the counters once configured
> through the perf.
> Earlier the cycle/instret counters are directly exposed to the
> userspace without kernel/perf frameworking knowing
> when/which user space application is reading it. That has security implications.
>
> With this series applied, the user space application just needs to
> configure the event (cycle/instret) through perf syscall.
> Once configured, the userspace application can find out the counter
> information from the mmap & directly
> read the counter. There is no latency while reading the counters.
>
> This mechanism allows stop/clear the counters when the requesting task
> is not running. It also takes care of context switching
> which may result in invalid values as you mentioned below. This is
> nothing new and all other arch (x86, ARM64) allow user space
> counter read through the same mechanism.
>
> Here is the relevant upstream discussion:
> https://lore.kernel.org/lkml/Y7wLa7I2hlz3rKw%2F@hirez.programming.kicks-ass.net/T/
>
> ARM64:
> https://docs.kernel.org/arm64/perf.html?highlight=perf_user_access#perf-userspace-pmu-hardware-counter-access
>
> example usage in x86:
> https://github.com/andikleen/pmu-tools/blob/master/jevents/rdpmc.c

The canonical implementation of this should be:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/mmap.c#n400
which is updated in these patches but the tests are not:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/tests/mmap-basic.c#n287
Which appears to be an oversight. The tests display some differences
between x86 and aarch64 that have assumed userspace hardware counter
access, and everything else that it is assumed don't.

Thanks,
Ian

> > Clearly process switches (especially cpu migrations) cause
> > problems, but they are obviously invalid values and can
> > be ignored.
> >
> > So while a lot of uses may be 'happy' with the values the
> > perf framework gives, sometimes you do need to directly
> > read the relevant registers.
> >
> >         David
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)
>
>
>
> --
> Regards,
> Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-18 18:15       ` Ian Rogers
@ 2023-04-18 20:30         ` Atish Patra
  2023-04-19  9:21           ` Alexandre Ghiti
  0 siblings, 1 reply; 26+ messages in thread
From: Atish Patra @ 2023-04-18 20:30 UTC (permalink / raw)
  To: Ian Rogers
  Cc: David Laight, Alexandre Ghiti, Jonathan Corbet, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon, Rob Herring,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-riscv@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org, paranlee

On Tue, Apr 18, 2023 at 11:46 PM Ian Rogers <irogers@google.com> wrote:
>
> On Tue, Apr 18, 2023 at 9:43 AM Atish Patra <atishp@atishpatra.org> wrote:
> >
> > On Fri, Apr 14, 2023 at 2:40 AM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Atish Patra
> > > > Sent: 13 April 2023 20:18
> > > >
> > > > On Thu, Apr 13, 2023 at 9:47 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > >
> > > > > riscv used to allow direct access to cycle/time/instret counters,
> > > > > bypassing the perf framework, this patchset intends to allow the user to
> > > > > mmap any counter when accessed through perf. But we can't break the
> > > > > existing behaviour so we introduce a sysctl perf_user_access like arm64
> > > > > does, which defaults to the legacy mode described above.
> > > > >
> > > >
> > > > It would be good provide additional direction for user space packages:
> > > >
> > > > The legacy behavior is supported for now in order to avoid breaking
> > > > existing software.
> > > > However, reading counters directly without perf interaction may
> > > > provide incorrect values which
> > > > the userspace software must avoid. We are hoping that the user space
> > > > packages which
> > > > read the cycle/instret directly, will move to the proper interface
> > > > eventually if they actually need it.
> > > > Most of the users are supposed to read "time" instead of "cycle" if
> > > > they intend to read timestamps.
> > >
> > > If you are trying to measure the performance of short code
> > > fragments then you need pretty much raw access directly to
> > > the cycle/clock count register.
> > >
> > > I've done this on x86 to compare the actual cycle times
> > > of different implementations of the IP checksum loop
> > > (and compare them to the theoretical limit).
> > > The perf framework just added far too much latency,
> > > only directly reading the cpu registers gave anything
> > > like reliable (and consistent) answers.
> > >
> >
> > This series allows direct access to the counters once configured
> > through the perf.
> > Earlier the cycle/instret counters are directly exposed to the
> > userspace without kernel/perf frameworking knowing
> > when/which user space application is reading it. That has security implications.
> >
> > With this series applied, the user space application just needs to
> > configure the event (cycle/instret) through perf syscall.
> > Once configured, the userspace application can find out the counter
> > information from the mmap & directly
> > read the counter. There is no latency while reading the counters.
> >
> > This mechanism allows stop/clear the counters when the requesting task
> > is not running. It also takes care of context switching
> > which may result in invalid values as you mentioned below. This is
> > nothing new and all other arch (x86, ARM64) allow user space
> > counter read through the same mechanism.
> >
> > Here is the relevant upstream discussion:
> > https://lore.kernel.org/lkml/Y7wLa7I2hlz3rKw%2F@hirez.programming.kicks-ass.net/T/
> >
> > ARM64:
> > https://docs.kernel.org/arm64/perf.html?highlight=perf_user_access#perf-userspace-pmu-hardware-counter-access
> >
> > example usage in x86:
> > https://github.com/andikleen/pmu-tools/blob/master/jevents/rdpmc.c
>
> The canonical implementation of this should be:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/mmap.c#n400

Thanks for sharing the libperf implementation.

> which is updated in these patches but the tests are not:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/tests/mmap-basic.c#n287
> Which appears to be an oversight. The tests display some differences

Yes. It's an oversight. We should make sure that perf mmap tests pass
for RISC-V as well.


> between x86 and aarch64 that have assumed userspace hardware counter
> access, and everything else that it is assumed don't.
>
> Thanks,
> Ian
>
> > > Clearly process switches (especially cpu migrations) cause
> > > problems, but they are obviously invalid values and can
> > > be ignored.
> > >
> > > So while a lot of uses may be 'happy' with the values the
> > > perf framework gives, sometimes you do need to directly
> > > read the relevant registers.
> > >
> > >         David
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > Registration No: 1397386 (Wales)
> >
> >
> >
> > --
> > Regards,
> > Atish



--
Regards,
Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-18 20:30         ` Atish Patra
@ 2023-04-19  9:21           ` Alexandre Ghiti
  2023-04-19 17:42             ` Ian Rogers
  0 siblings, 1 reply; 26+ messages in thread
From: Alexandre Ghiti @ 2023-04-19  9:21 UTC (permalink / raw)
  To: Atish Patra
  Cc: Ian Rogers, David Laight, Jonathan Corbet, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon, Rob Herring,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-riscv@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org, paranlee

Hi Ian,

On Tue, Apr 18, 2023 at 10:30 PM Atish Patra <atishp@atishpatra.org> wrote:
>
> On Tue, Apr 18, 2023 at 11:46 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Tue, Apr 18, 2023 at 9:43 AM Atish Patra <atishp@atishpatra.org> wrote:
> > >
> > > On Fri, Apr 14, 2023 at 2:40 AM David Laight <David.Laight@aculab.com> wrote:
> > > >
> > > > From: Atish Patra
> > > > > Sent: 13 April 2023 20:18
> > > > >
> > > > > On Thu, Apr 13, 2023 at 9:47 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > > >
> > > > > > riscv used to allow direct access to cycle/time/instret counters,
> > > > > > bypassing the perf framework, this patchset intends to allow the user to
> > > > > > mmap any counter when accessed through perf. But we can't break the
> > > > > > existing behaviour so we introduce a sysctl perf_user_access like arm64
> > > > > > does, which defaults to the legacy mode described above.
> > > > > >
> > > > >
> > > > > It would be good provide additional direction for user space packages:
> > > > >
> > > > > The legacy behavior is supported for now in order to avoid breaking
> > > > > existing software.
> > > > > However, reading counters directly without perf interaction may
> > > > > provide incorrect values which
> > > > > the userspace software must avoid. We are hoping that the user space
> > > > > packages which
> > > > > read the cycle/instret directly, will move to the proper interface
> > > > > eventually if they actually need it.
> > > > > Most of the users are supposed to read "time" instead of "cycle" if
> > > > > they intend to read timestamps.
> > > >
> > > > If you are trying to measure the performance of short code
> > > > fragments then you need pretty much raw access directly to
> > > > the cycle/clock count register.
> > > >
> > > > I've done this on x86 to compare the actual cycle times
> > > > of different implementations of the IP checksum loop
> > > > (and compare them to the theoretical limit).
> > > > The perf framework just added far too much latency,
> > > > only directly reading the cpu registers gave anything
> > > > like reliable (and consistent) answers.
> > > >
> > >
> > > This series allows direct access to the counters once configured
> > > through the perf.
> > > Earlier the cycle/instret counters are directly exposed to the
> > > userspace without kernel/perf frameworking knowing
> > > when/which user space application is reading it. That has security implications.
> > >
> > > With this series applied, the user space application just needs to
> > > configure the event (cycle/instret) through perf syscall.
> > > Once configured, the userspace application can find out the counter
> > > information from the mmap & directly
> > > read the counter. There is no latency while reading the counters.
> > >
> > > This mechanism allows stop/clear the counters when the requesting task
> > > is not running. It also takes care of context switching
> > > which may result in invalid values as you mentioned below. This is
> > > nothing new and all other arch (x86, ARM64) allow user space
> > > counter read through the same mechanism.
> > >
> > > Here is the relevant upstream discussion:
> > > https://lore.kernel.org/lkml/Y7wLa7I2hlz3rKw%2F@hirez.programming.kicks-ass.net/T/
> > >
> > > ARM64:
> > > https://docs.kernel.org/arm64/perf.html?highlight=perf_user_access#perf-userspace-pmu-hardware-counter-access
> > >
> > > example usage in x86:
> > > https://github.com/andikleen/pmu-tools/blob/master/jevents/rdpmc.c
> >
> > The canonical implementation of this should be:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/mmap.c#n400
>
> Thanks for sharing the libperf implementation.
>
> > which is updated in these patches but the tests are not:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/tests/mmap-basic.c#n287
> > Which appears to be an oversight. The tests display some differences
>
> Yes. It's an oversight. We should make sure that perf mmap tests pass
> for RISC-V as well.

Yes, that's an oversight, I had a local test adapted from this one but
forgot to update it afterwards, I'll do that in the next version.

Thanks for your quick feedbacks and sorry for being late,

Alex


>
>
> > between x86 and aarch64 that have assumed userspace hardware counter
> > access, and everything else that it is assumed don't.
> >
> > Thanks,
> > Ian
> >
> > > > Clearly process switches (especially cpu migrations) cause
> > > > problems, but they are obviously invalid values and can
> > > > be ignored.
> > > >
> > > > So while a lot of uses may be 'happy' with the values the
> > > > perf framework gives, sometimes you do need to directly
> > > > read the relevant registers.
> > > >
> > > >         David
> > > >
> > > > -
> > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > Registration No: 1397386 (Wales)
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Atish
>
>
>
> --
> Regards,
> Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-19  9:21           ` Alexandre Ghiti
@ 2023-04-19 17:42             ` Ian Rogers
  2023-04-19 23:21               ` Atish Patra
  0 siblings, 1 reply; 26+ messages in thread
From: Ian Rogers @ 2023-04-19 17:42 UTC (permalink / raw)
  To: Alexandre Ghiti, paranlee
  Cc: Atish Patra, David Laight, Jonathan Corbet, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon, Rob Herring,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-riscv@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org

On Wed, Apr 19, 2023 at 2:21 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> Hi Ian,
>
> On Tue, Apr 18, 2023 at 10:30 PM Atish Patra <atishp@atishpatra.org> wrote:
> >
> > On Tue, Apr 18, 2023 at 11:46 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Tue, Apr 18, 2023 at 9:43 AM Atish Patra <atishp@atishpatra.org> wrote:
> > > >
> > > > On Fri, Apr 14, 2023 at 2:40 AM David Laight <David.Laight@aculab.com> wrote:
> > > > >
> > > > > From: Atish Patra
> > > > > > Sent: 13 April 2023 20:18
> > > > > >
> > > > > > On Thu, Apr 13, 2023 at 9:47 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > > > >
> > > > > > > riscv used to allow direct access to cycle/time/instret counters,
> > > > > > > bypassing the perf framework, this patchset intends to allow the user to
> > > > > > > mmap any counter when accessed through perf. But we can't break the
> > > > > > > existing behaviour so we introduce a sysctl perf_user_access like arm64
> > > > > > > does, which defaults to the legacy mode described above.
> > > > > > >
> > > > > >
> > > > > > It would be good provide additional direction for user space packages:
> > > > > >
> > > > > > The legacy behavior is supported for now in order to avoid breaking
> > > > > > existing software.
> > > > > > However, reading counters directly without perf interaction may
> > > > > > provide incorrect values which
> > > > > > the userspace software must avoid. We are hoping that the user space
> > > > > > packages which
> > > > > > read the cycle/instret directly, will move to the proper interface
> > > > > > eventually if they actually need it.
> > > > > > Most of the users are supposed to read "time" instead of "cycle" if
> > > > > > they intend to read timestamps.
> > > > >
> > > > > If you are trying to measure the performance of short code
> > > > > fragments then you need pretty much raw access directly to
> > > > > the cycle/clock count register.
> > > > >
> > > > > I've done this on x86 to compare the actual cycle times
> > > > > of different implementations of the IP checksum loop
> > > > > (and compare them to the theoretical limit).
> > > > > The perf framework just added far too much latency,
> > > > > only directly reading the cpu registers gave anything
> > > > > like reliable (and consistent) answers.
> > > > >
> > > >
> > > > This series allows direct access to the counters once configured
> > > > through the perf.
> > > > Earlier the cycle/instret counters are directly exposed to the
> > > > userspace without kernel/perf frameworking knowing
> > > > when/which user space application is reading it. That has security implications.
> > > >
> > > > With this series applied, the user space application just needs to
> > > > configure the event (cycle/instret) through perf syscall.
> > > > Once configured, the userspace application can find out the counter
> > > > information from the mmap & directly
> > > > read the counter. There is no latency while reading the counters.
> > > >
> > > > This mechanism allows stop/clear the counters when the requesting task
> > > > is not running. It also takes care of context switching
> > > > which may result in invalid values as you mentioned below. This is
> > > > nothing new and all other arch (x86, ARM64) allow user space
> > > > counter read through the same mechanism.
> > > >
> > > > Here is the relevant upstream discussion:
> > > > https://lore.kernel.org/lkml/Y7wLa7I2hlz3rKw%2F@hirez.programming.kicks-ass.net/T/
> > > >
> > > > ARM64:
> > > > https://docs.kernel.org/arm64/perf.html?highlight=perf_user_access#perf-userspace-pmu-hardware-counter-access
> > > >
> > > > example usage in x86:
> > > > https://github.com/andikleen/pmu-tools/blob/master/jevents/rdpmc.c
> > >
> > > The canonical implementation of this should be:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/mmap.c#n400
> >
> > Thanks for sharing the libperf implementation.
> >
> > > which is updated in these patches but the tests are not:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/tests/mmap-basic.c#n287
> > > Which appears to be an oversight. The tests display some differences
> >
> > Yes. It's an oversight. We should make sure that perf mmap tests pass
> > for RISC-V as well.
>
> Yes, that's an oversight, I had a local test adapted from this one but
> forgot to update it afterwards, I'll do that in the next version.
>
> Thanks for your quick feedbacks and sorry for being late,
>
> Alex

Thanks Alex, there was an equally likely chance that I wasn't
understanding things :-) Is there any information on RISC-V PMU
testing? I know ParanLee is interested. It'd be awesome to have
something say on:
https://perf.wiki.kernel.org/index.php/Main_Page
on how to run tests, perhaps on QEMU or known to work boards. We can
also just drop a link on there if there is information. We can also
add the RISC-V PMU information to the links here:
https://perf.wiki.kernel.org/index.php/Useful_Links

Thanks,
Ian

>
> >
> >
> > > between x86 and aarch64 that have assumed userspace hardware counter
> > > access, and everything else that it is assumed don't.
> > >
> > > Thanks,
> > > Ian
> > >
> > > > > Clearly process switches (especially cpu migrations) cause
> > > > > problems, but they are obviously invalid values and can
> > > > > be ignored.
> > > > >
> > > > > So while a lot of uses may be 'happy' with the values the
> > > > > perf framework gives, sometimes you do need to directly
> > > > > read the relevant registers.
> > > > >
> > > > >         David
> > > > >
> > > > > -
> > > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > > Registration No: 1397386 (Wales)
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Atish
> >
> >
> >
> > --
> > Regards,
> > Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-19 17:42             ` Ian Rogers
@ 2023-04-19 23:21               ` Atish Patra
  2023-04-20  0:31                 ` Ian Rogers
  0 siblings, 1 reply; 26+ messages in thread
From: Atish Patra @ 2023-04-19 23:21 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Alexandre Ghiti, paranlee, David Laight, Jonathan Corbet,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon,
	Rob Herring, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org,
	linux-riscv@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org

On Wed, Apr 19, 2023 at 11:13 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Apr 19, 2023 at 2:21 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> >
> > Hi Ian,
> >
> > On Tue, Apr 18, 2023 at 10:30 PM Atish Patra <atishp@atishpatra.org> wrote:
> > >
> > > On Tue, Apr 18, 2023 at 11:46 PM Ian Rogers <irogers@google.com> wrote:
> > > >
> > > > On Tue, Apr 18, 2023 at 9:43 AM Atish Patra <atishp@atishpatra.org> wrote:
> > > > >
> > > > > On Fri, Apr 14, 2023 at 2:40 AM David Laight <David.Laight@aculab.com> wrote:
> > > > > >
> > > > > > From: Atish Patra
> > > > > > > Sent: 13 April 2023 20:18
> > > > > > >
> > > > > > > On Thu, Apr 13, 2023 at 9:47 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > > > > >
> > > > > > > > riscv used to allow direct access to cycle/time/instret counters,
> > > > > > > > bypassing the perf framework, this patchset intends to allow the user to
> > > > > > > > mmap any counter when accessed through perf. But we can't break the
> > > > > > > > existing behaviour so we introduce a sysctl perf_user_access like arm64
> > > > > > > > does, which defaults to the legacy mode described above.
> > > > > > > >
> > > > > > >
> > > > > > > It would be good provide additional direction for user space packages:
> > > > > > >
> > > > > > > The legacy behavior is supported for now in order to avoid breaking
> > > > > > > existing software.
> > > > > > > However, reading counters directly without perf interaction may
> > > > > > > provide incorrect values which
> > > > > > > the userspace software must avoid. We are hoping that the user space
> > > > > > > packages which
> > > > > > > read the cycle/instret directly, will move to the proper interface
> > > > > > > eventually if they actually need it.
> > > > > > > Most of the users are supposed to read "time" instead of "cycle" if
> > > > > > > they intend to read timestamps.
> > > > > >
> > > > > > If you are trying to measure the performance of short code
> > > > > > fragments then you need pretty much raw access directly to
> > > > > > the cycle/clock count register.
> > > > > >
> > > > > > I've done this on x86 to compare the actual cycle times
> > > > > > of different implementations of the IP checksum loop
> > > > > > (and compare them to the theoretical limit).
> > > > > > The perf framework just added far too much latency,
> > > > > > only directly reading the cpu registers gave anything
> > > > > > like reliable (and consistent) answers.
> > > > > >
> > > > >
> > > > > This series allows direct access to the counters once configured
> > > > > through the perf.
> > > > > Earlier the cycle/instret counters are directly exposed to the
> > > > > userspace without kernel/perf frameworking knowing
> > > > > when/which user space application is reading it. That has security implications.
> > > > >
> > > > > With this series applied, the user space application just needs to
> > > > > configure the event (cycle/instret) through perf syscall.
> > > > > Once configured, the userspace application can find out the counter
> > > > > information from the mmap & directly
> > > > > read the counter. There is no latency while reading the counters.
> > > > >
> > > > > This mechanism allows stop/clear the counters when the requesting task
> > > > > is not running. It also takes care of context switching
> > > > > which may result in invalid values as you mentioned below. This is
> > > > > nothing new and all other arch (x86, ARM64) allow user space
> > > > > counter read through the same mechanism.
> > > > >
> > > > > Here is the relevant upstream discussion:
> > > > > https://lore.kernel.org/lkml/Y7wLa7I2hlz3rKw%2F@hirez.programming.kicks-ass.net/T/
> > > > >
> > > > > ARM64:
> > > > > https://docs.kernel.org/arm64/perf.html?highlight=perf_user_access#perf-userspace-pmu-hardware-counter-access
> > > > >
> > > > > example usage in x86:
> > > > > https://github.com/andikleen/pmu-tools/blob/master/jevents/rdpmc.c
> > > >
> > > > The canonical implementation of this should be:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/mmap.c#n400
> > >
> > > Thanks for sharing the libperf implementation.
> > >
> > > > which is updated in these patches but the tests are not:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/tests/mmap-basic.c#n287
> > > > Which appears to be an oversight. The tests display some differences
> > >
> > > Yes. It's an oversight. We should make sure that perf mmap tests pass
> > > for RISC-V as well.
> >
> > Yes, that's an oversight, I had a local test adapted from this one but
> > forgot to update it afterwards, I'll do that in the next version.
> >
> > Thanks for your quick feedbacks and sorry for being late,
> >
> > Alex
>
> Thanks Alex, there was an equally likely chance that I wasn't
> understanding things :-) Is there any information on RISC-V PMU
> testing? I know ParanLee is interested. It'd be awesome to have

Are you looking for something specific to RISC-V general or perf on RISC-V?
All the RISC-V PMU patches have been upstream for a while (both in the
Qemu & Linux kernel).
Perf should work out-of-the box when you boot the latest kernel in the
latest version of the Qemu.

Initial KVM[1] patches support got merged during the last merge
window. It doesn't support
event sampling yet. We are working on that.

[1] https://lore.kernel.org/lkml/20230207095529.1787260-1-atishp@rivosinc.com/

> something say on:
> https://perf.wiki.kernel.org/index.php/Main_Page
> on how to run tests, perhaps on QEMU or known to work boards. We can
> also just drop a link on there if there is information. We can also
> add the RISC-V PMU information to the links here:
> https://perf.wiki.kernel.org/index.php/Useful_Links
>

I did not see any arch specific information there. Let us know what
would be good to
add there and we would be happy to add.

> Thanks,
> Ian
>
> >
> > >
> > >
> > > > between x86 and aarch64 that have assumed userspace hardware counter
> > > > access, and everything else that it is assumed don't.
> > > >
> > > > Thanks,
> > > > Ian
> > > >
> > > > > > Clearly process switches (especially cpu migrations) cause
> > > > > > problems, but they are obviously invalid values and can
> > > > > > be ignored.
> > > > > >
> > > > > > So while a lot of uses may be 'happy' with the values the
> > > > > > perf framework gives, sometimes you do need to directly
> > > > > > read the relevant registers.
> > > > > >
> > > > > >         David
> > > > > >
> > > > > > -
> > > > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > > > Registration No: 1397386 (Wales)
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Atish
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Atish



-- 
Regards,
Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/4] riscv: Allow userspace to directly access perf counters
  2023-04-19 23:21               ` Atish Patra
@ 2023-04-20  0:31                 ` Ian Rogers
  0 siblings, 0 replies; 26+ messages in thread
From: Ian Rogers @ 2023-04-20  0:31 UTC (permalink / raw)
  To: Atish Patra
  Cc: Alexandre Ghiti, paranlee, David Laight, Jonathan Corbet,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel, Will Deacon,
	Rob Herring, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org,
	linux-riscv@lists.infradead.org,
	linux-arm-kernel@lists.infradead.org

On Wed, Apr 19, 2023 at 4:22 PM Atish Patra <atishp@atishpatra.org> wrote:
>
> On Wed, Apr 19, 2023 at 11:13 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Wed, Apr 19, 2023 at 2:21 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > >
> > > Hi Ian,
> > >
> > > On Tue, Apr 18, 2023 at 10:30 PM Atish Patra <atishp@atishpatra.org> wrote:
> > > >
> > > > On Tue, Apr 18, 2023 at 11:46 PM Ian Rogers <irogers@google.com> wrote:
> > > > >
> > > > > On Tue, Apr 18, 2023 at 9:43 AM Atish Patra <atishp@atishpatra.org> wrote:
> > > > > >
> > > > > > On Fri, Apr 14, 2023 at 2:40 AM David Laight <David.Laight@aculab.com> wrote:
> > > > > > >
> > > > > > > From: Atish Patra
> > > > > > > > Sent: 13 April 2023 20:18
> > > > > > > >
> > > > > > > > On Thu, Apr 13, 2023 at 9:47 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > > > > > >
> > > > > > > > > riscv used to allow direct access to cycle/time/instret counters,
> > > > > > > > > bypassing the perf framework, this patchset intends to allow the user to
> > > > > > > > > mmap any counter when accessed through perf. But we can't break the
> > > > > > > > > existing behaviour so we introduce a sysctl perf_user_access like arm64
> > > > > > > > > does, which defaults to the legacy mode described above.
> > > > > > > > >
> > > > > > > >
> > > > > > > > It would be good provide additional direction for user space packages:
> > > > > > > >
> > > > > > > > The legacy behavior is supported for now in order to avoid breaking
> > > > > > > > existing software.
> > > > > > > > However, reading counters directly without perf interaction may
> > > > > > > > provide incorrect values which
> > > > > > > > the userspace software must avoid. We are hoping that the user space
> > > > > > > > packages which
> > > > > > > > read the cycle/instret directly, will move to the proper interface
> > > > > > > > eventually if they actually need it.
> > > > > > > > Most of the users are supposed to read "time" instead of "cycle" if
> > > > > > > > they intend to read timestamps.
> > > > > > >
> > > > > > > If you are trying to measure the performance of short code
> > > > > > > fragments then you need pretty much raw access directly to
> > > > > > > the cycle/clock count register.
> > > > > > >
> > > > > > > I've done this on x86 to compare the actual cycle times
> > > > > > > of different implementations of the IP checksum loop
> > > > > > > (and compare them to the theoretical limit).
> > > > > > > The perf framework just added far too much latency,
> > > > > > > only directly reading the cpu registers gave anything
> > > > > > > like reliable (and consistent) answers.
> > > > > > >
> > > > > >
> > > > > > This series allows direct access to the counters once configured
> > > > > > through the perf.
> > > > > > Earlier the cycle/instret counters are directly exposed to the
> > > > > > userspace without kernel/perf frameworking knowing
> > > > > > when/which user space application is reading it. That has security implications.
> > > > > >
> > > > > > With this series applied, the user space application just needs to
> > > > > > configure the event (cycle/instret) through perf syscall.
> > > > > > Once configured, the userspace application can find out the counter
> > > > > > information from the mmap & directly
> > > > > > read the counter. There is no latency while reading the counters.
> > > > > >
> > > > > > This mechanism allows stop/clear the counters when the requesting task
> > > > > > is not running. It also takes care of context switching
> > > > > > which may result in invalid values as you mentioned below. This is
> > > > > > nothing new and all other arch (x86, ARM64) allow user space
> > > > > > counter read through the same mechanism.
> > > > > >
> > > > > > Here is the relevant upstream discussion:
> > > > > > https://lore.kernel.org/lkml/Y7wLa7I2hlz3rKw%2F@hirez.programming.kicks-ass.net/T/
> > > > > >
> > > > > > ARM64:
> > > > > > https://docs.kernel.org/arm64/perf.html?highlight=perf_user_access#perf-userspace-pmu-hardware-counter-access
> > > > > >
> > > > > > example usage in x86:
> > > > > > https://github.com/andikleen/pmu-tools/blob/master/jevents/rdpmc.c
> > > > >
> > > > > The canonical implementation of this should be:
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/mmap.c#n400
> > > >
> > > > Thanks for sharing the libperf implementation.
> > > >
> > > > > which is updated in these patches but the tests are not:
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/tests/mmap-basic.c#n287
> > > > > Which appears to be an oversight. The tests display some differences
> > > >
> > > > Yes. It's an oversight. We should make sure that perf mmap tests pass
> > > > for RISC-V as well.
> > >
> > > Yes, that's an oversight, I had a local test adapted from this one but
> > > forgot to update it afterwards, I'll do that in the next version.
> > >
> > > Thanks for your quick feedbacks and sorry for being late,
> > >
> > > Alex
> >
> > Thanks Alex, there was an equally likely chance that I wasn't
> > understanding things :-) Is there any information on RISC-V PMU
> > testing? I know ParanLee is interested. It'd be awesome to have
>
> Are you looking for something specific to RISC-V general or perf on RISC-V?
> All the RISC-V PMU patches have been upstream for a while (both in the
> Qemu & Linux kernel).
> Perf should work out-of-the box when you boot the latest kernel in the
> latest version of the Qemu.
>
> Initial KVM[1] patches support got merged during the last merge
> window. It doesn't support
> event sampling yet. We are working on that.
>
> [1] https://lore.kernel.org/lkml/20230207095529.1787260-1-atishp@rivosinc.com/

Cool, it'd be nice to have a recipe for this from x86 Linux :-)

> > something say on:
> > https://perf.wiki.kernel.org/index.php/Main_Page
> > on how to run tests, perhaps on QEMU or known to work boards. We can
> > also just drop a link on there if there is information. We can also
> > add the RISC-V PMU information to the links here:
> > https://perf.wiki.kernel.org/index.php/Useful_Links
> >
>
> I did not see any arch specific information there. Let us know what
> would be good to
> add there and we would be happy to add.

I was specifically thinking under Manuals where the Intel, AMD and ARM
manuals are, links to the RISC-V documentation could be added.

Thanks,
Ian

> > Thanks,
> > Ian
> >
> > >
> > > >
> > > >
> > > > > between x86 and aarch64 that have assumed userspace hardware counter
> > > > > access, and everything else that it is assumed don't.
> > > > >
> > > > > Thanks,
> > > > > Ian
> > > > >
> > > > > > > Clearly process switches (especially cpu migrations) cause
> > > > > > > problems, but they are obviously invalid values and can
> > > > > > > be ignored.
> > > > > > >
> > > > > > > So while a lot of uses may be 'happy' with the values the
> > > > > > > perf framework gives, sometimes you do need to directly
> > > > > > > read the relevant registers.
> > > > > > >
> > > > > > >         David
> > > > > > >
> > > > > > > -
> > > > > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > > > > > Registration No: 1397386 (Wales)
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > Atish
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Atish
>
>
>
> --
> Regards,
> Atish

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-05-09 13:40 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-13 16:17 [PATCH 0/4] riscv: Allow userspace to directly access perf counters Alexandre Ghiti
2023-04-13 16:17 ` [PATCH 1/4] perf: Fix wrong comment about default event_idx Alexandre Ghiti
2023-04-13 16:17 ` [PATCH 2/4] include: riscv: Fix wrong include guard in riscv_pmu.h Alexandre Ghiti
2023-04-18 18:26   ` Conor Dooley
2023-04-13 16:17 ` [PATCH 3/4] riscv: Make legacy counter enum match the HW numbering Alexandre Ghiti
2023-04-13 16:17 ` [PATCH 4/4] riscv: Enable perf counters user access only through perf Alexandre Ghiti
2023-04-13 21:20   ` kernel test robot
2023-04-14  2:09   ` kernel test robot
2023-04-26 12:57   ` Andrew Jones
2023-04-26 13:17     ` Alexandre Ghiti
2023-04-26 13:25       ` Andrew Jones
2023-04-29  6:19         ` Atish Patra
2023-04-29  6:50           ` Atish Patra
2023-05-09 12:24       ` Emil Renner Berthing
2023-05-09 13:40         ` Alexandre Ghiti
2023-05-01  2:09   ` Bagas Sanjaya
2023-04-13 16:36 ` [PATCH 0/4] riscv: Allow userspace to directly access perf counters Ian Rogers
2023-04-13 19:17 ` Atish Patra
2023-04-13 21:10   ` David Laight
2023-04-18 16:43     ` Atish Patra
2023-04-18 18:15       ` Ian Rogers
2023-04-18 20:30         ` Atish Patra
2023-04-19  9:21           ` Alexandre Ghiti
2023-04-19 17:42             ` Ian Rogers
2023-04-19 23:21               ` Atish Patra
2023-04-20  0:31                 ` Ian Rogers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).