public inbox for linux-perf-users@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Enable perf tracing for unprivileged users
@ 2026-04-23 15:17 Anubhav Shelat
  2026-04-23 15:17 ` [PATCH v3 1/3] perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints Anubhav Shelat
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Anubhav Shelat @ 2026-04-23 15:17 UTC (permalink / raw)
  To: peterz, mingo, mhiramat, rostedt, acme, namhyung
  Cc: mathieu.desnoyers, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, james.clark, linux-kernel,
	linux-trace-kernel, linux-perf-users, Anubhav Shelat

Enable users to use perf-trace to trace their own processes, like strace
but without the overhead of ptrace(). Ensure that users cannot access
other users' or systemwide tracing data.

Changes in v3:
- Don't set PERF_SAMPLE_IP for unprivileged tracepoints. This allows us
  to exclude PERF_SAMPLE_IP from kaddr_leak without weakening KASLR.
- Mount tracefs as world-traversable so users can access eventfs
  directories.

v2: https://lore.kernel.org/lkml/20260410133529.21947-1-ashelat@redhat.com/

Anubhav Shelat (3):
  perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints
  perf: enable unprivileged syscall tracing with perf trace
  tracefs: make root directory world-traversable

 fs/tracefs/inode.c              |  2 +-
 kernel/events/core.c            | 23 ++++++++++++++++++++---
 kernel/trace/trace_event_perf.c | 12 +++++++++++-
 kernel/trace/trace_events.c     |  8 ++++++--
 tools/perf/util/evsel.c         |  4 +++-
 5 files changed, 41 insertions(+), 8 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3 1/3] perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints
  2026-04-23 15:17 [PATCH v3 0/3] Enable perf tracing for unprivileged users Anubhav Shelat
@ 2026-04-23 15:17 ` Anubhav Shelat
  2026-04-23 22:14   ` sashiko-bot
  2026-04-23 15:17 ` [PATCH v3 2/3] perf: enable unprivileged syscall tracing with perf trace Anubhav Shelat
  2026-04-23 15:17 ` [PATCH v3 3/3] tracefs: make root directory world-traversable Anubhav Shelat
  2 siblings, 1 reply; 6+ messages in thread
From: Anubhav Shelat @ 2026-04-23 15:17 UTC (permalink / raw)
  To: peterz, mingo, mhiramat, rostedt, acme, namhyung
  Cc: mathieu.desnoyers, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, james.clark, linux-kernel,
	linux-trace-kernel, linux-perf-users, Anubhav Shelat

For tracepoint events the IP is a static kernel address.
It doesn't vary by sample and provides no useful information for
unprivileged users. Skipping setting PERF_SAMPLE_IP for unprivileged
tracepoints avoids exposing a kernel address that reveals the KASLR base
offset and slightly reduces sample record size.

Assisted-by: Claude:claude-sonnet-4.5
Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
---
 tools/perf/util/evsel.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index f59228c1a39e..a1091d937ff9 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1503,7 +1503,9 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 	attr->write_backward = opts->overwrite ? 1 : 0;
 	attr->read_format   = PERF_FORMAT_LOST;
 
-	evsel__set_sample_bit(evsel, IP);
+	if (attr->type != PERF_TYPE_TRACEPOINT || perf_event_paranoid_check(1))
+		evsel__set_sample_bit(evsel, IP);
+
 	evsel__set_sample_bit(evsel, TID);
 
 	if (evsel->sample_read) {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 2/3] perf: enable unprivileged syscall tracing with perf trace
  2026-04-23 15:17 [PATCH v3 0/3] Enable perf tracing for unprivileged users Anubhav Shelat
  2026-04-23 15:17 ` [PATCH v3 1/3] perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints Anubhav Shelat
@ 2026-04-23 15:17 ` Anubhav Shelat
  2026-04-23 23:10   ` sashiko-bot
  2026-04-23 15:17 ` [PATCH v3 3/3] tracefs: make root directory world-traversable Anubhav Shelat
  2 siblings, 1 reply; 6+ messages in thread
From: Anubhav Shelat @ 2026-04-23 15:17 UTC (permalink / raw)
  To: peterz, mingo, mhiramat, rostedt, acme, namhyung
  Cc: mathieu.desnoyers, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, james.clark, linux-kernel,
	linux-trace-kernel, linux-perf-users, Anubhav Shelat

Allow unprivileged users to trace their own processes' syscalls using
perf trace, similar to strace without the intrusive overhead of ptrace().

Currently, perf trace requires CAP_PERFMON or paranoid level ≤ 1 even
though the kernel has existing infrastructure (TRACE_EVENT_FL_CAP_ANY)
specifically designed to mark syscall tracepoints as safe for
unprivileged access. To fix this:

1. Loosen the condition in perf_event_open() which requires privileges
for all events with exclude_kernel=0. This allows perf_event_open() to
bypass the paranoid check for task-attached tracepoint events. Ensure
that sample types which can expose kernel addresses to unprivileged
users are blocked.

2. Make the format and id tracefs files world-readable only for tracepoints
with TRACE_EVENT_FL_CAP_ANY, allowing unprivileged users to see syscall
tracepoint ids without exposing sensitive information.

Also add a check to perf_trace_event_perm() to ensure only
TRACE_EVENT_FL_CAP_ANY events can be traced.

Example usage after this change:
  $ perf trace ls          # works as unprivileged user
  $ perf trace             # system-wide, still requires privileges
  $ perf trace -p 1234     # requires ptrace permission on pid 1234

Assisted-by: Claude:claude-sonnet-4.5
Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
---
 kernel/events/core.c            | 24 +++++++++++++++++++++---
 kernel/trace/trace_event_perf.c | 12 +++++++++++-
 kernel/trace/trace_events.c     |  8 ++++++--
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6d1f8bad7e1c..e9c53758574d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13833,9 +13833,27 @@ SYSCALL_DEFINE5(perf_event_open,
 		return err;
 
 	if (!attr.exclude_kernel) {
-		err = perf_allow_kernel();
-		if (err)
-			return err;
+		bool tp_bypass = false;
+
+		if (attr.type == PERF_TYPE_TRACEPOINT && pid != -1) {
+			/*
+			 * Block sample types that expose kernel addresses to
+			 * prevent KASLR bypass
+			 */
+			u64 kaddr_leak = PERF_SAMPLE_CALLCHAIN |
+					 PERF_SAMPLE_BRANCH_STACK |
+					 PERF_SAMPLE_ADDR |
+					 PERF_SAMPLE_REGS_INTR |
+					 PERF_SAMPLE_IP;
+
+			tp_bypass = !(attr.sample_type & kaddr_leak);
+		}
+
+		if (!tp_bypass) {
+			err = perf_allow_kernel();
+			if (err)
+				return err;
+		}
 	}
 
 	if (attr.namespaces) {
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index a6bb7577e8c5..e8347df7ede5 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -73,8 +73,18 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
 	}
 
 	/* No tracing, just counting, so no obvious leak */
-	if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW))
+	if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW)) {
+		/*
+		 * Only allow CAP_ANY tracepoints for unprivileged
+		 * task-attached events in case kernel context is exposed.
+		 */
+		if (!p_event->attr.exclude_kernel && !perfmon_capable()) {
+			if (!(p_event->attach_state == PERF_ATTACH_TASK &&
+			      (tp_event->flags & TRACE_EVENT_FL_CAP_ANY)))
+				return -EACCES;
+		}
 		return 0;
+	}
 
 	/* Some events are ok to be traced by non-root users... */
 	if (p_event->attach_state == PERF_ATTACH_TASK) {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index aa422dc80ae8..69be5561d0b8 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -3054,7 +3054,9 @@ static int event_callback(const char *name, umode_t *mode, void **data,
 	struct trace_event_call *call = file->event_call;
 
 	if (strcmp(name, "format") == 0) {
-		*mode = TRACE_MODE_READ;
+		*mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
+			(TRACE_MODE_READ | 0004) :
+			TRACE_MODE_READ;
 		*fops = &ftrace_event_format_fops;
 		return 1;
 	}
@@ -3090,7 +3092,9 @@ static int event_callback(const char *name, umode_t *mode, void **data,
 #ifdef CONFIG_PERF_EVENTS
 	if (call->event.type && call->class->reg &&
 	    strcmp(name, "id") == 0) {
-		*mode = TRACE_MODE_READ;
+		*mode = (call->flags & TRACE_EVENT_FL_CAP_ANY) ?
+			(TRACE_MODE_READ | 0004) :
+			TRACE_MODE_READ;
 		*data = (void *)(long)call->event.type;
 		*fops = &ftrace_event_id_fops;
 		return 1;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 3/3] tracefs: make root directory world-traversable
  2026-04-23 15:17 [PATCH v3 0/3] Enable perf tracing for unprivileged users Anubhav Shelat
  2026-04-23 15:17 ` [PATCH v3 1/3] perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints Anubhav Shelat
  2026-04-23 15:17 ` [PATCH v3 2/3] perf: enable unprivileged syscall tracing with perf trace Anubhav Shelat
@ 2026-04-23 15:17 ` Anubhav Shelat
  2 siblings, 0 replies; 6+ messages in thread
From: Anubhav Shelat @ 2026-04-23 15:17 UTC (permalink / raw)
  To: peterz, mingo, mhiramat, rostedt, acme, namhyung
  Cc: mathieu.desnoyers, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, james.clark, linux-kernel,
	linux-trace-kernel, linux-perf-users, Anubhav Shelat

Change the default tracefs mount mode from 0700 to 0755. This allows
unprivileged users to access the eventfs directories underneath which
already use 0755.

This does not expose any tracing data since access to the files
themselves is controlled by individual permissions.

Signed-off-by: Anubhav Shelat <ashelat@redhat.com>
---
 fs/tracefs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 03f768536fd5..9506450fbc91 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -23,7 +23,7 @@
 #include <linux/slab.h>
 #include "internal.h"
 
-#define TRACEFS_DEFAULT_MODE	0700
+#define TRACEFS_DEFAULT_MODE	0755
 static struct kmem_cache *tracefs_inode_cachep __ro_after_init;
 
 static struct vfsmount *tracefs_mount;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 1/3] perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints
  2026-04-23 15:17 ` [PATCH v3 1/3] perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints Anubhav Shelat
@ 2026-04-23 22:14   ` sashiko-bot
  0 siblings, 0 replies; 6+ messages in thread
From: sashiko-bot @ 2026-04-23 22:14 UTC (permalink / raw)
  To: Anubhav Shelat; +Cc: linux-perf-users

Sashiko AI review found 1 potential issue(s):
- [High] Unconditionally dropping PERF_SAMPLE_IP for unprivileged tracepoints breaks user-space dynamic tracing (uprobes and USDTs).
--

commit 06030eb0792514412c01e8660eaafa6292e09335
Author: Anubhav Shelat <ashelat@redhat.com>

perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints

This commit conditionally drops the PERF_SAMPLE_IP bit for tracepoint events
configured by unprivileged users. It aims to prevent exposing kernel addresses
that might reveal the KASLR base offset.

> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index f59228c1a39eb..a1091d937ff92 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
[ ... ]
> @@ -1503,7 +1503,9 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>  	attr->write_backward = opts->overwrite ? 1 : 0;
>  	attr->read_format   = PERF_FORMAT_LOST;
>  
> -	evsel__set_sample_bit(evsel, IP);
> +	if (attr->type != PERF_TYPE_TRACEPOINT || perf_event_paranoid_check(1))
> +		evsel__set_sample_bit(evsel, IP);
> +

Uprobes and USDTs are also registered as PERF_TYPE_TRACEPOINT events, but their
instruction pointer is a safe user-space address indicating where the breakpoint
fired, rather than a kernel address that could leak KASLR.

By unconditionally skipping the IP bit for all tracepoints when executed by
an unprivileged user, downstream tools might fail to resolve symbols
or attribute hits to specific instructions for user-space events.

Does this inadvertently break user-space dynamic tracing?

>  	evsel__set_sample_bit(evsel, TID);
>  
>  	if (evsel->sample_read) {

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260423151746.16258-1-ashelat@redhat.com?part=1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3 2/3] perf: enable unprivileged syscall tracing with perf trace
  2026-04-23 15:17 ` [PATCH v3 2/3] perf: enable unprivileged syscall tracing with perf trace Anubhav Shelat
@ 2026-04-23 23:10   ` sashiko-bot
  0 siblings, 0 replies; 6+ messages in thread
From: sashiko-bot @ 2026-04-23 23:10 UTC (permalink / raw)
  To: Anubhav Shelat; +Cc: linux-perf-users

Sashiko AI review found 1 potential issue(s):
- [High] Unconditional rejection of unprivileged tracepoint counting breaks perf_event_paranoid functionality and creates an inverted security policy.
--

commit 741720c90ed77115c5603e05c7d2cdfc3f218082
Author: Anubhav Shelat <ashelat@redhat.com>

perf: enable unprivileged syscall tracing with perf trace

This commit allows unprivileged users to trace their own processes' syscalls
using perf trace without requiring CAP_PERFMON. It loosens the condition in
perf_event_open() for task-attached events that do not expose kernel addresses
and makes format and id files readable for TRACE_EVENT_FL_CAP_ANY tracepoints.

> diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> index a6bb7577e8c596..e8347df7ede5c2 100644
> --- a/kernel/trace/trace_event_perf.c
> +++ b/kernel/trace/trace_event_perf.c
> @@ -73,8 +73,18 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
>  	}
>  
>  	/* No tracing, just counting, so no obvious leak */
> -	if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW))
> +	if (!(p_event->attr.sample_type & PERF_SAMPLE_RAW)) {
> +		/*
> +		 * Only allow CAP_ANY tracepoints for unprivileged
> +		 * task-attached events in case kernel context is exposed.
> +		 */
> +		if (!p_event->attr.exclude_kernel && !perfmon_capable()) {
> +			if (!(p_event->attach_state == PERF_ATTACH_TASK &&
> +			      (tp_event->flags & TRACE_EVENT_FL_CAP_ANY)))
> +				return -EACCES;
> +		}
>  		return 0;
> +	}

Does this new check unconditionally block unprivileged tracepoint counting
even when sysctl_perf_event_paranoid is set to allow it?

Before this patch, if sysctl_perf_event_paranoid <= 1, unprivileged users
could count any tracepoint. It appears this check now returns -EACCES for
counting events without exclude_kernel, breaking tools like perf stat for
unprivileged users.

Additionally, if sysctl_perf_event_paranoid == -1, doesn't this create an
inverted security policy?

An unprivileged user requesting simple counting would be rejected here,
but if they request full raw event data via PERF_SAMPLE_RAW, execution
bypasses this block entirely and reaches perf_allow_tracepoint(), which
would succeed.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260423151746.16258-1-ashelat@redhat.com?part=2

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-23 23:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23 15:17 [PATCH v3 0/3] Enable perf tracing for unprivileged users Anubhav Shelat
2026-04-23 15:17 ` [PATCH v3 1/3] perf evsel: don't set PERF_SAMPLE_IP for unprivileged tracepoints Anubhav Shelat
2026-04-23 22:14   ` sashiko-bot
2026-04-23 15:17 ` [PATCH v3 2/3] perf: enable unprivileged syscall tracing with perf trace Anubhav Shelat
2026-04-23 23:10   ` sashiko-bot
2026-04-23 15:17 ` [PATCH v3 3/3] tracefs: make root directory world-traversable Anubhav Shelat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox