linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/6] perf: Convert mmap() related reference counts to refcount_t
@ 2025-08-06 20:12 Thomas Gleixner
  2025-08-06 20:12 ` [patch 1/6] perf/core: Remove redundant condition for AUX buffer size Thomas Gleixner
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-06 20:12 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Lorenzo Stoakes, Kees Cook

The recently fixed reference count leaks could have been detected by using
refcount_t and refcount_t would have mitigated the potential overflow at
least.

It turned out that converting the code as is does not work as the
allocation code ends up doing a refcount_inc() for the first allocation,
which causes refcount_t sanity checks to emit a UAF warning.

The reason is that the code is sharing functionality at the wrong level and
ends up being overly complicated for no reason. That's what inevitable led
to the refcount leak problems.

Address this by splitting the ringbuffer and the AUX buffer mapping and
allocation parts out into seperate functions, which handle the reference
counts in a sane way.

That not only simplifies the code and makes it halfways comprehensible, but
also allows to convert the mmap() related reference counts to refcount_t.

It survives lightweight testing with perf and passes the perf/mmap
selftest.

The series applies on top of Linus tree and is also available from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git perf/refcounts

Thanks,

	tglx
---
 include/linux/perf_event.h  |    2 
 kernel/events/core.c        |  361 ++++++++++++++++++++++----------------------
 kernel/events/internal.h    |    4 
 kernel/events/ring_buffer.c |    2 
 4 files changed, 185 insertions(+), 184 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 1/6] perf/core: Remove redundant condition for AUX buffer size
  2025-08-06 20:12 [patch 0/6] perf: Convert mmap() related reference counts to refcount_t Thomas Gleixner
@ 2025-08-06 20:12 ` Thomas Gleixner
  2025-08-07 13:30   ` Lorenzo Stoakes
  2025-08-06 20:12 ` [patch 2/6] perf/core: Split out mlock limit handling Thomas Gleixner
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-06 20:12 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Lorenzo Stoakes, Kees Cook

It is already checked whether the VMA size is the same as
nr_pages * PAGE_SIZE, so later checking both:

      aux_size == vma_size && aux_size == nr_pages * PAGE_SIZE

is redundant. Remove the vma_size check as nr_pages is what is actually
used in the allocation function. That prepares for splitting out the buffer
allocation into seperate functions, so that only nr_pages needs to be
handed in.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/events/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7043,7 +7043,7 @@ static int perf_mmap(struct file *file,
 		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
 			goto aux_unlock;
 
-		if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
+		if (aux_size != nr_pages * PAGE_SIZE)
 			goto aux_unlock;
 
 		/* already mapped with a different size */


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 2/6] perf/core: Split out mlock limit handling
  2025-08-06 20:12 [patch 0/6] perf: Convert mmap() related reference counts to refcount_t Thomas Gleixner
  2025-08-06 20:12 ` [patch 1/6] perf/core: Remove redundant condition for AUX buffer size Thomas Gleixner
@ 2025-08-06 20:12 ` Thomas Gleixner
  2025-08-07 14:14   ` Lorenzo Stoakes
  2025-08-06 20:12 ` [patch 3/6] perf/core: Split out VM accounting Thomas Gleixner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-06 20:12 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Lorenzo Stoakes, Kees Cook

To prepare for splitting the buffer allocation out into seperate functions
for the ring buffer and the AUX buffer, split out mlock limit handling into
a helper function, which can be called from both.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/events/core.c |   75 +++++++++++++++++++++++++--------------------------
 1 file changed, 38 insertions(+), 37 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6927,17 +6927,49 @@ static int map_range(struct perf_buffer
 	return err;
 }
 
+static bool perf_mmap_calc_limits(struct vm_area_struct *vma, long *user_extra, long *extra)
+{
+	unsigned long user_locked, user_lock_limit, locked, lock_limit;
+	struct user_struct *user = current_user();
+
+	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
+	/* Increase the limit linearly with more CPUs */
+	user_lock_limit *= num_online_cpus();
+
+	user_locked = atomic_long_read(&user->locked_vm);
+
+	/*
+	 * sysctl_perf_event_mlock may have changed, so that
+	 *     user->locked_vm > user_lock_limit
+	 */
+	if (user_locked > user_lock_limit)
+		user_locked = user_lock_limit;
+	user_locked += *user_extra;
+
+	if (user_locked > user_lock_limit) {
+		/*
+		 * charge locked_vm until it hits user_lock_limit;
+		 * charge the rest from pinned_vm
+		 */
+		*extra = user_locked - user_lock_limit;
+		*user_extra -= *extra;
+	}
+
+	lock_limit = rlimit(RLIMIT_MEMLOCK);
+	lock_limit >>= PAGE_SHIFT;
+	locked = atomic64_read(&vma->vm_mm->pinned_vm) + *extra;
+
+	return locked <= lock_limit || !perf_is_paranoid() || capable(CAP_IPC_LOCK);
+}
+
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct perf_event *event = file->private_data;
-	unsigned long user_locked, user_lock_limit;
 	struct user_struct *user = current_user();
+	unsigned long vma_size, nr_pages;
+	long user_extra = 0, extra = 0;
 	struct mutex *aux_mutex = NULL;
 	struct perf_buffer *rb = NULL;
-	unsigned long locked, lock_limit;
-	unsigned long vma_size;
-	unsigned long nr_pages;
-	long user_extra = 0, extra = 0;
 	int ret, flags = 0;
 	mapped_f mapped;
 
@@ -7063,38 +7095,7 @@ static int perf_mmap(struct file *file,
 		}
 	}
 
-	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
-
-	/*
-	 * Increase the limit linearly with more CPUs:
-	 */
-	user_lock_limit *= num_online_cpus();
-
-	user_locked = atomic_long_read(&user->locked_vm);
-
-	/*
-	 * sysctl_perf_event_mlock may have changed, so that
-	 *     user->locked_vm > user_lock_limit
-	 */
-	if (user_locked > user_lock_limit)
-		user_locked = user_lock_limit;
-	user_locked += user_extra;
-
-	if (user_locked > user_lock_limit) {
-		/*
-		 * charge locked_vm until it hits user_lock_limit;
-		 * charge the rest from pinned_vm
-		 */
-		extra = user_locked - user_lock_limit;
-		user_extra -= extra;
-	}
-
-	lock_limit = rlimit(RLIMIT_MEMLOCK);
-	lock_limit >>= PAGE_SHIFT;
-	locked = atomic64_read(&vma->vm_mm->pinned_vm) + extra;
-
-	if ((locked > lock_limit) && perf_is_paranoid() &&
-		!capable(CAP_IPC_LOCK)) {
+	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
 		ret = -EPERM;
 		goto unlock;
 	}


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 3/6] perf/core: Split out VM accounting
  2025-08-06 20:12 [patch 0/6] perf: Convert mmap() related reference counts to refcount_t Thomas Gleixner
  2025-08-06 20:12 ` [patch 1/6] perf/core: Remove redundant condition for AUX buffer size Thomas Gleixner
  2025-08-06 20:12 ` [patch 2/6] perf/core: Split out mlock limit handling Thomas Gleixner
@ 2025-08-06 20:12 ` Thomas Gleixner
  2025-08-07 14:25   ` Lorenzo Stoakes
  2025-08-06 20:12 ` [patch 4/6] perf/core: Split out ringbuffer allocation Thomas Gleixner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-06 20:12 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Lorenzo Stoakes, Kees Cook

Similary to the mlock limit calculation the VM accounting is required for
both the ringbuffer and the AUX buffer allocations.

To prepare for splitting them out into seperate functions, move the
accounting into a helper function.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/events/core.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6962,10 +6962,17 @@ static bool perf_mmap_calc_limits(struct
 	return locked <= lock_limit || !perf_is_paranoid() || capable(CAP_IPC_LOCK);
 }
 
+static void perf_mmap_account(struct vm_area_struct *vma, long user_extra, long extra)
+{
+	struct user_struct *user = current_user();
+
+	atomic_long_add(user_extra, &user->locked_vm);
+	atomic64_add(extra, &vma->vm_mm->pinned_vm);
+}
+
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct perf_event *event = file->private_data;
-	struct user_struct *user = current_user();
 	unsigned long vma_size, nr_pages;
 	long user_extra = 0, extra = 0;
 	struct mutex *aux_mutex = NULL;
@@ -7136,9 +7143,7 @@ static int perf_mmap(struct file *file,
 
 unlock:
 	if (!ret) {
-		atomic_long_add(user_extra, &user->locked_vm);
-		atomic64_add(extra, &vma->vm_mm->pinned_vm);
-
+		perf_mmap_account(vma, user_extra, extra);
 		atomic_inc(&event->mmap_count);
 	} else if (rb) {
 		/* AUX allocation failed */


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 4/6] perf/core: Split out ringbuffer allocation
  2025-08-06 20:12 [patch 0/6] perf: Convert mmap() related reference counts to refcount_t Thomas Gleixner
                   ` (2 preceding siblings ...)
  2025-08-06 20:12 ` [patch 3/6] perf/core: Split out VM accounting Thomas Gleixner
@ 2025-08-06 20:12 ` Thomas Gleixner
  2025-08-07 15:38   ` Lorenzo Stoakes
  2025-08-06 20:13 ` [patch 5/6] perf/core: Split the ringbuffer mmap() and allocation code out Thomas Gleixner
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-06 20:12 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Lorenzo Stoakes, Kees Cook

The code logic in perf_mmap() is incomprehensible and has been source of
subtle bugs in the past. It makes it impossible to convert the atomic_t
reference counts to refcount_t.

There is not really much, which is shared between the ringbuffer and AUX
buffer allocation code since the mlock limit calculation and the
accounting has been split out into helper functions.

Move the AUX buffer allocation code out and integrate the call with a
momentary workaround to allow skipping the remaining ringbuffer related
code completely. That workaround will be removed once the ringbuffer
allocation is moved to its own function as well.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/events/core.c |  134 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 75 insertions(+), 59 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6970,12 +6970,76 @@ static void perf_mmap_account(struct vm_
 	atomic64_add(extra, &vma->vm_mm->pinned_vm);
 }
 
+static int perf_mmap_aux(struct vm_area_struct *vma, struct perf_event *event,
+			 unsigned long nr_pages)
+{
+	long user_extra = nr_pages, extra = 0;
+	struct perf_buffer *rb = event->rb;
+	u64 aux_offset, aux_size;
+	int ret, rb_flags = 0;
+
+	/*
+	 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
+	 * mapped, all subsequent mappings should have the same size
+	 * and offset. Must be above the normal perf buffer.
+	 */
+	aux_offset = READ_ONCE(rb->user_page->aux_offset);
+	aux_size = READ_ONCE(rb->user_page->aux_size);
+
+	if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
+		return -EINVAL;
+
+	if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
+		return -EINVAL;
+
+	/* Already mapped with a different offset */
+	if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
+		return -EINVAL;
+
+	if (aux_size != nr_pages * PAGE_SIZE)
+		return -EINVAL;
+
+	/* Already mapped with a different size */
+	if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
+		return -EINVAL;
+
+	if (!is_power_of_2(nr_pages))
+		return -EINVAL;
+
+	/* If this succeeds, subsequent failures have to undo it */
+	if (!atomic_inc_not_zero(&rb->mmap_count))
+		return -EINVAL;
+
+	/* If mapped, attach to it */
+	if (rb_has_aux(rb)) {
+		atomic_inc(&rb->aux_mmap_count);
+		return 0;
+	}
+
+	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
+		atomic_dec(&rb->mmap_count);
+		return -EPERM;
+	}
+
+	ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
+			   event->attr.aux_watermark, rb_flags);
+	if (ret) {
+		atomic_dec(&rb->mmap_count);
+		return ret;
+	}
+
+	atomic_set(&rb->aux_mmap_count, 1);
+	rb->aux_mmap_locked = extra;
+	perf_mmap_account(vma, user_extra, extra);
+	atomic_inc(&event->mmap_count);
+	return 0;
+}
+
 static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	struct perf_event *event = file->private_data;
 	unsigned long vma_size, nr_pages;
 	long user_extra = 0, extra = 0;
-	struct mutex *aux_mutex = NULL;
 	struct perf_buffer *rb = NULL;
 	int ret, flags = 0;
 	mapped_f mapped;
@@ -7055,51 +7119,15 @@ static int perf_mmap(struct file *file,
 		}
 
 	} else {
-		/*
-		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
-		 * mapped, all subsequent mappings should have the same size
-		 * and offset. Must be above the normal perf buffer.
-		 */
-		u64 aux_offset, aux_size;
-
-		rb = event->rb;
-		if (!rb)
-			goto aux_unlock;
-
-		aux_mutex = &rb->aux_mutex;
-		mutex_lock(aux_mutex);
-
-		aux_offset = READ_ONCE(rb->user_page->aux_offset);
-		aux_size = READ_ONCE(rb->user_page->aux_size);
-
-		if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
-			goto aux_unlock;
-
-		if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
-			goto aux_unlock;
-
-		/* already mapped with a different offset */
-		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
-			goto aux_unlock;
-
-		if (aux_size != nr_pages * PAGE_SIZE)
-			goto aux_unlock;
-
-		/* already mapped with a different size */
-		if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
-			goto aux_unlock;
-
-		if (!is_power_of_2(nr_pages))
-			goto aux_unlock;
-
-		if (!atomic_inc_not_zero(&rb->mmap_count))
-			goto aux_unlock;
-
-		if (rb_has_aux(rb)) {
-			atomic_inc(&rb->aux_mmap_count);
-			ret = 0;
-			goto unlock;
+		if (event->rb) {
+			ret = -EINVAL;
+		} else {
+			scoped_guard(mutex, &event->rb->aux_mutex)
+				ret = perf_mmap_aux(vma, event, nr_pages);
 		}
+		// Temporary workaround to split out AUX handling first
+		mutex_unlock(&event->mmap_mutex);
+		goto out;
 	}
 
 	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
@@ -7132,28 +7160,16 @@ static int perf_mmap(struct file *file,
 		perf_event_init_userpage(event);
 		perf_event_update_userpage(event);
 		ret = 0;
-	} else {
-		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
-				   event->attr.aux_watermark, flags);
-		if (!ret) {
-			atomic_set(&rb->aux_mmap_count, 1);
-			rb->aux_mmap_locked = extra;
-		}
 	}
-
 unlock:
 	if (!ret) {
 		perf_mmap_account(vma, user_extra, extra);
 		atomic_inc(&event->mmap_count);
-	} else if (rb) {
-		/* AUX allocation failed */
-		atomic_dec(&rb->mmap_count);
 	}
-aux_unlock:
-	if (aux_mutex)
-		mutex_unlock(aux_mutex);
 	mutex_unlock(&event->mmap_mutex);
 
+// Temporary until RB allocation is split out.
+out:
 	if (ret)
 		return ret;
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 5/6] perf/core: Split the ringbuffer mmap() and allocation code out
  2025-08-06 20:12 [patch 0/6] perf: Convert mmap() related reference counts to refcount_t Thomas Gleixner
                   ` (3 preceding siblings ...)
  2025-08-06 20:12 ` [patch 4/6] perf/core: Split out ringbuffer allocation Thomas Gleixner
@ 2025-08-06 20:13 ` Thomas Gleixner
  2025-08-06 20:13 ` [patch 6/6] perf/core: Convert mmap() refcounts to refcount_t Thomas Gleixner
  2025-08-07 15:39 ` [patch 0/6] perf: Convert mmap() related reference counts " Lorenzo Stoakes
  6 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-06 20:13 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Lorenzo Stoakes, Kees Cook

The code logic in perf_mmap() is incomprehensible and has been source of
subtle bugs in the past. It makes it impossible to convert the atomic_t
reference counts to refcount_t.

Now that the AUX buffer mapping and allocation code is in it's own function
apply the same treatment to the ringbuffer part and remove the temporary
workarounds created by the AUX split out.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/events/core.c |  175 ++++++++++++++++++++++-----------------------------
 1 file changed, 77 insertions(+), 98 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6970,6 +6970,69 @@ static void perf_mmap_account(struct vm_
 	atomic64_add(extra, &vma->vm_mm->pinned_vm);
 }
 
+static int perf_mmap_rb(struct vm_area_struct *vma, struct perf_event *event,
+			unsigned long nr_pages)
+{
+	long user_extra = nr_pages, extra = 0;
+	struct perf_buffer *rb = event->rb;
+	int rb_flags = 0;
+
+	/*
+	 * If we have rb pages ensure they're a power-of-two number, so we
+	 * can do bitmasks instead of modulo.
+	 */
+	if (--nr_pages != 0 && !is_power_of_2(nr_pages))
+		return -EINVAL;
+
+	WARN_ON_ONCE(event->ctx->parent_ctx);
+
+	if (rb) {
+		if (data_page_nr(rb) != nr_pages)
+			return -EINVAL;
+
+		if (atomic_inc_not_zero(&event->rb->mmap_count)) {
+			/*
+			 * Success -- managed to mmap() the same buffer
+			 * multiple times.
+			 */
+			atomic_inc(&event->mmap_count);
+			return 0;
+		}
+		/*
+		 * Raced against perf_mmap_close()'s
+		 * atomic_dec_and_mutex_lock() remove the event and
+		 * continue as if !event->rb
+		 */
+		ring_buffer_attach(event, NULL);
+	}
+
+	if (!perf_mmap_calc_limits(vma, &user_extra, &extra))
+		return -EPERM;
+
+	if (vma->vm_flags & VM_WRITE)
+		rb_flags |= RING_BUFFER_WRITABLE;
+
+	rb = rb_alloc(nr_pages, event->attr.watermark ? event->attr.wakeup_watermark : 0,
+		      event->cpu, rb_flags);
+
+	if (!rb)
+		return -ENOMEM;
+
+	atomic_set(&rb->mmap_count, 1);
+	rb->mmap_user = get_current_user();
+	rb->mmap_locked = extra;
+
+	ring_buffer_attach(event, rb);
+
+	perf_event_update_time(event);
+	perf_event_init_userpage(event);
+	perf_event_update_userpage(event);
+
+	perf_mmap_account(vma, user_extra, extra);
+	atomic_set(&event->mmap_count, 1);
+	return 0;
+}
+
 static int perf_mmap_aux(struct vm_area_struct *vma, struct perf_event *event,
 			 unsigned long nr_pages)
 {
@@ -7039,10 +7102,8 @@ static int perf_mmap(struct file *file,
 {
 	struct perf_event *event = file->private_data;
 	unsigned long vma_size, nr_pages;
-	long user_extra = 0, extra = 0;
-	struct perf_buffer *rb = NULL;
-	int ret, flags = 0;
 	mapped_f mapped;
+	int ret;
 
 	/*
 	 * Don't allow mmap() of inherited per-task counters. This would
@@ -7068,114 +7129,32 @@ static int perf_mmap(struct file *file,
 	if (vma_size != PAGE_SIZE * nr_pages)
 		return -EINVAL;
 
-	user_extra = nr_pages;
-
-	mutex_lock(&event->mmap_mutex);
-	ret = -EINVAL;
-
-	/*
-	 * This relies on __pmu_detach_event() taking mmap_mutex after marking
-	 * the event REVOKED. Either we observe the state, or __pmu_detach_event()
-	 * will detach the rb created here.
-	 */
-	if (event->state <= PERF_EVENT_STATE_REVOKED) {
-		ret = -ENODEV;
-		goto unlock;
-	}
-
-	if (vma->vm_pgoff == 0) {
-		nr_pages -= 1;
-
+	scoped_guard(mutex, &event->mmap_mutex) {
 		/*
-		 * If we have rb pages ensure they're a power-of-two number, so we
-		 * can do bitmasks instead of modulo.
+		 * This relies on __pmu_detach_event() taking mmap_mutex
+		 * after marking the event REVOKED. Either we observe the
+		 * state, or __pmu_detach_event() will detach the rb
+		 * created here.
 		 */
-		if (nr_pages != 0 && !is_power_of_2(nr_pages))
-			goto unlock;
-
-		WARN_ON_ONCE(event->ctx->parent_ctx);
-
-		if (event->rb) {
-			if (data_page_nr(event->rb) != nr_pages)
-				goto unlock;
-
-			if (atomic_inc_not_zero(&event->rb->mmap_count)) {
-				/*
-				 * Success -- managed to mmap() the same buffer
-				 * multiple times.
-				 */
-				ret = 0;
-				/* We need the rb to map pages. */
-				rb = event->rb;
-				goto unlock;
-			}
-
-			/*
-			 * Raced against perf_mmap_close()'s
-			 * atomic_dec_and_mutex_lock() remove the
-			 * event and continue as if !event->rb
-			 */
-			ring_buffer_attach(event, NULL);
-		}
+		if (event->state <= PERF_EVENT_STATE_REVOKED)
+			return -ENODEV;
 
-	} else {
-		if (event->rb) {
-			ret = -EINVAL;
+		if (vma->vm_pgoff == 0) {
+			ret = perf_mmap_rb(vma, event, nr_pages);
 		} else {
+			if (!event->rb)
+				return -EINVAL;
 			scoped_guard(mutex, &event->rb->aux_mutex)
 				ret = perf_mmap_aux(vma, event, nr_pages);
 		}
-		// Temporary workaround to split out AUX handling first
-		mutex_unlock(&event->mmap_mutex);
-		goto out;
-	}
-
-	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
-		ret = -EPERM;
-		goto unlock;
-	}
-
-	WARN_ON(!rb && event->rb);
-
-	if (vma->vm_flags & VM_WRITE)
-		flags |= RING_BUFFER_WRITABLE;
-
-	if (!rb) {
-		rb = rb_alloc(nr_pages,
-			      event->attr.watermark ? event->attr.wakeup_watermark : 0,
-			      event->cpu, flags);
-
-		if (!rb) {
-			ret = -ENOMEM;
-			goto unlock;
-		}
-
-		atomic_set(&rb->mmap_count, 1);
-		rb->mmap_user = get_current_user();
-		rb->mmap_locked = extra;
-
-		ring_buffer_attach(event, rb);
-
-		perf_event_update_time(event);
-		perf_event_init_userpage(event);
-		perf_event_update_userpage(event);
-		ret = 0;
-	}
-unlock:
-	if (!ret) {
-		perf_mmap_account(vma, user_extra, extra);
-		atomic_inc(&event->mmap_count);
 	}
-	mutex_unlock(&event->mmap_mutex);
 
-// Temporary until RB allocation is split out.
-out:
 	if (ret)
 		return ret;
 
 	/*
 	 * Since pinned accounting is per vm we cannot allow fork() to copy our
-	 * vma.
+	 * VMA. The VMA is fixed size and must not be included in dumps.
 	 */
 	vm_flags_set(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_DONTDUMP);
 	vma->vm_ops = &perf_mmap_vmops;
@@ -7190,7 +7169,7 @@ static int perf_mmap(struct file *file,
 	 * full cleanup in this case and therefore does not invoke
 	 * vmops::close().
 	 */
-	ret = map_range(rb, vma);
+	ret = map_range(event->rb, vma);
 	if (ret)
 		perf_mmap_close(vma);
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 6/6] perf/core: Convert mmap() refcounts to refcount_t
  2025-08-06 20:12 [patch 0/6] perf: Convert mmap() related reference counts to refcount_t Thomas Gleixner
                   ` (4 preceding siblings ...)
  2025-08-06 20:13 ` [patch 5/6] perf/core: Split the ringbuffer mmap() and allocation code out Thomas Gleixner
@ 2025-08-06 20:13 ` Thomas Gleixner
  2025-08-07 15:39 ` [patch 0/6] perf: Convert mmap() related reference counts " Lorenzo Stoakes
  6 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-06 20:13 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Lorenzo Stoakes, Kees Cook

The recently fixed reference count leaks could have been detected by using
refcount_t and refcount_t would have mitigated the potential overflow at
least.

Now that the code is properly structured, convert the mmap() related
mmap_count variants over to refcount_t.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/perf_event.h  |    2 +-
 kernel/events/core.c        |   40 ++++++++++++++++++++--------------------
 kernel/events/internal.h    |    4 ++--
 kernel/events/ring_buffer.c |    2 +-
 4 files changed, 24 insertions(+), 24 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -859,7 +859,7 @@ struct perf_event {
 
 	/* mmap bits */
 	struct mutex			mmap_mutex;
-	atomic_t			mmap_count;
+	refcount_t			mmap_count;
 
 	struct perf_buffer		*rb;
 	struct list_head		rb_entry;
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3968,7 +3968,7 @@ static noinline int visit_groups_merge(s
  */
 static inline bool event_update_userpage(struct perf_event *event)
 {
-	if (likely(!atomic_read(&event->mmap_count)))
+	if (likely(!refcount_read(&event->mmap_count)))
 		return false;
 
 	perf_event_update_time(event);
@@ -6704,11 +6704,11 @@ static void perf_mmap_open(struct vm_are
 	struct perf_event *event = vma->vm_file->private_data;
 	mapped_f mapped = get_mapped(event, event_mapped);
 
-	atomic_inc(&event->mmap_count);
-	atomic_inc(&event->rb->mmap_count);
+	refcount_inc(&event->mmap_count);
+	refcount_inc(&event->rb->mmap_count);
 
 	if (vma->vm_pgoff)
-		atomic_inc(&event->rb->aux_mmap_count);
+		refcount_inc(&event->rb->aux_mmap_count);
 
 	if (mapped)
 		mapped(event, vma->vm_mm);
@@ -6743,7 +6743,7 @@ static void perf_mmap_close(struct vm_ar
 	 * to avoid complications.
 	 */
 	if (rb_has_aux(rb) && vma->vm_pgoff == rb->aux_pgoff &&
-	    atomic_dec_and_mutex_lock(&rb->aux_mmap_count, &rb->aux_mutex)) {
+	    refcount_dec_and_mutex_lock(&rb->aux_mmap_count, &rb->aux_mutex)) {
 		/*
 		 * Stop all AUX events that are writing to this buffer,
 		 * so that we can free its AUX pages and corresponding PMU
@@ -6763,10 +6763,10 @@ static void perf_mmap_close(struct vm_ar
 		mutex_unlock(&rb->aux_mutex);
 	}
 
-	if (atomic_dec_and_test(&rb->mmap_count))
+	if (refcount_dec_and_test(&rb->mmap_count))
 		detach_rest = true;
 
-	if (!atomic_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
+	if (!refcount_dec_and_mutex_lock(&event->mmap_count, &event->mmap_mutex))
 		goto out_put;
 
 	ring_buffer_attach(event, NULL);
@@ -6990,17 +6990,17 @@ static int perf_mmap_rb(struct vm_area_s
 		if (data_page_nr(rb) != nr_pages)
 			return -EINVAL;
 
-		if (atomic_inc_not_zero(&event->rb->mmap_count)) {
+		if (refcount_inc_not_zero(&event->rb->mmap_count)) {
 			/*
 			 * Success -- managed to mmap() the same buffer
 			 * multiple times.
 			 */
-			atomic_inc(&event->mmap_count);
+			refcount_inc(&event->mmap_count);
 			return 0;
 		}
 		/*
 		 * Raced against perf_mmap_close()'s
-		 * atomic_dec_and_mutex_lock() remove the event and
+		 * refcount_dec_and_mutex_lock() remove the event and
 		 * continue as if !event->rb
 		 */
 		ring_buffer_attach(event, NULL);
@@ -7018,7 +7018,7 @@ static int perf_mmap_rb(struct vm_area_s
 	if (!rb)
 		return -ENOMEM;
 
-	atomic_set(&rb->mmap_count, 1);
+	refcount_set(&rb->mmap_count, 1);
 	rb->mmap_user = get_current_user();
 	rb->mmap_locked = extra;
 
@@ -7029,7 +7029,7 @@ static int perf_mmap_rb(struct vm_area_s
 	perf_event_update_userpage(event);
 
 	perf_mmap_account(vma, user_extra, extra);
-	atomic_set(&event->mmap_count, 1);
+	refcount_set(&event->mmap_count, 1);
 	return 0;
 }
 
@@ -7070,31 +7070,31 @@ static int perf_mmap_aux(struct vm_area_
 		return -EINVAL;
 
 	/* If this succeeds, subsequent failures have to undo it */
-	if (!atomic_inc_not_zero(&rb->mmap_count))
+	if (!refcount_inc_not_zero(&rb->mmap_count))
 		return -EINVAL;
 
 	/* If mapped, attach to it */
 	if (rb_has_aux(rb)) {
-		atomic_inc(&rb->aux_mmap_count);
+		refcount_inc(&rb->aux_mmap_count);
 		return 0;
 	}
 
 	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
-		atomic_dec(&rb->mmap_count);
+		refcount_dec(&rb->mmap_count);
 		return -EPERM;
 	}
 
 	ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
 			   event->attr.aux_watermark, rb_flags);
 	if (ret) {
-		atomic_dec(&rb->mmap_count);
+		refcount_dec(&rb->mmap_count);
 		return ret;
 	}
 
-	atomic_set(&rb->aux_mmap_count, 1);
+	refcount_set(&rb->aux_mmap_count, 1);
 	rb->aux_mmap_locked = extra;
 	perf_mmap_account(vma, user_extra, extra);
-	atomic_inc(&event->mmap_count);
+	refcount_inc(&event->mmap_count);
 	return 0;
 }
 
@@ -13243,7 +13243,7 @@ perf_event_set_output(struct perf_event
 	mutex_lock_double(&event->mmap_mutex, &output_event->mmap_mutex);
 set:
 	/* Can't redirect output if we've got an active mmap() */
-	if (atomic_read(&event->mmap_count))
+	if (refcount_read(&event->mmap_count))
 		goto unlock;
 
 	if (output_event) {
@@ -13256,7 +13256,7 @@ perf_event_set_output(struct perf_event
 			goto unlock;
 
 		/* did we race against perf_mmap_close() */
-		if (!atomic_read(&rb->mmap_count)) {
+		if (!refcount_read(&rb->mmap_count)) {
 			ring_buffer_put(rb);
 			goto unlock;
 		}
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -35,7 +35,7 @@ struct perf_buffer {
 	spinlock_t			event_lock;
 	struct list_head		event_list;
 
-	atomic_t			mmap_count;
+	refcount_t			mmap_count;
 	unsigned long			mmap_locked;
 	struct user_struct		*mmap_user;
 
@@ -47,7 +47,7 @@ struct perf_buffer {
 	unsigned long			aux_pgoff;
 	int				aux_nr_pages;
 	int				aux_overwrite;
-	atomic_t			aux_mmap_count;
+	refcount_t			aux_mmap_count;
 	unsigned long			aux_mmap_locked;
 	void				(*free_aux)(void *);
 	refcount_t			aux_refcount;
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -400,7 +400,7 @@ void *perf_aux_output_begin(struct perf_
 	 * the same order, see perf_mmap_close. Otherwise we end up freeing
 	 * aux pages in this path, which is a bug, because in_atomic().
 	 */
-	if (!atomic_read(&rb->aux_mmap_count))
+	if (!refcount_read(&rb->aux_mmap_count))
 		goto err;
 
 	if (!refcount_inc_not_zero(&rb->aux_refcount))


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 1/6] perf/core: Remove redundant condition for AUX buffer size
  2025-08-06 20:12 ` [patch 1/6] perf/core: Remove redundant condition for AUX buffer size Thomas Gleixner
@ 2025-08-07 13:30   ` Lorenzo Stoakes
  0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 13:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Kees Cook

On Wed, Aug 06, 2025 at 10:12:54PM +0200, Thomas Gleixner wrote:
> It is already checked whether the VMA size is the same as
> nr_pages * PAGE_SIZE, so later checking both:

Yeah I see near the start of perf_mmap();

	if (vma_size != PAGE_SIZE * nr_pages)
		return -EINVAL;

>
>       aux_size == vma_size && aux_size == nr_pages * PAGE_SIZE
>
> is redundant. Remove the vma_size check as nr_pages is what is actually
> used in the allocation function. That prepares for splitting out the buffer
> allocation into seperate functions, so that only nr_pages needs to be
> handed in.
>
> No functional change.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

So this seems obviously correct:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  kernel/events/core.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7043,7 +7043,7 @@ static int perf_mmap(struct file *file,
>  		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
>  			goto aux_unlock;
>
> -		if (aux_size != vma_size || aux_size != nr_pages * PAGE_SIZE)
> +		if (aux_size != nr_pages * PAGE_SIZE)
>  			goto aux_unlock;
>
>  		/* already mapped with a different size */
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 2/6] perf/core: Split out mlock limit handling
  2025-08-06 20:12 ` [patch 2/6] perf/core: Split out mlock limit handling Thomas Gleixner
@ 2025-08-07 14:14   ` Lorenzo Stoakes
  0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 14:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Kees Cook

On Wed, Aug 06, 2025 at 10:12:55PM +0200, Thomas Gleixner wrote:
> To prepare for splitting the buffer allocation out into seperate functions
> for the ring buffer and the AUX buffer, split out mlock limit handling into
> a helper function, which can be called from both.
>
> No functional change intended.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  kernel/events/core.c |   75 +++++++++++++++++++++++++--------------------------
>  1 file changed, 38 insertions(+), 37 deletions(-)
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6927,17 +6927,49 @@ static int map_range(struct perf_buffer
>  	return err;
>  }
>
> +static bool perf_mmap_calc_limits(struct vm_area_struct *vma, long *user_extra, long *extra)
> +{
> +	unsigned long user_locked, user_lock_limit, locked, lock_limit;
> +	struct user_struct *user = current_user();
> +
> +	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
> +	/* Increase the limit linearly with more CPUs */
> +	user_lock_limit *= num_online_cpus();
> +
> +	user_locked = atomic_long_read(&user->locked_vm);
> +
> +	/*
> +	 * sysctl_perf_event_mlock may have changed, so that
> +	 *     user->locked_vm > user_lock_limit
> +	 */
> +	if (user_locked > user_lock_limit)
> +		user_locked = user_lock_limit;
> +	user_locked += *user_extra;
> +
> +	if (user_locked > user_lock_limit) {
> +		/*
> +		 * charge locked_vm until it hits user_lock_limit;
> +		 * charge the rest from pinned_vm
> +		 */
> +		*extra = user_locked - user_lock_limit;
> +		*user_extra -= *extra;
> +	}
> +
> +	lock_limit = rlimit(RLIMIT_MEMLOCK);
> +	lock_limit >>= PAGE_SHIFT;
> +	locked = atomic64_read(&vma->vm_mm->pinned_vm) + *extra;
> +
> +	return locked <= lock_limit || !perf_is_paranoid() || capable(CAP_IPC_LOCK);
> +}
> +
>  static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct perf_event *event = file->private_data;
> -	unsigned long user_locked, user_lock_limit;
>  	struct user_struct *user = current_user();
> +	unsigned long vma_size, nr_pages;
> +	long user_extra = 0, extra = 0;
>  	struct mutex *aux_mutex = NULL;
>  	struct perf_buffer *rb = NULL;
> -	unsigned long locked, lock_limit;
> -	unsigned long vma_size;
> -	unsigned long nr_pages;
> -	long user_extra = 0, extra = 0;
>  	int ret, flags = 0;
>  	mapped_f mapped;
>
> @@ -7063,38 +7095,7 @@ static int perf_mmap(struct file *file,
>  		}
>  	}
>
> -	user_lock_limit = sysctl_perf_event_mlock >> (PAGE_SHIFT - 10);
> -
> -	/*
> -	 * Increase the limit linearly with more CPUs:
> -	 */
> -	user_lock_limit *= num_online_cpus();
> -
> -	user_locked = atomic_long_read(&user->locked_vm);
> -
> -	/*
> -	 * sysctl_perf_event_mlock may have changed, so that
> -	 *     user->locked_vm > user_lock_limit
> -	 */
> -	if (user_locked > user_lock_limit)
> -		user_locked = user_lock_limit;
> -	user_locked += user_extra;
> -
> -	if (user_locked > user_lock_limit) {
> -		/*
> -		 * charge locked_vm until it hits user_lock_limit;
> -		 * charge the rest from pinned_vm
> -		 */
> -		extra = user_locked - user_lock_limit;
> -		user_extra -= extra;
> -	}
> -
> -	lock_limit = rlimit(RLIMIT_MEMLOCK);
> -	lock_limit >>= PAGE_SHIFT;
> -	locked = atomic64_read(&vma->vm_mm->pinned_vm) + extra;
> -
> -	if ((locked > lock_limit) && perf_is_paranoid() &&
> -		!capable(CAP_IPC_LOCK)) {
> +	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
>  		ret = -EPERM;
>  		goto unlock;
>  	}
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/6] perf/core: Split out VM accounting
  2025-08-06 20:12 ` [patch 3/6] perf/core: Split out VM accounting Thomas Gleixner
@ 2025-08-07 14:25   ` Lorenzo Stoakes
  0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 14:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Kees Cook

On Wed, Aug 06, 2025 at 10:12:57PM +0200, Thomas Gleixner wrote:
> Similary to the mlock limit calculation the VM accounting is required for
> both the ringbuffer and the AUX buffer allocations.
>
> To prepare for splitting them out into seperate functions, move the
> accounting into a helper function.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> --->  kernel/events/core.c |   13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6962,10 +6962,17 @@ static bool perf_mmap_calc_limits(struct
>  	return locked <= lock_limit || !perf_is_paranoid() || capable(CAP_IPC_LOCK);
>  }
>
> +static void perf_mmap_account(struct vm_area_struct *vma, long user_extra, long extra)
> +{
> +	struct user_struct *user = current_user();
> +
> +	atomic_long_add(user_extra, &user->locked_vm);
> +	atomic64_add(extra, &vma->vm_mm->pinned_vm);
> +}
> +
>  static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct perf_event *event = file->private_data;
> -	struct user_struct *user = current_user();
>  	unsigned long vma_size, nr_pages;
>  	long user_extra = 0, extra = 0;
>  	struct mutex *aux_mutex = NULL;
> @@ -7136,9 +7143,7 @@ static int perf_mmap(struct file *file,
>
>  unlock:
>  	if (!ret) {
> -		atomic_long_add(user_extra, &user->locked_vm);
> -		atomic64_add(extra, &vma->vm_mm->pinned_vm);
> -
> +		perf_mmap_account(vma, user_extra, extra);
>  		atomic_inc(&event->mmap_count);
>  	} else if (rb) {
>  		/* AUX allocation failed */
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 4/6] perf/core: Split out ringbuffer allocation
  2025-08-06 20:12 ` [patch 4/6] perf/core: Split out ringbuffer allocation Thomas Gleixner
@ 2025-08-07 15:38   ` Lorenzo Stoakes
  2025-08-11  6:26     ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 15:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Kees Cook

On Wed, Aug 06, 2025 at 10:12:58PM +0200, Thomas Gleixner wrote:
> The code logic in perf_mmap() is incomprehensible and has been source of
> subtle bugs in the past. It makes it impossible to convert the atomic_t
> reference counts to refcount_t.
>
> There is not really much, which is shared between the ringbuffer and AUX
> buffer allocation code since the mlock limit calculation and the
> accounting has been split out into helper functions.
>
> Move the AUX buffer allocation code out and integrate the call with a
> momentary workaround to allow skipping the remaining ringbuffer related
> code completely. That workaround will be removed once the ringbuffer
> allocation is moved to its own function as well.
>
> No functional change.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

It's easier if you separate out the move and refactor but am going through here,
just to check logic:

- Separate out checks into perf_mmap_aux()
- Add some comments
- We use scoped_guard() for the &event->rb->aux_mutex lock
- We actually return -EINVAL instead of stupidly relying on the default ret
  (such a footgun that, have hit issues with that before)
- Instead of proceeding with the rest of the code as before, we temporarily:
   - drop mutex event->mmap_mutex mutex

And then we explore the other differences below.

There seem to be two bugs, I've noted them inline below. The most series is an
inverted event->rb check, the logic below assumes that's fixed.

THe second is that you don't seem to be doing:

WARN_ON(!rb && event->rb);
if (vma->vm_flags & VM_WRITE)
	flags |= RING_BUFFER_WRITABLE;

In the aux code any more. Maybe first irrelevant, but second surely is?

I noted below inline anyway.

OK moving on, therefore perf_mmap_aux() does the stuff in perf_mmap() that would
previously be execute which would be:

~~~~~~~~~~ 1. ~~~~~~~~~~

-- BEFORE --

	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
		ret = -EPERM;
		goto unlock;
	}
...
unlock:
	if (!ret) {
		...
	} else if (rb) {
		/* AUX allocation failed */
		atomic_dec(&rb->mmap_count);
	}

aux_unlock:
	if (aux_mutex)
		mutex_unlock(aux_mutex);
	mutex_unlock(&event->mmap_mutex);

	if (ret)
		return ret;

-- AFTER --

(We already checked that event->rb is non-NULL)


	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
		atomic_dec(&rb->mmap_count);
		return -EPERM;
	}
	... < scope_guard takes care of aux_mutex >

		// Temporary workaround to split out AUX handling first
		mutex_unlock(&event->mmap_mutex);
		goto out;
 	}

out:
 	if (ret)
 		return ret;

(seems equivalent)

~~~~~~~~~~ 2. ~~~~~~~~~~

-- BEFORE --

	WARN_ON(!rb && event->rb);

	if (vma->vm_flags & VM_WRITE)
		flags |= RING_BUFFER_WRITABLE;

	...

	} else {
		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
				   event->attr.aux_watermark, flags);
		if (!ret) {
			atomic_set(&rb->aux_mmap_count, 1);
			rb->aux_mmap_locked = extra;
		}
	}

unlock:
	if (!ret) {
		perf_mmap_account(vma, user_extra, extra);
		atomic_inc(&event->mmap_count);
	} else if (rb) {
		/* AUX allocation failed */
		atomic_dec(&rb->mmap_count);
	}

aux_unlock:
	if (aux_mutex)
		mutex_unlock(aux_mutex);
	mutex_unlock(&event->mmap_mutex);

	if (ret)
		return ret;

	... vm_flags_set() blah blah ...

-- AFTER --

	ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
			   event->attr.aux_watermark, rb_flags);
	if (ret) {
		atomic_dec(&rb->mmap_count);
		return ret;
	}

	atomic_set(&rb->aux_mmap_count, 1);
	rb->aux_mmap_locked = extra;
	perf_mmap_account(vma, user_extra, extra);
	atomic_inc(&event->mmap_count);
	return 0;

	... < scope_guard takes care of aux_mutex >

		// Temporary workaround to split out AUX handling first
		mutex_unlock(&event->mmap_mutex);
		goto out;
 	}

out:
 	if (ret)
 		return ret;

	... vm_flags_set() blah blah ...


-- rb_alloc_aux() failure path --

BEFORE

WARN_ON(!rb && event->rb);
if (vma->vm_flags & VM_WRITE)
	flags |= RING_BUFFER_WRITABLE;
atomic_dec(&rb->mmap_count);
if (aux_mutex)
	mutex_unlock(aux_mutex);
mutex_unlock(&event->mmap_mutex);
if (ret)
	return ret;

AFTER

atomic_dec(&rb->mmap_count);
.. < scope_guard takes care of aux_mutex >
mutex_unlock(&event->mmap_mutex);
if (ret)
	return ret;

--  rb_alloc_aux() success path --

BEFORE

WARN_ON(!rb && event->rb);
if (vma->vm_flags & VM_WRITE)
	flags |= RING_BUFFER_WRITABLE;
atomic_set(&rb->aux_mmap_count, 1);
rb->aux_mmap_locked = extra;
perf_mmap_account(vma, user_extra, extra);
atomic_inc(&event->mmap_count);
if (aux_mutex)
	mutex_unlock(aux_mutex);
mutex_unlock(&event->mmap_mutex);
...vm_flags_set() blah blah  ...

AFTER

atomic_set(&rb->aux_mmap_count, 1);
rb->aux_mmap_locked = extra;
.. < scope_guard takes care of aux_mutex >

perf_mmap_account(vma, user_extra, extra);
atomic_inc(&event->mmap_count);
mutex_unlock(&event->mmap_mutex);
...vm_flags_set() blah blah  ...


DIFFERENCES:

If we get to the rb_alloc_aux() bit, we're missing the:

WARN_ON(!rb && event->rb);
if (vma->vm_flags & VM_WRITE)
	flags |= RING_BUFFER_WRITABLE;

Bit for aux case.

Otherwise, it seems to be equivalent.

> ---
>  kernel/events/core.c |  134 ++++++++++++++++++++++++++++-----------------------
>  1 file changed, 75 insertions(+), 59 deletions(-)
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6970,12 +6970,76 @@ static void perf_mmap_account(struct vm_
>  	atomic64_add(extra, &vma->vm_mm->pinned_vm);
>  }
>
> +static int perf_mmap_aux(struct vm_area_struct *vma, struct perf_event *event,
> +			 unsigned long nr_pages)
> +{
> +	long user_extra = nr_pages, extra = 0;
> +	struct perf_buffer *rb = event->rb;
> +	u64 aux_offset, aux_size;
> +	int ret, rb_flags = 0;
> +
> +	/*
> +	 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
> +	 * mapped, all subsequent mappings should have the same size
> +	 * and offset. Must be above the normal perf buffer.
> +	 */
> +	aux_offset = READ_ONCE(rb->user_page->aux_offset);
> +	aux_size = READ_ONCE(rb->user_page->aux_size);
> +
> +	if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
> +		return -EINVAL;
> +
> +	if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
> +		return -EINVAL;
> +
> +	/* Already mapped with a different offset */
> +	if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
> +		return -EINVAL;
> +
> +	if (aux_size != nr_pages * PAGE_SIZE)
> +		return -EINVAL;
> +
> +	/* Already mapped with a different size */
> +	if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
> +		return -EINVAL;
> +
> +	if (!is_power_of_2(nr_pages))
> +		return -EINVAL;
> +
> +	/* If this succeeds, subsequent failures have to undo it */
> +	if (!atomic_inc_not_zero(&rb->mmap_count))
> +		return -EINVAL;
> +
> +	/* If mapped, attach to it */
> +	if (rb_has_aux(rb)) {
> +		atomic_inc(&rb->aux_mmap_count);
> +		return 0;
> +	}
> +
> +	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
> +		atomic_dec(&rb->mmap_count);
> +		return -EPERM;
> +	}
> +
> +	ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
> +			   event->attr.aux_watermark, rb_flags);
> +	if (ret) {
> +		atomic_dec(&rb->mmap_count);
> +		return ret;
> +	}
> +
> +	atomic_set(&rb->aux_mmap_count, 1);
> +	rb->aux_mmap_locked = extra;
> +	perf_mmap_account(vma, user_extra, extra);
> +	atomic_inc(&event->mmap_count);
> +	return 0;
> +}
> +
>  static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	struct perf_event *event = file->private_data;
>  	unsigned long vma_size, nr_pages;
>  	long user_extra = 0, extra = 0;
> -	struct mutex *aux_mutex = NULL;
>  	struct perf_buffer *rb = NULL;
>  	int ret, flags = 0;
>  	mapped_f mapped;
> @@ -7055,51 +7119,15 @@ static int perf_mmap(struct file *file,
>  		}
>
>  	} else {
> -		/*
> -		 * AUX area mapping: if rb->aux_nr_pages != 0, it's already
> -		 * mapped, all subsequent mappings should have the same size
> -		 * and offset. Must be above the normal perf buffer.
> -		 */
> -		u64 aux_offset, aux_size;
> -
> -		rb = event->rb;
> -		if (!rb)
> -			goto aux_unlock;
> -
> -		aux_mutex = &rb->aux_mutex;
> -		mutex_lock(aux_mutex);
> -
> -		aux_offset = READ_ONCE(rb->user_page->aux_offset);
> -		aux_size = READ_ONCE(rb->user_page->aux_size);
> -
> -		if (aux_offset < perf_data_size(rb) + PAGE_SIZE)
> -			goto aux_unlock;
> -
> -		if (aux_offset != vma->vm_pgoff << PAGE_SHIFT)
> -			goto aux_unlock;
> -
> -		/* already mapped with a different offset */
> -		if (rb_has_aux(rb) && rb->aux_pgoff != vma->vm_pgoff)
> -			goto aux_unlock;
> -
> -		if (aux_size != nr_pages * PAGE_SIZE)
> -			goto aux_unlock;
> -
> -		/* already mapped with a different size */
> -		if (rb_has_aux(rb) && rb->aux_nr_pages != nr_pages)
> -			goto aux_unlock;
> -
> -		if (!is_power_of_2(nr_pages))
> -			goto aux_unlock;
> -
> -		if (!atomic_inc_not_zero(&rb->mmap_count))
> -			goto aux_unlock;
> -
> -		if (rb_has_aux(rb)) {
> -			atomic_inc(&rb->aux_mmap_count);
> -			ret = 0;
> -			goto unlock;
> +		if (event->rb) {
> +			ret = -EINVAL;

Shouldn't this be if (!event->rb) ?

> +		} else {

Because here you're dereffing event->rb in branch where !event->rb?

> +			scoped_guard(mutex, &event->rb->aux_mutex)
> +				ret = perf_mmap_aux(vma, event, nr_pages);
>  		}
> +		// Temporary workaround to split out AUX handling first
> +		mutex_unlock(&event->mmap_mutex);
> +		goto out;

Noted above but you're now skipping the:

	WARN_ON(!rb && event->rb);

	if (vma->vm_flags & VM_WRITE)
		flags |= RING_BUFFER_WRITABLE;

Bit in the aux case no? Unless I'm missing something?

>  	}
>
>  	if (!perf_mmap_calc_limits(vma, &user_extra, &extra)) {
> @@ -7132,28 +7160,16 @@ static int perf_mmap(struct file *file,
>  		perf_event_init_userpage(event);
>  		perf_event_update_userpage(event);
>  		ret = 0;
> -	} else {
> -		ret = rb_alloc_aux(rb, event, vma->vm_pgoff, nr_pages,
> -				   event->attr.aux_watermark, flags);
> -		if (!ret) {
> -			atomic_set(&rb->aux_mmap_count, 1);
> -			rb->aux_mmap_locked = extra;
> -		}
>  	}
> -
>  unlock:
>  	if (!ret) {
>  		perf_mmap_account(vma, user_extra, extra);
>  		atomic_inc(&event->mmap_count);
> -	} else if (rb) {
> -		/* AUX allocation failed */
> -		atomic_dec(&rb->mmap_count);
>  	}
> -aux_unlock:
> -	if (aux_mutex)
> -		mutex_unlock(aux_mutex);
>  	mutex_unlock(&event->mmap_mutex);
>
> +// Temporary until RB allocation is split out.
> +out:
>  	if (ret)
>  		return ret;
>
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/6] perf: Convert mmap() related reference counts to refcount_t
  2025-08-06 20:12 [patch 0/6] perf: Convert mmap() related reference counts to refcount_t Thomas Gleixner
                   ` (5 preceding siblings ...)
  2025-08-06 20:13 ` [patch 6/6] perf/core: Convert mmap() refcounts to refcount_t Thomas Gleixner
@ 2025-08-07 15:39 ` Lorenzo Stoakes
  6 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2025-08-07 15:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Kees Cook

On Wed, Aug 06, 2025 at 10:12:52PM +0200, Thomas Gleixner wrote:
> The recently fixed reference count leaks could have been detected by using
> refcount_t and refcount_t would have mitigated the potential overflow at
> least.
>
> It turned out that converting the code as is does not work as the
> allocation code ends up doing a refcount_inc() for the first allocation,
> which causes refcount_t sanity checks to emit a UAF warning.
>
> The reason is that the code is sharing functionality at the wrong level and
> ends up being overly complicated for no reason. That's what inevitable led
> to the refcount leak problems.
>
> Address this by splitting the ringbuffer and the AUX buffer mapping and
> allocation parts out into seperate functions, which handle the reference
> counts in a sane way.
>
> That not only simplifies the code and makes it halfways comprehensible, but
> also allows to convert the mmap() related reference counts to refcount_t.
>
> It survives lightweight testing with perf and passes the perf/mmap
> selftest.
>
> The series applies on top of Linus tree and is also available from git:
>
>     git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git perf/refcounts
>
> Thanks,
>
> 	tglx
> ---
>  include/linux/perf_event.h  |    2
>  kernel/events/core.c        |  361 ++++++++++++++++++++++----------------------
>  kernel/events/internal.h    |    4
>  kernel/events/ring_buffer.c |    2
>  4 files changed, 185 insertions(+), 184 deletions(-)

Found what appear to be a couple of bugs in 4/6, will pause review until
addressed as it seems that one patch fundamentally relies on the former,
etc. etc. and fixes will likely shuffle.

Will resume checks on respin/you indicate that my review has a mistake in it :)

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 4/6] perf/core: Split out ringbuffer allocation
  2025-08-07 15:38   ` Lorenzo Stoakes
@ 2025-08-11  6:26     ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2025-08-11  6:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: LKML, Linus Torvalds, Peter Zijlstra, Ingo Molnar, Namhyung Kim,
	Arnaldo Carvalho de Melo, Kees Cook

On Thu, Aug 07 2025 at 16:38, Lorenzo Stoakes wrote:
> On Wed, Aug 06, 2025 at 10:12:58PM +0200, Thomas Gleixner wrote:
> THe second is that you don't seem to be doing:
>
> WARN_ON(!rb && event->rb);
> if (vma->vm_flags & VM_WRITE)
> 	flags |= RING_BUFFER_WRITABLE;
>
> In the aux code any more. Maybe first irrelevant, but second surely
> is?

Yeah. The first one is kinda silly. The second one I dropped unintentionally.

> DIFFERENCES:
>
> If we get to the rb_alloc_aux() bit, we're missing the:
>
> WARN_ON(!rb && event->rb);
> if (vma->vm_flags & VM_WRITE)
> 	flags |= RING_BUFFER_WRITABLE;
>
> Bit for aux case.
>
> Otherwise, it seems to be equivalent.

Thanks for taking the time to go through this.

>> -		if (rb_has_aux(rb)) {
>> -			atomic_inc(&rb->aux_mmap_count);
>> -			ret = 0;
>> -			goto unlock;
>> +		if (event->rb) {
>> +			ret = -EINVAL;
>
> Shouldn't this be if (!event->rb) ?
>
>> +		} else {
>
> Because here you're dereffing event->rb in branch where !event->rb?

Yes. I obviously failed to tested this particular patch alone and that's
fixed up in the next which moves the RB allocation out, so it did not
blow up in my face when I tested the whole pile.

Thanks for spotting!

       tglx


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-08-11  6:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-06 20:12 [patch 0/6] perf: Convert mmap() related reference counts to refcount_t Thomas Gleixner
2025-08-06 20:12 ` [patch 1/6] perf/core: Remove redundant condition for AUX buffer size Thomas Gleixner
2025-08-07 13:30   ` Lorenzo Stoakes
2025-08-06 20:12 ` [patch 2/6] perf/core: Split out mlock limit handling Thomas Gleixner
2025-08-07 14:14   ` Lorenzo Stoakes
2025-08-06 20:12 ` [patch 3/6] perf/core: Split out VM accounting Thomas Gleixner
2025-08-07 14:25   ` Lorenzo Stoakes
2025-08-06 20:12 ` [patch 4/6] perf/core: Split out ringbuffer allocation Thomas Gleixner
2025-08-07 15:38   ` Lorenzo Stoakes
2025-08-11  6:26     ` Thomas Gleixner
2025-08-06 20:13 ` [patch 5/6] perf/core: Split the ringbuffer mmap() and allocation code out Thomas Gleixner
2025-08-06 20:13 ` [patch 6/6] perf/core: Convert mmap() refcounts to refcount_t Thomas Gleixner
2025-08-07 15:39 ` [patch 0/6] perf: Convert mmap() related reference counts " Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).