public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind
@ 2024-10-08 18:34 Lucas De Marchi
  2024-10-08 18:34 ` [PATCH 1/5] perf: Add dummy pmu module Lucas De Marchi
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-08 18:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: dri-devel, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin,
	Lucas De Marchi

v2 of my attempt at fixing how i915 interacts with perf events.

v1 - https://lore.kernel.org/all/20240722210648.80892-1-lucas.demarchi@intel.com/

From other people:
1) https://lore.kernel.org/lkml/20240115170120.662220-1-tvrtko.ursulin@linux.intel.com/
2) https://lore.kernel.org/all/20240213180302.47266-1-umesh.nerlige.ramappa@intel.com/

WARNING: patches 1, 4 and 5 are NOT intended to be applied as is. More
on this below.

This series basically builds on the idea of the first patch of my
previous series, but extends it in a way that

	a) the other patches are not needed  (at least, not as is) and
	b) driver can rebind just fine - no issues with the new call to
	   perf_pmu_register()

Another difference is that rather than mixing i915 cleanups this just
adds a dummy pmu with no backing HW. Intention for dummy_pmu is for
experimental purpose and to demonstrate changes tha need to be applied
to i915 (and probably amdgpu, and also in the pending xe patch).
If desired to have an example like that in tree, then we should hide it
behind a config option and I'd need to re-check the error handling.

With this set I could run the following test script multiple times with
no issues observed:

	#!/bin/bash

	set -e 

	rand_sleep() {
		sleep $(bc -l  <<< "$(shuf -i 0-3000 -n 1) / 1000")
	}

	test_noperf() {
		echo 0 > /sys/kernel/debug/dummy_pmu/bind

		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
	}

	test_perf_before() {
		echo 0 > /sys/kernel/debug/dummy_pmu/bind

		perf stat --interval-count 2 -e dummy_pmu_0/test-event-1/ -I500
		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
	}

	test_kill_perf_later() {
		echo 0 > /sys/kernel/debug/dummy_pmu/bind

		perf stat -e dummy_pmu_0/test-event-1/ -I500 &
		pid=$!
		rand_sleep
		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
		rand_sleep
		kill $pid
	}

	test_kill_perf_laaaaaaater() {
		echo 0 > /sys/kernel/debug/dummy_pmu/bind

		perf stat -e dummy_pmu_0/test-event-1/ -I500 &
		pid=$!
		rand_sleep
		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
		rand_sleep
		rand_sleep
		rand_sleep
		kill $pid
	}

	test_kill_perf_with_leader() {
		echo 0 > /sys/kernel/debug/dummy_pmu/bind

		perf stat -e '{dummy_pmu_0/test-event-1/,dummy_pmu_0/test-event-2/}:S' -I500 &
		pid=$!
		rand_sleep
		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
		rand_sleep
		rand_sleep
		kill $pid
	}

	N=${1:-1}

	for ((i=0; i<N; i++)); do
		printf "%04u/%04u\n" "$((i+1))" "$N" >&2
		test_noperf
		test_perf_before
		test_kill_perf_later
		test_kill_perf_laaaaaaater
		test_kill_perf_with_leader
		echo >&2
	done

Last patch is optional for a driver and not needed for the fix.

Open topics:

	- If something like the last patch is desirable, should it be
	  done from inside perf_pmu_unregister()?

	- Should we have a dummy_pmu (or whatever name) behind a config,
	  or maybe in Documentation/ ?

thanks,
Lucas De Marchi

Lucas De Marchi (5):
  perf: Add dummy pmu module
  perf: Move free outside of the mutex
  perf: Add pmu get/put
  perf/dummy_pmu: Tie pmu to device lifecycle
  perf/dummy_pmu: Track and disable active events

 include/linux/perf_event.h |  12 +
 kernel/events/Makefile     |   1 +
 kernel/events/core.c       |  39 ++-
 kernel/events/dummy_pmu.c  | 492 +++++++++++++++++++++++++++++++++++++
 4 files changed, 539 insertions(+), 5 deletions(-)
 create mode 100644 kernel/events/dummy_pmu.c

-- 
2.46.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] perf: Add dummy pmu module
  2024-10-08 18:34 [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Lucas De Marchi
@ 2024-10-08 18:34 ` Lucas De Marchi
  2024-10-15  0:26   ` Jeff Johnson
  2024-10-16  8:51   ` Peter Zijlstra
  2024-10-08 18:34 ` [PATCH 2/5] perf: Move free outside of the mutex Lucas De Marchi
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-08 18:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: dri-devel, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin,
	Lucas De Marchi

Simple dummy module that mimics what can be done with drivers to
bind/unbind them from the bus, which should trigger resource release.

This is mostly based on how i915 and (pending changes for) xe drivers
are interacting with perf pmu. A few differences due to not having
backing hardware or for simplicity:

	- Instead of using BDF for bind/unbind, use a single number.
	- Unbind is triggered either via debugfs or when removing the
	  module.
	- event::destroy() is always assigned as there should only be a
	  few additional calls

Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 kernel/events/Makefile    |   1 +
 kernel/events/dummy_pmu.c | 426 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 427 insertions(+)
 create mode 100644 kernel/events/dummy_pmu.c

diff --git a/kernel/events/Makefile b/kernel/events/Makefile
index 91a62f566743..2993fed2d091 100644
--- a/kernel/events/Makefile
+++ b/kernel/events/Makefile
@@ -4,3 +4,4 @@ obj-y := core.o ring_buffer.o callchain.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
 obj-$(CONFIG_HW_BREAKPOINT_KUNIT_TEST) += hw_breakpoint_test.o
 obj-$(CONFIG_UPROBES) += uprobes.o
+obj-m += dummy_pmu.o
diff --git a/kernel/events/dummy_pmu.c b/kernel/events/dummy_pmu.c
new file mode 100644
index 000000000000..cdba3a831e4a
--- /dev/null
+++ b/kernel/events/dummy_pmu.c
@@ -0,0 +1,426 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright © 2024 Intel Corporation
+ */
+
+#define pr_fmt(fmt) "%s: " fmt, KBUILD_MODNAME
+
+#include <linux/debugfs.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/perf_event.h>
+#include <linux/random.h>
+#include <linux/seq_file.h>
+#include <linux/types.h>
+
+struct dummy_mod {
+	struct dentry *debugfs_root;
+
+	struct list_head device_list;
+	struct mutex mutex;
+};
+
+struct dummy_pmu {
+	struct pmu base;
+	char *name;
+	bool registered;
+};
+
+struct dummy_device {
+	unsigned int instance;
+	struct kref refcount;
+	struct list_head mod_entry;
+	struct dummy_pmu pmu;
+};
+
+static struct dummy_mod dm;
+
+static void device_release(struct kref *ref);
+
+static struct dummy_pmu *event_to_pmu(struct perf_event *event)
+{
+	return container_of(event->pmu, struct dummy_pmu, base);
+}
+
+static struct dummy_device *pmu_to_device(struct dummy_pmu *pmu)
+{
+	return container_of(pmu, struct dummy_device, pmu);
+}
+
+static ssize_t dummy_pmu_events_sysfs_show(struct device *dev,
+					   struct device_attribute *attr,
+					   char *page)
+{
+	struct perf_pmu_events_attr *pmu_attr;
+
+	pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr);
+
+	return sprintf(page, "event=0x%04llx\n", pmu_attr->id);
+}
+
+#define DUMMY_PMU_EVENT_ATTR(name, config)				\
+	PMU_EVENT_ATTR_ID(name, dummy_pmu_events_sysfs_show, config)
+
+PMU_FORMAT_ATTR(event, "config:0-63");
+
+#define DUMMY1_CONFIG 0x01
+#define DUMMY2_CONFIG 0x02
+
+static struct attribute *dummy_pmu_event_attrs[] = {
+	DUMMY_PMU_EVENT_ATTR(test-event-1, DUMMY1_CONFIG),
+	DUMMY_PMU_EVENT_ATTR(test-event-2, DUMMY2_CONFIG),
+	NULL,
+};
+
+static struct attribute *dummy_pmu_format_attrs[] = {
+	&format_attr_event.attr,
+	NULL,
+};
+static const struct attribute_group dummy_pmu_events_attr_group = {
+	.name = "events",
+	.attrs = dummy_pmu_event_attrs,
+};
+static const struct attribute_group dummy_pmu_format_attr_group = {
+	.name = "format",
+	.attrs = dummy_pmu_format_attrs,
+};
+static const struct attribute_group *attr_groups[] = {
+	&dummy_pmu_format_attr_group,
+	&dummy_pmu_events_attr_group,
+	NULL,
+};
+
+static void dummy_pmu_event_destroy(struct perf_event *event)
+{
+	struct dummy_pmu *pmu = event_to_pmu(event);
+	struct dummy_device *d = pmu_to_device(pmu);
+
+	kref_put(&d->refcount, device_release);
+}
+
+static int dummy_pmu_event_init(struct perf_event *event)
+{
+	struct dummy_pmu *pmu = event_to_pmu(event);
+	struct dummy_device *d = pmu_to_device(pmu);
+
+	if (!pmu->registered)
+		return -ENODEV;
+
+	if (event->attr.type != event->pmu->type)
+		return -ENOENT;
+
+	/* unsupported modes and filters */
+	if (event->attr.sample_period) /* no sampling */
+		return -EINVAL;
+
+	if (has_branch_stack(event))
+		return -EOPNOTSUPP;
+
+	if (event->cpu < 0)
+		return -EINVAL;
+
+	/* Event keeps a ref to maintain PMU allocated, even if it's unregistered */
+	kref_get(&d->refcount);
+	event->destroy = dummy_pmu_event_destroy;
+
+	return 0;
+}
+
+static void dummy_pmu_event_start(struct perf_event *event, int flags)
+{
+	struct dummy_pmu *pmu = event_to_pmu(event);
+
+	if (!pmu->registered)
+		return;
+
+	event->hw.state = 0;
+}
+
+static void dummy_pmu_event_read(struct perf_event *event)
+{
+	struct dummy_pmu *pmu = event_to_pmu(event);
+	u8 buf;
+
+	if (!pmu->registered) {
+		event->hw.state = PERF_HES_STOPPED;
+		return;
+	}
+
+	get_random_bytes(&buf, 1);
+	buf %= 10;
+
+	switch (event->attr.config & 0xf) {
+	case DUMMY1_CONFIG:
+		break;
+	case DUMMY2_CONFIG:
+		buf *= 2;
+		break;
+	}
+
+	local64_add(buf, &event->count);
+}
+
+static void dummy_pmu_event_stop(struct perf_event *event, int flags)
+{
+	struct dummy_pmu *pmu = event_to_pmu(event);
+
+	if (!pmu->registered)
+		goto out;
+
+	if (flags & PERF_EF_UPDATE)
+		dummy_pmu_event_read(event);
+
+out:
+	event->hw.state = PERF_HES_STOPPED;
+}
+
+static int dummy_pmu_event_add(struct perf_event *event, int flags)
+{
+	struct dummy_pmu *pmu = event_to_pmu(event);
+
+	if (!pmu->registered)
+		return -ENODEV;
+
+	if (flags & PERF_EF_START)
+		dummy_pmu_event_start(event, flags);
+
+	return 0;
+
+}
+
+static void dummy_pmu_event_del(struct perf_event *event, int flags)
+{
+	dummy_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int device_init(struct dummy_device *d)
+{
+	int ret;
+
+	d->pmu.base = (struct pmu){
+		.attr_groups	= attr_groups,
+		.module		= THIS_MODULE,
+		.task_ctx_nr	= perf_invalid_context,
+		.event_init	= dummy_pmu_event_init,
+		.add		= dummy_pmu_event_add,
+		.del		= dummy_pmu_event_del,
+		.start		= dummy_pmu_event_start,
+		.stop		= dummy_pmu_event_stop,
+		.read		= dummy_pmu_event_read,
+	};
+
+	d->pmu.name = kasprintf(GFP_KERNEL, "dummy_pmu_%u", d->instance);
+	if (!d->pmu.name)
+		return -ENOMEM;
+
+	ret = perf_pmu_register(&d->pmu.base, d->pmu.name, -1);
+	if (ret)
+		return ret;
+
+	d->pmu.registered = true;
+	pr_info("Device registered: %s\n", d->pmu.name);
+
+	return 0;
+}
+
+static void device_exit(struct dummy_device *d)
+{
+	d->pmu.registered = false;
+	perf_pmu_unregister(&d->pmu.base);
+
+	pr_info("Device released: %s\n", d->pmu.name);
+}
+
+static void device_release(struct kref *ref)
+{
+	struct dummy_device *d = container_of(ref, struct dummy_device, refcount);
+
+	kfree(d->pmu.name);
+	kfree(d);
+}
+
+static struct dummy_device *find_device_locked(struct dummy_mod *m, unsigned int instance)
+{
+	struct dummy_device *d;
+
+	list_for_each_entry(d, &m->device_list, mod_entry)
+		if (d->instance == instance)
+			return d;
+
+	return NULL;
+}
+
+static int dummy_add_device(struct dummy_mod *m, unsigned int instance)
+{
+	struct dummy_device *d, *d2;
+	int ret = 0;
+
+	mutex_lock(&m->mutex);
+	d = find_device_locked(m, instance);
+	mutex_unlock(&m->mutex);
+	if (d)
+		return -EINVAL;
+
+	d = kcalloc(1, sizeof(*d), GFP_KERNEL);
+	if (!d)
+		return -ENOMEM;
+
+	kref_init(&d->refcount);
+	d->instance = instance;
+
+	ret = device_init(d);
+	if (ret < 0)
+		goto fail_put;
+
+	mutex_lock(&m->mutex);
+	d2 = find_device_locked(m, instance);
+	if (d2) {
+		mutex_unlock(&m->mutex);
+		ret = -EINVAL;
+		goto fail_exit;
+	}
+	list_add(&d->mod_entry, &m->device_list);
+	mutex_unlock(&m->mutex);
+
+	return 0;
+
+fail_exit:
+	device_exit(d);
+fail_put:
+	kref_put(&d->refcount, device_release);
+	return ret;
+}
+
+static int dummy_del_device(struct dummy_mod *m, unsigned int instance)
+{
+	struct dummy_device *d, *found = NULL;
+
+	mutex_lock(&m->mutex);
+	list_for_each_entry(d, &m->device_list, mod_entry) {
+		if (d->instance == instance) {
+			list_del(&d->mod_entry);
+			found = d;
+			break;
+		}
+	}
+	mutex_unlock(&m->mutex);
+
+	if (!found)
+		return -EINVAL;
+
+	device_exit(found);
+	kref_put(&found->refcount, device_release);
+
+	return 0;
+}
+
+static int parse_device(const char __user *ubuf, size_t size, u32 *instance)
+{
+	char buf[16];
+	ssize_t len;
+
+	if (size > sizeof(buf) - 1)
+		return -E2BIG;
+
+	len = strncpy_from_user(buf, ubuf, sizeof(buf));
+	if (len < 0 || len >= sizeof(buf) - 1)
+		return -E2BIG;
+
+	if (kstrtou32(buf, 0, instance))
+		return -EINVAL;
+
+	return size;
+}
+
+static int bind_show(struct seq_file *s, void *unused)
+{
+	struct dummy_mod *m = s->private;
+	struct dummy_device *d;
+
+	mutex_lock(&m->mutex);
+	list_for_each_entry(d, &m->device_list, mod_entry)
+		seq_printf(s, "%u\n", d->instance);
+	mutex_unlock(&m->mutex);
+
+	return 0;
+}
+
+static ssize_t bind_write(struct file *f, const char __user *ubuf,
+			  size_t size, loff_t *pos)
+{
+	struct dummy_mod *m = file_inode(f)->i_private;
+	u32 instance;
+	ssize_t ret;
+
+	ret = parse_device(ubuf, size, &instance);
+	if (ret < 0)
+		return ret;
+
+	ret = dummy_add_device(m, instance);
+	if (ret < 0)
+		return ret;
+
+	return size;
+}
+DEFINE_SHOW_STORE_ATTRIBUTE(bind);
+
+static int unbind_show(struct seq_file *s, void *unused)
+{
+	return -EPERM;
+}
+
+static ssize_t unbind_write(struct file *f, const char __user *ubuf,
+			    size_t size, loff_t *pos)
+{
+	struct dummy_mod *m = file_inode(f)->i_private;
+	unsigned int instance;
+	ssize_t ret;
+
+	ret = parse_device(ubuf, size, &instance);
+	if (ret < 0)
+		return ret;
+
+	ret = dummy_del_device(m, instance);
+	if (ret < 0)
+		return ret;
+
+	return size;
+}
+DEFINE_SHOW_STORE_ATTRIBUTE(unbind);
+
+static int __init dummy_init(void)
+{
+	struct dentry *dir;
+
+	dir = debugfs_create_dir(KBUILD_MODNAME, NULL);
+	debugfs_create_file("bind", 0600, dir, &dm, &bind_fops);
+	debugfs_create_file("unbind", 0200, dir, &dm, &unbind_fops);
+
+	dm.debugfs_root = dir;
+	INIT_LIST_HEAD(&dm.device_list);
+	mutex_init(&dm.mutex);
+
+	return 0;
+}
+
+static void dummy_exit(void)
+{
+	struct dummy_device *d, *tmp;
+
+	debugfs_remove_recursive(dm.debugfs_root);
+
+	mutex_lock(&dm.mutex);
+	list_for_each_entry_safe(d, tmp, &dm.device_list, mod_entry) {
+		device_exit(d);
+		kref_put(&d->refcount, device_release);
+	}
+	mutex_unlock(&dm.mutex);
+}
+
+module_init(dummy_init);
+module_exit(dummy_exit);
+
+MODULE_AUTHOR("Lucas De Marchi <lucas.demarchi@intel.com>");
+MODULE_LICENSE("GPL");
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/5] perf: Move free outside of the mutex
  2024-10-08 18:34 [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Lucas De Marchi
  2024-10-08 18:34 ` [PATCH 1/5] perf: Add dummy pmu module Lucas De Marchi
@ 2024-10-08 18:34 ` Lucas De Marchi
  2024-10-08 18:34 ` [PATCH 3/5] perf: Add pmu get/put Lucas De Marchi
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-08 18:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: dri-devel, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin,
	Lucas De Marchi

It's not needed to hold the mutex to free the percpu variables stored in
pmu. Move them outside of the mutex protection in preparation for
possibly allowing them to live longer, according to the lifecycle of the
object owning/containing the pmu.

Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 kernel/events/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3b8b85adb10a..6395dbf67671 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11845,7 +11845,6 @@ void perf_pmu_unregister(struct pmu *pmu)
 	synchronize_srcu(&pmus_srcu);
 	synchronize_rcu();
 
-	free_percpu(pmu->pmu_disable_count);
 	idr_remove(&pmu_idr, pmu->type);
 	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
 		if (pmu->nr_addr_filters)
@@ -11853,8 +11852,11 @@ void perf_pmu_unregister(struct pmu *pmu)
 		device_del(pmu->dev);
 		put_device(pmu->dev);
 	}
-	free_pmu_context(pmu);
+
 	mutex_unlock(&pmus_lock);
+
+	free_percpu(pmu->pmu_disable_count);
+	free_pmu_context(pmu);
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/5] perf: Add pmu get/put
  2024-10-08 18:34 [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Lucas De Marchi
  2024-10-08 18:34 ` [PATCH 1/5] perf: Add dummy pmu module Lucas De Marchi
  2024-10-08 18:34 ` [PATCH 2/5] perf: Move free outside of the mutex Lucas De Marchi
@ 2024-10-08 18:34 ` Lucas De Marchi
  2024-10-09 10:44   ` kernel test robot
  2024-10-14 17:32   ` Peter Zijlstra
  2024-10-08 18:35 ` [PATCH 4/5] perf/dummy_pmu: Tie pmu to device lifecycle Lucas De Marchi
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-08 18:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: dri-devel, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin,
	Lucas De Marchi

If a pmu is unregistered while there's an active event, perf will still
access the pmu via event->pmu, even after the event is destroyed. This
makes it difficult for drivers like i915 that can be unbound from the
HW.

	BUG: KASAN: use-after-free in exclusive_event_destroy+0xd8/0xf0
	Read of size 4 at addr ffff88816e2bb63c by task perf/7748

i915 tries to cope with it by installing a event->destroy, but that is
not sufficient: if pmu is released by the driver, it will still crash
since event->pmu is still used.

Moreover, even with that use-after-free fixed by adjusting the order in
_free_event() or delaying the free by the driver, kernel still oops when
closing the event fd related to a unregistered pmu: the percpu variables
allocated on perf_pmu_register() would already be released. One such
crash is:

	BUG: KASAN: user-memory-access in _raw_spin_lock_irqsave+0x88/0x100
	Write of size 4 at addr 00000000ffffffff by task perf/727

	CPU: 0 UID: 0 PID: 727 Comm: perf Not tainted 6.12.0-rc1-DEMARCHI-dxnf+ #9
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022
	Call Trace:
	 <TASK>
	 dump_stack_lvl+0x5f/0x90
	 print_report+0x4d3/0x50a
	 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
	 ? kasan_addr_to_slab+0xd/0xb0
	 kasan_report+0xe2/0x170
	 ? _raw_spin_lock_irqsave+0x88/0x100
	 ? _raw_spin_lock_irqsave+0x88/0x100
	 kasan_check_range+0x125/0x230
	 __kasan_check_write+0x14/0x30
	 _raw_spin_lock_irqsave+0x88/0x100
	 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
	 _atomic_dec_and_raw_lock_irqsave+0x89/0x110
	 ? __kasan_check_write+0x14/0x30
	 put_pmu_ctx+0x98/0x330

The fix here is to provide a set of get/put hooks that drivers can
implement to piggy back the perf's pmu lifecycle to the driver's
instance lifecycle.  With this, perf_pmu_unregister() can be called by
the driver, which is then responsible for freeing the resources.

Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 include/linux/perf_event.h | 12 ++++++++++++
 kernel/events/core.c       | 37 ++++++++++++++++++++++++++++++++-----
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fb908843f209..d6983dbf5a45 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -561,6 +561,17 @@ struct pmu {
 	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
 	 */
 	int (*check_period)		(struct perf_event *event, u64 value); /* optional */
+
+	/*
+	 * Optional: get a reference. Typically needed by PMUs that are bound to a device
+	 * that can be hotplugged, either physically or through sysfs' bind/unbind. When provided,
+	 * pmu::put() is mandatory and it's driver responsibility to call perf_pmu_free() when
+	 * resources can be released.
+	 */
+	struct pmu *(*get)		(struct pmu *pmu);
+
+	/* Optional: put a reference. See pmu::get() */
+	 void (*put)			(struct pmu *pmu);
 };
 
 enum perf_addr_filter_action_t {
@@ -1104,6 +1115,7 @@ extern void perf_event_itrace_started(struct perf_event *event);
 
 extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
 extern void perf_pmu_unregister(struct pmu *pmu);
+extern void perf_pmu_free(struct pmu *pmu);
 
 extern void __perf_event_task_sched_in(struct task_struct *prev,
 				       struct task_struct *task);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6395dbf67671..bf5b8fc8979e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5319,8 +5319,24 @@ static void perf_pending_task_sync(struct perf_event *event)
 	rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
 }
 
+static void pmu_module_put(struct pmu **ppmu)
+{
+	struct pmu *pmu = *ppmu;
+	struct module *module = pmu->module;
+
+	if (pmu->put)
+		pmu->put(pmu);
+
+	module_put(module);
+
+	/* Can't touch pmu anymore/ */
+	*ppmu = NULL;
+}
+
 static void _free_event(struct perf_event *event)
 {
+	struct module *module;
+
 	irq_work_sync(&event->pending_irq);
 	irq_work_sync(&event->pending_disable_irq);
 	perf_pending_task_sync(event);
@@ -5374,7 +5390,8 @@ static void _free_event(struct perf_event *event)
 		put_ctx(event->ctx);
 
 	exclusive_event_destroy(event);
-	module_put(event->pmu->module);
+
+	pmu_module_put(&event->pmu);
 
 	call_rcu(&event->rcu_head, free_event_rcu);
 }
@@ -11512,10 +11529,12 @@ static int perf_event_idx_default(struct perf_event *event)
 	return 0;
 }
 
-static void free_pmu_context(struct pmu *pmu)
+void perf_pmu_free(struct pmu *pmu)
 {
 	free_percpu(pmu->cpu_pmu_context);
+	free_percpu(pmu->pmu_disable_count);
 }
+EXPORT_SYMBOL_GPL(perf_pmu_free);
 
 /*
  * Let userspace know that this PMU supports address range filtering:
@@ -11749,6 +11768,11 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		goto free_pdc;
 	}
 
+	if (WARN_ONCE((!!pmu->get) ^ (!!pmu->put), "Can not register a pmu with only get or put defined.\n")) {
+		ret = -EINVAL;
+		goto free_pdc;
+	}
+
 	pmu->name = name;
 
 	if (type >= 0)
@@ -11855,8 +11879,8 @@ void perf_pmu_unregister(struct pmu *pmu)
 
 	mutex_unlock(&pmus_lock);
 
-	free_percpu(pmu->pmu_disable_count);
-	free_pmu_context(pmu);
+	if (!pmu->put)
+		perf_pmu_free(pmu);
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
@@ -11890,6 +11914,9 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 		BUG_ON(!ctx);
 	}
 
+	if (pmu->get)
+		pmu->get(pmu);
+
 	event->pmu = pmu;
 	ret = pmu->event_init(event);
 
@@ -11926,7 +11953,7 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 	}
 
 	if (ret)
-		module_put(pmu->module);
+		pmu_module_put(&pmu);
 
 	return ret;
 }
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/5] perf/dummy_pmu: Tie pmu to device lifecycle
  2024-10-08 18:34 [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Lucas De Marchi
                   ` (2 preceding siblings ...)
  2024-10-08 18:34 ` [PATCH 3/5] perf: Add pmu get/put Lucas De Marchi
@ 2024-10-08 18:35 ` Lucas De Marchi
  2024-10-08 18:35 ` [PATCH 5/5] perf/dummy_pmu: Track and disable active events Lucas De Marchi
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-08 18:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: dri-devel, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin,
	Lucas De Marchi

Allow to unregister the PMU from perf with active events. When driver is
being accessed perf keeps a reference that when released triggers the
device cleanup.

Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 kernel/events/dummy_pmu.c | 56 ++++++++++++++++++++++++++++-----------
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/kernel/events/dummy_pmu.c b/kernel/events/dummy_pmu.c
index cdba3a831e4a..c07e111bff01 100644
--- a/kernel/events/dummy_pmu.c
+++ b/kernel/events/dummy_pmu.c
@@ -49,6 +49,11 @@ static struct dummy_device *pmu_to_device(struct dummy_pmu *pmu)
 	return container_of(pmu, struct dummy_device, pmu);
 }
 
+static struct dummy_pmu *pmu_to_dummy(struct pmu *pmu)
+{
+	return container_of(pmu, struct dummy_pmu, base);
+}
+
 static ssize_t dummy_pmu_events_sysfs_show(struct device *dev,
 					   struct device_attribute *attr,
 					   char *page)
@@ -92,18 +97,9 @@ static const struct attribute_group *attr_groups[] = {
 	NULL,
 };
 
-static void dummy_pmu_event_destroy(struct perf_event *event)
-{
-	struct dummy_pmu *pmu = event_to_pmu(event);
-	struct dummy_device *d = pmu_to_device(pmu);
-
-	kref_put(&d->refcount, device_release);
-}
-
 static int dummy_pmu_event_init(struct perf_event *event)
 {
 	struct dummy_pmu *pmu = event_to_pmu(event);
-	struct dummy_device *d = pmu_to_device(pmu);
 
 	if (!pmu->registered)
 		return -ENODEV;
@@ -121,10 +117,6 @@ static int dummy_pmu_event_init(struct perf_event *event)
 	if (event->cpu < 0)
 		return -EINVAL;
 
-	/* Event keeps a ref to maintain PMU allocated, even if it's unregistered */
-	kref_get(&d->refcount);
-	event->destroy = dummy_pmu_event_destroy;
-
 	return 0;
 }
 
@@ -195,10 +187,29 @@ static void dummy_pmu_event_del(struct perf_event *event, int flags)
 	dummy_pmu_event_stop(event, PERF_EF_UPDATE);
 }
 
+static struct pmu *dummy_pmu_get(struct pmu *pmu)
+{
+	struct dummy_device *d = pmu_to_device(pmu_to_dummy(pmu));
+
+	kref_get(&d->refcount);
+
+	return pmu;
+}
+
+static void dummy_pmu_put(struct pmu *pmu)
+{
+	struct dummy_device *d = pmu_to_device(pmu_to_dummy(pmu));
+
+	kref_put(&d->refcount, device_release);
+}
+
 static int device_init(struct dummy_device *d)
 {
 	int ret;
 
+	if (WARN_ONCE(d->pmu.name, "Cannot re-register pmu.\n"))
+		return -EINVAL;
+
 	d->pmu.base = (struct pmu){
 		.attr_groups	= attr_groups,
 		.module		= THIS_MODULE,
@@ -209,6 +220,8 @@ static int device_init(struct dummy_device *d)
 		.start		= dummy_pmu_event_start,
 		.stop		= dummy_pmu_event_stop,
 		.read		= dummy_pmu_event_read,
+		.get		= dummy_pmu_get,
+		.put		= dummy_pmu_put,
 	};
 
 	d->pmu.name = kasprintf(GFP_KERNEL, "dummy_pmu_%u", d->instance);
@@ -217,12 +230,22 @@ static int device_init(struct dummy_device *d)
 
 	ret = perf_pmu_register(&d->pmu.base, d->pmu.name, -1);
 	if (ret)
-		return ret;
+		goto fail;
 
 	d->pmu.registered = true;
 	pr_info("Device registered: %s\n", d->pmu.name);
 
 	return 0;
+
+fail:
+	/*
+	 * See device_release: if name is non-NULL, dummy_pmu was registered
+	 * with perf and needs cleanup
+	 */
+	kfree(d->pmu.name);
+	d->pmu.name = NULL;
+
+	return ret;
 }
 
 static void device_exit(struct dummy_device *d)
@@ -237,7 +260,10 @@ static void device_release(struct kref *ref)
 {
 	struct dummy_device *d = container_of(ref, struct dummy_device, refcount);
 
-	kfree(d->pmu.name);
+	if (d->pmu.name) {
+		perf_pmu_free(&d->pmu.base);
+		kfree(d->pmu.name);
+	}
 	kfree(d);
 }
 
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/5] perf/dummy_pmu: Track and disable active events
  2024-10-08 18:34 [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Lucas De Marchi
                   ` (3 preceding siblings ...)
  2024-10-08 18:35 ` [PATCH 4/5] perf/dummy_pmu: Tie pmu to device lifecycle Lucas De Marchi
@ 2024-10-08 18:35 ` Lucas De Marchi
  2024-10-11 22:21 ` [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Umesh Nerlige Ramappa
  2024-10-11 22:59 ` Lucas De Marchi
  6 siblings, 0 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-08 18:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: dri-devel, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin,
	Lucas De Marchi

When unregistering the PMU, disable the currently active events. This
allows userspace to see the change and possibly reacting on it, like
reopening the fd. With perf-stat, "<not counted>" starts to be printed:

	$ stat -e dummy_pmu_0/test-event-1/ -I1000
	1.001227905                 12      dummy_pmu_0/test-event-1/
	2.004009349                 11      dummy_pmu_0/test-event-1/
	3.005785067                  0      dummy_pmu_0/test-event-1/
	4.008565935      <not counted>      dummy_pmu_0/test-event-1/
	5.010446891      <not counted>      dummy_pmu_0/test-event-1/

Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 kernel/events/dummy_pmu.c | 40 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/kernel/events/dummy_pmu.c b/kernel/events/dummy_pmu.c
index c07e111bff01..496cb8469a05 100644
--- a/kernel/events/dummy_pmu.c
+++ b/kernel/events/dummy_pmu.c
@@ -14,6 +14,7 @@
 #include <linux/random.h>
 #include <linux/seq_file.h>
 #include <linux/types.h>
+#include <linux/xarray.h>
 
 struct dummy_mod {
 	struct dentry *debugfs_root;
@@ -25,6 +26,7 @@ struct dummy_mod {
 struct dummy_pmu {
 	struct pmu base;
 	char *name;
+	struct xarray active_events;
 	bool registered;
 };
 
@@ -97,9 +99,25 @@ static const struct attribute_group *attr_groups[] = {
 	NULL,
 };
 
+static void dummy_pmu_event_destroy(struct perf_event *event)
+{
+	struct dummy_pmu *pmu = event_to_pmu(event);
+	unsigned long idx;
+	struct perf_event *e;
+
+	/* Event not active anymore */
+	xa_for_each(&pmu->active_events, idx, e)
+		if (e == event) {
+			xa_erase(&pmu->active_events, idx);
+			break;
+		}
+}
+
 static int dummy_pmu_event_init(struct perf_event *event)
 {
 	struct dummy_pmu *pmu = event_to_pmu(event);
+	u32 event_id;
+	int ret;
 
 	if (!pmu->registered)
 		return -ENODEV;
@@ -117,6 +135,13 @@ static int dummy_pmu_event_init(struct perf_event *event)
 	if (event->cpu < 0)
 		return -EINVAL;
 
+	ret = xa_alloc(&pmu->active_events, &event_id, event,
+			xa_limit_32b, GFP_KERNEL);
+	if (ret)
+		return ret;
+
+	event->destroy = dummy_pmu_event_destroy;
+
 	return 0;
 }
 
@@ -232,6 +257,8 @@ static int device_init(struct dummy_device *d)
 	if (ret)
 		goto fail;
 
+	xa_init_flags(&d->pmu.active_events, XA_FLAGS_ALLOC);
+
 	d->pmu.registered = true;
 	pr_info("Device registered: %s\n", d->pmu.name);
 
@@ -248,9 +275,22 @@ static int device_init(struct dummy_device *d)
 	return ret;
 }
 
+static void disable_active_events(struct dummy_pmu *pmu)
+{
+	struct perf_event *event;
+	unsigned long idx;
+
+	xa_for_each(&pmu->active_events, idx, event) {
+		xa_erase(&pmu->active_events, idx);
+		perf_event_disable(event);
+	}
+}
+
 static void device_exit(struct dummy_device *d)
 {
 	d->pmu.registered = false;
+
+	disable_active_events(&d->pmu);
 	perf_pmu_unregister(&d->pmu.base);
 
 	pr_info("Device released: %s\n", d->pmu.name);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-08 18:34 ` [PATCH 3/5] perf: Add pmu get/put Lucas De Marchi
@ 2024-10-09 10:44   ` kernel test robot
  2024-10-14 17:32   ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2024-10-09 10:44 UTC (permalink / raw)
  To: Lucas De Marchi, linux-kernel
  Cc: llvm, oe-kbuild-all, dri-devel, Ingo Molnar,
	Arnaldo Carvalho de Melo, Umesh Nerlige Ramappa, Ian Rogers,
	Tvrtko Ursulin, Lucas De Marchi

Hi Lucas,

kernel test robot noticed the following build warnings:

[auto build test WARNING on perf-tools-next/perf-tools-next]
[cannot apply to tip/perf/core perf-tools/perf-tools acme/perf/core linus/master v6.12-rc2 next-20241008]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lucas-De-Marchi/perf-Add-dummy-pmu-module/20241009-023728
base:   https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git perf-tools-next
patch link:    https://lore.kernel.org/r/20241008183501.1354695-4-lucas.demarchi%40intel.com
patch subject: [PATCH 3/5] perf: Add pmu get/put
config: x86_64-kexec (https://download.01.org/0day-ci/archive/20241009/202410091848.aRUoRWWD-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241009/202410091848.aRUoRWWD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410091848.aRUoRWWD-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> kernel/events/core.c:5235:17: warning: unused variable 'module' [-Wunused-variable]
    5235 |         struct module *module;
         |                        ^~~~~~
   1 warning generated.


vim +/module +5235 kernel/events/core.c

  5232	
  5233	static void _free_event(struct perf_event *event)
  5234	{
> 5235		struct module *module;
  5236	
  5237		irq_work_sync(&event->pending_irq);
  5238		irq_work_sync(&event->pending_disable_irq);
  5239		perf_pending_task_sync(event);
  5240	
  5241		unaccount_event(event);
  5242	
  5243		security_perf_event_free(event);
  5244	
  5245		if (event->rb) {
  5246			/*
  5247			 * Can happen when we close an event with re-directed output.
  5248			 *
  5249			 * Since we have a 0 refcount, perf_mmap_close() will skip
  5250			 * over us; possibly making our ring_buffer_put() the last.
  5251			 */
  5252			mutex_lock(&event->mmap_mutex);
  5253			ring_buffer_attach(event, NULL);
  5254			mutex_unlock(&event->mmap_mutex);
  5255		}
  5256	
  5257		if (is_cgroup_event(event))
  5258			perf_detach_cgroup(event);
  5259	
  5260		if (!event->parent) {
  5261			if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
  5262				put_callchain_buffers();
  5263		}
  5264	
  5265		perf_event_free_bpf_prog(event);
  5266		perf_addr_filters_splice(event, NULL);
  5267		kfree(event->addr_filter_ranges);
  5268	
  5269		if (event->destroy)
  5270			event->destroy(event);
  5271	
  5272		/*
  5273		 * Must be after ->destroy(), due to uprobe_perf_close() using
  5274		 * hw.target.
  5275		 */
  5276		if (event->hw.target)
  5277			put_task_struct(event->hw.target);
  5278	
  5279		if (event->pmu_ctx)
  5280			put_pmu_ctx(event->pmu_ctx);
  5281	
  5282		/*
  5283		 * perf_event_free_task() relies on put_ctx() being 'last', in particular
  5284		 * all task references must be cleaned up.
  5285		 */
  5286		if (event->ctx)
  5287			put_ctx(event->ctx);
  5288	
  5289		exclusive_event_destroy(event);
  5290	
  5291		pmu_module_put(&event->pmu);
  5292	
  5293		call_rcu(&event->rcu_head, free_event_rcu);
  5294	}
  5295	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind
  2024-10-08 18:34 [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Lucas De Marchi
                   ` (4 preceding siblings ...)
  2024-10-08 18:35 ` [PATCH 5/5] perf/dummy_pmu: Track and disable active events Lucas De Marchi
@ 2024-10-11 22:21 ` Umesh Nerlige Ramappa
  2024-10-11 23:03   ` Lucas De Marchi
  2024-10-11 22:59 ` Lucas De Marchi
  6 siblings, 1 reply; 24+ messages in thread
From: Umesh Nerlige Ramappa @ 2024-10-11 22:21 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: linux-kernel, dri-devel, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Ian Rogers, Tvrtko Ursulin

On Tue, Oct 08, 2024 at 01:34:56PM -0500, Lucas De Marchi wrote:
>v2 of my attempt at fixing how i915 interacts with perf events.
>
>v1 - https://lore.kernel.org/all/20240722210648.80892-1-lucas.demarchi@intel.com/
>
>From other people:
>1) https://lore.kernel.org/lkml/20240115170120.662220-1-tvrtko.ursulin@linux.intel.com/
>2) https://lore.kernel.org/all/20240213180302.47266-1-umesh.nerlige.ramappa@intel.com/
>
>WARNING: patches 1, 4 and 5 are NOT intended to be applied as is. More
>on this below.
>
>This series basically builds on the idea of the first patch of my
>previous series, but extends it in a way that
>
>	a) the other patches are not needed  (at least, not as is) and
>	b) driver can rebind just fine - no issues with the new call to
>	   perf_pmu_register()

I have 2 broad questions:

(1) I am curious how (b) works. You seem to have a notion of instances of 
devices and then are you using the instance number to create the name 
used for the sysfs entry? Did I get that right?

If so, should the application discover what the name is each time it is 
run? In the failure case that I am seeing, I am running an application 
that does not work when I rename the sysfs entry to something else.

(2) Similar to Patch 1 of your v1 series where you modified _free_event:

static void _free_event(struct perf_event *event)
{
	struct module *module;
...
	module = event->pmu->module;
...
	if (event->destroy)
		event->destroy(event);
...
	module_put(module);
...
}

With the above code, the kref to i915->pmu can be taken from the below 
points in i915 code (just to point out the sequence):

i915_pmu_register()
{
	perf_pmu_register()
	drm_dev_get()
	kref_init()
}

i915_pmu_unregister()
{
	kref_put(&ref, pmu_cleanup)
} 

i915_pmu_event_init()
{
	kref_get()
}

i915_pmu_event_destroy()
{
	kref_put(&ref, pmu_cleanup)
}

void pmu_cleanup(struct kref *ref)
{
	i915_pmu_unregister_cpuhp_state(pmu);
	perf_pmu_unregister(&pmu->base);
	pmu->base.event_init = NULL;
	kfree(pmu->base.attr_groups);
	if (!is_igp(i915))
		kfree(pmu->name);
	free_event_attributes(pmu);
	drm_dev_put(&i915->drm);
}

Would this work? Do you see any gaps that may need the ref counting code 
you added in perf?

Thanks,
Umesh

>
>Another difference is that rather than mixing i915 cleanups this just
>adds a dummy pmu with no backing HW. Intention for dummy_pmu is for
>experimental purpose and to demonstrate changes tha need to be applied
>to i915 (and probably amdgpu, and also in the pending xe patch).
>If desired to have an example like that in tree, then we should hide it
>behind a config option and I'd need to re-check the error handling.
>
>With this set I could run the following test script multiple times with
>no issues observed:
>
>	#!/bin/bash
>
>	set -e
>
>	rand_sleep() {
>		sleep $(bc -l  <<< "$(shuf -i 0-3000 -n 1) / 1000")
>	}
>
>	test_noperf() {
>		echo 0 > /sys/kernel/debug/dummy_pmu/bind
>
>		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
>	}
>
>	test_perf_before() {
>		echo 0 > /sys/kernel/debug/dummy_pmu/bind
>
>		perf stat --interval-count 2 -e dummy_pmu_0/test-event-1/ -I500
>		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
>	}
>
>	test_kill_perf_later() {
>		echo 0 > /sys/kernel/debug/dummy_pmu/bind
>
>		perf stat -e dummy_pmu_0/test-event-1/ -I500 &
>		pid=$!
>		rand_sleep
>		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
>		rand_sleep
>		kill $pid
>	}
>
>	test_kill_perf_laaaaaaater() {
>		echo 0 > /sys/kernel/debug/dummy_pmu/bind
>
>		perf stat -e dummy_pmu_0/test-event-1/ -I500 &
>		pid=$!
>		rand_sleep
>		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
>		rand_sleep
>		rand_sleep
>		rand_sleep
>		kill $pid
>	}
>
>	test_kill_perf_with_leader() {
>		echo 0 > /sys/kernel/debug/dummy_pmu/bind
>
>		perf stat -e '{dummy_pmu_0/test-event-1/,dummy_pmu_0/test-event-2/}:S' -I500 &
>		pid=$!
>		rand_sleep
>		echo 0 > /sys/kernel/debug/dummy_pmu/unbind
>		rand_sleep
>		rand_sleep
>		kill $pid
>	}
>
>	N=${1:-1}
>
>	for ((i=0; i<N; i++)); do
>		printf "%04u/%04u\n" "$((i+1))" "$N" >&2
>		test_noperf
>		test_perf_before
>		test_kill_perf_later
>		test_kill_perf_laaaaaaater
>		test_kill_perf_with_leader
>		echo >&2
>	done
>
>Last patch is optional for a driver and not needed for the fix.
>
>Open topics:
>
>	- If something like the last patch is desirable, should it be
>	  done from inside perf_pmu_unregister()?
>
>	- Should we have a dummy_pmu (or whatever name) behind a config,
>	  or maybe in Documentation/ ?
>
>thanks,
>Lucas De Marchi
>
>Lucas De Marchi (5):
>  perf: Add dummy pmu module
>  perf: Move free outside of the mutex
>  perf: Add pmu get/put
>  perf/dummy_pmu: Tie pmu to device lifecycle
>  perf/dummy_pmu: Track and disable active events
>
> include/linux/perf_event.h |  12 +
> kernel/events/Makefile     |   1 +
> kernel/events/core.c       |  39 ++-
> kernel/events/dummy_pmu.c  | 492 +++++++++++++++++++++++++++++++++++++
> 4 files changed, 539 insertions(+), 5 deletions(-)
> create mode 100644 kernel/events/dummy_pmu.c
>
>-- 
>2.46.2
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind
  2024-10-08 18:34 [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Lucas De Marchi
                   ` (5 preceding siblings ...)
  2024-10-11 22:21 ` [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Umesh Nerlige Ramappa
@ 2024-10-11 22:59 ` Lucas De Marchi
  6 siblings, 0 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-11 22:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: dri-devel, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Tue, Oct 08, 2024 at 01:34:56PM -0500, Lucas De Marchi wrote:
>v2 of my attempt at fixing how i915 interacts with perf events.
>
>v1 - https://lore.kernel.org/all/20240722210648.80892-1-lucas.demarchi@intel.com/
>
>From other people:
>1) https://lore.kernel.org/lkml/20240115170120.662220-1-tvrtko.ursulin@linux.intel.com/
>2) https://lore.kernel.org/all/20240213180302.47266-1-umesh.nerlige.ramappa@intel.com/
>
>WARNING: patches 1, 4 and 5 are NOT intended to be applied as is. More
>on this below.


I also took the patches 2 and 3, that are the ones needed, and applied
the i915 changes on top. I sent only to i915 mailing list since I didn't
want to pollute the mailing list with resubmissions of the same patches
over and over.

These fixes also worked for i915. See
https://lore.kernel.org/intel-gfx/20241011225430.1219345-1-lucas.demarchi@intel.com/
if interested.

Thanks
Lucas De Marchi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind
  2024-10-11 22:21 ` [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Umesh Nerlige Ramappa
@ 2024-10-11 23:03   ` Lucas De Marchi
  0 siblings, 0 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-11 23:03 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa
  Cc: linux-kernel, dri-devel, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Ian Rogers, Tvrtko Ursulin

On Fri, Oct 11, 2024 at 03:21:18PM -0700, Umesh Nerlige Ramappa wrote:
>On Tue, Oct 08, 2024 at 01:34:56PM -0500, Lucas De Marchi wrote:
>>v2 of my attempt at fixing how i915 interacts with perf events.
>>
>>v1 - https://lore.kernel.org/all/20240722210648.80892-1-lucas.demarchi@intel.com/
>>
>>From other people:
>>1) https://lore.kernel.org/lkml/20240115170120.662220-1-tvrtko.ursulin@linux.intel.com/
>>2) https://lore.kernel.org/all/20240213180302.47266-1-umesh.nerlige.ramappa@intel.com/
>>
>>WARNING: patches 1, 4 and 5 are NOT intended to be applied as is. More
>>on this below.
>>
>>This series basically builds on the idea of the first patch of my
>>previous series, but extends it in a way that
>>
>>	a) the other patches are not needed  (at least, not as is) and
>>	b) driver can rebind just fine - no issues with the new call to
>>	   perf_pmu_register()
>
>I have 2 broad questions:
>
>(1) I am curious how (b) works. You seem to have a notion of instances 
>of devices and then are you using the instance number to create the 
>name used for the sysfs entry? Did I get that right?


humn... no. We just unregister the driver from pmu, so the name becomes
free for when the driver rebinds with the same event name.

>
>If so, should the application discover what the name is each time it 
>is run? In the failure case that I am seeing, I am running an 
>application that does not work when I rename the sysfs entry to 
>something else.
>
>(2) Similar to Patch 1 of your v1 series where you modified _free_event:
>
>static void _free_event(struct perf_event *event)
>{
>	struct module *module;
>...
>	module = event->pmu->module;
>...
>	if (event->destroy)
>		event->destroy(event);
>...
>	module_put(module);
>...
>}
>
>With the above code, the kref to i915->pmu can be taken from the below 
>points in i915 code (just to point out the sequence):
>
>i915_pmu_register()
>{
>	perf_pmu_register()
>	drm_dev_get()
>	kref_init()
>}
>
>i915_pmu_unregister()
>{
>	kref_put(&ref, pmu_cleanup)
>}
>
>i915_pmu_event_init()
>{
>	kref_get()
>}
>
>i915_pmu_event_destroy()
>{
>	kref_put(&ref, pmu_cleanup)
>}
>
>void pmu_cleanup(struct kref *ref)
>{
>	i915_pmu_unregister_cpuhp_state(pmu);
>	perf_pmu_unregister(&pmu->base);
>	pmu->base.event_init = NULL;
>	kfree(pmu->base.attr_groups);
>	if (!is_igp(i915))
>		kfree(pmu->name);
>	free_event_attributes(pmu);
>	drm_dev_put(&i915->drm);
>}
>
>Would this work? Do you see any gaps that may need the ref counting 
>code you added in perf?


well... I just posted the fixes for i915 on top of these patches :)
You may want to look at https://lore.kernel.org/intel-gfx/20241011225430.1219345-1-lucas.demarchi@intel.com/
It works for me on my machine with a DG2.

Lucas De Marchi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-08 18:34 ` [PATCH 3/5] perf: Add pmu get/put Lucas De Marchi
  2024-10-09 10:44   ` kernel test robot
@ 2024-10-14 17:32   ` Peter Zijlstra
  2024-10-14 18:20     ` Lucas De Marchi
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2024-10-14 17:32 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Tue, Oct 08, 2024 at 01:34:59PM -0500, Lucas De Marchi wrote:
> If a pmu is unregistered while there's an active event, perf will still
> access the pmu via event->pmu, even after the event is destroyed. This
> makes it difficult for drivers like i915 that can be unbound from the
> HW.
> 
> 	BUG: KASAN: use-after-free in exclusive_event_destroy+0xd8/0xf0
> 	Read of size 4 at addr ffff88816e2bb63c by task perf/7748
> 
> i915 tries to cope with it by installing a event->destroy, but that is
> not sufficient: if pmu is released by the driver, it will still crash
> since event->pmu is still used.
> 
> Moreover, even with that use-after-free fixed by adjusting the order in
> _free_event() or delaying the free by the driver, kernel still oops when
> closing the event fd related to a unregistered pmu: the percpu variables
> allocated on perf_pmu_register() would already be released. One such
> crash is:
> 
> 	BUG: KASAN: user-memory-access in _raw_spin_lock_irqsave+0x88/0x100
> 	Write of size 4 at addr 00000000ffffffff by task perf/727
> 
> 	CPU: 0 UID: 0 PID: 727 Comm: perf Not tainted 6.12.0-rc1-DEMARCHI-dxnf+ #9
> 	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022
> 	Call Trace:
> 	 <TASK>
> 	 dump_stack_lvl+0x5f/0x90
> 	 print_report+0x4d3/0x50a
> 	 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> 	 ? kasan_addr_to_slab+0xd/0xb0
> 	 kasan_report+0xe2/0x170
> 	 ? _raw_spin_lock_irqsave+0x88/0x100
> 	 ? _raw_spin_lock_irqsave+0x88/0x100
> 	 kasan_check_range+0x125/0x230
> 	 __kasan_check_write+0x14/0x30
> 	 _raw_spin_lock_irqsave+0x88/0x100
> 	 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
> 	 _atomic_dec_and_raw_lock_irqsave+0x89/0x110
> 	 ? __kasan_check_write+0x14/0x30
> 	 put_pmu_ctx+0x98/0x330
> 
> The fix here is to provide a set of get/put hooks that drivers can
> implement to piggy back the perf's pmu lifecycle to the driver's
> instance lifecycle.  With this, perf_pmu_unregister() can be called by
> the driver, which is then responsible for freeing the resources.

I'm confused.. probably because I still don't have any clue about
drivers and the above isn't really telling me much either.

I don't see how you get rid of the try_module_get() we do per event;
without that you can't unload the module.

And I don't see how you think it is safe to free a pmu while there are
still events around.

Nor do I really see what these new get/put methods do. I see you call
->put() where we do module_put(), and ->get() near try_module_get(), but
how is that helping?



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-14 17:32   ` Peter Zijlstra
@ 2024-10-14 18:20     ` Lucas De Marchi
  2024-10-14 19:25       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-14 18:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Mon, Oct 14, 2024 at 07:32:46PM +0200, Peter Zijlstra wrote:
>On Tue, Oct 08, 2024 at 01:34:59PM -0500, Lucas De Marchi wrote:
>> If a pmu is unregistered while there's an active event, perf will still
>> access the pmu via event->pmu, even after the event is destroyed. This
>> makes it difficult for drivers like i915 that can be unbound from the
>> HW.
>>
>> 	BUG: KASAN: use-after-free in exclusive_event_destroy+0xd8/0xf0
>> 	Read of size 4 at addr ffff88816e2bb63c by task perf/7748
>>
>> i915 tries to cope with it by installing a event->destroy, but that is
>> not sufficient: if pmu is released by the driver, it will still crash
>> since event->pmu is still used.
>>
>> Moreover, even with that use-after-free fixed by adjusting the order in
>> _free_event() or delaying the free by the driver, kernel still oops when
>> closing the event fd related to a unregistered pmu: the percpu variables
>> allocated on perf_pmu_register() would already be released. One such
>> crash is:
>>
>> 	BUG: KASAN: user-memory-access in _raw_spin_lock_irqsave+0x88/0x100
>> 	Write of size 4 at addr 00000000ffffffff by task perf/727
>>
>> 	CPU: 0 UID: 0 PID: 727 Comm: perf Not tainted 6.12.0-rc1-DEMARCHI-dxnf+ #9
>> 	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022
>> 	Call Trace:
>> 	 <TASK>
>> 	 dump_stack_lvl+0x5f/0x90
>> 	 print_report+0x4d3/0x50a
>> 	 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
>> 	 ? kasan_addr_to_slab+0xd/0xb0
>> 	 kasan_report+0xe2/0x170
>> 	 ? _raw_spin_lock_irqsave+0x88/0x100
>> 	 ? _raw_spin_lock_irqsave+0x88/0x100
>> 	 kasan_check_range+0x125/0x230
>> 	 __kasan_check_write+0x14/0x30
>> 	 _raw_spin_lock_irqsave+0x88/0x100
>> 	 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
>> 	 _atomic_dec_and_raw_lock_irqsave+0x89/0x110
>> 	 ? __kasan_check_write+0x14/0x30
>> 	 put_pmu_ctx+0x98/0x330
>>
>> The fix here is to provide a set of get/put hooks that drivers can
>> implement to piggy back the perf's pmu lifecycle to the driver's
>> instance lifecycle.  With this, perf_pmu_unregister() can be called by
>> the driver, which is then responsible for freeing the resources.
>
>I'm confused.. probably because I still don't have any clue about
>drivers and the above isn't really telling me much either.
>
>I don't see how you get rid of the try_module_get() we do per event;
>without that you can't unload the module.

I don't get rid of the try_module_get(). They serve diffeerent purposes.
Having a reference to the module prevents the _module_ going away (and
hence the function pointers we call into from perf). It doesn't prevent
the module unbinding from the HW.  A module may have N instances if it's
bound to N devices.

This can be done today to unbind the HW (integrated graphics) from the
i915 module:

	# echo 0000:00:02.0 > /sys/bus/pci/drivers/i915/unbind

The ref taken by these new get()/put() are related to preventing the
data going away - the driver can use that to take a ref on something
that will survive the unbind.

>
>And I don't see how you think it is safe to free a pmu while there are
>still events around.

so, we don't actually free it - the pmu is unregistered but the
`struct pmu` and (possibly) its container are still around after unregister.
When the get/put are used, the driver can keep the data around, which is
then free'd when the last reference is put.

>
>Nor do I really see what these new get/put methods do. I see you call
>->put() where we do module_put(), and ->get() near try_module_get(), but
>how is that helping?

Maybe the specific patches for i915 can help? Patch series:
https://lore.kernel.org/intel-gfx/20241011225430.1219345-1-lucas.demarchi@intel.com/

Important patches here are patches 2 and 3:

- Subject: [PATCH 2/8] drm/i915/pmu: Let resource survive unbind

   Allow the final kfree() to happen at a different time, not
   necessarily together with the call to perf_pmu_unregister().
   Here it uses drmm_add_action() to easily tie on the last drm ref going
   away.

- Subject: [PATCH 3/8] drm/i915/pmu: Fix crash due to use-after-free

   This implements the get()/put() so we get/put a reference to the drm
   dev.

These 2 patches for i915 are the equivalent of patch 4 in this series
for the dummy_pmu.

Lucas De Marchi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-14 18:20     ` Lucas De Marchi
@ 2024-10-14 19:25       ` Peter Zijlstra
  2024-10-14 20:12         ` Lucas De Marchi
  2024-10-16 12:03         ` Peter Zijlstra
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2024-10-14 19:25 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Mon, Oct 14, 2024 at 01:20:34PM -0500, Lucas De Marchi wrote:
> On Mon, Oct 14, 2024 at 07:32:46PM +0200, Peter Zijlstra wrote:

> > I'm confused.. probably because I still don't have any clue about
> > drivers and the above isn't really telling me much either.
> > 
> > I don't see how you get rid of the try_module_get() we do per event;
> > without that you can't unload the module.
> 
> I don't get rid of the try_module_get(). They serve diffeerent purposes.
> Having a reference to the module prevents the _module_ going away (and
> hence the function pointers we call into from perf). It doesn't prevent
> the module unbinding from the HW.  A module may have N instances if it's
> bound to N devices.
> 
> This can be done today to unbind the HW (integrated graphics) from the
> i915 module:
> 
> 	# echo 0000:00:02.0 > /sys/bus/pci/drivers/i915/unbind
> 
> The ref taken by these new get()/put() are related to preventing the
> data going away - the driver can use that to take a ref on something
> that will survive the unbind.

OK, for some reason I thought to remember that you wanted to be able to
unload the module too.

> > And I don't see how you think it is safe to free a pmu while there are
> > still events around.
> 
> so, we don't actually free it - the pmu is unregistered but the
> `struct pmu` and (possibly) its container are still around after unregister.
> When the get/put are used, the driver can keep the data around, which is
> then free'd when the last reference is put.

Aaaaah, okay. So the implementation knows to nop out all device
interaction when it gets unbound, but the events and pmu data stick
around until they're naturally killed off?

Ah, reading the below patches that is indeed what i915 does. pmu->closed
makes this so.

The dummy thing you posted in this thread, does perf_event_disable() on
all previously created events, and this is not sound. Userspace can do
PERF_EVENT_IOC_ENABLE on them and then things will go side-ways fast.
And I was afraid i915 was doing this same.

> - Subject: [PATCH 3/8] drm/i915/pmu: Fix crash due to use-after-free

So reading that Changelog, you would like a replacement for pmu->closed
as well.

I suppose, one way to go about doing this is to have
perf_pmu_unregister() replace a bunch of methods. Notably you have
pmu->closed in:

  - event_init()
  - read()
  - start()
  - stop()
  - add()

Having perf_pmu_unregister() overwrite those function pointers with
something akin to your pmu->closed would go a long way, right? It would
require using READ_ONCE() for calling the methods, which would make
things a little ugly :/

But I also think we want to force all the events into STATE_ERROR, and
I'm not immediately sure how best to go about doing that. Adding better
return value handling to ->add() is trivial enough, and perhaps calling
perf_pmu_resched() is sufficient to cycle everything.

Let me ponder that a little bit.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-14 19:25       ` Peter Zijlstra
@ 2024-10-14 20:12         ` Lucas De Marchi
  2024-10-16 12:03         ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-14 20:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Mon, Oct 14, 2024 at 09:25:19PM +0200, Peter Zijlstra wrote:
>On Mon, Oct 14, 2024 at 01:20:34PM -0500, Lucas De Marchi wrote:
>> On Mon, Oct 14, 2024 at 07:32:46PM +0200, Peter Zijlstra wrote:
>
>> > I'm confused.. probably because I still don't have any clue about
>> > drivers and the above isn't really telling me much either.
>> >
>> > I don't see how you get rid of the try_module_get() we do per event;
>> > without that you can't unload the module.
>>
>> I don't get rid of the try_module_get(). They serve diffeerent purposes.
>> Having a reference to the module prevents the _module_ going away (and
>> hence the function pointers we call into from perf). It doesn't prevent
>> the module unbinding from the HW.  A module may have N instances if it's
>> bound to N devices.
>>
>> This can be done today to unbind the HW (integrated graphics) from the
>> i915 module:
>>
>> 	# echo 0000:00:02.0 > /sys/bus/pci/drivers/i915/unbind
>>
>> The ref taken by these new get()/put() are related to preventing the
>> data going away - the driver can use that to take a ref on something
>> that will survive the unbind.
>
>OK, for some reason I thought to remember that you wanted to be able to
>unload the module too.

humn... maybe because in the first version we were talking about
shortcutting all the function calls. If made by the driver, it'd allow
to remove the ugly `if (pmu->closed)`, and if made by perf itself, it
would  allow to drop the module ref since we would guarantee we are not
calling into the module anymore.

But I think that's orthogonal to what we really care about: once the HW
vanishes underneath us, stop accessing it and unregister the PMU in a
way that if the HW shows up again we can still register it and nothing
explodes.

>
>> > And I don't see how you think it is safe to free a pmu while there are
>> > still events around.
>>
>> so, we don't actually free it - the pmu is unregistered but the
>> `struct pmu` and (possibly) its container are still around after unregister.
>> When the get/put are used, the driver can keep the data around, which is
>> then free'd when the last reference is put.
>
>Aaaaah, okay. So the implementation knows to nop out all device
>interaction when it gets unbound, but the events and pmu data stick
>around until they're naturally killed off?
>
>Ah, reading the below patches that is indeed what i915 does. pmu->closed
>makes this so.

yes. Without these patches it doesn't work well though, particuarly
because

a) even if we protected the methods with pmu->closed(), the data is
freed when we call perf_pmu_unregister(), making the whole pmu pointer
invalid
b) kernel/core/events.c still accesss the pmu after calling
event->destroy() - we can't really hook on that to destroy the pmu like
is done today.

>
>The dummy thing you posted in this thread, does perf_event_disable() on

that's optional and I think we could live without. The main issue is
completely crashing the kernel if we do:

	# perf stat -e i915/rc6-residency/ -I1000 &
	# echo 0000:00:02.0 > /sys/bus/pci/drivers/i915/unbind

... which these patches are fixing regardless of calling
perf_event_disable().

>all previously created events, and this is not sound. Userspace can do
>PERF_EVENT_IOC_ENABLE on them and then things will go side-ways fast.
>And I was afraid i915 was doing this same.

For the i915 series, that would be "[PATCH 8/8] drm/i915/pmu: Release
open events when unregistering". See release_active_events()

I was just wanting a smoke signal/hint for userspace that "something
went wrong" with this counter.

>
>> - Subject: [PATCH 3/8] drm/i915/pmu: Fix crash due to use-after-free
>
>So reading that Changelog, you would like a replacement for pmu->closed
>as well.
>
>I suppose, one way to go about doing this is to have
>perf_pmu_unregister() replace a bunch of methods. Notably you have
>pmu->closed in:
>
>  - event_init()
>  - read()
>  - start()
>  - stop()
>  - add()
>
>Having perf_pmu_unregister() overwrite those function pointers with
>something akin to your pmu->closed would go a long way, right? It would

it would help if we want to unload the module, to make it work for
other drivers without having to add similar flag and to make those
drivers less ugly with those checks. However grepping the
kernel for calls to perf_pmu_register(), it seems there are only 3
candidates, all in drm: i915, amdgpu and xe (the xe is a pending patch
series waiting on the resolution of this issue).  There is
drivers/powercap/intel_rapl_common.c with its `bool registered` flag,
but that is basically doing multiple register/unregister to replace the pmu
rather than working with HW that can be removed.

>require using READ_ONCE() for calling the methods, which would make
>things a little ugly :/
>
>But I also think we want to force all the events into STATE_ERROR, and

yeah...  When I was looking for the smoke signal I mentioned, I wanted
something to move the event into that state, but couldn't find one. The
best I found was to disable it.

>I'm not immediately sure how best to go about doing that. Adding better
>return value handling to ->add() is trivial enough, and perhaps calling
>perf_pmu_resched() is sufficient to cycle everything.
>
>Let me ponder that a little bit.

thanks for the help

Lucas De Marchi

>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] perf: Add dummy pmu module
  2024-10-08 18:34 ` [PATCH 1/5] perf: Add dummy pmu module Lucas De Marchi
@ 2024-10-15  0:26   ` Jeff Johnson
  2024-10-15  4:23     ` Lucas De Marchi
  2024-10-16  8:51   ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Jeff Johnson @ 2024-10-15  0:26 UTC (permalink / raw)
  To: Lucas De Marchi, linux-kernel
  Cc: dri-devel, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On 10/8/24 11:34, Lucas De Marchi wrote:
...
> +module_init(dummy_init);
> +module_exit(dummy_exit);
> +
> +MODULE_AUTHOR("Lucas De Marchi <lucas.demarchi@intel.com>");
> +MODULE_LICENSE("GPL");

Since commit 1fffe7a34c89 ("script: modpost: emit a warning when the
description is missing"), a module without a MODULE_DESCRIPTION() will
result in a warning when built with make W=1. Recently, multiple
developers have been eradicating these warnings treewide, and very few
(if any) are left, so please don't introduce a new one :)

Please add the missing MODULE_DESCRIPTION()


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] perf: Add dummy pmu module
  2024-10-15  0:26   ` Jeff Johnson
@ 2024-10-15  4:23     ` Lucas De Marchi
  0 siblings, 0 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-15  4:23 UTC (permalink / raw)
  To: Jeff Johnson
  Cc: linux-kernel, dri-devel, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Umesh Nerlige Ramappa, Ian Rogers,
	Tvrtko Ursulin

On Mon, Oct 14, 2024 at 05:26:19PM -0700, Jeff Johnson wrote:
>On 10/8/24 11:34, Lucas De Marchi wrote:
>...
>> +module_init(dummy_init);
>> +module_exit(dummy_exit);
>> +
>> +MODULE_AUTHOR("Lucas De Marchi <lucas.demarchi@intel.com>");
>> +MODULE_LICENSE("GPL");
>
>Since commit 1fffe7a34c89 ("script: modpost: emit a warning when the
>description is missing"), a module without a MODULE_DESCRIPTION() will
>result in a warning when built with make W=1. Recently, multiple
>developers have been eradicating these warnings treewide, and very few
>(if any) are left, so please don't introduce a new one :)
>
>Please add the missing MODULE_DESCRIPTION()

Thanks. This patch is not intended to be applied - it serves only to
demonstrate the issue solved by the other patches. But I will keep that
in mind if I re-submit it.

Lucas De Marchi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] perf: Add dummy pmu module
  2024-10-08 18:34 ` [PATCH 1/5] perf: Add dummy pmu module Lucas De Marchi
  2024-10-15  0:26   ` Jeff Johnson
@ 2024-10-16  8:51   ` Peter Zijlstra
  2024-10-18 19:30     ` Lucas De Marchi
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2024-10-16  8:51 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Tue, Oct 08, 2024 at 01:34:57PM -0500, Lucas De Marchi wrote:
> +static int parse_device(const char __user *ubuf, size_t size, u32 *instance)
> +{
> +	char buf[16];
> +	ssize_t len;
> +
> +	if (size > sizeof(buf) - 1)
> +		return -E2BIG;
> +
> +	len = strncpy_from_user(buf, ubuf, sizeof(buf));
> +	if (len < 0 || len >= sizeof(buf) - 1)
> +		return -E2BIG;
> +
> +	if (kstrtou32(buf, 0, instance))
> +		return -EINVAL;
> +
> +	return size;
> +}

I had to change this to:

+static int parse_device(const char __user *ubuf, size_t size, u32 *instance)
+{
+       int ret = kstrtouint_from_user(ubuf, size, 0, instance);
+       if (ret) {
+               printk("suckage: %d\n", ret);
+               return ret;
+       }
+       return size;
+}

because otherwise it didn't want to work for me; I kept getting garbage
at the end and things failing. Specifically, it looked like the string
presented by userspace was not '\0' terminated, ubuf was pointing to
"1...garbage..." and size was 1.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-14 19:25       ` Peter Zijlstra
  2024-10-14 20:12         ` Lucas De Marchi
@ 2024-10-16 12:03         ` Peter Zijlstra
  2024-10-18 19:46           ` Lucas De Marchi
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2024-10-16 12:03 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Mon, Oct 14, 2024 at 09:25:19PM +0200, Peter Zijlstra wrote:

> Let me ponder that a little bit.

So I did the thing on top of the get/put thing that would allow you to
get rid of the ->closed thing, and before I was finished I already hated
all of it :-(

The thing is, if you're going to the effort of adding get/put and
putting the responsibility on the implementation, then the ->closed
thing is only a little extra ask.

So... I wondered, how hard it would be for perf_pmu_unregister() to do
what you want.

Time passed, hacks were done...

and while what I have is still riddled with holes (more work is
definitely required), it does pass your dummy_pmu test scipt.

I'll poke at this a little longer. Afaict it should be possible to make
this good enough for what you want. Just needs more holes filled.

---
 include/linux/perf_event.h |  13 ++-
 kernel/events/core.c       | 260 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 228 insertions(+), 45 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fb908843f209..20995d33d27e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -318,6 +318,9 @@ struct perf_output_handle;
 struct pmu {
 	struct list_head		entry;
 
+	spinlock_t			events_lock;
+	struct list_head		events;
+
 	struct module			*module;
 	struct device			*dev;
 	struct device			*parent;
@@ -612,9 +615,10 @@ struct perf_addr_filter_range {
  * enum perf_event_state - the states of an event:
  */
 enum perf_event_state {
-	PERF_EVENT_STATE_DEAD		= -4,
-	PERF_EVENT_STATE_EXIT		= -3,
-	PERF_EVENT_STATE_ERROR		= -2,
+	PERF_EVENT_STATE_DEAD		= -5,
+	PERF_EVENT_STATE_REVOKED	= -4, /* pmu gone, must not touch */
+	PERF_EVENT_STATE_EXIT		= -3, /* task died, still inherit */
+	PERF_EVENT_STATE_ERROR		= -2, /* scheduling error, can enable */
 	PERF_EVENT_STATE_OFF		= -1,
 	PERF_EVENT_STATE_INACTIVE	=  0,
 	PERF_EVENT_STATE_ACTIVE		=  1,
@@ -853,6 +857,7 @@ struct perf_event {
 	void *security;
 #endif
 	struct list_head		sb_list;
+	struct list_head		pmu_list;
 
 	/*
 	 * Certain events gets forwarded to another pmu internally by over-
@@ -1103,7 +1108,7 @@ extern void perf_aux_output_flag(struct perf_output_handle *handle, u64 flags);
 extern void perf_event_itrace_started(struct perf_event *event);
 
 extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
-extern void perf_pmu_unregister(struct pmu *pmu);
+extern int perf_pmu_unregister(struct pmu *pmu);
 
 extern void __perf_event_task_sched_in(struct task_struct *prev,
 				       struct task_struct *task);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index cdd09769e6c5..e66827367a97 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2406,7 +2406,9 @@ ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event)
 
 #define DETACH_GROUP	0x01UL
 #define DETACH_CHILD	0x02UL
-#define DETACH_DEAD	0x04UL
+#define DETACH_EXIT	0x04UL
+#define DETACH_REVOKE	0x08UL
+#define DETACH_DEAD	0x10UL
 
 /*
  * Cross CPU call to remove a performance event
@@ -2421,6 +2423,7 @@ __perf_remove_from_context(struct perf_event *event,
 			   void *info)
 {
 	struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
+	enum perf_event_state state = PERF_EVENT_STATE_OFF;
 	unsigned long flags = (unsigned long)info;
 
 	ctx_time_update(cpuctx, ctx);
@@ -2429,16 +2432,22 @@ __perf_remove_from_context(struct perf_event *event,
 	 * Ensure event_sched_out() switches to OFF, at the very least
 	 * this avoids raising perf_pending_task() at this time.
 	 */
-	if (flags & DETACH_DEAD)
+	if (flags & DETACH_EXIT)
+		state = PERF_EVENT_STATE_EXIT;
+	if (flags & DETACH_REVOKE)
+		state = PERF_EVENT_STATE_REVOKED;
+	if (flags & DETACH_DEAD) {
 		event->pending_disable = 1;
+		state = PERF_EVENT_STATE_DEAD;
+	}
 	event_sched_out(event, ctx);
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
 	if (flags & DETACH_CHILD)
 		perf_child_detach(event);
 	list_del_event(event, ctx);
-	if (flags & DETACH_DEAD)
-		event->state = PERF_EVENT_STATE_DEAD;
+
+	event->state = state;
 
 	if (!pmu_ctx->nr_events) {
 		pmu_ctx->rotate_necessary = 0;
@@ -4508,7 +4517,8 @@ static void perf_event_enable_on_exec(struct perf_event_context *ctx)
 
 static void perf_remove_from_owner(struct perf_event *event);
 static void perf_event_exit_event(struct perf_event *event,
-				  struct perf_event_context *ctx);
+				  struct perf_event_context *ctx,
+				  bool revoke);
 
 /*
  * Removes all events from the current task that have been marked
@@ -4535,7 +4545,7 @@ static void perf_event_remove_on_exec(struct perf_event_context *ctx)
 
 		modified = true;
 
-		perf_event_exit_event(event, ctx);
+		perf_event_exit_event(event, ctx, false);
 	}
 
 	raw_spin_lock_irqsave(&ctx->lock, flags);
@@ -5121,6 +5131,7 @@ static bool is_sb_event(struct perf_event *event)
 	    attr->context_switch || attr->text_poke ||
 	    attr->bpf_event)
 		return true;
+
 	return false;
 }
 
@@ -5321,6 +5332,8 @@ static void perf_pending_task_sync(struct perf_event *event)
 
 static void _free_event(struct perf_event *event)
 {
+	struct pmu *pmu = event->pmu;
+
 	irq_work_sync(&event->pending_irq);
 	irq_work_sync(&event->pending_disable_irq);
 	perf_pending_task_sync(event);
@@ -5330,6 +5343,7 @@ static void _free_event(struct perf_event *event)
 	security_perf_event_free(event);
 
 	if (event->rb) {
+		WARN_ON_ONCE(!pmu);
 		/*
 		 * Can happen when we close an event with re-directed output.
 		 *
@@ -5349,12 +5363,16 @@ static void _free_event(struct perf_event *event)
 			put_callchain_buffers();
 	}
 
-	perf_event_free_bpf_prog(event);
-	perf_addr_filters_splice(event, NULL);
-	kfree(event->addr_filter_ranges);
+	if (pmu) {
+		perf_event_free_bpf_prog(event);
+		perf_addr_filters_splice(event, NULL);
+		kfree(event->addr_filter_ranges);
+	}
 
-	if (event->destroy)
+	if (event->destroy) {
+		WARN_ON_ONCE(!pmu);
 		event->destroy(event);
+	}
 
 	/*
 	 * Must be after ->destroy(), due to uprobe_perf_close() using
@@ -5363,8 +5381,10 @@ static void _free_event(struct perf_event *event)
 	if (event->hw.target)
 		put_task_struct(event->hw.target);
 
-	if (event->pmu_ctx)
+	if (event->pmu_ctx) {
+		WARN_ON_ONCE(!pmu);
 		put_pmu_ctx(event->pmu_ctx);
+	}
 
 	/*
 	 * perf_event_free_task() relies on put_ctx() being 'last', in particular
@@ -5373,8 +5393,14 @@ static void _free_event(struct perf_event *event)
 	if (event->ctx)
 		put_ctx(event->ctx);
 
-	exclusive_event_destroy(event);
-	module_put(event->pmu->module);
+	if (pmu) {
+		exclusive_event_destroy(event);
+		module_put(pmu->module);
+		scoped_guard(spinlock, &pmu->events_lock) {
+			list_del(&event->pmu_list);
+			wake_up_var(pmu);
+		}
+	}
 
 	call_rcu(&event->rcu_head, free_event_rcu);
 }
@@ -5492,7 +5518,11 @@ int perf_event_release_kernel(struct perf_event *event)
 	 * Thus this guarantees that we will in fact observe and kill _ALL_
 	 * child events.
 	 */
-	perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
+	if (event->state > PERF_EVENT_STATE_REVOKED) {
+		perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
+	} else {
+		event->state = PERF_EVENT_STATE_DEAD;
+	}
 
 	perf_event_ctx_unlock(event, ctx);
 
@@ -5803,7 +5833,7 @@ __perf_read(struct perf_event *event, char __user *buf, size_t count)
 	 * error state (i.e. because it was pinned but it couldn't be
 	 * scheduled on to the CPU at some point).
 	 */
-	if (event->state == PERF_EVENT_STATE_ERROR)
+	if (event->state <= PERF_EVENT_STATE_ERROR)
 		return 0;
 
 	if (count < event->read_size)
@@ -5842,8 +5872,14 @@ static __poll_t perf_poll(struct file *file, poll_table *wait)
 	struct perf_buffer *rb;
 	__poll_t events = EPOLLHUP;
 
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return EPOLLERR;
+
 	poll_wait(file, &event->waitq, wait);
 
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return EPOLLERR;
+
 	if (is_event_hup(event))
 		return events;
 
@@ -6013,7 +6049,7 @@ static inline int perf_fget_light(int fd, struct fd *p)
 }
 
 static int perf_event_set_output(struct perf_event *event,
-				 struct perf_event *output_event);
+				 struct perf_event *output_event, bool force);
 static int perf_event_set_filter(struct perf_event *event, void __user *arg);
 static int perf_copy_attr(struct perf_event_attr __user *uattr,
 			  struct perf_event_attr *attr);
@@ -6023,6 +6059,9 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 	void (*func)(struct perf_event *);
 	u32 flags = arg;
 
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return -ENODEV;
+
 	switch (cmd) {
 	case PERF_EVENT_IOC_ENABLE:
 		func = _perf_event_enable;
@@ -6065,10 +6104,10 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
 			if (ret)
 				return ret;
 			output_event = fd_file(output)->private_data;
-			ret = perf_event_set_output(event, output_event);
+			ret = perf_event_set_output(event, output_event, false);
 			fdput(output);
 		} else {
-			ret = perf_event_set_output(event, NULL);
+			ret = perf_event_set_output(event, NULL, false);
 		}
 		return ret;
 	}
@@ -6472,6 +6511,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
 	unsigned long size = perf_data_size(rb);
 	bool detach_rest = false;
 
+	/* FIXIES vs perf_pmu_unregister() */
 	if (event->pmu->event_unmapped)
 		event->pmu->event_unmapped(event, vma->vm_mm);
 
@@ -6591,7 +6631,15 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	unsigned long vma_size;
 	unsigned long nr_pages;
 	long user_extra = 0, extra = 0;
-	int ret = 0, flags = 0;
+	int ret, flags = 0;
+
+	ret = security_perf_event_read(event);
+	if (ret)
+		return ret;
+
+	/* XXX needs event->mmap_mutex? */
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return -ENODEV;
 
 	/*
 	 * Don't allow mmap() of inherited per-task counters. This would
@@ -6604,10 +6652,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	ret = security_perf_event_read(event);
-	if (ret)
-		return ret;
-
 	vma_size = vma->vm_end - vma->vm_start;
 
 	if (vma->vm_pgoff == 0) {
@@ -6810,6 +6854,9 @@ static int perf_fasync(int fd, struct file *filp, int on)
 	struct perf_event *event = filp->private_data;
 	int retval;
 
+	if (event->state <= PERF_EVENT_STATE_REVOKED)
+		return -ENODEV;
+
 	inode_lock(inode);
 	retval = fasync_helper(fd, filp, on, &event->fasync);
 	inode_unlock(inode);
@@ -11513,6 +11560,7 @@ static int perf_event_idx_default(struct perf_event *event)
 
 static void free_pmu_context(struct pmu *pmu)
 {
+	free_percpu(pmu->pmu_disable_count);
 	free_percpu(pmu->cpu_pmu_context);
 }
 
@@ -11753,6 +11801,7 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 	if (type >= 0)
 		max = type;
 
+	// XXX broken vs perf_init_event(), this publishes before @pmu is finalized
 	ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
 	if (ret < 0)
 		goto free_pdc;
@@ -11809,8 +11858,14 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 	if (!pmu->event_idx)
 		pmu->event_idx = perf_event_idx_default;
 
-	list_add_rcu(&pmu->entry, &pmus);
 	atomic_set(&pmu->exclusive_cnt, 0);
+	INIT_LIST_HEAD(&pmu->events);
+	spin_lock_init(&pmu->events_lock);
+
+	/*
+	 * Publish the pmu after it is fully initialized.
+	 */
+	list_add_rcu(&pmu->entry, &pmus);
 	ret = 0;
 unlock:
 	mutex_unlock(&pmus_lock);
@@ -11832,10 +11887,110 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_register);
 
-void perf_pmu_unregister(struct pmu *pmu)
+static void __pmu_detach_event(struct pmu *pmu, struct perf_event *event,
+			       struct perf_event_context *ctx)
 {
+	/*
+	 * De-schedule the event and mark it EXIT_PMU.
+	 */
+	perf_event_exit_event(event, ctx, true);
+
+	/*
+	 * All _free_event() bits that rely on event->pmu:
+	 */
+	perf_event_set_output(event, NULL, true);
+
+	perf_event_free_bpf_prog(event);
+	perf_addr_filters_splice(event, NULL);
+	kfree(event->addr_filter_ranges);
+
+	if (event->destroy) {
+		event->destroy(event);
+		event->destroy = NULL;
+	}
+
+	if (event->pmu_ctx) {
+		/*
+		 * Make sure pmu->cpu_pmu_context is unused. An alternative is
+		 * to have it be pointers and make put_pmu_ctx()'s
+		 * epc->embedded case be another RCU free case.
+		 */
+		put_pmu_ctx(event->pmu_ctx);
+		event->pmu_ctx = NULL;
+	}
+
+	exclusive_event_destroy(event);
+	module_put(pmu->module);
+
+	event->pmu = NULL; /* force fault instead of UAF */
+}
+
+static void pmu_detach_event(struct pmu *pmu, struct perf_event *event)
+{
+	struct perf_event_context *ctx;
+
+	ctx = perf_event_ctx_lock(event);
+	__pmu_detach_event(pmu, event, ctx);
+	perf_event_ctx_unlock(event, ctx);
+
+	scoped_guard(spinlock, &pmu->events_lock)
+		list_del(&event->pmu_list);
+}
+
+static struct perf_event *pmu_get_event(struct pmu *pmu)
+{
+	struct perf_event *event;
+
+	guard(spinlock)(&pmu->events_lock);
+	list_for_each_entry(event, &pmu->events, pmu_list) {
+		if (atomic_long_inc_not_zero(&event->refcount))
+			return event;
+	}
+
+	return NULL;
+}
+
+static bool pmu_empty(struct pmu *pmu)
+{
+	guard(spinlock)(&pmu->events_lock);
+	return list_empty(&pmu->events);
+}
+
+static void pmu_detach_events(struct pmu *pmu)
+{
+	struct perf_event *event;
+
+	for (;;) {
+		event = pmu_get_event(pmu);
+		if (!event)
+			break;
+
+		pmu_detach_event(pmu, event);
+		put_event(event);
+	}
+
+	/*
+	 * wait for pending _free_event()s
+	 */
+	wait_var_event(pmu, pmu_empty(pmu));
+}
+
+int perf_pmu_unregister(struct pmu *pmu)
+{
+	/*
+	 * FIXME!
+	 *
+	 *   perf_mmap_close(); event->pmu->event_unmapped()
+	 *
+	 * XXX this check is racy vs perf_event_alloc()
+	 */
+	if (pmu->event_unmapped && !pmu_empty(pmu))
+		return -EBUSY;
+
 	mutex_lock(&pmus_lock);
 	list_del_rcu(&pmu->entry);
+	idr_remove(&pmu_idr, pmu->type);
+	mutex_unlock(&pmus_lock);
 
 	/*
 	 * We dereference the pmu list under both SRCU and regular RCU, so
@@ -11844,16 +11999,29 @@ void perf_pmu_unregister(struct pmu *pmu)
 	synchronize_srcu(&pmus_srcu);
 	synchronize_rcu();
 
-	free_percpu(pmu->pmu_disable_count);
-	idr_remove(&pmu_idr, pmu->type);
+	/*
+	 * PMU is removed from the pmus list, so no new events will
+	 * be created, now take care of the existing ones.
+	 */
+	pmu_detach_events(pmu);
+
+	/*
+	 * PMU is unused, make it go away.
+	 */
 	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
 		if (pmu->nr_addr_filters)
 			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
 		device_del(pmu->dev);
 		put_device(pmu->dev);
 	}
+
+	/* 
+	 * XXX -- broken? can still contain SW events at this point?
+	 * audit more, make sure DETACH_GROUP DTRT
+	 */
 	free_pmu_context(pmu);
-	mutex_unlock(&pmus_lock);
+
+	return 0;
 }
 EXPORT_SYMBOL_GPL(perf_pmu_unregister);
 
@@ -12330,6 +12498,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
+	scoped_guard(spinlock, &pmu->events_lock)
+		list_add(&event->pmu_list, &pmu->events);
+
 	return event;
 
 err_callchain_buffer:
@@ -12493,7 +12664,7 @@ static void mutex_lock_double(struct mutex *a, struct mutex *b)
 }
 
 static int
-perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
+perf_event_set_output(struct perf_event *event, struct perf_event *output_event, bool force)
 {
 	struct perf_buffer *rb = NULL;
 	int ret = -EINVAL;
@@ -12549,8 +12720,12 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 	mutex_lock_double(&event->mmap_mutex, &output_event->mmap_mutex);
 set:
 	/* Can't redirect output if we've got an active mmap() */
-	if (atomic_read(&event->mmap_count))
-		goto unlock;
+	if (atomic_read(&event->mmap_count)) {
+		if (!force)
+			goto unlock;
+
+		WARN_ON_ONCE(event->pmu->event_unmapped);
+	}
 
 	if (output_event) {
 		/* get the rb we want to redirect to */
@@ -12740,6 +12915,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		if (err)
 			goto err_fd;
 		group_leader = fd_file(group)->private_data;
+		if (group_leader->state <= PERF_EVENT_STATE_REVOKED) {
+			err = -ENODEV;
+			goto err_group_fd;
+		}
 		if (flags & PERF_FLAG_FD_OUTPUT)
 			output_event = group_leader;
 		if (flags & PERF_FLAG_FD_NO_GROUP)
@@ -12916,7 +13095,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	event->pmu_ctx = pmu_ctx;
 
 	if (output_event) {
-		err = perf_event_set_output(event, output_event);
+		err = perf_event_set_output(event, output_event, false);
 		if (err)
 			goto err_context;
 	}
@@ -13287,10 +13466,11 @@ static void sync_child_event(struct perf_event *child_event)
 }
 
 static void
-perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
+perf_event_exit_event(struct perf_event *event,
+		      struct perf_event_context *ctx, bool revoke)
 {
 	struct perf_event *parent_event = event->parent;
-	unsigned long detach_flags = 0;
+	unsigned long detach_flags = DETACH_EXIT;
 
 	if (parent_event) {
 		/*
@@ -13305,16 +13485,14 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
 		 * Do destroy all inherited groups, we don't care about those
 		 * and being thorough is better.
 		 */
-		detach_flags = DETACH_GROUP | DETACH_CHILD;
+		detach_flags |= DETACH_GROUP | DETACH_CHILD;
 		mutex_lock(&parent_event->child_mutex);
 	}
 
-	perf_remove_from_context(event, detach_flags);
+	if (revoke)
+		detach_flags |= DETACH_GROUP | DETACH_REVOKE;
 
-	raw_spin_lock_irq(&ctx->lock);
-	if (event->state > PERF_EVENT_STATE_EXIT)
-		perf_event_set_state(event, PERF_EVENT_STATE_EXIT);
-	raw_spin_unlock_irq(&ctx->lock);
+	perf_remove_from_context(event, detach_flags);
 
 	/*
 	 * Child events can be freed.
@@ -13390,7 +13568,7 @@ static void perf_event_exit_task_context(struct task_struct *child)
 	perf_event_task(child, child_ctx, 0);
 
 	list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry)
-		perf_event_exit_event(child_event, child_ctx);
+		perf_event_exit_event(child_event, child_ctx, false);
 
 	mutex_unlock(&child_ctx->mutex);
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] perf: Add dummy pmu module
  2024-10-16  8:51   ` Peter Zijlstra
@ 2024-10-18 19:30     ` Lucas De Marchi
  0 siblings, 0 replies; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-18 19:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Wed, Oct 16, 2024 at 10:51:02AM +0200, Peter Zijlstra wrote:
>On Tue, Oct 08, 2024 at 01:34:57PM -0500, Lucas De Marchi wrote:
>> +static int parse_device(const char __user *ubuf, size_t size, u32 *instance)
>> +{
>> +	char buf[16];
>> +	ssize_t len;
>> +
>> +	if (size > sizeof(buf) - 1)
>> +		return -E2BIG;
>> +
>> +	len = strncpy_from_user(buf, ubuf, sizeof(buf));
>> +	if (len < 0 || len >= sizeof(buf) - 1)
>> +		return -E2BIG;
>> +
>> +	if (kstrtou32(buf, 0, instance))
>> +		return -EINVAL;
>> +
>> +	return size;
>> +}
>
>I had to change this to:
>
>+static int parse_device(const char __user *ubuf, size_t size, u32 *instance)
>+{
>+       int ret = kstrtouint_from_user(ubuf, size, 0, instance);
>+       if (ret) {
>+               printk("suckage: %d\n", ret);
>+               return ret;
>+       }
>+       return size;
>+}
>
>because otherwise it didn't want to work for me; I kept getting garbage
>at the end and things failing. Specifically, it looked like the string
>presented by userspace was not '\0' terminated, ubuf was pointing to
>"1...garbage..." and size was 1.

oh... it's the sysfs (that PCI drivers use) that is guaranteed to be
nul-terminated. debugfs is not. And it probably worked for me because of
CONFIG_INIT_STACK_ALL_ZERO=y that comes from my distro.

and this version is also shorter and simpler.

thanks
Lucas De Marchi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-16 12:03         ` Peter Zijlstra
@ 2024-10-18 19:46           ` Lucas De Marchi
  2024-10-22 21:52             ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-18 19:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Wed, Oct 16, 2024 at 02:03:02PM +0200, Peter Zijlstra wrote:
>On Mon, Oct 14, 2024 at 09:25:19PM +0200, Peter Zijlstra wrote:
>
>> Let me ponder that a little bit.
>
>So I did the thing on top of the get/put thing that would allow you to
>get rid of the ->closed thing, and before I was finished I already hated
>all of it :-(
>
>The thing is, if you're going to the effort of adding get/put and
>putting the responsibility on the implementation, then the ->closed
>thing is only a little extra ask.
>
>So... I wondered, how hard it would be for perf_pmu_unregister() to do
>what you want.
>
>Time passed, hacks were done...
>
>and while what I have is still riddled with holes (more work is
>definitely required), it does pass your dummy_pmu test scipt.
>
>I'll poke at this a little longer. Afaict it should be possible to make
>this good enough for what you want. Just needs more holes filled.
>
>---
> include/linux/perf_event.h |  13 ++-
> kernel/events/core.c       | 260 ++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 228 insertions(+), 45 deletions(-)
>
>diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>index fb908843f209..20995d33d27e 100644
>--- a/include/linux/perf_event.h
>+++ b/include/linux/perf_event.h
>@@ -318,6 +318,9 @@ struct perf_output_handle;
> struct pmu {
> 	struct list_head		entry;
>
>+	spinlock_t			events_lock;
>+	struct list_head		events;

oh... I looked at several lists and was wondering which one would
contain the events of our pmu. Now I see we didn't have that :)

>+
> 	struct module			*module;
> 	struct device			*dev;
> 	struct device			*parent;
>@@ -612,9 +615,10 @@ struct perf_addr_filter_range {
>  * enum perf_event_state - the states of an event:
>  */
> enum perf_event_state {
>-	PERF_EVENT_STATE_DEAD		= -4,
>-	PERF_EVENT_STATE_EXIT		= -3,
>-	PERF_EVENT_STATE_ERROR		= -2,
>+	PERF_EVENT_STATE_DEAD		= -5,
>+	PERF_EVENT_STATE_REVOKED	= -4, /* pmu gone, must not touch */
>+	PERF_EVENT_STATE_EXIT		= -3, /* task died, still inherit */
>+	PERF_EVENT_STATE_ERROR		= -2, /* scheduling error, can enable */
> 	PERF_EVENT_STATE_OFF		= -1,
> 	PERF_EVENT_STATE_INACTIVE	=  0,
> 	PERF_EVENT_STATE_ACTIVE		=  1,
>@@ -853,6 +857,7 @@ struct perf_event {
> 	void *security;
> #endif
> 	struct list_head		sb_list;
>+	struct list_head		pmu_list;
>
> 	/*
> 	 * Certain events gets forwarded to another pmu internally by over-
>@@ -1103,7 +1108,7 @@ extern void perf_aux_output_flag(struct perf_output_handle *handle, u64 flags);
> extern void perf_event_itrace_started(struct perf_event *event);
>
> extern int perf_pmu_register(struct pmu *pmu, const char *name, int type);
>-extern void perf_pmu_unregister(struct pmu *pmu);
>+extern int perf_pmu_unregister(struct pmu *pmu);
>
> extern void __perf_event_task_sched_in(struct task_struct *prev,
> 				       struct task_struct *task);
>diff --git a/kernel/events/core.c b/kernel/events/core.c
>index cdd09769e6c5..e66827367a97 100644
>--- a/kernel/events/core.c
>+++ b/kernel/events/core.c
>@@ -2406,7 +2406,9 @@ ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event)
>
> #define DETACH_GROUP	0x01UL
> #define DETACH_CHILD	0x02UL
>-#define DETACH_DEAD	0x04UL
>+#define DETACH_EXIT	0x04UL
>+#define DETACH_REVOKE	0x08UL
>+#define DETACH_DEAD	0x10UL
>
> /*
>  * Cross CPU call to remove a performance event
>@@ -2421,6 +2423,7 @@ __perf_remove_from_context(struct perf_event *event,
> 			   void *info)
> {
> 	struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
>+	enum perf_event_state state = PERF_EVENT_STATE_OFF;
> 	unsigned long flags = (unsigned long)info;
>
> 	ctx_time_update(cpuctx, ctx);
>@@ -2429,16 +2432,22 @@ __perf_remove_from_context(struct perf_event *event,
> 	 * Ensure event_sched_out() switches to OFF, at the very least
> 	 * this avoids raising perf_pending_task() at this time.
> 	 */
>-	if (flags & DETACH_DEAD)
>+	if (flags & DETACH_EXIT)
>+		state = PERF_EVENT_STATE_EXIT;
>+	if (flags & DETACH_REVOKE)
>+		state = PERF_EVENT_STATE_REVOKED;
>+	if (flags & DETACH_DEAD) {
> 		event->pending_disable = 1;
>+		state = PERF_EVENT_STATE_DEAD;
>+	}
> 	event_sched_out(event, ctx);
> 	if (flags & DETACH_GROUP)
> 		perf_group_detach(event);
> 	if (flags & DETACH_CHILD)
> 		perf_child_detach(event);
> 	list_del_event(event, ctx);
>-	if (flags & DETACH_DEAD)
>-		event->state = PERF_EVENT_STATE_DEAD;
>+
>+	event->state = state;
>
> 	if (!pmu_ctx->nr_events) {
> 		pmu_ctx->rotate_necessary = 0;
>@@ -4508,7 +4517,8 @@ static void perf_event_enable_on_exec(struct perf_event_context *ctx)
>
> static void perf_remove_from_owner(struct perf_event *event);
> static void perf_event_exit_event(struct perf_event *event,
>-				  struct perf_event_context *ctx);
>+				  struct perf_event_context *ctx,
>+				  bool revoke);
>
> /*
>  * Removes all events from the current task that have been marked
>@@ -4535,7 +4545,7 @@ static void perf_event_remove_on_exec(struct perf_event_context *ctx)
>
> 		modified = true;
>
>-		perf_event_exit_event(event, ctx);
>+		perf_event_exit_event(event, ctx, false);
> 	}
>
> 	raw_spin_lock_irqsave(&ctx->lock, flags);
>@@ -5121,6 +5131,7 @@ static bool is_sb_event(struct perf_event *event)
> 	    attr->context_switch || attr->text_poke ||
> 	    attr->bpf_event)
> 		return true;
>+
> 	return false;
> }
>
>@@ -5321,6 +5332,8 @@ static void perf_pending_task_sync(struct perf_event *event)
>
> static void _free_event(struct perf_event *event)
> {
>+	struct pmu *pmu = event->pmu;
>+
> 	irq_work_sync(&event->pending_irq);
> 	irq_work_sync(&event->pending_disable_irq);
> 	perf_pending_task_sync(event);
>@@ -5330,6 +5343,7 @@ static void _free_event(struct perf_event *event)
> 	security_perf_event_free(event);
>
> 	if (event->rb) {
>+		WARN_ON_ONCE(!pmu);
> 		/*
> 		 * Can happen when we close an event with re-directed output.
> 		 *
>@@ -5349,12 +5363,16 @@ static void _free_event(struct perf_event *event)
> 			put_callchain_buffers();
> 	}
>
>-	perf_event_free_bpf_prog(event);
>-	perf_addr_filters_splice(event, NULL);
>-	kfree(event->addr_filter_ranges);
>+	if (pmu) {
>+		perf_event_free_bpf_prog(event);
>+		perf_addr_filters_splice(event, NULL);
>+		kfree(event->addr_filter_ranges);
>+	}
>
>-	if (event->destroy)
>+	if (event->destroy) {
>+		WARN_ON_ONCE(!pmu);
> 		event->destroy(event);
>+	}
>
> 	/*
> 	 * Must be after ->destroy(), due to uprobe_perf_close() using
>@@ -5363,8 +5381,10 @@ static void _free_event(struct perf_event *event)
> 	if (event->hw.target)
> 		put_task_struct(event->hw.target);
>
>-	if (event->pmu_ctx)
>+	if (event->pmu_ctx) {
>+		WARN_ON_ONCE(!pmu);
> 		put_pmu_ctx(event->pmu_ctx);
>+	}
>
> 	/*
> 	 * perf_event_free_task() relies on put_ctx() being 'last', in particular
>@@ -5373,8 +5393,14 @@ static void _free_event(struct perf_event *event)
> 	if (event->ctx)
> 		put_ctx(event->ctx);
>
>-	exclusive_event_destroy(event);
>-	module_put(event->pmu->module);
>+	if (pmu) {
>+		exclusive_event_destroy(event);
>+		module_put(pmu->module);
>+		scoped_guard(spinlock, &pmu->events_lock) {
>+			list_del(&event->pmu_list);
>+			wake_up_var(pmu);
>+		}
>+	}
>
> 	call_rcu(&event->rcu_head, free_event_rcu);
> }
>@@ -5492,7 +5518,11 @@ int perf_event_release_kernel(struct perf_event *event)
> 	 * Thus this guarantees that we will in fact observe and kill _ALL_
> 	 * child events.
> 	 */
>-	perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
>+	if (event->state > PERF_EVENT_STATE_REVOKED) {
>+		perf_remove_from_context(event, DETACH_GROUP|DETACH_DEAD);
>+	} else {
>+		event->state = PERF_EVENT_STATE_DEAD;
>+	}
>
> 	perf_event_ctx_unlock(event, ctx);
>
>@@ -5803,7 +5833,7 @@ __perf_read(struct perf_event *event, char __user *buf, size_t count)
> 	 * error state (i.e. because it was pinned but it couldn't be
> 	 * scheduled on to the CPU at some point).
> 	 */
>-	if (event->state == PERF_EVENT_STATE_ERROR)
>+	if (event->state <= PERF_EVENT_STATE_ERROR)
> 		return 0;
>
> 	if (count < event->read_size)
>@@ -5842,8 +5872,14 @@ static __poll_t perf_poll(struct file *file, poll_table *wait)
> 	struct perf_buffer *rb;
> 	__poll_t events = EPOLLHUP;
>
>+	if (event->state <= PERF_EVENT_STATE_REVOKED)
>+		return EPOLLERR;
>+
> 	poll_wait(file, &event->waitq, wait);
>
>+	if (event->state <= PERF_EVENT_STATE_REVOKED)
>+		return EPOLLERR;
>+
> 	if (is_event_hup(event))
> 		return events;
>
>@@ -6013,7 +6049,7 @@ static inline int perf_fget_light(int fd, struct fd *p)
> }
>
> static int perf_event_set_output(struct perf_event *event,
>-				 struct perf_event *output_event);
>+				 struct perf_event *output_event, bool force);
> static int perf_event_set_filter(struct perf_event *event, void __user *arg);
> static int perf_copy_attr(struct perf_event_attr __user *uattr,
> 			  struct perf_event_attr *attr);
>@@ -6023,6 +6059,9 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
> 	void (*func)(struct perf_event *);
> 	u32 flags = arg;
>
>+	if (event->state <= PERF_EVENT_STATE_REVOKED)
>+		return -ENODEV;
>+
> 	switch (cmd) {
> 	case PERF_EVENT_IOC_ENABLE:
> 		func = _perf_event_enable;
>@@ -6065,10 +6104,10 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
> 			if (ret)
> 				return ret;
> 			output_event = fd_file(output)->private_data;
>-			ret = perf_event_set_output(event, output_event);
>+			ret = perf_event_set_output(event, output_event, false);
> 			fdput(output);
> 		} else {
>-			ret = perf_event_set_output(event, NULL);
>+			ret = perf_event_set_output(event, NULL, false);
> 		}
> 		return ret;
> 	}
>@@ -6472,6 +6511,7 @@ static void perf_mmap_close(struct vm_area_struct *vma)
> 	unsigned long size = perf_data_size(rb);
> 	bool detach_rest = false;
>
>+	/* FIXIES vs perf_pmu_unregister() */
> 	if (event->pmu->event_unmapped)
> 		event->pmu->event_unmapped(event, vma->vm_mm);
>
>@@ -6591,7 +6631,15 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> 	unsigned long vma_size;
> 	unsigned long nr_pages;
> 	long user_extra = 0, extra = 0;
>-	int ret = 0, flags = 0;
>+	int ret, flags = 0;
>+
>+	ret = security_perf_event_read(event);
>+	if (ret)
>+		return ret;
>+
>+	/* XXX needs event->mmap_mutex? */
>+	if (event->state <= PERF_EVENT_STATE_REVOKED)
>+		return -ENODEV;
>
> 	/*
> 	 * Don't allow mmap() of inherited per-task counters. This would
>@@ -6604,10 +6652,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
> 	if (!(vma->vm_flags & VM_SHARED))
> 		return -EINVAL;
>
>-	ret = security_perf_event_read(event);
>-	if (ret)
>-		return ret;
>-
> 	vma_size = vma->vm_end - vma->vm_start;
>
> 	if (vma->vm_pgoff == 0) {
>@@ -6810,6 +6854,9 @@ static int perf_fasync(int fd, struct file *filp, int on)
> 	struct perf_event *event = filp->private_data;
> 	int retval;
>
>+	if (event->state <= PERF_EVENT_STATE_REVOKED)
>+		return -ENODEV;
>+
> 	inode_lock(inode);
> 	retval = fasync_helper(fd, filp, on, &event->fasync);
> 	inode_unlock(inode);
>@@ -11513,6 +11560,7 @@ static int perf_event_idx_default(struct perf_event *event)
>
> static void free_pmu_context(struct pmu *pmu)
> {
>+	free_percpu(pmu->pmu_disable_count);
> 	free_percpu(pmu->cpu_pmu_context);
> }
>
>@@ -11753,6 +11801,7 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> 	if (type >= 0)
> 		max = type;
>
>+	// XXX broken vs perf_init_event(), this publishes before @pmu is finalized
> 	ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
> 	if (ret < 0)
> 		goto free_pdc;
>@@ -11809,8 +11858,14 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> 	if (!pmu->event_idx)
> 		pmu->event_idx = perf_event_idx_default;
>
>-	list_add_rcu(&pmu->entry, &pmus);
> 	atomic_set(&pmu->exclusive_cnt, 0);
>+	INIT_LIST_HEAD(&pmu->events);
>+	spin_lock_init(&pmu->events_lock);
>+
>+	/*
>+	 * Publish the pmu after it is fully initialized.
>+	 */
>+	list_add_rcu(&pmu->entry, &pmus);
> 	ret = 0;
> unlock:
> 	mutex_unlock(&pmus_lock);
>@@ -11832,10 +11887,110 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
> }
> EXPORT_SYMBOL_GPL(perf_pmu_register);
>
>-void perf_pmu_unregister(struct pmu *pmu)
>+static void __pmu_detach_event(struct pmu *pmu, struct perf_event *event,
>+			       struct perf_event_context *ctx)
> {
>+	/*
>+	 * De-schedule the event and mark it EXIT_PMU.
>+	 */
>+	perf_event_exit_event(event, ctx, true);
>+
>+	/*
>+	 * All _free_event() bits that rely on event->pmu:
>+	 */
>+	perf_event_set_output(event, NULL, true);
>+
>+	perf_event_free_bpf_prog(event);
>+	perf_addr_filters_splice(event, NULL);
>+	kfree(event->addr_filter_ranges);
>+
>+	if (event->destroy) {
>+		event->destroy(event);
>+		event->destroy = NULL;
>+	}
>+
>+	if (event->pmu_ctx) {
>+		/*
>+		 * Make sure pmu->cpu_pmu_context is unused. An alternative is
>+		 * to have it be pointers and make put_pmu_ctx()'s
>+		 * epc->embedded case be another RCU free case.
>+		 */
>+		put_pmu_ctx(event->pmu_ctx);
>+		event->pmu_ctx = NULL;
>+	}
>+
>+	exclusive_event_destroy(event);
>+	module_put(pmu->module);
>+
>+	event->pmu = NULL; /* force fault instead of UAF */
>+}
>+
>+static void pmu_detach_event(struct pmu *pmu, struct perf_event *event)
>+{
>+	struct perf_event_context *ctx;
>+
>+	ctx = perf_event_ctx_lock(event);
>+	__pmu_detach_event(pmu, event, ctx);
>+	perf_event_ctx_unlock(event, ctx);
>+
>+	scoped_guard(spinlock, &pmu->events_lock)
>+		list_del(&event->pmu_list);
>+}
>+
>+static struct perf_event *pmu_get_event(struct pmu *pmu)
>+{
>+	struct perf_event *event;
>+
>+	guard(spinlock)(&pmu->events_lock);
>+	list_for_each_entry(event, &pmu->events, pmu_list) {
>+		if (atomic_long_inc_not_zero(&event->refcount))
>+			return event;
>+	}
>+
>+	return NULL;
>+}
>+
>+static bool pmu_empty(struct pmu *pmu)
>+{
>+	guard(spinlock)(&pmu->events_lock);
>+	return list_empty(&pmu->events);
>+}
>+
>+static void pmu_detach_events(struct pmu *pmu)
>+{
>+	struct perf_event *event;
>+
>+	for (;;) {
>+		event = pmu_get_event(pmu);
>+		if (!event)
>+			break;
>+
>+		pmu_detach_event(pmu, event);
>+		put_event(event);
>+	}
>+
>+	/*
>+	 * wait for pending _free_event()s
>+	 */
>+	wait_var_event(pmu, pmu_empty(pmu));


ok... so now we are synchronosly moving the events to a revoked state
during unregister, so we wouldn't need the refcount on the driver side
anymore and can just free the objects right after return.

I will give this a try with i915 and/or xe.

thanks
Lucas De Marchi


>+}
>+
>+int perf_pmu_unregister(struct pmu *pmu)
>+{
>+	/*
>+	 * FIXME!
>+	 *
>+	 *   perf_mmap_close(); event->pmu->event_unmapped()
>+	 *
>+	 * XXX this check is racy vs perf_event_alloc()
>+	 */
>+	if (pmu->event_unmapped && !pmu_empty(pmu))
>+		return -EBUSY;
>+
> 	mutex_lock(&pmus_lock);
> 	list_del_rcu(&pmu->entry);
>+	idr_remove(&pmu_idr, pmu->type);
>+	mutex_unlock(&pmus_lock);
>
> 	/*
> 	 * We dereference the pmu list under both SRCU and regular RCU, so
>@@ -11844,16 +11999,29 @@ void perf_pmu_unregister(struct pmu *pmu)
> 	synchronize_srcu(&pmus_srcu);
> 	synchronize_rcu();
>
>-	free_percpu(pmu->pmu_disable_count);
>-	idr_remove(&pmu_idr, pmu->type);
>+	/*
>+	 * PMU is removed from the pmus list, so no new events will
>+	 * be created, now take care of the existing ones.
>+	 */
>+	pmu_detach_events(pmu);
>+
>+	/*
>+	 * PMU is unused, make it go away.
>+	 */
> 	if (pmu_bus_running && pmu->dev && pmu->dev != PMU_NULL_DEV) {
> 		if (pmu->nr_addr_filters)
> 			device_remove_file(pmu->dev, &dev_attr_nr_addr_filters);
> 		device_del(pmu->dev);
> 		put_device(pmu->dev);
> 	}
>+
>+	/*
>+	 * XXX -- broken? can still contain SW events at this point?
>+	 * audit more, make sure DETACH_GROUP DTRT
>+	 */
> 	free_pmu_context(pmu);
>-	mutex_unlock(&pmus_lock);
>+
>+	return 0;
> }
> EXPORT_SYMBOL_GPL(perf_pmu_unregister);
>
>@@ -12330,6 +12498,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> 	/* symmetric to unaccount_event() in _free_event() */
> 	account_event(event);
>
>+	scoped_guard(spinlock, &pmu->events_lock)
>+		list_add(&event->pmu_list, &pmu->events);
>+
> 	return event;
>
> err_callchain_buffer:
>@@ -12493,7 +12664,7 @@ static void mutex_lock_double(struct mutex *a, struct mutex *b)
> }
>
> static int
>-perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
>+perf_event_set_output(struct perf_event *event, struct perf_event *output_event, bool force)
> {
> 	struct perf_buffer *rb = NULL;
> 	int ret = -EINVAL;
>@@ -12549,8 +12720,12 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
> 	mutex_lock_double(&event->mmap_mutex, &output_event->mmap_mutex);
> set:
> 	/* Can't redirect output if we've got an active mmap() */
>-	if (atomic_read(&event->mmap_count))
>-		goto unlock;
>+	if (atomic_read(&event->mmap_count)) {
>+		if (!force)
>+			goto unlock;
>+
>+		WARN_ON_ONCE(event->pmu->event_unmapped);
>+	}
>
> 	if (output_event) {
> 		/* get the rb we want to redirect to */
>@@ -12740,6 +12915,10 @@ SYSCALL_DEFINE5(perf_event_open,
> 		if (err)
> 			goto err_fd;
> 		group_leader = fd_file(group)->private_data;
>+		if (group_leader->state <= PERF_EVENT_STATE_REVOKED) {
>+			err = -ENODEV;
>+			goto err_group_fd;
>+		}
> 		if (flags & PERF_FLAG_FD_OUTPUT)
> 			output_event = group_leader;
> 		if (flags & PERF_FLAG_FD_NO_GROUP)
>@@ -12916,7 +13095,7 @@ SYSCALL_DEFINE5(perf_event_open,
> 	event->pmu_ctx = pmu_ctx;
>
> 	if (output_event) {
>-		err = perf_event_set_output(event, output_event);
>+		err = perf_event_set_output(event, output_event, false);
> 		if (err)
> 			goto err_context;
> 	}
>@@ -13287,10 +13466,11 @@ static void sync_child_event(struct perf_event *child_event)
> }
>
> static void
>-perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
>+perf_event_exit_event(struct perf_event *event,
>+		      struct perf_event_context *ctx, bool revoke)
> {
> 	struct perf_event *parent_event = event->parent;
>-	unsigned long detach_flags = 0;
>+	unsigned long detach_flags = DETACH_EXIT;
>
> 	if (parent_event) {
> 		/*
>@@ -13305,16 +13485,14 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> 		 * Do destroy all inherited groups, we don't care about those
> 		 * and being thorough is better.
> 		 */
>-		detach_flags = DETACH_GROUP | DETACH_CHILD;
>+		detach_flags |= DETACH_GROUP | DETACH_CHILD;
> 		mutex_lock(&parent_event->child_mutex);
> 	}
>
>-	perf_remove_from_context(event, detach_flags);
>+	if (revoke)
>+		detach_flags |= DETACH_GROUP | DETACH_REVOKE;
>
>-	raw_spin_lock_irq(&ctx->lock);
>-	if (event->state > PERF_EVENT_STATE_EXIT)
>-		perf_event_set_state(event, PERF_EVENT_STATE_EXIT);
>-	raw_spin_unlock_irq(&ctx->lock);
>+	perf_remove_from_context(event, detach_flags);
>
> 	/*
> 	 * Child events can be freed.
>@@ -13390,7 +13568,7 @@ static void perf_event_exit_task_context(struct task_struct *child)
> 	perf_event_task(child, child_ctx, 0);
>
> 	list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry)
>-		perf_event_exit_event(child_event, child_ctx);
>+		perf_event_exit_event(child_event, child_ctx, false);
>
> 	mutex_unlock(&child_ctx->mutex);
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-18 19:46           ` Lucas De Marchi
@ 2024-10-22 21:52             ` Peter Zijlstra
  2024-10-23  5:07               ` Lucas De Marchi
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2024-10-22 21:52 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Fri, Oct 18, 2024 at 02:46:31PM -0500, Lucas De Marchi wrote:

> I will give this a try with i915 and/or xe.

Less horrible version here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/pmu-unregister

I've just pushed it out to the robots, but it builds, passes perf test
and your dummy_pmu testcase (for me, on my one testbox and .config
etc..)

I'll let it sit with the robots for a few days and if they don't
complain I'll post it.

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-22 21:52             ` Peter Zijlstra
@ 2024-10-23  5:07               ` Lucas De Marchi
  2024-10-31  5:07                 ` Lucas De Marchi
  0 siblings, 1 reply; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-23  5:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Tue, Oct 22, 2024 at 11:52:10PM +0200, Peter Zijlstra wrote:
>On Fri, Oct 18, 2024 at 02:46:31PM -0500, Lucas De Marchi wrote:
>
>> I will give this a try with i915 and/or xe.
>
>Less horrible version here:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/pmu-unregister
>
>I've just pushed it out to the robots, but it builds, passes perf test
>and your dummy_pmu testcase (for me, on my one testbox and .config
>etc..)

It passed for me as well with both dummy_pmu and with i915. I have some
changes to igt (i915/xe testsuite) that should bring some more coverage.
I minimized the pending test changes I had and posted to trigger CI:

https://patchwork.freedesktop.org/series/140355/

thanks
Lucas De Marchi


>
>I'll let it sit with the robots for a few days and if they don't
>complain I'll post it.
>
>Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-23  5:07               ` Lucas De Marchi
@ 2024-10-31  5:07                 ` Lucas De Marchi
  2024-10-31  9:45                   ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Lucas De Marchi @ 2024-10-31  5:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Wed, Oct 23, 2024 at 12:07:58AM -0500, Lucas De Marchi wrote:
>On Tue, Oct 22, 2024 at 11:52:10PM +0200, Peter Zijlstra wrote:
>>On Fri, Oct 18, 2024 at 02:46:31PM -0500, Lucas De Marchi wrote:
>>
>>>I will give this a try with i915 and/or xe.
>>
>>Less horrible version here:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/pmu-unregister
>>
>>I've just pushed it out to the robots, but it builds, passes perf test
>>and your dummy_pmu testcase (for me, on my one testbox and .config
>>etc..)
>
>It passed for me as well with both dummy_pmu and with i915. I have some
>changes to igt (i915/xe testsuite) that should bring some more coverage.
>I minimized the pending test changes I had and posted to trigger CI:
>
>https://patchwork.freedesktop.org/series/140355/

Our CI also didn't trigger that pmu issue and the test could run
succesfully.

I did get a report from kernel test robot though:

https://lore.kernel.org/all/202410281436.8246527d-lkp@intel.com/

thanks
Lucas De Marchi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/5] perf: Add pmu get/put
  2024-10-31  5:07                 ` Lucas De Marchi
@ 2024-10-31  9:45                   ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2024-10-31  9:45 UTC (permalink / raw)
  To: Lucas De Marchi
  Cc: linux-kernel, dri-devel, Ingo Molnar, Arnaldo Carvalho de Melo,
	Umesh Nerlige Ramappa, Ian Rogers, Tvrtko Ursulin

On Thu, Oct 31, 2024 at 12:07:42AM -0500, Lucas De Marchi wrote:
> On Wed, Oct 23, 2024 at 12:07:58AM -0500, Lucas De Marchi wrote:
> > On Tue, Oct 22, 2024 at 11:52:10PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 18, 2024 at 02:46:31PM -0500, Lucas De Marchi wrote:
> > > 
> > > > I will give this a try with i915 and/or xe.
> > > 
> > > Less horrible version here:
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/pmu-unregister
> > > 
> > > I've just pushed it out to the robots, but it builds, passes perf test
> > > and your dummy_pmu testcase (for me, on my one testbox and .config
> > > etc..)
> > 
> > It passed for me as well with both dummy_pmu and with i915. I have some
> > changes to igt (i915/xe testsuite) that should bring some more coverage.
> > I minimized the pending test changes I had and posted to trigger CI:
> > 
> > https://patchwork.freedesktop.org/series/140355/
> 
> Our CI also didn't trigger that pmu issue and the test could run
> succesfully.

Excellent.

> I did get a report from kernel test robot though:
> 
> https://lore.kernel.org/all/202410281436.8246527d-lkp@intel.com/

Yeah, I think I fixed that one, but the robot gifted me another one that
I still need to find time to address. I'm hoping this weel.

I'll repost properly once I fix it.

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-10-31  9:46 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-08 18:34 [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Lucas De Marchi
2024-10-08 18:34 ` [PATCH 1/5] perf: Add dummy pmu module Lucas De Marchi
2024-10-15  0:26   ` Jeff Johnson
2024-10-15  4:23     ` Lucas De Marchi
2024-10-16  8:51   ` Peter Zijlstra
2024-10-18 19:30     ` Lucas De Marchi
2024-10-08 18:34 ` [PATCH 2/5] perf: Move free outside of the mutex Lucas De Marchi
2024-10-08 18:34 ` [PATCH 3/5] perf: Add pmu get/put Lucas De Marchi
2024-10-09 10:44   ` kernel test robot
2024-10-14 17:32   ` Peter Zijlstra
2024-10-14 18:20     ` Lucas De Marchi
2024-10-14 19:25       ` Peter Zijlstra
2024-10-14 20:12         ` Lucas De Marchi
2024-10-16 12:03         ` Peter Zijlstra
2024-10-18 19:46           ` Lucas De Marchi
2024-10-22 21:52             ` Peter Zijlstra
2024-10-23  5:07               ` Lucas De Marchi
2024-10-31  5:07                 ` Lucas De Marchi
2024-10-31  9:45                   ` Peter Zijlstra
2024-10-08 18:35 ` [PATCH 4/5] perf/dummy_pmu: Tie pmu to device lifecycle Lucas De Marchi
2024-10-08 18:35 ` [PATCH 5/5] perf/dummy_pmu: Track and disable active events Lucas De Marchi
2024-10-11 22:21 ` [PATCH 0/5] perf: Fix pmu for drivers with bind/unbind Umesh Nerlige Ramappa
2024-10-11 23:03   ` Lucas De Marchi
2024-10-11 22:59 ` Lucas De Marchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox