[PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT.
@ 2024-06-24 15:15 Sebastian Andrzej Siewior
  2024-06-24 15:15 ` [PATCH v4 1/6] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-24 15:15 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Frederic Weisbecker, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner

Hi,

Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
the signal gets delayed until event_sched_out() which then uses
task_work_add() for its delivery. This breaks on PREEMPT_RT because the
signal is delivered with disabled preemption.

While looking at this, I also stumbled upon __perf_pending_irq() which
requires disabled interrupts but this is not the case on PREEMPT_RT.

This series aim to address both issues while not introducing a new issue
at the same time ;)
Any testing is appreciated.

v3…v4: https://lore.kernel.org/all/20240322065208.60456-1-bigeasy@linutronix.de/
   - Rebased on top of Frederic's series
      (https://lore.kernel.org/all/20240621091601.18227-1-frederic@kernel.org)
   - Frederick pointed out that perf_pending_task() needs to
     perf_swevent_get_recursion_context() in order not to recurse if
     something within perf_swevent_.*_recursion_context() triggers a
     software event. To address this, the counters have been moved to
     the task_struct (#3 + #4) and preemt_disable() has been replaced
     with a RCU-read lock (#5).
   - The remaning logic same that means the event is pushed to task-work
     instead of delivering from IRQ-work. The series was tested with
     remove_on_exec as suggested by Marco Elver: On PREEMPT_RT a single
     invocation passes, 100 parallel invocations report (for some)
     unexpected SIGTRAPs and timeouts. This also observed on !RT
     (without the series) with a higher task-count.

v2…v3: https://lore.kernel.org/all/20240312180814.3373778-1-bigeasy@linutronix.de/
    - Marco suggested to add a few comments
      - Added a comment to __perf_event_overflow() to explain why irq_work
        is raised in the in_nmi() case.
      - Added a comment to perf_event_exit_event() to explain why the
        pending event is deleted.

v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/
    - Marco pointed me to the testsuite that showed two problems:
      - Delayed task_work from NMI / missing events.
        Fixed by triggering dummy irq_work to enforce an interrupt for
	the exit-to-userland path which checks task_work
      - Increased ref-count on clean up/ during exec.
        Mostly addressed by the former change. There is still a window
	if the NMI occurs during execve(). This is addressed by removing
	the task_work before free_event().
      The testsuite (remove_on_exec) fails sometimes if the event/
      SIGTRAP is sent before the sighandler is installed.

Sebastian


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 1/6] perf: Move irq_work_queue() where the event is prepared.
  2024-06-24 15:15 [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
@ 2024-06-24 15:15 ` Sebastian Andrzej Siewior
  2024-06-24 15:15 ` [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-24 15:15 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Frederic Weisbecker, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior, Arnaldo Carvalho de Melo

Only if perf_event::pending_sigtrap is zero, the irq_work accounted by
increminging perf_event::nr_pending. The member perf_event::pending_addr
might be overwritten by a subsequent event if the signal was not yet
delivered and is expected. The irq_work will not be enqeueued again
because it has a check to be only enqueued once.

Move irq_work_queue() to where the counter is incremented and
perf_event::pending_sigtrap is set to make it more obvious that the
irq_work is scheduled once.

Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/events/core.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 586d4f3676240..647abeeaeeb02 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9738,6 +9738,11 @@ static int __perf_event_overflow(struct perf_event *event,
 		if (!event->pending_sigtrap) {
 			event->pending_sigtrap = pending_id;
 			local_inc(&event->ctx->nr_pending);
+
+			event->pending_addr = 0;
+			if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
+				event->pending_addr = data->addr;
+			irq_work_queue(&event->pending_irq);
 		} else if (event->attr.exclude_kernel && valid_sample) {
 			/*
 			 * Should not be able to return to user space without
@@ -9753,11 +9758,6 @@ static int __perf_event_overflow(struct perf_event *event,
 			 */
 			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
 		}
-
-		event->pending_addr = 0;
-		if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
-			event->pending_addr = data->addr;
-		irq_work_queue(&event->pending_irq);
 	}
 
 	READ_ONCE(event->overflow_handler)(event, data, regs);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-06-24 15:15 [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
  2024-06-24 15:15 ` [PATCH v4 1/6] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
@ 2024-06-24 15:15 ` Sebastian Andrzej Siewior
  2024-07-01 12:28   ` Peter Zijlstra
  2024-10-28  8:30   ` Lai, Yi
  2024-06-24 15:15 ` [PATCH v4 3/6] perf: Shrink the size of the recursion counter Sebastian Andrzej Siewior
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-24 15:15 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Frederic Weisbecker, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior, Arnaldo Carvalho de Melo

A signal is delivered by raising irq_work() which works from any context
including NMI. irq_work() can be delayed if the architecture does not
provide an interrupt vector. In order not to lose a signal, the signal
is injected via task_work during event_sched_out().

Instead going via irq_work, the signal could be added directly via
task_work. The signal is sent to current and can be enqueued on its
return path to userland instead of triggering irq_work. A dummy IRQ is
required in the NMI case to ensure the task_work is handled before
returning to user land. For this irq_work is used. An alternative would
be just raising an interrupt like arch_send_call_function_single_ipi().

During testing with `remove_on_exec' it become visible that the event
can be enqueued via NMI during execve(). The task_work must not be kept
because free_event() will complain later. Also the new task will not
have a sighandler installed.

Queue signal via task_work. Remove perf_event::pending_sigtrap and
and use perf_event::pending_work instead. Raise irq_work in the NMI case
for a dummy interrupt. Remove the task_work if the event is freed.

Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/perf_event.h |  3 +--
 kernel/events/core.c       | 36 +++++++++++++++---------------------
 2 files changed, 16 insertions(+), 23 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 393fb13733b02..ea0d82418d854 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -781,7 +781,6 @@ struct perf_event {
 	unsigned int			pending_wakeup;
 	unsigned int			pending_kill;
 	unsigned int			pending_disable;
-	unsigned int			pending_sigtrap;
 	unsigned long			pending_addr;	/* SIGTRAP */
 	struct irq_work			pending_irq;
 	struct callback_head		pending_task;
@@ -963,7 +962,7 @@ struct perf_event_context {
 	struct rcu_head			rcu_head;
 
 	/*
-	 * Sum (event->pending_sigtrap + event->pending_work)
+	 * Sum (event->pending_work + event->pending_work)
 	 *
 	 * The SIGTRAP is targeted at ctx->task, as such it won't do changing
 	 * that until the signal is delivered.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 647abeeaeeb02..6256a9593c3da 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2283,17 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
 		state = PERF_EVENT_STATE_OFF;
 	}
 
-	if (event->pending_sigtrap) {
-		event->pending_sigtrap = 0;
-		if (state != PERF_EVENT_STATE_OFF &&
-		    !event->pending_work &&
-		    !task_work_add(current, &event->pending_task, TWA_RESUME)) {
-			event->pending_work = 1;
-		} else {
-			local_dec(&event->ctx->nr_pending);
-		}
-	}
-
 	perf_event_set_state(event, state);
 
 	if (!is_software_event(event))
@@ -6787,11 +6776,6 @@ static void __perf_pending_irq(struct perf_event *event)
 	 * Yay, we hit home and are in the context of the event.
 	 */
 	if (cpu == smp_processor_id()) {
-		if (event->pending_sigtrap) {
-			event->pending_sigtrap = 0;
-			perf_sigtrap(event);
-			local_dec(&event->ctx->nr_pending);
-		}
 		if (event->pending_disable) {
 			event->pending_disable = 0;
 			perf_event_disable_local(event);
@@ -9735,18 +9719,28 @@ static int __perf_event_overflow(struct perf_event *event,
 
 		if (regs)
 			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
-		if (!event->pending_sigtrap) {
-			event->pending_sigtrap = pending_id;
+
+		if (!event->pending_work &&
+		    !task_work_add(current, &event->pending_task, TWA_RESUME)) {
+			event->pending_work = pending_id;
 			local_inc(&event->ctx->nr_pending);
 
 			event->pending_addr = 0;
 			if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
 				event->pending_addr = data->addr;
-			irq_work_queue(&event->pending_irq);
+			/*
+			 * The NMI path returns directly to userland. The
+			 * irq_work is raised as a dummy interrupt to ensure
+			 * regular return path to user is taken and task_work
+			 * is processed.
+			 */
+			if (in_nmi())
+				irq_work_queue(&event->pending_irq);
+
 		} else if (event->attr.exclude_kernel && valid_sample) {
 			/*
 			 * Should not be able to return to user space without
-			 * consuming pending_sigtrap; with exceptions:
+			 * consuming pending_work; with exceptions:
 			 *
 			 *  1. Where !exclude_kernel, events can overflow again
 			 *     in the kernel without returning to user space.
@@ -9756,7 +9750,7 @@ static int __perf_event_overflow(struct perf_event *event,
 			 *     To approximate progress (with false negatives),
 			 *     check 32-bit hash of the current IP.
 			 */
-			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
+			WARN_ON_ONCE(event->pending_work != pending_id);
 		}
 	}
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 3/6] perf: Shrink the size of the recursion counter.
  2024-06-24 15:15 [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
  2024-06-24 15:15 ` [PATCH v4 1/6] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
  2024-06-24 15:15 ` [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
@ 2024-06-24 15:15 ` Sebastian Andrzej Siewior
  2024-07-01 12:31   ` Peter Zijlstra
  2024-06-24 15:15 ` [PATCH v4 4/6] perf: Move swevent_htable::recursion into task_struct Sebastian Andrzej Siewior
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-24 15:15 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Frederic Weisbecker, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior

There are four recursion counter, one for each context. The type of the
counter is `int' but the counter is used as `bool' since it is only
incremented if zero.

Reduce the type of the recursion counter to an unsigned char, keep the
increment/ decrement operation.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/events/callchain.c | 2 +-
 kernel/events/core.c      | 2 +-
 kernel/events/internal.h  | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1273be84392cf..ad57944b6c40e 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -29,7 +29,7 @@ static inline size_t perf_callchain_entry__sizeof(void)
 				 sysctl_perf_event_max_contexts_per_stack));
 }
 
-static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
+static DEFINE_PER_CPU(u8, callchain_recursion[PERF_NR_CONTEXTS]);
 static atomic_t nr_callchain_events;
 static DEFINE_MUTEX(callchain_mutex);
 static struct callchain_cpus_entries *callchain_cpus_entries;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6256a9593c3da..f48ce05907042 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9781,7 +9781,7 @@ struct swevent_htable {
 	int				hlist_refcount;
 
 	/* Recursion avoidance in each contexts */
-	int				recursion[PERF_NR_CONTEXTS];
+	u8				recursion[PERF_NR_CONTEXTS];
 };
 
 static DEFINE_PER_CPU(struct swevent_htable, swevent_htable);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 5150d5f84c033..f9a3244206b20 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,7 +208,7 @@ arch_perf_out_copy_user(void *dst, const void *src, unsigned long n)
 
 DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)
 
-static inline int get_recursion_context(int *recursion)
+static inline int get_recursion_context(u8 *recursion)
 {
 	unsigned char rctx = interrupt_context_level();
 
@@ -221,7 +221,7 @@ static inline int get_recursion_context(int *recursion)
 	return rctx;
 }
 
-static inline void put_recursion_context(int *recursion, int rctx)
+static inline void put_recursion_context(u8 *recursion, int rctx)
 {
 	barrier();
 	recursion[rctx]--;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 4/6] perf: Move swevent_htable::recursion into task_struct.
  2024-06-24 15:15 [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
                   ` (2 preceding siblings ...)
  2024-06-24 15:15 ` [PATCH v4 3/6] perf: Shrink the size of the recursion counter Sebastian Andrzej Siewior
@ 2024-06-24 15:15 ` Sebastian Andrzej Siewior
  2024-06-24 15:15 ` [PATCH v4 5/6] perf: Don't disable preemption in perf_pending_task() Sebastian Andrzej Siewior
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-24 15:15 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Frederic Weisbecker, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior

The swevent_htable::recursion counter is used to avoid creating an
swevent while an event is processed to avoid recursion. The counter is
per-CPU and preemption must be disabled to have a stable counter.
perf_pending_task() disables preemption to access the counter and then
signal. This is problematic on PREEMPT_RT because sending a signal uses
a spinlock_t which must not be acquired in atomic on PREEMPT_RT because
it becomes a sleeping lock.

The atomic context can be avoided by moving the counter into the
task_struct. There is a 4 byte hole between futex_state (usually always
on) and the following perf pointer (perf_event_ctxp). After the
recursion lost some weight it fits perfectly.

Move swevent_htable::recursion into task_struct.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/perf_event.h |  6 ------
 include/linux/sched.h      |  7 +++++++
 kernel/events/core.c       | 13 +++----------
 kernel/events/internal.h   |  2 +-
 4 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ea0d82418d854..99a7ea1d29ed5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -970,12 +970,6 @@ struct perf_event_context {
 	local_t				nr_pending;
 };
 
-/*
- * Number of contexts where an event can trigger:
- *	task, softirq, hardirq, nmi.
- */
-#define PERF_NR_CONTEXTS	4
-
 struct perf_cpu_pmu_context {
 	struct perf_event_pmu_context	epc;
 	struct perf_event_pmu_context	*task_epc;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6d..afb1087f5831b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -734,6 +734,12 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/*
+ * Number of contexts where an event can trigger:
+ *      task, softirq, hardirq, nmi.
+ */
+#define PERF_NR_CONTEXTS	4
+
 struct wake_q_node {
 	struct wake_q_node *next;
 };
@@ -1256,6 +1262,7 @@ struct task_struct {
 	unsigned int			futex_state;
 #endif
 #ifdef CONFIG_PERF_EVENTS
+	u8				perf_recursion[PERF_NR_CONTEXTS];
 	struct perf_event_context	*perf_event_ctxp;
 	struct mutex			perf_event_mutex;
 	struct list_head		perf_event_list;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f48ce05907042..fc9a78e1fb4aa 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9779,11 +9779,7 @@ struct swevent_htable {
 	struct swevent_hlist		*swevent_hlist;
 	struct mutex			hlist_mutex;
 	int				hlist_refcount;
-
-	/* Recursion avoidance in each contexts */
-	u8				recursion[PERF_NR_CONTEXTS];
 };
-
 static DEFINE_PER_CPU(struct swevent_htable, swevent_htable);
 
 /*
@@ -9981,17 +9977,13 @@ DEFINE_PER_CPU(struct pt_regs, __perf_regs[4]);
 
 int perf_swevent_get_recursion_context(void)
 {
-	struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
-
-	return get_recursion_context(swhash->recursion);
+	return get_recursion_context(current->perf_recursion);
 }
 EXPORT_SYMBOL_GPL(perf_swevent_get_recursion_context);
 
 void perf_swevent_put_recursion_context(int rctx)
 {
-	struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
-
-	put_recursion_context(swhash->recursion, rctx);
+	put_recursion_context(current->perf_recursion, rctx);
 }
 
 void ___perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
@@ -13658,6 +13650,7 @@ int perf_event_init_task(struct task_struct *child, u64 clone_flags)
 {
 	int ret;
 
+	memset(child->perf_recursion, 0, sizeof(child->perf_recursion));
 	child->perf_event_ctxp = NULL;
 	mutex_init(&child->perf_event_mutex);
 	INIT_LIST_HEAD(&child->perf_event_list);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index f9a3244206b20..f0daaa6f2a33b 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -221,7 +221,7 @@ static inline int get_recursion_context(u8 *recursion)
 	return rctx;
 }
 
-static inline void put_recursion_context(u8 *recursion, int rctx)
+static inline void put_recursion_context(u8 *recursion, unsigned char rctx)
 {
 	barrier();
 	recursion[rctx]--;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 5/6] perf: Don't disable preemption in perf_pending_task().
  2024-06-24 15:15 [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
                   ` (3 preceding siblings ...)
  2024-06-24 15:15 ` [PATCH v4 4/6] perf: Move swevent_htable::recursion into task_struct Sebastian Andrzej Siewior
@ 2024-06-24 15:15 ` Sebastian Andrzej Siewior
  2024-06-24 15:15 ` [PATCH v4 6/6] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior
  2024-06-25 13:42 ` [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Marco Elver
  6 siblings, 0 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-24 15:15 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Frederic Weisbecker, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior

perf_pending_task() is invoked in task context and disables preemption
because perf_swevent_get_recursion_context() used to access per-CPU
variables. The other reason is to create a RCU read section while
accessing the perf_event.

The recursion counter is no longer a per-CPU accounter so disabling
preemption is no longer required. The RCU section is needed and must be
created explicit.

Replace the preemption-disable section with a explicit RCU-read section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/events/core.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fc9a78e1fb4aa..f75aa9f14c979 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5208,10 +5208,9 @@ static void perf_pending_task_sync(struct perf_event *event)
 	}
 
 	/*
-	 * All accesses related to the event are within the same
-	 * non-preemptible section in perf_pending_task(). The RCU
-	 * grace period before the event is freed will make sure all
-	 * those accesses are complete by then.
+	 * All accesses related to the event are within the same RCU section in
+	 * perf_pending_task(). The RCU grace period before the event is freed
+	 * will make sure all those accesses are complete by then.
 	 */
 	rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
 }
@@ -6842,7 +6841,7 @@ static void perf_pending_task(struct callback_head *head)
 	 * critical section as the ->pending_work reset. See comment in
 	 * perf_pending_task_sync().
 	 */
-	preempt_disable_notrace();
+	rcu_read_lock();
 	/*
 	 * If we 'fail' here, that's OK, it means recursion is already disabled
 	 * and we won't recurse 'further'.
@@ -6855,10 +6854,10 @@ static void perf_pending_task(struct callback_head *head)
 		local_dec(&event->ctx->nr_pending);
 		rcuwait_wake_up(&event->pending_work_wait);
 	}
+	rcu_read_unlock();
 
 	if (rctx >= 0)
 		perf_swevent_put_recursion_context(rctx);
-	preempt_enable_notrace();
 }
 
 #ifdef CONFIG_GUEST_PERF_EVENTS
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 6/6] perf: Split __perf_pending_irq() out of perf_pending_irq()
  2024-06-24 15:15 [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
                   ` (4 preceding siblings ...)
  2024-06-24 15:15 ` [PATCH v4 5/6] perf: Don't disable preemption in perf_pending_task() Sebastian Andrzej Siewior
@ 2024-06-24 15:15 ` Sebastian Andrzej Siewior
  2024-06-25 13:42 ` [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Marco Elver
  6 siblings, 0 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-24 15:15 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel
  Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Frederic Weisbecker, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Sebastian Andrzej Siewior, Arnaldo Carvalho de Melo

perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq().
The former is in charge of waking any tasks which wait to be woken up
while the latter disables perf-events.

The irq_work perf_pending_irq(), while this an irq_work, the callback
is invoked in thread context on PREEMPT_RT. This is needed because all
the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks
which must not be used with disabled interrupts.
Disabling events, as done by __perf_pending_irq(), expects a hardirq
context and disabled interrupts. This requirement is not fulfilled on
PREEMPT_RT.

Split functionality based on perf_event::pending_disable into irq_work
named `pending_disable_irq' and invoke it in hardirq context on
PREEMPT_RT. Rename the split out callback to perf_pending_disable().

Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 31 +++++++++++++++++++++++--------
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 99a7ea1d29ed5..65ece0d5b4b6d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -783,6 +783,7 @@ struct perf_event {
 	unsigned int			pending_disable;
 	unsigned long			pending_addr;	/* SIGTRAP */
 	struct irq_work			pending_irq;
+	struct irq_work			pending_disable_irq;
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
 	struct rcuwait			pending_work_wait;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f75aa9f14c979..8bba63ea9c686 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2451,7 +2451,7 @@ static void __perf_event_disable(struct perf_event *event,
  * hold the top-level event's child_mutex, so any descendant that
  * goes to exit will block in perf_event_exit_event().
  *
- * When called from perf_pending_irq it's OK because event->ctx
+ * When called from perf_pending_disable it's OK because event->ctx
  * is the current context on this CPU and preemption is disabled,
  * hence we can't get into perf_event_task_sched_out for this context.
  */
@@ -2491,7 +2491,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
 void perf_event_disable_inatomic(struct perf_event *event)
 {
 	event->pending_disable = 1;
-	irq_work_queue(&event->pending_irq);
+	irq_work_queue(&event->pending_disable_irq);
 }
 
 #define MAX_INTERRUPTS (~0ULL)
@@ -5218,6 +5218,7 @@ static void perf_pending_task_sync(struct perf_event *event)
 static void _free_event(struct perf_event *event)
 {
 	irq_work_sync(&event->pending_irq);
+	irq_work_sync(&event->pending_disable_irq);
 	perf_pending_task_sync(event);
 
 	unaccount_event(event);
@@ -6760,7 +6761,7 @@ static void perf_sigtrap(struct perf_event *event)
 /*
  * Deliver the pending work in-event-context or follow the context.
  */
-static void __perf_pending_irq(struct perf_event *event)
+static void __perf_pending_disable(struct perf_event *event)
 {
 	int cpu = READ_ONCE(event->oncpu);
 
@@ -6798,11 +6799,26 @@ static void __perf_pending_irq(struct perf_event *event)
 	 *				  irq_work_queue(); // FAILS
 	 *
 	 *  irq_work_run()
-	 *    perf_pending_irq()
+	 *    perf_pending_disable()
 	 *
 	 * But the event runs on CPU-B and wants disabling there.
 	 */
-	irq_work_queue_on(&event->pending_irq, cpu);
+	irq_work_queue_on(&event->pending_disable_irq, cpu);
+}
+
+static void perf_pending_disable(struct irq_work *entry)
+{
+	struct perf_event *event = container_of(entry, struct perf_event, pending_disable_irq);
+	int rctx;
+
+	/*
+	 * If we 'fail' here, that's OK, it means recursion is already disabled
+	 * and we won't recurse 'further'.
+	 */
+	rctx = perf_swevent_get_recursion_context();
+	__perf_pending_disable(event);
+	if (rctx >= 0)
+		perf_swevent_put_recursion_context(rctx);
 }
 
 static void perf_pending_irq(struct irq_work *entry)
@@ -6825,8 +6841,6 @@ static void perf_pending_irq(struct irq_work *entry)
 		perf_event_wakeup(event);
 	}
 
-	__perf_pending_irq(event);
-
 	if (rctx >= 0)
 		perf_swevent_put_recursion_context(rctx);
 }
@@ -9734,7 +9748,7 @@ static int __perf_event_overflow(struct perf_event *event,
 			 * is processed.
 			 */
 			if (in_nmi())
-				irq_work_queue(&event->pending_irq);
+				irq_work_queue(&event->pending_disable_irq);
 
 		} else if (event->attr.exclude_kernel && valid_sample) {
 			/*
@@ -11972,6 +11986,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	init_waitqueue_head(&event->waitq);
 	init_irq_work(&event->pending_irq, perf_pending_irq);
+	event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
 	init_task_work(&event->pending_task, perf_pending_task);
 	rcuwait_init(&event->pending_work_wait);
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT.
  2024-06-24 15:15 [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
                   ` (5 preceding siblings ...)
  2024-06-24 15:15 ` [PATCH v4 6/6] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior
@ 2024-06-25 13:42 ` Marco Elver
  6 siblings, 0 replies; 32+ messages in thread
From: Marco Elver @ 2024-06-25 13:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira,
	Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Mark Rutland, Namhyung Kim, Peter Zijlstra,
	Thomas Gleixner

On Mon, 24 Jun 2024 at 17:27, Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> Hi,
>
> Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
> the signal gets delayed until event_sched_out() which then uses
> task_work_add() for its delivery. This breaks on PREEMPT_RT because the
> signal is delivered with disabled preemption.
>
> While looking at this, I also stumbled upon __perf_pending_irq() which
> requires disabled interrupts but this is not the case on PREEMPT_RT.
>
> This series aim to address both issues while not introducing a new issue
> at the same time ;)
> Any testing is appreciated.

Tested-by: Marco Elver <elver@google.com>

Ran lots of concurrent copies of the "sigtrap_threads" and
"remove_on_exec" tests (with lockdep on), and it all survived. Fuzzer
is still running but hasn't found anything relevant yet.


> v3…v4: https://lore.kernel.org/all/20240322065208.60456-1-bigeasy@linutronix.de/
>    - Rebased on top of Frederic's series
>       (https://lore.kernel.org/all/20240621091601.18227-1-frederic@kernel.org)
>    - Frederick pointed out that perf_pending_task() needs to
>      perf_swevent_get_recursion_context() in order not to recurse if
>      something within perf_swevent_.*_recursion_context() triggers a
>      software event. To address this, the counters have been moved to
>      the task_struct (#3 + #4) and preemt_disable() has been replaced
>      with a RCU-read lock (#5).
>    - The remaning logic same that means the event is pushed to task-work
>      instead of delivering from IRQ-work. The series was tested with
>      remove_on_exec as suggested by Marco Elver: On PREEMPT_RT a single
>      invocation passes, 100 parallel invocations report (for some)
>      unexpected SIGTRAPs and timeouts. This also observed on !RT
>      (without the series) with a higher task-count.
>
> v2…v3: https://lore.kernel.org/all/20240312180814.3373778-1-bigeasy@linutronix.de/
>     - Marco suggested to add a few comments
>       - Added a comment to __perf_event_overflow() to explain why irq_work
>         is raised in the in_nmi() case.
>       - Added a comment to perf_event_exit_event() to explain why the
>         pending event is deleted.
>
> v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/
>     - Marco pointed me to the testsuite that showed two problems:
>       - Delayed task_work from NMI / missing events.
>         Fixed by triggering dummy irq_work to enforce an interrupt for
>         the exit-to-userland path which checks task_work
>       - Increased ref-count on clean up/ during exec.
>         Mostly addressed by the former change. There is still a window
>         if the NMI occurs during execve(). This is addressed by removing
>         the task_work before free_event().
>       The testsuite (remove_on_exec) fails sometimes if the event/
>       SIGTRAP is sent before the sighandler is installed.
>
> Sebastian
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-06-24 15:15 ` [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
@ 2024-07-01 12:28   ` Peter Zijlstra
  2024-07-01 13:27     ` Sebastian Andrzej Siewior
  2024-10-28  8:30   ` Lai, Yi
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2024-07-01 12:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira,
	Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Thomas Gleixner, Arnaldo Carvalho de Melo

On Mon, Jun 24, 2024 at 05:15:15PM +0200, Sebastian Andrzej Siewior wrote:
> @@ -9735,18 +9719,28 @@ static int __perf_event_overflow(struct perf_event *event,
>  
>  		if (regs)
>  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> -		if (!event->pending_sigtrap) {
> -			event->pending_sigtrap = pending_id;
> +
> +		if (!event->pending_work &&
> +		    !task_work_add(current, &event->pending_task, TWA_RESUME)) {
> +			event->pending_work = pending_id;
>  			local_inc(&event->ctx->nr_pending);

task_work_add() isn't NMI safe, at the very least the
kasan_record_aux_stack() thing needs to go. But the whole
set_notify_resume() thing also needs an audit.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/6] perf: Shrink the size of the recursion counter.
  2024-06-24 15:15 ` [PATCH v4 3/6] perf: Shrink the size of the recursion counter Sebastian Andrzej Siewior
@ 2024-07-01 12:31   ` Peter Zijlstra
  2024-07-01 12:56     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2024-07-01 12:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira,
	Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Thomas Gleixner

On Mon, Jun 24, 2024 at 05:15:16PM +0200, Sebastian Andrzej Siewior wrote:
> There are four recursion counter, one for each context. The type of the
> counter is `int' but the counter is used as `bool' since it is only
> incremented if zero.
> 
> Reduce the type of the recursion counter to an unsigned char, keep the
> increment/ decrement operation.

Does this actually matter? Aren't u8 memops encoded by longer
instructions etc..

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/6] perf: Shrink the size of the recursion counter.
  2024-07-01 12:31   ` Peter Zijlstra
@ 2024-07-01 12:56     ` Sebastian Andrzej Siewior
  2024-07-01 13:10       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-01 12:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira,
	Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Thomas Gleixner

On 2024-07-01 14:31:37 [+0200], Peter Zijlstra wrote:
> On Mon, Jun 24, 2024 at 05:15:16PM +0200, Sebastian Andrzej Siewior wrote:
> > There are four recursion counter, one for each context. The type of the
> > counter is `int' but the counter is used as `bool' since it is only
> > incremented if zero.
> > 
> > Reduce the type of the recursion counter to an unsigned char, keep the
> > increment/ decrement operation.
> 
> Does this actually matter? Aren't u8 memops encoded by longer
> instructions etc..

The goal here isn't to reduce the opcodes but to add it to task_struct
without making it larger by filling a hole.

But since you made me look at assembly:
old:
     316b:       65 48 8b 15 00 00 00    mov    %gs:0x0(%rip),%rdx        # 3173 <perf_swevent_get_recursion_context+0x33>
     3173:       1c ff                   sbb    $0xff,%al
     3175:       48 0f be c8             movsbq %al,%rcx
     3179:       48 8d 94 8a 00 00 00    lea    0x0(%rdx,%rcx,4),%rdx
     3180:       00
                         317d: R_X86_64_32S      .data..percpu+0x4c
     3181:       8b 0a                   mov    (%rdx),%ecx
     3183:       85 c9                   test   %ecx,%ecx
     3185:       75 0e                   jne    3195 <perf_swevent_get_recursion_context+0x55>
     3187:       c7 02 01 00 00 00       movl   $0x1,(%rdx)
^^^
     318d:       0f be c0                movsbl %al,%eax

new:
     2ff8:       1c ff                   sbb    $0xff,%al
     2ffa:       81 e2 00 01 ff 00       and    $0xff0100,%edx
     3000:       83 fa 01                cmp    $0x1,%edx
     3003:       1c ff                   sbb    $0xff,%al
     3005:       48 0f be d0             movsbq %al,%rdx
     3009:       48 8d 94 11 00 00 00    lea    0x0(%rcx,%rdx,1),%rdx
     3010:       00
                         300d: R_X86_64_32S      .data..percpu+0x4c
 
     3011:       80 3a 00                cmpb   $0x0,(%rdx)
     3014:       75 0b                   jne    3021 <perf_swevent_get_recursion_context+0x51>
     3016:       c6 02 01                movb   $0x1,(%rdx)
^^^
     3019:       0f be c0                movsbl %al,%eax
     301c:       e9 00 00 00 00          jmp    3021 <perf_swevent_get_recursion_context+0x51>

So we do even save a few bytes. We could avoid the "movsbl" at 3019 by
making the return type `unsigned char' ;)

Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/6] perf: Shrink the size of the recursion counter.
  2024-07-01 12:56     ` Sebastian Andrzej Siewior
@ 2024-07-01 13:10       ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2024-07-01 13:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira,
	Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Thomas Gleixner

On Mon, Jul 01, 2024 at 02:56:43PM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-07-01 14:31:37 [+0200], Peter Zijlstra wrote:
> > On Mon, Jun 24, 2024 at 05:15:16PM +0200, Sebastian Andrzej Siewior wrote:
> > > There are four recursion counter, one for each context. The type of the
> > > counter is `int' but the counter is used as `bool' since it is only
> > > incremented if zero.
> > > 
> > > Reduce the type of the recursion counter to an unsigned char, keep the
> > > increment/ decrement operation.
> > 
> > Does this actually matter? Aren't u8 memops encoded by longer
> > instructions etc..
> 
> The goal here isn't to reduce the opcodes but to add it to task_struct
> without making it larger by filling a hole.

Changelog failed to mention this crucial fact.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-07-01 12:28   ` Peter Zijlstra
@ 2024-07-01 13:27     ` Sebastian Andrzej Siewior
  2024-07-02  9:01       ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-01 13:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira,
	Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Thomas Gleixner, Arnaldo Carvalho de Melo

On 2024-07-01 14:28:29 [+0200], Peter Zijlstra wrote:
> On Mon, Jun 24, 2024 at 05:15:15PM +0200, Sebastian Andrzej Siewior wrote:
> > @@ -9735,18 +9719,28 @@ static int __perf_event_overflow(struct perf_event *event,
> >  
> >  		if (regs)
> >  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> > -		if (!event->pending_sigtrap) {
> > -			event->pending_sigtrap = pending_id;
> > +
> > +		if (!event->pending_work &&
> > +		    !task_work_add(current, &event->pending_task, TWA_RESUME)) {
> > +			event->pending_work = pending_id;
> >  			local_inc(&event->ctx->nr_pending);
> 
> task_work_add() isn't NMI safe, at the very least the
> kasan_record_aux_stack() thing needs to go. But the whole
> set_notify_resume() thing also needs an audit.

oh. I just peeked and set_notify_resume() wouldn't survive the audit.
Would you mind adding task_work_add_nmi() which is task_work_add(,
TWA_NONE) without kasan_record_aux_stack()? The signal part would need
to be moved the irq_work() below in the NMI case.

Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-07-01 13:27     ` Sebastian Andrzej Siewior
@ 2024-07-02  9:01       ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2024-07-02  9:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira,
	Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Thomas Gleixner, Arnaldo Carvalho de Melo

On Mon, Jul 01, 2024 at 03:27:25PM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-07-01 14:28:29 [+0200], Peter Zijlstra wrote:
> > On Mon, Jun 24, 2024 at 05:15:15PM +0200, Sebastian Andrzej Siewior wrote:
> > > @@ -9735,18 +9719,28 @@ static int __perf_event_overflow(struct perf_event *event,
> > >  
> > >  		if (regs)
> > >  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> > > -		if (!event->pending_sigtrap) {
> > > -			event->pending_sigtrap = pending_id;
> > > +
> > > +		if (!event->pending_work &&
> > > +		    !task_work_add(current, &event->pending_task, TWA_RESUME)) {
> > > +			event->pending_work = pending_id;
> > >  			local_inc(&event->ctx->nr_pending);
> > 
> > task_work_add() isn't NMI safe, at the very least the
> > kasan_record_aux_stack() thing needs to go. But the whole
> > set_notify_resume() thing also needs an audit.
> 
> oh. I just peeked and set_notify_resume() wouldn't survive the audit.
> Would you mind adding task_work_add_nmi() which is task_work_add(,
> TWA_NONE) without kasan_record_aux_stack()? The signal part would need
> to be moved the irq_work() below in the NMI case.

Perhaps add TWA_NMI_CURRENT or somesuch. That would be descriptive and
the value can be used to elide the call to kasan*() and possibly also
include that self-IPI for the NMI-from-user case where we don't have a
return-to-user path.

It can also verify task == current, which ensures people don't try and
use it for other things.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-06-24 15:15 ` [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
  2024-07-01 12:28   ` Peter Zijlstra
@ 2024-10-28  8:30   ` Lai, Yi
  2024-10-28 12:21     ` Frederic Weisbecker
  1 sibling, 1 reply; 32+ messages in thread
From: Lai, Yi @ 2024-10-28  8:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira,
	Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo,
	yi1.lai, syzkaller-bugs

Hi Sebastian Andrzej Siewior,

Greetings!

I used Syzkaller and found that there is INFO: task hung in in _free_event in v6.12-rc4.

After bisection and the first bad commit is:
"
c5d93d23a260 perf: Enqueue SIGTRAP always via task_work.
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/241027_012744__free_event
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/241027_012744__free_event/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/241027_012744__free_event/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/241027_012744__free_event/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/241027_012744__free_event/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/241027_012744__free_event/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/241027_012744__free_event/bzImage_42f7652d3eb527d03665b09edac47f85fb600924
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/241027_012744__free_event/42f7652d3eb527d03665b09edac47f85fb600924_dmesg.log

"
[  300.651268] INFO: task repro:671 blocked for more than 147 seconds.
[  300.651706]       Not tainted 6.12.0-rc4-42f7652d3eb5+ #1
[  300.652006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  300.652430] task:repro           state:D stack:0     pid:671   tgid:671   ppid:670    flags:0x00004002
[  300.652939] Call Trace:
[  300.653088]  <TASK>
[  300.653221]  __schedule+0xe13/0x33a0
[  300.653474]  ? __pfx___schedule+0x10/0x10
[  300.653704]  ? lock_release+0x441/0x870
[  300.653946]  ? __pfx_lock_release+0x10/0x10
[  300.654184]  ? trace_lock_acquire+0x139/0x1b0
[  300.654439]  ? lock_acquire+0x80/0xb0
[  300.654651]  ? schedule+0x216/0x3f0
[  300.654859]  schedule+0xf6/0x3f0
[  300.655083]  _free_event+0x531/0x14c0
[  300.655317]  perf_event_release_kernel+0x648/0x870
[  300.655597]  ? __pfx_perf_event_release_kernel+0x10/0x10
[  300.655899]  ? trace_hardirqs_on+0x51/0x60
[  300.656176]  ? __sanitizer_cov_trace_const_cmp2+0x1c/0x30
[  300.656474]  ? __pfx_perf_release+0x10/0x10
[  300.656697]  perf_release+0x3a/0x50
[  300.656916]  __fput+0x414/0xb60
[  300.657163]  ____fput+0x22/0x30
[  300.657335]  task_work_run+0x19c/0x2b0
[  300.657548]  ? __pfx_task_work_run+0x10/0x10
[  300.657780]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[  300.658101]  ? switch_task_namespaces+0xc6/0x110
[  300.658390]  do_exit+0xb19/0x2a30
[  300.658577]  ? lockdep_hardirqs_on+0x89/0x110
[  300.658822]  ? __pfx_do_exit+0x10/0x10
[  300.659088]  do_group_exit+0xe4/0x2c0
[  300.659338]  get_signal+0x2279/0x24c0
[  300.659554]  ? __pfx_get_signal+0x10/0x10
[  300.659767]  ? __might_fault+0xf1/0x1b0
[  300.659984]  arch_do_signal_or_restart+0x8e/0x7d0
[  300.660239]  ? __pfx_arch_do_signal_or_restart+0x10/0x10
[  300.660524]  ? __this_cpu_preempt_check+0x21/0x30
[  300.660772]  ? syscall_exit_to_user_mode+0x10f/0x200
[  300.661035]  syscall_exit_to_user_mode+0x144/0x200
[  300.661288]  do_syscall_64+0x79/0x140
[  300.661489]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  300.661760] RIP: 0033:0x7f374dc3ee5d
[  300.661953] RSP: 002b:00007ffd55371398 EFLAGS: 00000206 ORIG_RAX: 000000000000012a
[  300.662341] RAX: 0000000000000003 RBX: 0000000000000000 RCX: 00007f374dc3ee5d
[  300.662702] RDX: 00000000ffffffff RSI: 0000000000000000 RDI: 0000000020000240
[  300.663087] RBP: 00007ffd553713a0 R08: 0000000000000008 R09: 00007ffd553713d0
[  300.663455] R10: 00000000ffffffff R11: 0000000000000206 R12: 00007ffd553714f8
[  300.663827] R13: 0000000000401b49 R14: 0000000000403e08 R15: 00007f374de9e000
[  300.664212]  </TASK>
[  300.664338]
[  300.664338] Showing all locks held in the system:
[  300.664670] 1 lock held by khungtaskd/33:
[  300.664887]  #0: ffffffff8705ca00 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0
[  300.665390]
[  300.665481] =============================================
"

I hope you find it useful.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 

On Mon, Jun 24, 2024 at 05:15:15PM +0200, Sebastian Andrzej Siewior wrote:
> A signal is delivered by raising irq_work() which works from any context
> including NMI. irq_work() can be delayed if the architecture does not
> provide an interrupt vector. In order not to lose a signal, the signal
> is injected via task_work during event_sched_out().
> 
> Instead going via irq_work, the signal could be added directly via
> task_work. The signal is sent to current and can be enqueued on its
> return path to userland instead of triggering irq_work. A dummy IRQ is
> required in the NMI case to ensure the task_work is handled before
> returning to user land. For this irq_work is used. An alternative would
> be just raising an interrupt like arch_send_call_function_single_ipi().
> 
> During testing with `remove_on_exec' it become visible that the event
> can be enqueued via NMI during execve(). The task_work must not be kept
> because free_event() will complain later. Also the new task will not
> have a sighandler installed.
> 
> Queue signal via task_work. Remove perf_event::pending_sigtrap and
> and use perf_event::pending_work instead. Raise irq_work in the NMI case
> for a dummy interrupt. Remove the task_work if the event is freed.
> 
> Tested-by: Marco Elver <elver@google.com>
> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/perf_event.h |  3 +--
>  kernel/events/core.c       | 36 +++++++++++++++---------------------
>  2 files changed, 16 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 393fb13733b02..ea0d82418d854 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -781,7 +781,6 @@ struct perf_event {
>  	unsigned int			pending_wakeup;
>  	unsigned int			pending_kill;
>  	unsigned int			pending_disable;
> -	unsigned int			pending_sigtrap;
>  	unsigned long			pending_addr;	/* SIGTRAP */
>  	struct irq_work			pending_irq;
>  	struct callback_head		pending_task;
> @@ -963,7 +962,7 @@ struct perf_event_context {
>  	struct rcu_head			rcu_head;
>  
>  	/*
> -	 * Sum (event->pending_sigtrap + event->pending_work)
> +	 * Sum (event->pending_work + event->pending_work)
>  	 *
>  	 * The SIGTRAP is targeted at ctx->task, as such it won't do changing
>  	 * that until the signal is delivered.
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 647abeeaeeb02..6256a9593c3da 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2283,17 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
>  		state = PERF_EVENT_STATE_OFF;
>  	}
>  
> -	if (event->pending_sigtrap) {
> -		event->pending_sigtrap = 0;
> -		if (state != PERF_EVENT_STATE_OFF &&
> -		    !event->pending_work &&
> -		    !task_work_add(current, &event->pending_task, TWA_RESUME)) {
> -			event->pending_work = 1;
> -		} else {
> -			local_dec(&event->ctx->nr_pending);
> -		}
> -	}
> -
>  	perf_event_set_state(event, state);
>  
>  	if (!is_software_event(event))
> @@ -6787,11 +6776,6 @@ static void __perf_pending_irq(struct perf_event *event)
>  	 * Yay, we hit home and are in the context of the event.
>  	 */
>  	if (cpu == smp_processor_id()) {
> -		if (event->pending_sigtrap) {
> -			event->pending_sigtrap = 0;
> -			perf_sigtrap(event);
> -			local_dec(&event->ctx->nr_pending);
> -		}
>  		if (event->pending_disable) {
>  			event->pending_disable = 0;
>  			perf_event_disable_local(event);
> @@ -9735,18 +9719,28 @@ static int __perf_event_overflow(struct perf_event *event,
>  
>  		if (regs)
>  			pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> -		if (!event->pending_sigtrap) {
> -			event->pending_sigtrap = pending_id;
> +
> +		if (!event->pending_work &&
> +		    !task_work_add(current, &event->pending_task, TWA_RESUME)) {
> +			event->pending_work = pending_id;
>  			local_inc(&event->ctx->nr_pending);
>  
>  			event->pending_addr = 0;
>  			if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
>  				event->pending_addr = data->addr;
> -			irq_work_queue(&event->pending_irq);
> +			/*
> +			 * The NMI path returns directly to userland. The
> +			 * irq_work is raised as a dummy interrupt to ensure
> +			 * regular return path to user is taken and task_work
> +			 * is processed.
> +			 */
> +			if (in_nmi())
> +				irq_work_queue(&event->pending_irq);
> +
>  		} else if (event->attr.exclude_kernel && valid_sample) {
>  			/*
>  			 * Should not be able to return to user space without
> -			 * consuming pending_sigtrap; with exceptions:
> +			 * consuming pending_work; with exceptions:
>  			 *
>  			 *  1. Where !exclude_kernel, events can overflow again
>  			 *     in the kernel without returning to user space.
> @@ -9756,7 +9750,7 @@ static int __perf_event_overflow(struct perf_event *event,
>  			 *     To approximate progress (with false negatives),
>  			 *     check 32-bit hash of the current IP.
>  			 */
> -			WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> +			WARN_ON_ONCE(event->pending_work != pending_id);
>  		}
>  	}
>  
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-10-28  8:30   ` Lai, Yi
@ 2024-10-28 12:21     ` Frederic Weisbecker
  2024-10-29 17:21       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2024-10-28 12:21 UTC (permalink / raw)
  To: Lai, Yi
  Cc: Sebastian Andrzej Siewior, linux-perf-users, linux-kernel,
	Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo,
	yi1.lai, syzkaller-bugs

Le Mon, Oct 28, 2024 at 04:30:26PM +0800, Lai, Yi a écrit :
> [  300.651268] INFO: task repro:671 blocked for more than 147 seconds.
> [  300.651706]       Not tainted 6.12.0-rc4-42f7652d3eb5+ #1
> [  300.652006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  300.652430] task:repro           state:D stack:0     pid:671   tgid:671   ppid:670    flags:0x00004002
> [  300.652939] Call Trace:
> [  300.653088]  <TASK>
> [  300.653221]  __schedule+0xe13/0x33a0
> [  300.653474]  ? __pfx___schedule+0x10/0x10
> [  300.653704]  ? lock_release+0x441/0x870
> [  300.653946]  ? __pfx_lock_release+0x10/0x10
> [  300.654184]  ? trace_lock_acquire+0x139/0x1b0
> [  300.654439]  ? lock_acquire+0x80/0xb0
> [  300.654651]  ? schedule+0x216/0x3f0
> [  300.654859]  schedule+0xf6/0x3f0
> [  300.655083]  _free_event+0x531/0x14c0
> [  300.655317]  perf_event_release_kernel+0x648/0x870
> [  300.655597]  ? __pfx_perf_event_release_kernel+0x10/0x10
> [  300.655899]  ? trace_hardirqs_on+0x51/0x60
> [  300.656176]  ? __sanitizer_cov_trace_const_cmp2+0x1c/0x30
> [  300.656474]  ? __pfx_perf_release+0x10/0x10
> [  300.656697]  perf_release+0x3a/0x50
> [  300.656916]  __fput+0x414/0xb60
> [  300.657163]  ____fput+0x22/0x30
> [  300.657335]  task_work_run+0x19c/0x2b0

Ah the perf_pending_task work is pending but perf_pending_task_sync()
fails to cancel there:

	/*
	 * If the task is queued to the current task's queue, we
	 * obviously can't wait for it to complete. Simply cancel it.
	 */
	if (task_work_cancel(current, head)) {
		event->pending_work = 0;
		local_dec(&event->ctx->nr_no_switch_fast);
		return;
	}

And that's because the work is not anymore on the task work
list in task->task_works. Instead it's in the executing list
in task_work_run(). It's a blind spot for task_work_cancel()
if the current task is already running the task works. And it
does since it's running the fput delayed work.

Something like this untested?

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 449dd64ed9ac..035580fa2c81 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1163,6 +1163,7 @@ struct task_struct {
 	unsigned int			sas_ss_flags;
 
 	struct callback_head		*task_works;
+	struct callback_head		*task_works_running;
 
 #ifdef CONFIG_AUDIT
 #ifdef CONFIG_AUDITSYSCALL
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index cf5e7e891a77..fdd70f09a7f0 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -33,6 +33,7 @@ struct callback_head *task_work_cancel_match(struct task_struct *task,
 	bool (*match)(struct callback_head *, void *data), void *data);
 struct callback_head *task_work_cancel_func(struct task_struct *, task_work_func_t);
 bool task_work_cancel(struct task_struct *task, struct callback_head *cb);
+bool task_work_cancel_current(struct callback_head *cb);
 void task_work_run(void);
 
 static inline void exit_task_work(struct task_struct *task)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e3589c4287cb..1b15f3c83595 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5305,7 +5305,7 @@ static void perf_pending_task_sync(struct perf_event *event)
 	 * If the task is queued to the current task's queue, we
 	 * obviously can't wait for it to complete. Simply cancel it.
 	 */
-	if (task_work_cancel(current, head)) {
+	if (task_work_cancel_current(head)) {
 		event->pending_work = 0;
 		local_dec(&event->ctx->nr_no_switch_fast);
 		return;
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af2..1b898701d888 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2450,6 +2450,7 @@ __latent_entropy struct task_struct *copy_process(
 
 	p->pdeath_signal = 0;
 	p->task_works = NULL;
+	p->task_works_running = NULL;
 	clear_posix_cputimers_work(p);
 
 #ifdef CONFIG_KRETPROBES
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 5d14d639ac71..2efa81a6cbf6 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -184,6 +184,26 @@ bool task_work_cancel(struct task_struct *task, struct callback_head *cb)
 	return ret == cb;
 }
 
+bool task_work_cancel_current(struct callback_head *cb)
+{
+	struct callback_head **running;
+
+	if (task_work_cancel(current, cb))
+		return true;
+
+	running = &current->task_works_running;
+	while (*running) {
+		if (*running == cb) {
+			*running = cb->next;
+			return true;
+		}
+		running = &(*running)->next;
+	}
+
+	return false;
+}
+
+
 /**
  * task_work_run - execute the works added by task_work_add()
  *
@@ -195,7 +215,7 @@ bool task_work_cancel(struct task_struct *task, struct callback_head *cb)
 void task_work_run(void)
 {
 	struct task_struct *task = current;
-	struct callback_head *work, *head, *next;
+	struct callback_head *work, *head;
 
 	for (;;) {
 		/*
@@ -223,10 +243,11 @@ void task_work_run(void)
 		raw_spin_lock_irq(&task->pi_lock);
 		raw_spin_unlock_irq(&task->pi_lock);
 
+		WARN_ON_ONCE(task->task_works_running);
 		do {
-			next = work->next;
+			task->task_works_running = work->next;
 			work->func(work);
-			work = next;
+			work = task->task_works_running;
 			cond_resched();
 		} while (work);
 	}


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-10-28 12:21     ` Frederic Weisbecker
@ 2024-10-29 17:21       ` Sebastian Andrzej Siewior
  2024-10-30 14:07         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-29 17:21 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Lai, Yi, linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo,
	yi1.lai, syzkaller-bugs

On 2024-10-28 13:21:39 [+0100], Frederic Weisbecker wrote:
> Ah the perf_pending_task work is pending but perf_pending_task_sync()
> fails to cancel there:
> 
> 	/*
> 	 * If the task is queued to the current task's queue, we
> 	 * obviously can't wait for it to complete. Simply cancel it.
> 	 */
> 	if (task_work_cancel(current, head)) {
> 		event->pending_work = 0;
> 		local_dec(&event->ctx->nr_no_switch_fast);
> 		return;
> 	}
> 
> And that's because the work is not anymore on the task work
> list in task->task_works. Instead it's in the executing list
> in task_work_run(). It's a blind spot for task_work_cancel()
> if the current task is already running the task works. And it
> does since it's running the fput delayed work.
> 
> Something like this untested?

Tested. Not sure if this is a good idea.
Couldn't we take the ->next pointer and add it to current::task_works
instead?
That patch in ZtYyXG4fYbUdoBpk@pavilion.home gets rid of that
rcuwait_wait_event().

Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-10-29 17:21       ` Sebastian Andrzej Siewior
@ 2024-10-30 14:07         ` Sebastian Andrzej Siewior
  2024-10-30 15:46           ` Frederic Weisbecker
  0 siblings, 1 reply; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-30 14:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Lai, Yi, linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo,
	yi1.lai, syzkaller-bugs

On 2024-10-29 18:21:31 [+0100], To Frederic Weisbecker wrote:
> On 2024-10-28 13:21:39 [+0100], Frederic Weisbecker wrote:
> > Ah the perf_pending_task work is pending but perf_pending_task_sync()
> > fails to cancel there:
> > 
> > 	/*
> > 	 * If the task is queued to the current task's queue, we
> > 	 * obviously can't wait for it to complete. Simply cancel it.
> > 	 */
> > 	if (task_work_cancel(current, head)) {
> > 		event->pending_work = 0;
> > 		local_dec(&event->ctx->nr_no_switch_fast);
> > 		return;
> > 	}
> > 
> > And that's because the work is not anymore on the task work
> > list in task->task_works. Instead it's in the executing list
> > in task_work_run(). It's a blind spot for task_work_cancel()
> > if the current task is already running the task works. And it
> > does since it's running the fput delayed work.
> > 
> > Something like this untested?
> 
> Tested. Not sure if this is a good idea.
> Couldn't we take the ->next pointer and add it to current::task_works
> instead?
> That patch in ZtYyXG4fYbUdoBpk@pavilion.home gets rid of that
> rcuwait_wait_event().

Just tested. That patch from ZtYyXG4fYbUdoBpk@pavilion.home also solves
that problem. Could we take that one instead?
 
Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-10-30 14:07         ` Sebastian Andrzej Siewior
@ 2024-10-30 15:46           ` Frederic Weisbecker
  2024-11-07 14:46             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2024-10-30 15:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Lai, Yi, linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo,
	yi1.lai, syzkaller-bugs

Le Wed, Oct 30, 2024 at 03:07:21PM +0100, Sebastian Andrzej Siewior a écrit :
> On 2024-10-29 18:21:31 [+0100], To Frederic Weisbecker wrote:
> > On 2024-10-28 13:21:39 [+0100], Frederic Weisbecker wrote:
> > > Ah the perf_pending_task work is pending but perf_pending_task_sync()
> > > fails to cancel there:
> > > 
> > > 	/*
> > > 	 * If the task is queued to the current task's queue, we
> > > 	 * obviously can't wait for it to complete. Simply cancel it.
> > > 	 */
> > > 	if (task_work_cancel(current, head)) {
> > > 		event->pending_work = 0;
> > > 		local_dec(&event->ctx->nr_no_switch_fast);
> > > 		return;
> > > 	}
> > > 
> > > And that's because the work is not anymore on the task work
> > > list in task->task_works. Instead it's in the executing list
> > > in task_work_run(). It's a blind spot for task_work_cancel()
> > > if the current task is already running the task works. And it
> > > does since it's running the fput delayed work.
> > > 
> > > Something like this untested?
> > 
> > Tested. Not sure if this is a good idea.
> > Couldn't we take the ->next pointer and add it to current::task_works
> > instead?
> > That patch in ZtYyXG4fYbUdoBpk@pavilion.home gets rid of that
> > rcuwait_wait_event().
> 
> Just tested. That patch from ZtYyXG4fYbUdoBpk@pavilion.home also solves
> that problem. Could we take that one instead?

This needs more thoughts. We must make sure that the parent is put _after_
the child because it's dereferenced on release, for example:

put_event()
   free_event()
      irq_work_sync(&event->pending_irq);
      ====> IRQ or irq_workd
      perf_event_wakeup()
         ring_buffer_wakeup()
	    event = event->parent;
	    rcu_dereference(event->rb);

And now after this patch it's possible that this happens after
the parent has been released.

We could put the parent from the child's free_event() but some
places (inherit_event()) may call free_event() on a child without
having held a reference to the parent.

Also note that with this patch the task may receive late irrelevant
signals after the event is removed. It's probably not that bad but
still... This could be a concern for exec(), is there a missing
task_work_run() there before flush_signal_handlers()?

Anyway here is an updated version:

diff --git a/kernel/events/core.c b/kernel/events/core.c
index e3589c4287cb..4031d0dbc47b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5448,10 +5448,17 @@ static void perf_remove_from_owner(struct perf_event *event)
 
 static void put_event(struct perf_event *event)
 {
+	struct perf_event *parent;
+
 	if (!atomic_long_dec_and_test(&event->refcount))
 		return;
 
+	parent = event->parent;
 	_free_event(event);
+
+	/* Matches the refcount bump in inherit_event() */
+	if (parent)
+		put_event(parent);
 }
 
 /*
@@ -5535,11 +5542,6 @@ int perf_event_release_kernel(struct perf_event *event)
 		if (tmp == child) {
 			perf_remove_from_context(child, DETACH_GROUP);
 			list_move(&child->child_list, &free_list);
-			/*
-			 * This matches the refcount bump in inherit_event();
-			 * this can't be the last reference.
-			 */
-			put_event(event);
 		} else {
 			var = &ctx->refcount;
 		}
@@ -5565,7 +5567,7 @@ int perf_event_release_kernel(struct perf_event *event)
 		void *var = &child->ctx->refcount;
 
 		list_del(&child->child_list);
-		free_event(child);
+		put_event(child);
 
 		/*
 		 * Wake any perf_event_free_task() waiting for this event to be
@@ -13325,8 +13327,7 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
 		 * Kick perf_poll() for is_event_hup();
 		 */
 		perf_event_wakeup(parent_event);
-		free_event(event);
-		put_event(parent_event);
+		put_event(event);
 		return;
 	}
 
@@ -13444,13 +13445,11 @@ static void perf_free_event(struct perf_event *event,
 	list_del_init(&event->child_list);
 	mutex_unlock(&parent->child_mutex);
 
-	put_event(parent);
-
 	raw_spin_lock_irq(&ctx->lock);
 	perf_group_detach(event);
 	list_del_event(event, ctx);
 	raw_spin_unlock_irq(&ctx->lock);
-	free_event(event);
+	put_event(event);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-10-30 15:46           ` Frederic Weisbecker
@ 2024-11-07 14:46             ` Sebastian Andrzej Siewior
  2024-11-08 13:11               ` Frederic Weisbecker
  0 siblings, 1 reply; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-07 14:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Lai, Yi, linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo,
	yi1.lai, syzkaller-bugs

On 2024-10-30 16:46:22 [+0100], Frederic Weisbecker wrote:
> This needs more thoughts. We must make sure that the parent is put _after_
> the child because it's dereferenced on release, for example:
…
> put_event()
>    free_event()
>       irq_work_sync(&event->pending_irq);
>       ====> IRQ or irq_workd
>       perf_event_wakeup()
>          ring_buffer_wakeup()
> 	    event = event->parent;
> 	    rcu_dereference(event->rb);
> 
> And now after this patch it's possible that this happens after
> the parent has been released.
> 
> We could put the parent from the child's free_event() but some
> places (inherit_event()) may call free_event() on a child without
> having held a reference to the parent.
> 
> Also note that with this patch the task may receive late irrelevant
> signals after the event is removed. It's probably not that bad but
> still... This could be a concern for exec(), is there a missing
> task_work_run() there before flush_signal_handlers()?

So if this causes so much pain, what about taking only one item at a
item? The following passes the test, too:

diff --git a/kernel/task_work.c b/kernel/task_work.c
index c969f1f26be58..fc796ffddfc74 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -206,7 +206,7 @@ bool task_work_cancel(struct task_struct *task, struct callback_head *cb)
 void task_work_run(void)
 {
 	struct task_struct *task = current;
-	struct callback_head *work, *head, *next;
+	struct callback_head *work, *head;
 
 	for (;;) {
 		/*
@@ -214,17 +214,7 @@ void task_work_run(void)
 		 * work_exited unless the list is empty.
 		 */
 		work = READ_ONCE(task->task_works);
-		do {
-			head = NULL;
-			if (!work) {
-				if (task->flags & PF_EXITING)
-					head = &work_exited;
-				else
-					break;
-			}
-		} while (!try_cmpxchg(&task->task_works, &work, head));
-
-		if (!work)
+		if (!work && !(task->flags & PF_EXITING))
 			break;
 		/*
 		 * Synchronize with task_work_cancel_match(). It can not remove
@@ -232,13 +222,24 @@ void task_work_run(void)
 		 * But it can remove another entry from the ->next list.
 		 */
 		raw_spin_lock_irq(&task->pi_lock);
+		do {
+			head = NULL;
+			if (work) {
+				head = READ_ONCE(work->next);
+			} else {
+				if (task->flags & PF_EXITING)
+					head = &work_exited;
+				else
+					break;
+			}
+		} while (!try_cmpxchg(&task->task_works, &work, head));
 		raw_spin_unlock_irq(&task->pi_lock);
 
-		do {
-			next = work->next;
-			work->func(work);
-			work = next;
+		if (!work)
+			break;
+		work->func(work);
+
+		if (head)
 			cond_resched();
-		} while (work);
 	}
 }

Sebastian

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-11-07 14:46             ` Sebastian Andrzej Siewior
@ 2024-11-08 13:11               ` Frederic Weisbecker
  2024-11-08 19:08                 ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2024-11-08 13:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Oleg Nesterov
  Cc: Lai, Yi, linux-perf-users, linux-kernel, Adrian Hunter,
	Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo,
	yi1.lai, syzkaller-bugs

+Cc Oleg.

Le Thu, Nov 07, 2024 at 03:46:17PM +0100, Sebastian Andrzej Siewior a écrit :
> On 2024-10-30 16:46:22 [+0100], Frederic Weisbecker wrote:
> > This needs more thoughts. We must make sure that the parent is put _after_
> > the child because it's dereferenced on release, for example:
> …
> > put_event()
> >    free_event()
> >       irq_work_sync(&event->pending_irq);
> >       ====> IRQ or irq_workd
> >       perf_event_wakeup()
> >          ring_buffer_wakeup()
> > 	    event = event->parent;
> > 	    rcu_dereference(event->rb);
> > 
> > And now after this patch it's possible that this happens after
> > the parent has been released.
> > 
> > We could put the parent from the child's free_event() but some
> > places (inherit_event()) may call free_event() on a child without
> > having held a reference to the parent.
> > 
> > Also note that with this patch the task may receive late irrelevant
> > signals after the event is removed. It's probably not that bad but
> > still... This could be a concern for exec(), is there a missing
> > task_work_run() there before flush_signal_handlers()?
> 
> So if this causes so much pain, what about taking only one item at a
> item? The following passes the test, too:
> 
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index c969f1f26be58..fc796ffddfc74 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -206,7 +206,7 @@ bool task_work_cancel(struct task_struct *task, struct callback_head *cb)
>  void task_work_run(void)
>  {
>  	struct task_struct *task = current;
> -	struct callback_head *work, *head, *next;
> +	struct callback_head *work, *head;
>  
>  	for (;;) {
>  		/*
> @@ -214,17 +214,7 @@ void task_work_run(void)
>  		 * work_exited unless the list is empty.
>  		 */
>  		work = READ_ONCE(task->task_works);
> -		do {
> -			head = NULL;
> -			if (!work) {
> -				if (task->flags & PF_EXITING)
> -					head = &work_exited;
> -				else
> -					break;
> -			}
> -		} while (!try_cmpxchg(&task->task_works, &work, head));
> -
> -		if (!work)
> +		if (!work && !(task->flags & PF_EXITING))
>  			break;
>  		/*
>  		 * Synchronize with task_work_cancel_match(). It can not remove
> @@ -232,13 +222,24 @@ void task_work_run(void)
>  		 * But it can remove another entry from the ->next list.
>  		 */
>  		raw_spin_lock_irq(&task->pi_lock);
> +		do {
> +			head = NULL;
> +			if (work) {
> +				head = READ_ONCE(work->next);
> +			} else {
> +				if (task->flags & PF_EXITING)
> +					head = &work_exited;
> +				else
> +					break;
> +			}
> +		} while (!try_cmpxchg(&task->task_works, &work, head));
>  		raw_spin_unlock_irq(&task->pi_lock);

And having more than one task work should be sufficiently rare
that we don't care about doing the locking + cmpxchg() for each
of them pending.

I like it!

Thanks.

>  
> -		do {
> -			next = work->next;
> -			work->func(work);
> -			work = next;
> +		if (!work)
> +			break;
> +		work->func(work);
> +
> +		if (head)
>  			cond_resched();
> -		} while (work);
>  	}
>  }
> 
> Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-11-08 13:11               ` Frederic Weisbecker
@ 2024-11-08 19:08                 ` Oleg Nesterov
  2024-11-08 22:26                   ` Frederic Weisbecker
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2024-11-08 19:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Sebastian Andrzej Siewior, Lai, Yi, linux-perf-users,
	linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Arnaldo Carvalho de Melo, yi1.lai, syzkaller-bugs

Sorry, currently I don't have time to even read the emails from lkml.

Plus I wasn't cc'ed so I don't understand the intent at all, but ...

On 11/08, Frederic Weisbecker wrote:
>
> > @@ -232,13 +222,24 @@ void task_work_run(void)
> >  		 * But it can remove another entry from the ->next list.
> >  		 */
> >  		raw_spin_lock_irq(&task->pi_lock);
> > +		do {
> > +			head = NULL;
> > +			if (work) {
> > +				head = READ_ONCE(work->next);
> > +			} else {
> > +				if (task->flags & PF_EXITING)
> > +					head = &work_exited;
> > +				else
> > +					break;
> > +			}
> > +		} while (!try_cmpxchg(&task->task_works, &work, head));
> >  		raw_spin_unlock_irq(&task->pi_lock);
>
> And having more than one task work should be sufficiently rare
> that we don't care about doing the locking + cmpxchg() for each
> of them pending.

Please see
https://lore.kernel.org/all/1440816150.8932.123.camel@edumazet-glaptop2.roam.corp.google.com/
and the whole thread.

I don't think raw_spin_lock_irq + cmpxchg for each work is a good
idea, but quite possibly I misunderstood this change.

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-11-08 19:08                 ` Oleg Nesterov
@ 2024-11-08 22:26                   ` Frederic Weisbecker
  2024-11-11 12:08                     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2024-11-08 22:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, Lai, Yi, linux-perf-users,
	linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Arnaldo Carvalho de Melo, yi1.lai, syzkaller-bugs

Le Fri, Nov 08, 2024 at 08:08:36PM +0100, Oleg Nesterov a écrit :
> Sorry, currently I don't have time to even read the emails from lkml.
> 
> Plus I wasn't cc'ed so I don't understand the intent at all, but ...
> 
> On 11/08, Frederic Weisbecker wrote:
> >
> > > @@ -232,13 +222,24 @@ void task_work_run(void)
> > >  		 * But it can remove another entry from the ->next list.
> > >  		 */
> > >  		raw_spin_lock_irq(&task->pi_lock);
> > > +		do {
> > > +			head = NULL;
> > > +			if (work) {
> > > +				head = READ_ONCE(work->next);
> > > +			} else {
> > > +				if (task->flags & PF_EXITING)
> > > +					head = &work_exited;
> > > +				else
> > > +					break;
> > > +			}
> > > +		} while (!try_cmpxchg(&task->task_works, &work, head));
> > >  		raw_spin_unlock_irq(&task->pi_lock);
> >
> > And having more than one task work should be sufficiently rare
> > that we don't care about doing the locking + cmpxchg() for each
> > of them pending.
> 
> Please see
> https://lore.kernel.org/all/1440816150.8932.123.camel@edumazet-glaptop2.roam.corp.google.com/
> and the whole thread.
> 
> I don't think raw_spin_lock_irq + cmpxchg for each work is a good
> idea, but quite possibly I misunderstood this change.

I did not realize there could be gazillion files released in a row. So there
could be noticeable performance issues I guess...

Thanks.
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-11-08 22:26                   ` Frederic Weisbecker
@ 2024-11-11 12:08                     ` Sebastian Andrzej Siewior
  2024-12-04  3:02                       ` Lai, Yi
  2024-12-04 13:48                       ` Oleg Nesterov
  0 siblings, 2 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-11 12:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, Lai, Yi, linux-perf-users, linux-kernel,
	Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo,
	yi1.lai, syzkaller-bugs

On 2024-11-08 23:26:36 [+0100], Frederic Weisbecker wrote:
> > Please see
> > https://lore.kernel.org/all/1440816150.8932.123.camel@edumazet-glaptop2.roam.corp.google.com/
> > and the whole thread.

Thank you for this Oleg.

> > I don't think raw_spin_lock_irq + cmpxchg for each work is a good
> > idea, but quite possibly I misunderstood this change.
> 
> I did not realize there could be gazillion files released in a row. So there
> could be noticeable performance issues I guess...

I made a testcase to open 2M (2 *10^6) files and then exit. This led
task_work_run() run 2M + 3 callbacks (+ stdin/out/err) for the task.

Running 70 samples on the "orig" kernel:
- avg callback time 1.156.470,3 us
- 63 samples are starting with 11 (1,1 sec) avg: 1.128.046,7 us
- 6 samples are starting with 14 (1,4 sec) avg: 1.435.294,8us

Running 70 samples on the "patched" kernel:
- avg callback time 1.278.938,8 us
- 59 samples are starting with 12 (1,2 sec) avg: 1.230.189,1 us
- 10 samples are starting with 15 (1,5sec) avg: 1.555.934,5 us

With the extra lock the task_work_run() runtime extends by approximately
~122ms for the 2M invoked callbacks. 
The spike 1,1sec -> 1,4sec or 1,2sec -> 1,5 sec is due to context
switching (there are few cond_resched()/ might_sleep()).

It is not that bad, is it?

> Thanks.
> > 
> > Oleg.

Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-11-11 12:08                     ` Sebastian Andrzej Siewior
@ 2024-12-04  3:02                       ` Lai, Yi
  2024-12-04 13:48                       ` Oleg Nesterov
  1 sibling, 0 replies; 32+ messages in thread
From: Lai, Yi @ 2024-12-04  3:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Frederic Weisbecker, Oleg Nesterov, linux-perf-users,
	linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Arnaldo Carvalho de Melo, yi1.lai, syzkaller-bugs

On Mon, Nov 11, 2024 at 01:08:57PM +0100, Sebastian Andrzej Siewior wrote:
> On 2024-11-08 23:26:36 [+0100], Frederic Weisbecker wrote:
> > > Please see
> > > https://lore.kernel.org/all/1440816150.8932.123.camel@edumazet-glaptop2.roam.corp.google.com/
> > > and the whole thread.
> 
> Thank you for this Oleg.
> 
> > > I don't think raw_spin_lock_irq + cmpxchg for each work is a good
> > > idea, but quite possibly I misunderstood this change.
> > 
> > I did not realize there could be gazillion files released in a row. So there
> > could be noticeable performance issues I guess...
> 
> I made a testcase to open 2M (2 *10^6) files and then exit. This led
> task_work_run() run 2M + 3 callbacks (+ stdin/out/err) for the task.
> 
> Running 70 samples on the "orig" kernel:
> - avg callback time 1.156.470,3 us
> - 63 samples are starting with 11 (1,1 sec) avg: 1.128.046,7 us
> - 6 samples are starting with 14 (1,4 sec) avg: 1.435.294,8us
> 
> Running 70 samples on the "patched" kernel:
> - avg callback time 1.278.938,8 us
> - 59 samples are starting with 12 (1,2 sec) avg: 1.230.189,1 us
> - 10 samples are starting with 15 (1,5sec) avg: 1.555.934,5 us
> 
> With the extra lock the task_work_run() runtime extends by approximately
> ~122ms for the 2M invoked callbacks. 
> The spike 1,1sec -> 1,4sec or 1,2sec -> 1,5 sec is due to context
> switching (there are few cond_resched()/ might_sleep()).
> 
> It is not that bad, is it?
>

Hi,

Do we reach consensus about the fix? The issue can still be reproduced
using v6.13-rc1 kernel. Call trace:

[  300.429498]  ? lock_acquire+0x80/0xb0
[  300.429695]  ? schedule+0x216/0x3f0
[  300.429888]  schedule+0xf6/0x3f0
[  300.430068]  _free_event+0x531/0x14c0
[  300.430277]  perf_event_release_kernel+0x648/0x870
[  300.430680]  ? __pfx_perf_event_release_kernel+0x10/0x10
[  300.430969]  ? __this_cpu_preempt_check+0x21/0x30
[  300.431228]  ? __sanitizer_cov_trace_const_cmp2+0x1c/0x30
[  300.431519]  ? __pfx_perf_release+0x10/0x10
[  300.431746]  perf_release+0x3a/0x50
[  300.431940]  __fput+0x414/0xb60
[  300.432127]  ____fput+0x22/0x30
[  300.432303]  task_work_run+0x19c/0x2b0
[  300.432518]  ? __pfx_task_work_run+0x10/0x10
[  300.432752]  ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[  300.433038]  ? switch_task_namespaces+0xc6/0x110
[  300.433295]  do_exit+0xb19/0x2a30
[  300.433484]  ? lockdep_hardirqs_on+0x89/0x110
[  300.433728]  ? __pfx_do_exit+0x10/0x10
[  300.433942]  do_group_exit+0xe4/0x2c0
[  300.434147]  get_signal+0x2279/0x24c0
[  300.434360]  ? __pfx_get_signal+0x10/0x10
[  300.434601]  ? __might_fault+0xf1/0x1b0
[  300.434824]  arch_do_signal_or_restart+0x8e/0x7d0
[  300.435083]  ? __pfx_arch_do_signal_or_restart+0x10/0x10
[  300.435375]  ? __this_cpu_preempt_check+0x21/0x30
[  300.435629]  ? syscall_exit_to_user_mode+0x10f/0x200
[  300.435901]  syscall_exit_to_user_mode+0x144/0x200
[  300.436161]  do_syscall_64+0x79/0x140
[  300.436366]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  300.436643] RIP: 0033:0x7f5add83ee5d
[  300.436843] RSP: 002b:00007fff21a70e68 EFLAGS: 00000206 ORIG_RAX: 000000000000012a
[  300.437240] RAX: 0000000000000003 RBX: 0000000000000000 RCX: 00007f5add83ee5d
[  300.437610] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000080
[  300.437969] RBP: 00007fff21a70e70 R08: 000000000000000b R09: 00007fff21a70ea0
[  300.438329] R10: 00000000ffffffff R11: 0000000000000206 R12: 00007fff21a70fc8
[  300.438703] R13: 0000000000401b4f R14: 0000000000403e08 R15: 00007f5addc0e000
[  300.439075]  </TASK>
[  300.439197]
[  300.439197] Showing all locks held in the system:
[  300.439513] 1 lock held by khungtaskd/33:
[  300.439723]  #0: ffffffff8705ca40 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x73/0x3c0
[  300.440199]
[  300.440288] =============================================
[  300.440288]

Regards,
Yi Lai

> > Thanks.
> > > 
> > > Oleg.
> 
> Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-11-11 12:08                     ` Sebastian Andrzej Siewior
  2024-12-04  3:02                       ` Lai, Yi
@ 2024-12-04 13:48                       ` Oleg Nesterov
  2024-12-05  0:19                         ` Frederic Weisbecker
  1 sibling, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2024-12-04 13:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Frederic Weisbecker, Lai, Yi, linux-perf-users, linux-kernel,
	Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Daniel Bristot de Oliveira, Ian Rogers, Ingo Molnar, Jiri Olsa,
	Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
	Peter Zijlstra, Thomas Gleixner, Arnaldo Carvalho de Melo,
	yi1.lai, syzkaller-bugs

On 11/11, Sebastian Andrzej Siewior wrote:
>
> On 2024-11-08 23:26:36 [+0100], Frederic Weisbecker wrote:
> > > Please see
> > > https://lore.kernel.org/all/1440816150.8932.123.camel@edumazet-glaptop2.roam.corp.google.com/
> > > and the whole thread.
>
> Thank you for this Oleg.
>
> > > I don't think raw_spin_lock_irq + cmpxchg for each work is a good
> > > idea, but quite possibly I misunderstood this change.
> >
> > I did not realize there could be gazillion files released in a row. So there
> > could be noticeable performance issues I guess...
>
> I made a testcase to open 2M (2 *10^6) files and then exit. This led
> task_work_run() run 2M + 3 callbacks (+ stdin/out/err) for the task.
>
> Running 70 samples on the "orig" kernel:
> - avg callback time 1.156.470,3 us
> - 63 samples are starting with 11 (1,1 sec) avg: 1.128.046,7 us
> - 6 samples are starting with 14 (1,4 sec) avg: 1.435.294,8us
>
> Running 70 samples on the "patched" kernel:
> - avg callback time 1.278.938,8 us
> - 59 samples are starting with 12 (1,2 sec) avg: 1.230.189,1 us
> - 10 samples are starting with 15 (1,5sec) avg: 1.555.934,5 us
>
> With the extra lock the task_work_run() runtime extends by approximately
> ~122ms for the 2M invoked callbacks.
> The spike 1,1sec -> 1,4sec or 1,2sec -> 1,5 sec is due to context
> switching (there are few cond_resched()/ might_sleep()).
>
> It is not that bad, is it?

Not that bad, but I personally dislike this patch for other reasons.
But lets forget it for the moment.

The numbers in

	PATCH] task_work: remove fifo ordering guarantee
	https://lore.kernel.org/all/1440816150.8932.123.camel@edumazet-glaptop2.roam.corp.google.com/

didn't look too bad too, yet they convinced Linus and other reviewers.

I still think that fifo makes much more sense. The main (only?) offender
is fput(), so perhaps we can do something like
https://lore.kernel.org/all/20150907134924.GA24254@redhat.com/
but when I look at this change now I see it is racy.

Stupid question. What if we revert this "task_work: remove fifo ordering guarantee"
patch above? Can this help?

I don't understand this code and the problem. But when I (try to) read the
previous discussion on lore.kernel.org it seems that perf_pending_task_sync()
fails to cancel event->pending_task because it is called from task_work_run()
and then rcuwait_wait_event() obviously hangs.

Your patch can only help if task_work_add(current, &event->pending_task) was
called before fput()->task_work_add(task, &file->f_task_work(), right?
So perhaps, if we restore the fifo ordering, we can rely on the fact that
current should call perf_pending_task() before it calls perf_release/free_event ?

Sorry in advance, I am sure I have missed something...

Oleg.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-12-04 13:48                       ` Oleg Nesterov
@ 2024-12-05  0:19                         ` Frederic Weisbecker
  2024-12-05  9:20                           ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2024-12-05  0:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, Lai, Yi, linux-perf-users,
	linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Arnaldo Carvalho de Melo, yi1.lai, syzkaller-bugs

Le Wed, Dec 04, 2024 at 02:48:27PM +0100, Oleg Nesterov a écrit :
> On 11/11, Sebastian Andrzej Siewior wrote:
> Not that bad, but I personally dislike this patch for other reasons.
> But lets forget it for the moment.
> 
> The numbers in
> 
> 	PATCH] task_work: remove fifo ordering guarantee
> 	https://lore.kernel.org/all/1440816150.8932.123.camel@edumazet-glaptop2.roam.corp.google.com/
> 
> didn't look too bad too, yet they convinced Linus and other reviewers.
> 
> I still think that fifo makes much more sense. The main (only?) offender
> is fput(), so perhaps we can do something like
> https://lore.kernel.org/all/20150907134924.GA24254@redhat.com/
> but when I look at this change now I see it is racy.
> 
> Stupid question. What if we revert this "task_work: remove fifo ordering guarantee"
> patch above? Can this help?
> 
> I don't understand this code and the problem. But when I (try to) read the
> previous discussion on lore.kernel.org it seems that perf_pending_task_sync()
> fails to cancel event->pending_task because it is called from task_work_run()
> and then rcuwait_wait_event() obviously hangs.
> 
> Your patch can only help if task_work_add(current, &event->pending_task) was
> called before fput()->task_work_add(task, &file->f_task_work(), right?

Right, IIUC if &event->pending_task was added after then perf_pending_task()
would be called before perf_release() and we wouldn't have the problem.

> So perhaps, if we restore the fifo ordering, we can rely on the fact that
> current should call perf_pending_task() before it calls perf_release/free_event ?

Hmm but a perf event can still fire between the task_work_add() on fput and the
actual call to task_work_run() that will run the queue. So &event->pending_task
can be set either before or after. And then whether fifo or lifo, that would
still be buggy. Or am I missing something?

Looking at task_work, it seems that most enqueues happen to the current task.
AFAICT, only io_uring() does remote enqueue. Would it make sense to have a light
version of task_work that is only ever used by current? This would be a very
simple flavour with easy queue and cancellation without locking/atomics/RmW
operations. We would just need to be extra careful about NMIs. And cancellation
on the current queue would be more deterministic...

Of course we would then lose the advantage of a solution that works for both
remote and current enqueue...

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-12-05  0:19                         ` Frederic Weisbecker
@ 2024-12-05  9:20                           ` Oleg Nesterov
  2024-12-05 10:05                             ` Frederic Weisbecker
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2024-12-05  9:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Sebastian Andrzej Siewior, Lai, Yi, linux-perf-users,
	linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Arnaldo Carvalho de Melo, yi1.lai, syzkaller-bugs

On 12/05, Frederic Weisbecker wrote:
>
> Le Wed, Dec 04, 2024 at 02:48:27PM +0100, Oleg Nesterov a écrit :
>
> > So perhaps, if we restore the fifo ordering, we can rely on the fact that
> > current should call perf_pending_task() before it calls perf_release/free_event ?
>
> Hmm but a perf event can still fire between the task_work_add() on fput and the
> actual call to task_work_run() that will run the queue. So &event->pending_task
> can be set either before or after. And then whether fifo or lifo, that would
> still be buggy.

Ah, indeed,

> Or am I missing something?

No, it is me, thanks for correcting me!

> Looking at task_work, it seems that most enqueues happen to the current task.
> AFAICT, only io_uring() does remote enqueue. Would it make sense to have a light
> version of task_work that is only ever used by current? This would be a very
> simple flavour with easy queue and cancellation without locking/atomics/RmW
> operations.

Perhaps, but we also need to avoid the races with task_work_cancel() from
another task. I mean, if a task T does task_work_add_light(work), it can race
with task_work_cancel(T, ...) which can change T->task_works on another CPU.


OK, I can't suggest a good solution for this problem, so I won't argue with
the patch from Sebastian. Unfortunately, with this change we can never restore
the fifo ordering. And note that the current ordering is not lifo, it is
"unpredictable". Plus the extra lock/xchg for each work...

Hmm. I just noticed that task_work_run() needs a simple fix:

	--- x/kernel/task_work.c
	+++ x/kernel/task_work.c
	@@ -235,7 +235,7 @@
			raw_spin_unlock_irq(&task->pi_lock);
	 
			do {
	-			next = work->next;
	+			next = READ_ONCE(work->next);
				work->func(work);
				work = next;
				cond_resched();

Perhaps it makes sense before the patch from Sebastian even if that patch
removes this do/while loop ?

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-12-05  9:20                           ` Oleg Nesterov
@ 2024-12-05 10:05                             ` Frederic Weisbecker
  2024-12-05 10:28                               ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2024-12-05 10:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, Lai, Yi, linux-perf-users,
	linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Arnaldo Carvalho de Melo, yi1.lai, syzkaller-bugs

Le Thu, Dec 05, 2024 at 10:20:16AM +0100, Oleg Nesterov a écrit :
> > Looking at task_work, it seems that most enqueues happen to the current task.
> > AFAICT, only io_uring() does remote enqueue. Would it make sense to have a light
> > version of task_work that is only ever used by current? This would be a very
> > simple flavour with easy queue and cancellation without locking/atomics/RmW
> > operations.
> 
> Perhaps, but we also need to avoid the races with task_work_cancel() from
> another task. I mean, if a task T does task_work_add_light(work), it can race
> with task_work_cancel(T, ...) which can change T->task_works on another CPU.

I was thinking about two different lists.

> 
> 
> OK, I can't suggest a good solution for this problem, so I won't argue with
> the patch from Sebastian. Unfortunately, with this change we can never restore
> the fifo ordering. And note that the current ordering is not lifo, it is
> "unpredictable". Plus the extra lock/xchg for each work...

Good point it's unpredictable due to extraction before execution.

Another alternative is to maintain another head that points to the
head of the executing list. This way we can have task_work_cancel_current()
that completely cancels the work. That was my initial proposal here and it
avoids the lock/xchg for each work:

https://lore.kernel.org/all/Zx-B0wK3xqRQsCOS@localhost.localdomain/

> 
> Hmm. I just noticed that task_work_run() needs a simple fix:
> 
> 	--- x/kernel/task_work.c
> 	+++ x/kernel/task_work.c
> 	@@ -235,7 +235,7 @@
> 			raw_spin_unlock_irq(&task->pi_lock);
> 	 
> 			do {
> 	-			next = work->next;
> 	+			next = READ_ONCE(work->next);
> 				work->func(work);
> 				work = next;
> 				cond_resched();
> 
> Perhaps it makes sense before the patch from Sebastian even if that patch
> removes this do/while loop ?

Hmm, can work->next be modified concurrently here?

Thanks.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-12-05 10:05                             ` Frederic Weisbecker
@ 2024-12-05 10:28                               ` Oleg Nesterov
  2024-12-13 22:52                                 ` Frederic Weisbecker
  0 siblings, 1 reply; 32+ messages in thread
From: Oleg Nesterov @ 2024-12-05 10:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Sebastian Andrzej Siewior, Lai, Yi, linux-perf-users,
	linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Arnaldo Carvalho de Melo, yi1.lai, syzkaller-bugs

On 12/05, Frederic Weisbecker wrote:
>
> Le Thu, Dec 05, 2024 at 10:20:16AM +0100, Oleg Nesterov a écrit :
>
> > > Looking at task_work, it seems that most enqueues happen to the current task.
> > > AFAICT, only io_uring() does remote enqueue. Would it make sense to have a light
> > > version of task_work that is only ever used by current? This would be a very
> > > simple flavour with easy queue and cancellation without locking/atomics/RmW
> > > operations.
> >
> > Perhaps, but we also need to avoid the races with task_work_cancel() from
> > another task. I mean, if a task T does task_work_add_light(work), it can race
> > with task_work_cancel(T, ...) which can change T->task_works on another CPU.
>
> I was thinking about two different lists.

OK... but this needs more thinking/discussion.

> Another alternative is to maintain another head that points to the
> head of the executing list. This way we can have task_work_cancel_current()
> that completely cancels the work. That was my initial proposal here and it
> avoids the lock/xchg for each work:
>
> https://lore.kernel.org/all/Zx-B0wK3xqRQsCOS@localhost.localdomain/

Thanks... Heh, I thought about something like this too ;) Although I thought
that we need a bit more to implement task_work_cancel_sync(). But this is
another story.

> > Hmm. I just noticed that task_work_run() needs a simple fix:
> >
> > 	--- x/kernel/task_work.c
> > 	+++ x/kernel/task_work.c
> > 	@@ -235,7 +235,7 @@
> > 			raw_spin_unlock_irq(&task->pi_lock);
> >
> > 			do {
> > 	-			next = work->next;
> > 	+			next = READ_ONCE(work->next);
> > 				work->func(work);
> > 				work = next;
> > 				cond_resched();
> >
> > Perhaps it makes sense before the patch from Sebastian even if that patch
> > removes this do/while loop ?
>
> Hmm, can work->next be modified concurrently here?

work->func(work) can, say, do kfree(work) or do another task_work_add(X, work).

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-12-05 10:28                               ` Oleg Nesterov
@ 2024-12-13 22:52                                 ` Frederic Weisbecker
  2024-12-16 19:19                                   ` Oleg Nesterov
  0 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2024-12-13 22:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sebastian Andrzej Siewior, Lai, Yi, linux-perf-users,
	linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Arnaldo Carvalho de Melo, yi1.lai, syzkaller-bugs

Le Thu, Dec 05, 2024 at 11:28:41AM +0100, Oleg Nesterov a écrit :
> On 12/05, Frederic Weisbecker wrote:
> >
> > Le Thu, Dec 05, 2024 at 10:20:16AM +0100, Oleg Nesterov a écrit :
> >
> > > > Looking at task_work, it seems that most enqueues happen to the current task.
> > > > AFAICT, only io_uring() does remote enqueue. Would it make sense to have a light
> > > > version of task_work that is only ever used by current? This would be a very
> > > > simple flavour with easy queue and cancellation without locking/atomics/RmW
> > > > operations.
> > >
> > > Perhaps, but we also need to avoid the races with task_work_cancel() from
> > > another task. I mean, if a task T does task_work_add_light(work), it can race
> > > with task_work_cancel(T, ...) which can change T->task_works on another CPU.
> >
> > I was thinking about two different lists.
> 
> OK... but this needs more thinking/discussion.

Sure.

> 
> > Another alternative is to maintain another head that points to the
> > head of the executing list. This way we can have task_work_cancel_current()
> > that completely cancels the work. That was my initial proposal here and it
> > avoids the lock/xchg for each work:
> >
> > https://lore.kernel.org/all/Zx-B0wK3xqRQsCOS@localhost.localdomain/
> 
> Thanks... Heh, I thought about something like this too ;) Although I thought
> that we need a bit more to implement task_work_cancel_sync(). But this is
> another story.

So which way do you prefer do solve the initial problem?

> 
> > > Hmm. I just noticed that task_work_run() needs a simple fix:
> > >
> > > 	--- x/kernel/task_work.c
> > > 	+++ x/kernel/task_work.c
> > > 	@@ -235,7 +235,7 @@
> > > 			raw_spin_unlock_irq(&task->pi_lock);
> > >
> > > 			do {
> > > 	-			next = work->next;
> > > 	+			next = READ_ONCE(work->next);
> > > 				work->func(work);
> > > 				work = next;
> > > 				cond_resched();
> > >
> > > Perhaps it makes sense before the patch from Sebastian even if that patch
> > > removes this do/while loop ?
> >
> > Hmm, can work->next be modified concurrently here?
> 
> work->func(work) can, say, do kfree(work) or do another task_work_add(X,
> work).

Right but then isn't it serialized program order, from the compiler point of view?

Thanks.

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work.
  2024-12-13 22:52                                 ` Frederic Weisbecker
@ 2024-12-16 19:19                                   ` Oleg Nesterov
  0 siblings, 0 replies; 32+ messages in thread
From: Oleg Nesterov @ 2024-12-16 19:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Sebastian Andrzej Siewior, Lai, Yi, linux-perf-users,
	linux-kernel, Adrian Hunter, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Daniel Bristot de Oliveira, Ian Rogers,
	Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
	Namhyung Kim, Peter Zijlstra, Thomas Gleixner,
	Arnaldo Carvalho de Melo, yi1.lai, syzkaller-bugs

On 12/13, Frederic Weisbecker wrote:
>
> So which way do you prefer do solve the initial problem?

Ah. please do what you think/feel is right.

I can't suggest a better fix anyway.

> > > > 			do {
> > > > 	-			next = work->next;
> > > > 	+			next = READ_ONCE(work->next);
> > > > 				work->func(work);
> > > > 				work = next;
> > > > 				cond_resched();
> > > >
> > > > Perhaps it makes sense before the patch from Sebastian even if that patch
> > > > removes this do/while loop ?
> > >
> > > Hmm, can work->next be modified concurrently here?
> >
> > work->func(work) can, say, do kfree(work) or do another task_work_add(X,
> > work).
>
> Right but then isn't it serialized program order, from the compiler point of view?

Hmm, indeed, you are right. In this case the compiler can't assume it can
"defer" work->next. Thanks for correcting me!

Oleg.


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-12-16 19:20 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-24 15:15 [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
2024-06-24 15:15 ` [PATCH v4 1/6] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
2024-06-24 15:15 ` [PATCH v4 2/6] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
2024-07-01 12:28   ` Peter Zijlstra
2024-07-01 13:27     ` Sebastian Andrzej Siewior
2024-07-02  9:01       ` Peter Zijlstra
2024-10-28  8:30   ` Lai, Yi
2024-10-28 12:21     ` Frederic Weisbecker
2024-10-29 17:21       ` Sebastian Andrzej Siewior
2024-10-30 14:07         ` Sebastian Andrzej Siewior
2024-10-30 15:46           ` Frederic Weisbecker
2024-11-07 14:46             ` Sebastian Andrzej Siewior
2024-11-08 13:11               ` Frederic Weisbecker
2024-11-08 19:08                 ` Oleg Nesterov
2024-11-08 22:26                   ` Frederic Weisbecker
2024-11-11 12:08                     ` Sebastian Andrzej Siewior
2024-12-04  3:02                       ` Lai, Yi
2024-12-04 13:48                       ` Oleg Nesterov
2024-12-05  0:19                         ` Frederic Weisbecker
2024-12-05  9:20                           ` Oleg Nesterov
2024-12-05 10:05                             ` Frederic Weisbecker
2024-12-05 10:28                               ` Oleg Nesterov
2024-12-13 22:52                                 ` Frederic Weisbecker
2024-12-16 19:19                                   ` Oleg Nesterov
2024-06-24 15:15 ` [PATCH v4 3/6] perf: Shrink the size of the recursion counter Sebastian Andrzej Siewior
2024-07-01 12:31   ` Peter Zijlstra
2024-07-01 12:56     ` Sebastian Andrzej Siewior
2024-07-01 13:10       ` Peter Zijlstra
2024-06-24 15:15 ` [PATCH v4 4/6] perf: Move swevent_htable::recursion into task_struct Sebastian Andrzej Siewior
2024-06-24 15:15 ` [PATCH v4 5/6] perf: Don't disable preemption in perf_pending_task() Sebastian Andrzej Siewior
2024-06-24 15:15 ` [PATCH v4 6/6] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior
2024-06-25 13:42 ` [PATCH v4 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Marco Elver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).