* [PATCH v5 1/7] perf: Move irq_work_queue() where the event is prepared.
2024-07-04 17:03 [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
@ 2024-07-04 17:03 ` Sebastian Andrzej Siewior
2024-07-09 11:41 ` [tip: perf/core] " tip-bot2 for Sebastian Andrzej Siewior
2024-07-04 17:03 ` [PATCH v5 2/7] task_work: Add TWA_NMI_CURRENT as an additional notify mode Sebastian Andrzej Siewior
` (7 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-04 17:03 UTC (permalink / raw)
To: linux-perf-users, linux-kernel
Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior,
Arnaldo Carvalho de Melo
Only if perf_event::pending_sigtrap is zero, the irq_work accounted by
increminging perf_event::nr_pending. The member perf_event::pending_addr
might be overwritten by a subsequent event if the signal was not yet
delivered and is expected. The irq_work will not be enqeueued again
because it has a check to be only enqueued once.
Move irq_work_queue() to where the counter is incremented and
perf_event::pending_sigtrap is set to make it more obvious that the
irq_work is scheduled once.
Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/events/core.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 586d4f3676240..647abeeaeeb02 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9738,6 +9738,11 @@ static int __perf_event_overflow(struct perf_event *event,
if (!event->pending_sigtrap) {
event->pending_sigtrap = pending_id;
local_inc(&event->ctx->nr_pending);
+
+ event->pending_addr = 0;
+ if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
+ event->pending_addr = data->addr;
+ irq_work_queue(&event->pending_irq);
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
@@ -9753,11 +9758,6 @@ static int __perf_event_overflow(struct perf_event *event,
*/
WARN_ON_ONCE(event->pending_sigtrap != pending_id);
}
-
- event->pending_addr = 0;
- if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
- event->pending_addr = data->addr;
- irq_work_queue(&event->pending_irq);
}
READ_ONCE(event->overflow_handler)(event, data, regs);
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread* [tip: perf/core] perf: Move irq_work_queue() where the event is prepared.
2024-07-04 17:03 ` [PATCH v5 1/7] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
@ 2024-07-09 11:41 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2024-07-09 11:41 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), Marco Elver,
Arnaldo Carvalho de Melo, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 058244c683111d44a5de16ac74f19a1754dd7a0c
Gitweb: https://git.kernel.org/tip/058244c683111d44a5de16ac74f19a1754dd7a0c
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Thu, 04 Jul 2024 19:03:35 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 09 Jul 2024 13:26:34 +02:00
perf: Move irq_work_queue() where the event is prepared.
Only if perf_event::pending_sigtrap is zero, the irq_work accounted by
increminging perf_event::nr_pending. The member perf_event::pending_addr
might be overwritten by a subsequent event if the signal was not yet
delivered and is expected. The irq_work will not be enqeueued again
because it has a check to be only enqueued once.
Move irq_work_queue() to where the counter is incremented and
perf_event::pending_sigtrap is set to make it more obvious that the
irq_work is scheduled once.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-2-bigeasy@linutronix.de
---
kernel/events/core.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 32c7996..c54da50 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9727,6 +9727,11 @@ static int __perf_event_overflow(struct perf_event *event,
if (!event->pending_sigtrap) {
event->pending_sigtrap = pending_id;
local_inc(&event->ctx->nr_pending);
+
+ event->pending_addr = 0;
+ if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
+ event->pending_addr = data->addr;
+ irq_work_queue(&event->pending_irq);
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
@@ -9742,11 +9747,6 @@ static int __perf_event_overflow(struct perf_event *event,
*/
WARN_ON_ONCE(event->pending_sigtrap != pending_id);
}
-
- event->pending_addr = 0;
- if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
- event->pending_addr = data->addr;
- irq_work_queue(&event->pending_irq);
}
READ_ONCE(event->overflow_handler)(event, data, regs);
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 2/7] task_work: Add TWA_NMI_CURRENT as an additional notify mode.
2024-07-04 17:03 [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
2024-07-04 17:03 ` [PATCH v5 1/7] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
@ 2024-07-04 17:03 ` Sebastian Andrzej Siewior
2024-07-05 14:00 ` Sebastian Andrzej Siewior
2024-07-09 11:41 ` [tip: perf/core] " tip-bot2 for Sebastian Andrzej Siewior
2024-07-04 17:03 ` [PATCH v5 3/7] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
` (6 subsequent siblings)
8 siblings, 2 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-04 17:03 UTC (permalink / raw)
To: linux-perf-users, linux-kernel
Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior
Adding task_work from NMI context requires the following:
- The kasan_record_aux_stack() is not NMU safe and must be avoided.
- Using TWA_RESUME is NMI safe. If the NMI occurs while the CPU is in
userland then it will continue in userland and not invoke the `work'
callback.
Add TWA_NMI_CURRENT as an additional notify mode. In this mode skip
kasan and use irq_work in hardirq-mode to for needed interrupt. Set
TIF_NOTIFY_RESUME within the irq_work callback due to k[ac]san
instrumentation in test_and_set_bit() which does not look NMI safe in
case of a report.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/task_work.h | 1 +
kernel/task_work.c | 25 ++++++++++++++++++++++---
2 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 26b8a47f41fca..cf5e7e891a776 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -18,6 +18,7 @@ enum task_work_notify_mode {
TWA_RESUME,
TWA_SIGNAL,
TWA_SIGNAL_NO_IPI,
+ TWA_NMI_CURRENT,
};
static inline bool task_work_pending(struct task_struct *task)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 2134ac8057a94..05fb41fe09f5d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -1,10 +1,19 @@
// SPDX-License-Identifier: GPL-2.0
+#include <linux/irq_work.h>
#include <linux/spinlock.h>
#include <linux/task_work.h>
#include <linux/resume_user_mode.h>
+#include <trace/events/ipi.h>
static struct callback_head work_exited; /* all we need is ->next == NULL */
+static void task_work_set_notify_irq(struct irq_work *entry)
+{
+ test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+}
+static DEFINE_PER_CPU(struct irq_work, irq_work_NMI_resume) =
+ IRQ_WORK_INIT_HARD(task_work_set_notify_irq);
+
/**
* task_work_add - ask the @task to execute @work->func()
* @task: the task which should run the callback
@@ -12,7 +21,7 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
* @notify: how to notify the targeted task
*
* Queue @work for task_work_run() below and notify the @task if @notify
- * is @TWA_RESUME, @TWA_SIGNAL, or @TWA_SIGNAL_NO_IPI.
+ * is @TWA_RESUME, @TWA_SIGNAL, @TWA_SIGNAL_NO_IPI or @TWA_NMI_CURRENT.
*
* @TWA_SIGNAL works like signals, in that the it will interrupt the targeted
* task and run the task_work, regardless of whether the task is currently
@@ -24,6 +33,8 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
* kernel anyway.
* @TWA_RESUME work is run only when the task exits the kernel and returns to
* user mode, or before entering guest mode.
+ * @TWA_NMI_CURRENT works like @TWA_RESUME, except it can only be used for the
+ * current @task and if the current context is NMI.
*
* Fails if the @task is exiting/exited and thus it can't process this @work.
* Otherwise @work->func() will be called when the @task goes through one of
@@ -44,8 +55,13 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
{
struct callback_head *head;
- /* record the work call stack in order to print it in KASAN reports */
- kasan_record_aux_stack(work);
+ if (notify == TWA_NMI_CURRENT) {
+ if (WARN_ON_ONCE(task != current))
+ return -EINVAL;
+ } else {
+ /* record the work call stack in order to print it in KASAN reports */
+ kasan_record_aux_stack(work);
+ }
head = READ_ONCE(task->task_works);
do {
@@ -66,6 +82,9 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
case TWA_SIGNAL_NO_IPI:
__set_notify_signal(task);
break;
+ case TWA_NMI_CURRENT:
+ irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume));
+ break;
default:
WARN_ON_ONCE(1);
break;
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v5 2/7] task_work: Add TWA_NMI_CURRENT as an additional notify mode.
2024-07-04 17:03 ` [PATCH v5 2/7] task_work: Add TWA_NMI_CURRENT as an additional notify mode Sebastian Andrzej Siewior
@ 2024-07-05 14:00 ` Sebastian Andrzej Siewior
2024-07-05 14:58 ` Peter Zijlstra
2024-07-09 11:41 ` [tip: perf/core] " tip-bot2 for Sebastian Andrzej Siewior
1 sibling, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-05 14:00 UTC (permalink / raw)
To: linux-perf-users, linux-kernel
Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner
On 2024-07-04 19:03:36 [+0200], To linux-perf-users@vger.kernel.org wrote:
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 2134ac8057a94..05fb41fe09f5d 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -1,10 +1,19 @@
> // SPDX-License-Identifier: GPL-2.0
> +#include <linux/irq_work.h>
> #include <linux/spinlock.h>
> #include <linux/task_work.h>
> #include <linux/resume_user_mode.h>
> +#include <trace/events/ipi.h>
Just noticed that this trace include does not belong here. It is a
leftover from an earlier iteration where I intended to use
smp_send_reschedule() instead of irq_work…
Should I repost the series, or just this one?
Sebastian
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 2/7] task_work: Add TWA_NMI_CURRENT as an additional notify mode.
2024-07-05 14:00 ` Sebastian Andrzej Siewior
@ 2024-07-05 14:58 ` Peter Zijlstra
0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2024-07-05 14:58 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Frederic Weisbecker, Ian Rogers,
Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
Namhyung Kim, Thomas Gleixner
On Fri, Jul 05, 2024 at 04:00:20PM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-07-04 19:03:36 [+0200], To linux-perf-users@vger.kernel.org wrote:
> > diff --git a/kernel/task_work.c b/kernel/task_work.c
> > index 2134ac8057a94..05fb41fe09f5d 100644
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -1,10 +1,19 @@
> > // SPDX-License-Identifier: GPL-2.0
> > +#include <linux/irq_work.h>
> > #include <linux/spinlock.h>
> > #include <linux/task_work.h>
> > #include <linux/resume_user_mode.h>
> > +#include <trace/events/ipi.h>
>
> Just noticed that this trace include does not belong here. It is a
> leftover from an earlier iteration where I intended to use
> smp_send_reschedule() instead of irq_work…
>
> Should I repost the series, or just this one?
I'll zap it. no worries.
^ permalink raw reply [flat|nested] 22+ messages in thread
* [tip: perf/core] task_work: Add TWA_NMI_CURRENT as an additional notify mode.
2024-07-04 17:03 ` [PATCH v5 2/7] task_work: Add TWA_NMI_CURRENT as an additional notify mode Sebastian Andrzej Siewior
2024-07-05 14:00 ` Sebastian Andrzej Siewior
@ 2024-07-09 11:41 ` tip-bot2 for Sebastian Andrzej Siewior
1 sibling, 0 replies; 22+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2024-07-09 11:41 UTC (permalink / raw)
To: linux-tip-commits
Cc: Peter Zijlstra, Sebastian Andrzej Siewior, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 466e4d801cd438a1ab2c8a2cce1bef6b65c31bbb
Gitweb: https://git.kernel.org/tip/466e4d801cd438a1ab2c8a2cce1bef6b65c31bbb
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Thu, 04 Jul 2024 19:03:36 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 09 Jul 2024 13:26:34 +02:00
task_work: Add TWA_NMI_CURRENT as an additional notify mode.
Adding task_work from NMI context requires the following:
- The kasan_record_aux_stack() is not NMU safe and must be avoided.
- Using TWA_RESUME is NMI safe. If the NMI occurs while the CPU is in
userland then it will continue in userland and not invoke the `work'
callback.
Add TWA_NMI_CURRENT as an additional notify mode. In this mode skip
kasan and use irq_work in hardirq-mode to for needed interrupt. Set
TIF_NOTIFY_RESUME within the irq_work callback due to k[ac]san
instrumentation in test_and_set_bit() which does not look NMI safe in
case of a report.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240704170424.1466941-3-bigeasy@linutronix.de
---
include/linux/task_work.h | 1 +
kernel/task_work.c | 24 +++++++++++++++++++++---
2 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 26b8a47..cf5e7e8 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -18,6 +18,7 @@ enum task_work_notify_mode {
TWA_RESUME,
TWA_SIGNAL,
TWA_SIGNAL_NO_IPI,
+ TWA_NMI_CURRENT,
};
static inline bool task_work_pending(struct task_struct *task)
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 2134ac8..5c2daa7 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -1,10 +1,18 @@
// SPDX-License-Identifier: GPL-2.0
+#include <linux/irq_work.h>
#include <linux/spinlock.h>
#include <linux/task_work.h>
#include <linux/resume_user_mode.h>
static struct callback_head work_exited; /* all we need is ->next == NULL */
+static void task_work_set_notify_irq(struct irq_work *entry)
+{
+ test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+}
+static DEFINE_PER_CPU(struct irq_work, irq_work_NMI_resume) =
+ IRQ_WORK_INIT_HARD(task_work_set_notify_irq);
+
/**
* task_work_add - ask the @task to execute @work->func()
* @task: the task which should run the callback
@@ -12,7 +20,7 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
* @notify: how to notify the targeted task
*
* Queue @work for task_work_run() below and notify the @task if @notify
- * is @TWA_RESUME, @TWA_SIGNAL, or @TWA_SIGNAL_NO_IPI.
+ * is @TWA_RESUME, @TWA_SIGNAL, @TWA_SIGNAL_NO_IPI or @TWA_NMI_CURRENT.
*
* @TWA_SIGNAL works like signals, in that the it will interrupt the targeted
* task and run the task_work, regardless of whether the task is currently
@@ -24,6 +32,8 @@ static struct callback_head work_exited; /* all we need is ->next == NULL */
* kernel anyway.
* @TWA_RESUME work is run only when the task exits the kernel and returns to
* user mode, or before entering guest mode.
+ * @TWA_NMI_CURRENT works like @TWA_RESUME, except it can only be used for the
+ * current @task and if the current context is NMI.
*
* Fails if the @task is exiting/exited and thus it can't process this @work.
* Otherwise @work->func() will be called when the @task goes through one of
@@ -44,8 +54,13 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
{
struct callback_head *head;
- /* record the work call stack in order to print it in KASAN reports */
- kasan_record_aux_stack(work);
+ if (notify == TWA_NMI_CURRENT) {
+ if (WARN_ON_ONCE(task != current))
+ return -EINVAL;
+ } else {
+ /* record the work call stack in order to print it in KASAN reports */
+ kasan_record_aux_stack(work);
+ }
head = READ_ONCE(task->task_works);
do {
@@ -66,6 +81,9 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
case TWA_SIGNAL_NO_IPI:
__set_notify_signal(task);
break;
+ case TWA_NMI_CURRENT:
+ irq_work_queue(this_cpu_ptr(&irq_work_NMI_resume));
+ break;
default:
WARN_ON_ONCE(1);
break;
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 3/7] perf: Enqueue SIGTRAP always via task_work.
2024-07-04 17:03 [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
2024-07-04 17:03 ` [PATCH v5 1/7] perf: Move irq_work_queue() where the event is prepared Sebastian Andrzej Siewior
2024-07-04 17:03 ` [PATCH v5 2/7] task_work: Add TWA_NMI_CURRENT as an additional notify mode Sebastian Andrzej Siewior
@ 2024-07-04 17:03 ` Sebastian Andrzej Siewior
2024-07-09 11:41 ` [tip: perf/core] " tip-bot2 for Sebastian Andrzej Siewior
2024-07-04 17:03 ` [PATCH v5 4/7] perf: Shrink the size of the recursion counter Sebastian Andrzej Siewior
` (5 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-04 17:03 UTC (permalink / raw)
To: linux-perf-users, linux-kernel
Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior,
Arnaldo Carvalho de Melo
A signal is delivered by raising irq_work() which works from any context
including NMI. irq_work() can be delayed if the architecture does not
provide an interrupt vector. In order not to lose a signal, the signal
is injected via task_work during event_sched_out().
Instead going via irq_work, the signal could be added directly via
task_work. The signal is sent to current and can be enqueued on its
return path to userland.
Queue signal via task_work and consider possible NMI context. Remove
perf_event::pending_sigtrap and and use perf_event::pending_work
instead.
Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/all/ZMAtZ2t43GXoF6tM@kernel.org/
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/perf_event.h | 3 +--
kernel/events/core.c | 31 ++++++++++---------------------
2 files changed, 11 insertions(+), 23 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 393fb13733b02..ea0d82418d854 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -781,7 +781,6 @@ struct perf_event {
unsigned int pending_wakeup;
unsigned int pending_kill;
unsigned int pending_disable;
- unsigned int pending_sigtrap;
unsigned long pending_addr; /* SIGTRAP */
struct irq_work pending_irq;
struct callback_head pending_task;
@@ -963,7 +962,7 @@ struct perf_event_context {
struct rcu_head rcu_head;
/*
- * Sum (event->pending_sigtrap + event->pending_work)
+ * Sum (event->pending_work + event->pending_work)
*
* The SIGTRAP is targeted at ctx->task, as such it won't do changing
* that until the signal is delivered.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 647abeeaeeb02..c278aefa94e76 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2283,17 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
state = PERF_EVENT_STATE_OFF;
}
- if (event->pending_sigtrap) {
- event->pending_sigtrap = 0;
- if (state != PERF_EVENT_STATE_OFF &&
- !event->pending_work &&
- !task_work_add(current, &event->pending_task, TWA_RESUME)) {
- event->pending_work = 1;
- } else {
- local_dec(&event->ctx->nr_pending);
- }
- }
-
perf_event_set_state(event, state);
if (!is_software_event(event))
@@ -6787,11 +6776,6 @@ static void __perf_pending_irq(struct perf_event *event)
* Yay, we hit home and are in the context of the event.
*/
if (cpu == smp_processor_id()) {
- if (event->pending_sigtrap) {
- event->pending_sigtrap = 0;
- perf_sigtrap(event);
- local_dec(&event->ctx->nr_pending);
- }
if (event->pending_disable) {
event->pending_disable = 0;
perf_event_disable_local(event);
@@ -9732,21 +9716,26 @@ static int __perf_event_overflow(struct perf_event *event,
*/
bool valid_sample = sample_is_allowed(event, regs);
unsigned int pending_id = 1;
+ enum task_work_notify_mode notify_mode;
if (regs)
pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
- if (!event->pending_sigtrap) {
- event->pending_sigtrap = pending_id;
+
+ notify_mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
+
+ if (!event->pending_work &&
+ !task_work_add(current, &event->pending_task, notify_mode)) {
+ event->pending_work = pending_id;
local_inc(&event->ctx->nr_pending);
event->pending_addr = 0;
if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
event->pending_addr = data->addr;
- irq_work_queue(&event->pending_irq);
+
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
- * consuming pending_sigtrap; with exceptions:
+ * consuming pending_work; with exceptions:
*
* 1. Where !exclude_kernel, events can overflow again
* in the kernel without returning to user space.
@@ -9756,7 +9745,7 @@ static int __perf_event_overflow(struct perf_event *event,
* To approximate progress (with false negatives),
* check 32-bit hash of the current IP.
*/
- WARN_ON_ONCE(event->pending_sigtrap != pending_id);
+ WARN_ON_ONCE(event->pending_work != pending_id);
}
}
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread* [tip: perf/core] perf: Enqueue SIGTRAP always via task_work.
2024-07-04 17:03 ` [PATCH v5 3/7] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
@ 2024-07-09 11:41 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2024-07-09 11:41 UTC (permalink / raw)
To: linux-tip-commits
Cc: Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior,
Peter Zijlstra (Intel), Marco Elver, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: c5d93d23a26012a98b77f9e354ab9b3afd420059
Gitweb: https://git.kernel.org/tip/c5d93d23a26012a98b77f9e354ab9b3afd420059
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Thu, 04 Jul 2024 19:03:37 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 09 Jul 2024 13:26:35 +02:00
perf: Enqueue SIGTRAP always via task_work.
A signal is delivered by raising irq_work() which works from any context
including NMI. irq_work() can be delayed if the architecture does not
provide an interrupt vector. In order not to lose a signal, the signal
is injected via task_work during event_sched_out().
Instead going via irq_work, the signal could be added directly via
task_work. The signal is sent to current and can be enqueued on its
return path to userland.
Queue signal via task_work and consider possible NMI context. Remove
perf_event::pending_sigtrap and and use perf_event::pending_work
instead.
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-4-bigeasy@linutronix.de
---
include/linux/perf_event.h | 3 +--
kernel/events/core.c | 31 ++++++++++---------------------
2 files changed, 11 insertions(+), 23 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 393fb13..ea0d824 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -781,7 +781,6 @@ struct perf_event {
unsigned int pending_wakeup;
unsigned int pending_kill;
unsigned int pending_disable;
- unsigned int pending_sigtrap;
unsigned long pending_addr; /* SIGTRAP */
struct irq_work pending_irq;
struct callback_head pending_task;
@@ -963,7 +962,7 @@ struct perf_event_context {
struct rcu_head rcu_head;
/*
- * Sum (event->pending_sigtrap + event->pending_work)
+ * Sum (event->pending_work + event->pending_work)
*
* The SIGTRAP is targeted at ctx->task, as such it won't do changing
* that until the signal is delivered.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c54da50..73e1b02 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2283,17 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
state = PERF_EVENT_STATE_OFF;
}
- if (event->pending_sigtrap) {
- event->pending_sigtrap = 0;
- if (state != PERF_EVENT_STATE_OFF &&
- !event->pending_work &&
- !task_work_add(current, &event->pending_task, TWA_RESUME)) {
- event->pending_work = 1;
- } else {
- local_dec(&event->ctx->nr_pending);
- }
- }
-
perf_event_set_state(event, state);
if (!is_software_event(event))
@@ -6776,11 +6765,6 @@ static void __perf_pending_irq(struct perf_event *event)
* Yay, we hit home and are in the context of the event.
*/
if (cpu == smp_processor_id()) {
- if (event->pending_sigtrap) {
- event->pending_sigtrap = 0;
- perf_sigtrap(event);
- local_dec(&event->ctx->nr_pending);
- }
if (event->pending_disable) {
event->pending_disable = 0;
perf_event_disable_local(event);
@@ -9721,21 +9705,26 @@ static int __perf_event_overflow(struct perf_event *event,
*/
bool valid_sample = sample_is_allowed(event, regs);
unsigned int pending_id = 1;
+ enum task_work_notify_mode notify_mode;
if (regs)
pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
- if (!event->pending_sigtrap) {
- event->pending_sigtrap = pending_id;
+
+ notify_mode = in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
+
+ if (!event->pending_work &&
+ !task_work_add(current, &event->pending_task, notify_mode)) {
+ event->pending_work = pending_id;
local_inc(&event->ctx->nr_pending);
event->pending_addr = 0;
if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
event->pending_addr = data->addr;
- irq_work_queue(&event->pending_irq);
+
} else if (event->attr.exclude_kernel && valid_sample) {
/*
* Should not be able to return to user space without
- * consuming pending_sigtrap; with exceptions:
+ * consuming pending_work; with exceptions:
*
* 1. Where !exclude_kernel, events can overflow again
* in the kernel without returning to user space.
@@ -9745,7 +9734,7 @@ static int __perf_event_overflow(struct perf_event *event,
* To approximate progress (with false negatives),
* check 32-bit hash of the current IP.
*/
- WARN_ON_ONCE(event->pending_sigtrap != pending_id);
+ WARN_ON_ONCE(event->pending_work != pending_id);
}
}
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 4/7] perf: Shrink the size of the recursion counter.
2024-07-04 17:03 [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2024-07-04 17:03 ` [PATCH v5 3/7] perf: Enqueue SIGTRAP always via task_work Sebastian Andrzej Siewior
@ 2024-07-04 17:03 ` Sebastian Andrzej Siewior
2024-07-09 11:41 ` [tip: perf/core] " tip-bot2 for Sebastian Andrzej Siewior
2024-07-04 17:03 ` [PATCH v5 5/7] perf: Move swevent_htable::recursion into task_struct Sebastian Andrzej Siewior
` (4 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-04 17:03 UTC (permalink / raw)
To: linux-perf-users, linux-kernel
Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior
There are four recursion counter, one for each context. The type of the
counter is `int' but the counter is used as `bool' since it is only
incremented if zero.
The main goal here is to shrink the whole struct into 32bit int which
can later be added task_struct into an existing hole.
Reduce the type of the recursion counter to an unsigned char, keep the
increment/ decrement operation.
Tested-by: Marco Elver <elver@google.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/events/callchain.c | 2 +-
kernel/events/core.c | 2 +-
kernel/events/internal.h | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1273be84392cf..ad57944b6c40e 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -29,7 +29,7 @@ static inline size_t perf_callchain_entry__sizeof(void)
sysctl_perf_event_max_contexts_per_stack));
}
-static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
+static DEFINE_PER_CPU(u8, callchain_recursion[PERF_NR_CONTEXTS]);
static atomic_t nr_callchain_events;
static DEFINE_MUTEX(callchain_mutex);
static struct callchain_cpus_entries *callchain_cpus_entries;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c278aefa94e76..bd4b81bf63b6d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9776,7 +9776,7 @@ struct swevent_htable {
int hlist_refcount;
/* Recursion avoidance in each contexts */
- int recursion[PERF_NR_CONTEXTS];
+ u8 recursion[PERF_NR_CONTEXTS];
};
static DEFINE_PER_CPU(struct swevent_htable, swevent_htable);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 5150d5f84c033..f9a3244206b20 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,7 +208,7 @@ arch_perf_out_copy_user(void *dst, const void *src, unsigned long n)
DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)
-static inline int get_recursion_context(int *recursion)
+static inline int get_recursion_context(u8 *recursion)
{
unsigned char rctx = interrupt_context_level();
@@ -221,7 +221,7 @@ static inline int get_recursion_context(int *recursion)
return rctx;
}
-static inline void put_recursion_context(int *recursion, int rctx)
+static inline void put_recursion_context(u8 *recursion, int rctx)
{
barrier();
recursion[rctx]--;
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread* [tip: perf/core] perf: Shrink the size of the recursion counter.
2024-07-04 17:03 ` [PATCH v5 4/7] perf: Shrink the size of the recursion counter Sebastian Andrzej Siewior
@ 2024-07-09 11:41 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2024-07-09 11:41 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), Marco Elver,
x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 5af42f928f3ac555c228740fb4a92d05b19fdd49
Gitweb: https://git.kernel.org/tip/5af42f928f3ac555c228740fb4a92d05b19fdd49
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Thu, 04 Jul 2024 19:03:38 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 09 Jul 2024 13:26:35 +02:00
perf: Shrink the size of the recursion counter.
There are four recursion counter, one for each context. The type of the
counter is `int' but the counter is used as `bool' since it is only
incremented if zero.
The main goal here is to shrink the whole struct into 32bit int which
can later be added task_struct into an existing hole.
Reduce the type of the recursion counter to an unsigned char, keep the
increment/ decrement operation.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-5-bigeasy@linutronix.de
---
kernel/events/callchain.c | 2 +-
kernel/events/core.c | 2 +-
kernel/events/internal.h | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1273be8..ad57944 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -29,7 +29,7 @@ static inline size_t perf_callchain_entry__sizeof(void)
sysctl_perf_event_max_contexts_per_stack));
}
-static DEFINE_PER_CPU(int, callchain_recursion[PERF_NR_CONTEXTS]);
+static DEFINE_PER_CPU(u8, callchain_recursion[PERF_NR_CONTEXTS]);
static atomic_t nr_callchain_events;
static DEFINE_MUTEX(callchain_mutex);
static struct callchain_cpus_entries *callchain_cpus_entries;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 73e1b02..53e2750 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9765,7 +9765,7 @@ struct swevent_htable {
int hlist_refcount;
/* Recursion avoidance in each contexts */
- int recursion[PERF_NR_CONTEXTS];
+ u8 recursion[PERF_NR_CONTEXTS];
};
static DEFINE_PER_CPU(struct swevent_htable, swevent_htable);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 386d21c..7f06b79 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -208,7 +208,7 @@ arch_perf_out_copy_user(void *dst, const void *src, unsigned long n)
DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)
-static inline int get_recursion_context(int *recursion)
+static inline int get_recursion_context(u8 *recursion)
{
unsigned char rctx = interrupt_context_level();
@@ -221,7 +221,7 @@ static inline int get_recursion_context(int *recursion)
return rctx;
}
-static inline void put_recursion_context(int *recursion, int rctx)
+static inline void put_recursion_context(u8 *recursion, int rctx)
{
barrier();
recursion[rctx]--;
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 5/7] perf: Move swevent_htable::recursion into task_struct.
2024-07-04 17:03 [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
` (3 preceding siblings ...)
2024-07-04 17:03 ` [PATCH v5 4/7] perf: Shrink the size of the recursion counter Sebastian Andrzej Siewior
@ 2024-07-04 17:03 ` Sebastian Andrzej Siewior
2024-07-09 11:41 ` [tip: perf/core] " tip-bot2 for Sebastian Andrzej Siewior
2024-07-04 17:03 ` [PATCH v5 6/7] perf: Don't disable preemption in perf_pending_task() Sebastian Andrzej Siewior
` (3 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-04 17:03 UTC (permalink / raw)
To: linux-perf-users, linux-kernel
Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior
The swevent_htable::recursion counter is used to avoid creating an
swevent while an event is processed to avoid recursion. The counter is
per-CPU and preemption must be disabled to have a stable counter.
perf_pending_task() disables preemption to access the counter and then
signal. This is problematic on PREEMPT_RT because sending a signal uses
a spinlock_t which must not be acquired in atomic on PREEMPT_RT because
it becomes a sleeping lock.
The atomic context can be avoided by moving the counter into the
task_struct. There is a 4 byte hole between futex_state (usually always
on) and the following perf pointer (perf_event_ctxp). After the
recursion lost some weight it fits perfectly.
Move swevent_htable::recursion into task_struct.
Tested-by: Marco Elver <elver@google.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/perf_event.h | 6 ------
include/linux/sched.h | 7 +++++++
kernel/events/core.c | 13 +++----------
kernel/events/internal.h | 2 +-
4 files changed, 11 insertions(+), 17 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ea0d82418d854..99a7ea1d29ed5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -970,12 +970,6 @@ struct perf_event_context {
local_t nr_pending;
};
-/*
- * Number of contexts where an event can trigger:
- * task, softirq, hardirq, nmi.
- */
-#define PERF_NR_CONTEXTS 4
-
struct perf_cpu_pmu_context {
struct perf_event_pmu_context epc;
struct perf_event_pmu_context *task_epc;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6d..afb1087f5831b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -734,6 +734,12 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};
+/*
+ * Number of contexts where an event can trigger:
+ * task, softirq, hardirq, nmi.
+ */
+#define PERF_NR_CONTEXTS 4
+
struct wake_q_node {
struct wake_q_node *next;
};
@@ -1256,6 +1262,7 @@ struct task_struct {
unsigned int futex_state;
#endif
#ifdef CONFIG_PERF_EVENTS
+ u8 perf_recursion[PERF_NR_CONTEXTS];
struct perf_event_context *perf_event_ctxp;
struct mutex perf_event_mutex;
struct list_head perf_event_list;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bd4b81bf63b6d..1a26a9c33306a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9774,11 +9774,7 @@ struct swevent_htable {
struct swevent_hlist *swevent_hlist;
struct mutex hlist_mutex;
int hlist_refcount;
-
- /* Recursion avoidance in each contexts */
- u8 recursion[PERF_NR_CONTEXTS];
};
-
static DEFINE_PER_CPU(struct swevent_htable, swevent_htable);
/*
@@ -9976,17 +9972,13 @@ DEFINE_PER_CPU(struct pt_regs, __perf_regs[4]);
int perf_swevent_get_recursion_context(void)
{
- struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
-
- return get_recursion_context(swhash->recursion);
+ return get_recursion_context(current->perf_recursion);
}
EXPORT_SYMBOL_GPL(perf_swevent_get_recursion_context);
void perf_swevent_put_recursion_context(int rctx)
{
- struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
-
- put_recursion_context(swhash->recursion, rctx);
+ put_recursion_context(current->perf_recursion, rctx);
}
void ___perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
@@ -13653,6 +13645,7 @@ int perf_event_init_task(struct task_struct *child, u64 clone_flags)
{
int ret;
+ memset(child->perf_recursion, 0, sizeof(child->perf_recursion));
child->perf_event_ctxp = NULL;
mutex_init(&child->perf_event_mutex);
INIT_LIST_HEAD(&child->perf_event_list);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index f9a3244206b20..f0daaa6f2a33b 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -221,7 +221,7 @@ static inline int get_recursion_context(u8 *recursion)
return rctx;
}
-static inline void put_recursion_context(u8 *recursion, int rctx)
+static inline void put_recursion_context(u8 *recursion, unsigned char rctx)
{
barrier();
recursion[rctx]--;
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread* [tip: perf/core] perf: Move swevent_htable::recursion into task_struct.
2024-07-04 17:03 ` [PATCH v5 5/7] perf: Move swevent_htable::recursion into task_struct Sebastian Andrzej Siewior
@ 2024-07-09 11:41 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2024-07-09 11:41 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), Marco Elver,
x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 0d40a6d83e3e6751f1107ba33587262d937c969f
Gitweb: https://git.kernel.org/tip/0d40a6d83e3e6751f1107ba33587262d937c969f
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Thu, 04 Jul 2024 19:03:39 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 09 Jul 2024 13:26:36 +02:00
perf: Move swevent_htable::recursion into task_struct.
The swevent_htable::recursion counter is used to avoid creating an
swevent while an event is processed to avoid recursion. The counter is
per-CPU and preemption must be disabled to have a stable counter.
perf_pending_task() disables preemption to access the counter and then
signal. This is problematic on PREEMPT_RT because sending a signal uses
a spinlock_t which must not be acquired in atomic on PREEMPT_RT because
it becomes a sleeping lock.
The atomic context can be avoided by moving the counter into the
task_struct. There is a 4 byte hole between futex_state (usually always
on) and the following perf pointer (perf_event_ctxp). After the
recursion lost some weight it fits perfectly.
Move swevent_htable::recursion into task_struct.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-6-bigeasy@linutronix.de
---
include/linux/perf_event.h | 6 ------
include/linux/sched.h | 7 +++++++
kernel/events/core.c | 13 +++----------
kernel/events/internal.h | 2 +-
4 files changed, 11 insertions(+), 17 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ea0d824..99a7ea1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -970,12 +970,6 @@ struct perf_event_context {
local_t nr_pending;
};
-/*
- * Number of contexts where an event can trigger:
- * task, softirq, hardirq, nmi.
- */
-#define PERF_NR_CONTEXTS 4
-
struct perf_cpu_pmu_context {
struct perf_event_pmu_context epc;
struct perf_event_pmu_context *task_epc;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac..afb1087 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -734,6 +734,12 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};
+/*
+ * Number of contexts where an event can trigger:
+ * task, softirq, hardirq, nmi.
+ */
+#define PERF_NR_CONTEXTS 4
+
struct wake_q_node {
struct wake_q_node *next;
};
@@ -1256,6 +1262,7 @@ struct task_struct {
unsigned int futex_state;
#endif
#ifdef CONFIG_PERF_EVENTS
+ u8 perf_recursion[PERF_NR_CONTEXTS];
struct perf_event_context *perf_event_ctxp;
struct mutex perf_event_mutex;
struct list_head perf_event_list;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 53e2750..b523225 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9763,11 +9763,7 @@ struct swevent_htable {
struct swevent_hlist *swevent_hlist;
struct mutex hlist_mutex;
int hlist_refcount;
-
- /* Recursion avoidance in each contexts */
- u8 recursion[PERF_NR_CONTEXTS];
};
-
static DEFINE_PER_CPU(struct swevent_htable, swevent_htable);
/*
@@ -9965,17 +9961,13 @@ DEFINE_PER_CPU(struct pt_regs, __perf_regs[4]);
int perf_swevent_get_recursion_context(void)
{
- struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
-
- return get_recursion_context(swhash->recursion);
+ return get_recursion_context(current->perf_recursion);
}
EXPORT_SYMBOL_GPL(perf_swevent_get_recursion_context);
void perf_swevent_put_recursion_context(int rctx)
{
- struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
-
- put_recursion_context(swhash->recursion, rctx);
+ put_recursion_context(current->perf_recursion, rctx);
}
void ___perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr)
@@ -13642,6 +13634,7 @@ int perf_event_init_task(struct task_struct *child, u64 clone_flags)
{
int ret;
+ memset(child->perf_recursion, 0, sizeof(child->perf_recursion));
child->perf_event_ctxp = NULL;
mutex_init(&child->perf_event_mutex);
INIT_LIST_HEAD(&child->perf_event_list);
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 7f06b79..4515144 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -221,7 +221,7 @@ static inline int get_recursion_context(u8 *recursion)
return rctx;
}
-static inline void put_recursion_context(u8 *recursion, int rctx)
+static inline void put_recursion_context(u8 *recursion, unsigned char rctx)
{
barrier();
recursion[rctx]--;
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 6/7] perf: Don't disable preemption in perf_pending_task().
2024-07-04 17:03 [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
` (4 preceding siblings ...)
2024-07-04 17:03 ` [PATCH v5 5/7] perf: Move swevent_htable::recursion into task_struct Sebastian Andrzej Siewior
@ 2024-07-04 17:03 ` Sebastian Andrzej Siewior
2024-07-09 11:41 ` [tip: perf/core] " tip-bot2 for Sebastian Andrzej Siewior
2024-07-04 17:03 ` [PATCH v5 7/7] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior
` (2 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-04 17:03 UTC (permalink / raw)
To: linux-perf-users, linux-kernel
Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior
perf_pending_task() is invoked in task context and disables preemption
because perf_swevent_get_recursion_context() used to access per-CPU
variables. The other reason is to create a RCU read section while
accessing the perf_event.
The recursion counter is no longer a per-CPU accounter so disabling
preemption is no longer required. The RCU section is needed and must be
created explicit.
Replace the preemption-disable section with a explicit RCU-read section.
Tested-by: Marco Elver <elver@google.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/events/core.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1a26a9c33306a..67f5aab933c81 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5208,10 +5208,9 @@ static void perf_pending_task_sync(struct perf_event *event)
}
/*
- * All accesses related to the event are within the same
- * non-preemptible section in perf_pending_task(). The RCU
- * grace period before the event is freed will make sure all
- * those accesses are complete by then.
+ * All accesses related to the event are within the same RCU section in
+ * perf_pending_task(). The RCU grace period before the event is freed
+ * will make sure all those accesses are complete by then.
*/
rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
}
@@ -6842,7 +6841,7 @@ static void perf_pending_task(struct callback_head *head)
* critical section as the ->pending_work reset. See comment in
* perf_pending_task_sync().
*/
- preempt_disable_notrace();
+ rcu_read_lock();
/*
* If we 'fail' here, that's OK, it means recursion is already disabled
* and we won't recurse 'further'.
@@ -6855,10 +6854,10 @@ static void perf_pending_task(struct callback_head *head)
local_dec(&event->ctx->nr_pending);
rcuwait_wake_up(&event->pending_work_wait);
}
+ rcu_read_unlock();
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
- preempt_enable_notrace();
}
#ifdef CONFIG_GUEST_PERF_EVENTS
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread* [tip: perf/core] perf: Don't disable preemption in perf_pending_task().
2024-07-04 17:03 ` [PATCH v5 6/7] perf: Don't disable preemption in perf_pending_task() Sebastian Andrzej Siewior
@ 2024-07-09 11:41 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2024-07-09 11:41 UTC (permalink / raw)
To: linux-tip-commits
Cc: Sebastian Andrzej Siewior, Peter Zijlstra (Intel), Marco Elver,
x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 16b9569df9d2ab07eeee075cb7895e9d3e08e8f0
Gitweb: https://git.kernel.org/tip/16b9569df9d2ab07eeee075cb7895e9d3e08e8f0
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Thu, 04 Jul 2024 19:03:40 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 09 Jul 2024 13:26:36 +02:00
perf: Don't disable preemption in perf_pending_task().
perf_pending_task() is invoked in task context and disables preemption
because perf_swevent_get_recursion_context() used to access per-CPU
variables. The other reason is to create a RCU read section while
accessing the perf_event.
The recursion counter is no longer a per-CPU accounter so disabling
preemption is no longer required. The RCU section is needed and must be
created explicit.
Replace the preemption-disable section with a explicit RCU-read section.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-7-bigeasy@linutronix.de
---
kernel/events/core.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b523225..96e03d6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5208,10 +5208,9 @@ static void perf_pending_task_sync(struct perf_event *event)
}
/*
- * All accesses related to the event are within the same
- * non-preemptible section in perf_pending_task(). The RCU
- * grace period before the event is freed will make sure all
- * those accesses are complete by then.
+ * All accesses related to the event are within the same RCU section in
+ * perf_pending_task(). The RCU grace period before the event is freed
+ * will make sure all those accesses are complete by then.
*/
rcuwait_wait_event(&event->pending_work_wait, !event->pending_work, TASK_UNINTERRUPTIBLE);
}
@@ -6831,7 +6830,7 @@ static void perf_pending_task(struct callback_head *head)
* critical section as the ->pending_work reset. See comment in
* perf_pending_task_sync().
*/
- preempt_disable_notrace();
+ rcu_read_lock();
/*
* If we 'fail' here, that's OK, it means recursion is already disabled
* and we won't recurse 'further'.
@@ -6844,10 +6843,10 @@ static void perf_pending_task(struct callback_head *head)
local_dec(&event->ctx->nr_pending);
rcuwait_wake_up(&event->pending_work_wait);
}
+ rcu_read_unlock();
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
- preempt_enable_notrace();
}
#ifdef CONFIG_GUEST_PERF_EVENTS
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 7/7] perf: Split __perf_pending_irq() out of perf_pending_irq()
2024-07-04 17:03 [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
` (5 preceding siblings ...)
2024-07-04 17:03 ` [PATCH v5 6/7] perf: Don't disable preemption in perf_pending_task() Sebastian Andrzej Siewior
@ 2024-07-04 17:03 ` Sebastian Andrzej Siewior
2024-07-09 11:41 ` [tip: perf/core] " tip-bot2 for Sebastian Andrzej Siewior
2024-07-04 19:45 ` [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Arnaldo Carvalho de Melo
2024-07-05 7:55 ` Peter Zijlstra
8 siblings, 1 reply; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-04 17:03 UTC (permalink / raw)
To: linux-perf-users, linux-kernel
Cc: Adrian Hunter, Alexander Shishkin, Arnaldo Carvalho de Melo,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner, Sebastian Andrzej Siewior,
Arnaldo Carvalho de Melo
perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq().
The former is in charge of waking any tasks which waits to be woken up
while the latter disables perf-events.
The irq_work perf_pending_irq(), while this an irq_work, the callback
is invoked in thread context on PREEMPT_RT. This is needed because all
the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks
which must not be used with disabled interrupts.
Disabling events, as done by __perf_pending_irq(), expects a hardirq
context and disabled interrupts. This requirement is not fulfilled on
PREEMPT_RT.
Split functionality based on perf_event::pending_disable into irq_work
named `pending_disable_irq' and invoke it in hardirq context on
PREEMPT_RT. Rename the split out callback to perf_pending_disable().
Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/all/ZMAtZ2t43GXoF6tM@kernel.org/
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 29 ++++++++++++++++++++++-------
2 files changed, 23 insertions(+), 7 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 99a7ea1d29ed5..65ece0d5b4b6d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -783,6 +783,7 @@ struct perf_event {
unsigned int pending_disable;
unsigned long pending_addr; /* SIGTRAP */
struct irq_work pending_irq;
+ struct irq_work pending_disable_irq;
struct callback_head pending_task;
unsigned int pending_work;
struct rcuwait pending_work_wait;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 67f5aab933c81..0acf6ee4df528 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2451,7 +2451,7 @@ static void __perf_event_disable(struct perf_event *event,
* hold the top-level event's child_mutex, so any descendant that
* goes to exit will block in perf_event_exit_event().
*
- * When called from perf_pending_irq it's OK because event->ctx
+ * When called from perf_pending_disable it's OK because event->ctx
* is the current context on this CPU and preemption is disabled,
* hence we can't get into perf_event_task_sched_out for this context.
*/
@@ -2491,7 +2491,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
void perf_event_disable_inatomic(struct perf_event *event)
{
event->pending_disable = 1;
- irq_work_queue(&event->pending_irq);
+ irq_work_queue(&event->pending_disable_irq);
}
#define MAX_INTERRUPTS (~0ULL)
@@ -5218,6 +5218,7 @@ static void perf_pending_task_sync(struct perf_event *event)
static void _free_event(struct perf_event *event)
{
irq_work_sync(&event->pending_irq);
+ irq_work_sync(&event->pending_disable_irq);
perf_pending_task_sync(event);
unaccount_event(event);
@@ -6760,7 +6761,7 @@ static void perf_sigtrap(struct perf_event *event)
/*
* Deliver the pending work in-event-context or follow the context.
*/
-static void __perf_pending_irq(struct perf_event *event)
+static void __perf_pending_disable(struct perf_event *event)
{
int cpu = READ_ONCE(event->oncpu);
@@ -6798,11 +6799,26 @@ static void __perf_pending_irq(struct perf_event *event)
* irq_work_queue(); // FAILS
*
* irq_work_run()
- * perf_pending_irq()
+ * perf_pending_disable()
*
* But the event runs on CPU-B and wants disabling there.
*/
- irq_work_queue_on(&event->pending_irq, cpu);
+ irq_work_queue_on(&event->pending_disable_irq, cpu);
+}
+
+static void perf_pending_disable(struct irq_work *entry)
+{
+ struct perf_event *event = container_of(entry, struct perf_event, pending_disable_irq);
+ int rctx;
+
+ /*
+ * If we 'fail' here, that's OK, it means recursion is already disabled
+ * and we won't recurse 'further'.
+ */
+ rctx = perf_swevent_get_recursion_context();
+ __perf_pending_disable(event);
+ if (rctx >= 0)
+ perf_swevent_put_recursion_context(rctx);
}
static void perf_pending_irq(struct irq_work *entry)
@@ -6825,8 +6841,6 @@ static void perf_pending_irq(struct irq_work *entry)
perf_event_wakeup(event);
}
- __perf_pending_irq(event);
-
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
}
@@ -11967,6 +11981,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending_irq, perf_pending_irq);
+ event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
init_task_work(&event->pending_task, perf_pending_task);
rcuwait_init(&event->pending_work_wait);
--
2.45.2
^ permalink raw reply related [flat|nested] 22+ messages in thread* [tip: perf/core] perf: Split __perf_pending_irq() out of perf_pending_irq()
2024-07-04 17:03 ` [PATCH v5 7/7] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior
@ 2024-07-09 11:41 ` tip-bot2 for Sebastian Andrzej Siewior
0 siblings, 0 replies; 22+ messages in thread
From: tip-bot2 for Sebastian Andrzej Siewior @ 2024-07-09 11:41 UTC (permalink / raw)
To: linux-tip-commits
Cc: Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior,
Peter Zijlstra (Intel), Marco Elver, x86, linux-kernel
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 2b84def990d388bed4afe4f21ae383a01991046c
Gitweb: https://git.kernel.org/tip/2b84def990d388bed4afe4f21ae383a01991046c
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
AuthorDate: Thu, 04 Jul 2024 19:03:41 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 09 Jul 2024 13:26:37 +02:00
perf: Split __perf_pending_irq() out of perf_pending_irq()
perf_pending_irq() invokes perf_event_wakeup() and __perf_pending_irq().
The former is in charge of waking any tasks which waits to be woken up
while the latter disables perf-events.
The irq_work perf_pending_irq(), while this an irq_work, the callback
is invoked in thread context on PREEMPT_RT. This is needed because all
the waking functions (wake_up_all(), kill_fasync()) acquire sleep locks
which must not be used with disabled interrupts.
Disabling events, as done by __perf_pending_irq(), expects a hardirq
context and disabled interrupts. This requirement is not fulfilled on
PREEMPT_RT.
Split functionality based on perf_event::pending_disable into irq_work
named `pending_disable_irq' and invoke it in hardirq context on
PREEMPT_RT. Rename the split out callback to perf_pending_disable().
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20240704170424.1466941-8-bigeasy@linutronix.de
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 29 ++++++++++++++++++++++-------
2 files changed, 23 insertions(+), 7 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 99a7ea1..65ece0d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -783,6 +783,7 @@ struct perf_event {
unsigned int pending_disable;
unsigned long pending_addr; /* SIGTRAP */
struct irq_work pending_irq;
+ struct irq_work pending_disable_irq;
struct callback_head pending_task;
unsigned int pending_work;
struct rcuwait pending_work_wait;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 96e03d6..f64c30e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2451,7 +2451,7 @@ static void __perf_event_disable(struct perf_event *event,
* hold the top-level event's child_mutex, so any descendant that
* goes to exit will block in perf_event_exit_event().
*
- * When called from perf_pending_irq it's OK because event->ctx
+ * When called from perf_pending_disable it's OK because event->ctx
* is the current context on this CPU and preemption is disabled,
* hence we can't get into perf_event_task_sched_out for this context.
*/
@@ -2491,7 +2491,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
void perf_event_disable_inatomic(struct perf_event *event)
{
event->pending_disable = 1;
- irq_work_queue(&event->pending_irq);
+ irq_work_queue(&event->pending_disable_irq);
}
#define MAX_INTERRUPTS (~0ULL)
@@ -5218,6 +5218,7 @@ static void perf_pending_task_sync(struct perf_event *event)
static void _free_event(struct perf_event *event)
{
irq_work_sync(&event->pending_irq);
+ irq_work_sync(&event->pending_disable_irq);
perf_pending_task_sync(event);
unaccount_event(event);
@@ -6749,7 +6750,7 @@ static void perf_sigtrap(struct perf_event *event)
/*
* Deliver the pending work in-event-context or follow the context.
*/
-static void __perf_pending_irq(struct perf_event *event)
+static void __perf_pending_disable(struct perf_event *event)
{
int cpu = READ_ONCE(event->oncpu);
@@ -6787,11 +6788,26 @@ static void __perf_pending_irq(struct perf_event *event)
* irq_work_queue(); // FAILS
*
* irq_work_run()
- * perf_pending_irq()
+ * perf_pending_disable()
*
* But the event runs on CPU-B and wants disabling there.
*/
- irq_work_queue_on(&event->pending_irq, cpu);
+ irq_work_queue_on(&event->pending_disable_irq, cpu);
+}
+
+static void perf_pending_disable(struct irq_work *entry)
+{
+ struct perf_event *event = container_of(entry, struct perf_event, pending_disable_irq);
+ int rctx;
+
+ /*
+ * If we 'fail' here, that's OK, it means recursion is already disabled
+ * and we won't recurse 'further'.
+ */
+ rctx = perf_swevent_get_recursion_context();
+ __perf_pending_disable(event);
+ if (rctx >= 0)
+ perf_swevent_put_recursion_context(rctx);
}
static void perf_pending_irq(struct irq_work *entry)
@@ -6814,8 +6830,6 @@ static void perf_pending_irq(struct irq_work *entry)
perf_event_wakeup(event);
}
- __perf_pending_irq(event);
-
if (rctx >= 0)
perf_swevent_put_recursion_context(rctx);
}
@@ -11956,6 +11970,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending_irq, perf_pending_irq);
+ event->pending_disable_irq = IRQ_WORK_INIT_HARD(perf_pending_disable);
init_task_work(&event->pending_task, perf_pending_task);
rcuwait_init(&event->pending_work_wait);
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT.
2024-07-04 17:03 [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
` (6 preceding siblings ...)
2024-07-04 17:03 ` [PATCH v5 7/7] perf: Split __perf_pending_irq() out of perf_pending_irq() Sebastian Andrzej Siewior
@ 2024-07-04 19:45 ` Arnaldo Carvalho de Melo
2024-07-04 19:46 ` Arnaldo Carvalho de Melo
2024-07-05 6:16 ` Sebastian Andrzej Siewior
2024-07-05 7:55 ` Peter Zijlstra
8 siblings, 2 replies; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2024-07-04 19:45 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner
On Thu, Jul 04, 2024 at 07:03:34PM +0200, Sebastian Andrzej Siewior wrote:
> Hi,
>
> Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
> the signal gets delayed until event_sched_out() which then uses
> task_work_add() for its delivery. This breaks on PREEMPT_RT because the
> signal is delivered with disabled preemption.
>
> While looking at this, I also stumbled upon __perf_pending_irq() which
> requires disabled interrupts but this is not the case on PREEMPT_RT.
>
> This series aim to address both issues while not introducing a new issue
> at the same time ;)
> Any testing is appreciated.
Were should I apply this patch? The v4 series was applied to
linux-rt-devel/linux-6.10.y-rt IIRC
- Arnaldo
> v4…v5: https://lore.kernel.org/all/20240624152732.1231678-1-bigeasy@linutronix.de/
> - Add TWA_NMI_CURRENT as notify mode for task_work_add() and use it.
> PeterZ pointed out that the current version is not NMI safe.
>
> v3…v4: https://lore.kernel.org/all/20240322065208.60456-1-bigeasy@linutronix.de/
> - Rebased on top of Frederic's series
> (https://lore.kernel.org/all/20240621091601.18227-1-frederic@kernel.org)
> - Frederick pointed out that perf_pending_task() needs to
> perf_swevent_get_recursion_context() in order not to recurse if
> something within perf_swevent_.*_recursion_context() triggers a
> software event. To address this, the counters have been moved to
> the task_struct (#3 + #4) and preemt_disable() has been replaced
> with a RCU-read lock (#5).
> - The remaning logic same that means the event is pushed to task-work
> instead of delivering from IRQ-work. The series was tested with
> remove_on_exec as suggested by Marco Elver: On PREEMPT_RT a single
> invocation passes, 100 parallel invocations report (for some)
> unexpected SIGTRAPs and timeouts. This also observed on !RT
> (without the series) with a higher task-count.
>
> v2…v3: https://lore.kernel.org/all/20240312180814.3373778-1-bigeasy@linutronix.de/
> - Marco suggested to add a few comments
> - Added a comment to __perf_event_overflow() to explain why irq_work
> is raised in the in_nmi() case.
> - Added a comment to perf_event_exit_event() to explain why the
> pending event is deleted.
>
> v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/
> - Marco pointed me to the testsuite that showed two problems:
> - Delayed task_work from NMI / missing events.
> Fixed by triggering dummy irq_work to enforce an interrupt for
> the exit-to-userland path which checks task_work
> - Increased ref-count on clean up/ during exec.
> Mostly addressed by the former change. There is still a window
> if the NMI occurs during execve(). This is addressed by removing
> the task_work before free_event().
> The testsuite (remove_on_exec) fails sometimes if the event/
> SIGTRAP is sent before the sighandler is installed.
>
> Sebastian
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT.
2024-07-04 19:45 ` [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Arnaldo Carvalho de Melo
@ 2024-07-04 19:46 ` Arnaldo Carvalho de Melo
2024-07-05 6:16 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 22+ messages in thread
From: Arnaldo Carvalho de Melo @ 2024-07-04 19:46 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner
On Thu, Jul 04, 2024 at 04:45:22PM -0300, Arnaldo Carvalho de Melo wrote:
> On Thu, Jul 04, 2024 at 07:03:34PM +0200, Sebastian Andrzej Siewior wrote:
> > Hi,
> >
> > Arnaldo reported that "perf test sigtrap" fails on PREEMPT_RT. Sending
> > the signal gets delayed until event_sched_out() which then uses
> > task_work_add() for its delivery. This breaks on PREEMPT_RT because the
> > signal is delivered with disabled preemption.
> >
> > While looking at this, I also stumbled upon __perf_pending_irq() which
> > requires disabled interrupts but this is not the case on PREEMPT_RT.
> >
> > This series aim to address both issues while not introducing a new issue
> > at the same time ;)
> > Any testing is appreciated.
>
> Were should I apply this patch? The v4 series was applied to
> linux-rt-devel/linux-6.10.y-rt IIRC
Looking at linux-rt-devel/linux-6.10.y-rt I see:
commit ca8b27c51f0962f8fb59e5acb23e0af791fb5c04
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Tue Jun 25 10:56:32 2024 +0200
perf: Update the perf series
This is an all-in-one patch integrating the following changes:
- Merging Frederick's "Fix leaked sigtrap events" series as of v4 which
is a dependency.
- Update the "perf test sigtrap" fixup to v4 as posted.
Link: https://lore.kernel.org/20240621091601.18227-1-frederic@kernel.org
Link: https://lore.kernel.org/20240624152732.1231678-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
But I think that is v4, right?
- Arnaldo
>
> > v4…v5: https://lore.kernel.org/all/20240624152732.1231678-1-bigeasy@linutronix.de/
> > - Add TWA_NMI_CURRENT as notify mode for task_work_add() and use it.
> > PeterZ pointed out that the current version is not NMI safe.
> >
> > v3…v4: https://lore.kernel.org/all/20240322065208.60456-1-bigeasy@linutronix.de/
> > - Rebased on top of Frederic's series
> > (https://lore.kernel.org/all/20240621091601.18227-1-frederic@kernel.org)
> > - Frederick pointed out that perf_pending_task() needs to
> > perf_swevent_get_recursion_context() in order not to recurse if
> > something within perf_swevent_.*_recursion_context() triggers a
> > software event. To address this, the counters have been moved to
> > the task_struct (#3 + #4) and preemt_disable() has been replaced
> > with a RCU-read lock (#5).
> > - The remaning logic same that means the event is pushed to task-work
> > instead of delivering from IRQ-work. The series was tested with
> > remove_on_exec as suggested by Marco Elver: On PREEMPT_RT a single
> > invocation passes, 100 parallel invocations report (for some)
> > unexpected SIGTRAPs and timeouts. This also observed on !RT
> > (without the series) with a higher task-count.
> >
> > v2…v3: https://lore.kernel.org/all/20240312180814.3373778-1-bigeasy@linutronix.de/
> > - Marco suggested to add a few comments
> > - Added a comment to __perf_event_overflow() to explain why irq_work
> > is raised in the in_nmi() case.
> > - Added a comment to perf_event_exit_event() to explain why the
> > pending event is deleted.
> >
> > v1…v2: https://lore.kernel.org/all/20240308175810.2894694-1-bigeasy@linutronix.de/
> > - Marco pointed me to the testsuite that showed two problems:
> > - Delayed task_work from NMI / missing events.
> > Fixed by triggering dummy irq_work to enforce an interrupt for
> > the exit-to-userland path which checks task_work
> > - Increased ref-count on clean up/ during exec.
> > Mostly addressed by the former change. There is still a window
> > if the NMI occurs during execve(). This is addressed by removing
> > the task_work before free_event().
> > The testsuite (remove_on_exec) fails sometimes if the event/
> > SIGTRAP is sent before the sighandler is installed.
> >
> > Sebastian
> >
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT.
2024-07-04 19:45 ` [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Arnaldo Carvalho de Melo
2024-07-04 19:46 ` Arnaldo Carvalho de Melo
@ 2024-07-05 6:16 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-05 6:16 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo
Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
Frederic Weisbecker, Ian Rogers, Ingo Molnar, Jiri Olsa,
Kan Liang, Marco Elver, Mark Rutland, Namhyung Kim,
Peter Zijlstra, Thomas Gleixner
On 2024-07-04 16:45:18 [-0300], Arnaldo Carvalho de Melo wrote:
> Were should I apply this patch? The v4 series was applied to
> linux-rt-devel/linux-6.10.y-rt IIRC
Revert the v4 series, apply this one. Wait a few hours and there should
be a new RT release with this done ;)
> - Arnaldo
Sebastian
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT.
2024-07-04 17:03 [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Sebastian Andrzej Siewior
` (7 preceding siblings ...)
2024-07-04 19:45 ` [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT Arnaldo Carvalho de Melo
@ 2024-07-05 7:55 ` Peter Zijlstra
2024-07-05 7:58 ` Sebastian Andrzej Siewior
8 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2024-07-05 7:55 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Frederic Weisbecker, Ian Rogers,
Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
Namhyung Kim, Thomas Gleixner
Thanks, I'll go queue it for perf/core once the robot gets back to me on
the current pile.
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 0/6] perf: Make SIGTRAP and __perf_pending_irq() work on RT.
2024-07-05 7:55 ` Peter Zijlstra
@ 2024-07-05 7:58 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 22+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-07-05 7:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-perf-users, linux-kernel, Adrian Hunter, Alexander Shishkin,
Arnaldo Carvalho de Melo, Frederic Weisbecker, Ian Rogers,
Ingo Molnar, Jiri Olsa, Kan Liang, Marco Elver, Mark Rutland,
Namhyung Kim, Thomas Gleixner
On 2024-07-05 09:55:11 [+0200], Peter Zijlstra wrote:
>
>
> Thanks, I'll go queue it for perf/core once the robot gets back to me on
> the current pile.
Thank you.
Sebastian
^ permalink raw reply [flat|nested] 22+ messages in thread