* [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value
@ 2011-03-24 16:44 Jiri Olsa
2011-03-25 19:10 ` Oleg Nesterov
2011-03-26 15:37 ` Peter Zijlstra
0 siblings, 2 replies; 41+ messages in thread
From: Jiri Olsa @ 2011-03-24 16:44 UTC (permalink / raw)
To: Peter Zijlstra, Paul Mackerras, Ingo Molnar; +Cc: oleg, linux-kernel
hi,
following program triggers panic in the current tip tree.
run it like: "while [ 1 ]; do ./ctx ; done"
--- reproducer
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/perf_event.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/syscall.h>
static int trace_event__id(const char *evname)
{
char *filename;
int err = -1, fd;
if (asprintf(&filename,
"/sys/kernel/debug/tracing/events/syscalls/%s/id",
evname) < 0)
return -1;
fd = open(filename, O_RDONLY);
if (fd >= 0) {
char id[16];
if (read(fd, id, sizeof(id)) > 0)
err = atoi(id);
close(fd);
}
free(filename);
return err;
}
static inline int
sys_perf_event_open(struct perf_event_attr *attr,
pid_t pid, int cpu, int group_fd,
unsigned long flags)
{
attr->size = sizeof(*attr);
return syscall(__NR_perf_event_open, attr, pid, cpu,
group_fd, flags);
}
int main(int argc, char **argv)
{
struct perf_event_attr attr;
#define CPU_MAX 2
int fd[CPU_MAX], cpu, i, l;
int id = trace_event__id("sys_enter_open");
if (id < 0) {
printf("is debugfs mounted on /sys/kernel/debug?\n");
return -1;
}
memset(&attr, 0, sizeof(attr));
attr.type = PERF_TYPE_TRACEPOINT;
attr.config = id;
for(l = 0; l < 5; l++) {
for(cpu = 0; cpu < CPU_MAX; cpu++)
fd[cpu] = sys_perf_event_open(&attr,
0, cpu, -1, 0);
for(i = 0; i < 1000; i++) {
close(open("krava", 0));
}
for(cpu = 0; cpu < CPU_MAX; cpu++)
close(fd[cpu]);
}
return 0;
}
---
--- panic msg
[ 157.737618] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 157.737618] IP: [<ffffffff81104e9b>] perf_ctx_adjust_freq+0x5b/0x130
[ 157.737618] PGD 7835c067 PUD 7be57067 PMD 0
[ 157.737618] Oops: 0000 [#1] SMP
[ 157.737618] last sysfs file: /sys/devices/system/cpu/online
[ 157.737618] CPU 1
[ 157.737618] Modules linked in:
[ 157.737618]
[ 157.737618] Pid: 2112, comm: perf Not tainted 2.6.38-tip+ #602 Intel Corporation Montevina platform/To be filled by O.E.M.
[ 157.737618] RIP: 0010:[<ffffffff81104e9b>] [<ffffffff81104e9b>] perf_ctx_adjust_freq+0x5b/0x130
[ 157.737618] RSP: 0018:ffff88007b483d68 EFLAGS: 00010087
[ 157.737618] RAX: 0000000000000000 RBX: ffff880077a811c0 RCX: ffffffff818b32ab
[ 157.737618] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff880077a811cc
[ 157.737618] RBP: ffff88007b483da8 R08: 0000000000000000 R09: 0000000000000000
[ 157.737618] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000f41a8
[ 157.737618] R13: fffffffffffffff0 R14: ffff880077a81210 R15: ffff88007b483d70
[ 157.737618] FS: 00007fba6834e720(0000) GS:ffff88007b480000(0000) knlGS:0000000000000000
[ 157.737618] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 157.737618] CR2: 0000000000000000 CR3: 000000007bf93000 CR4: 00000000000406e0
[ 157.737618] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 157.737618] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 157.737618] Process perf (pid: 2112, threadinfo ffff88007be02000, task ffff880076088400)
[ 157.737618] Stack:
[ 157.737618] ffff88007b483da8 0000000000000000 0000000000000000 ffff88007b4940d0
[ 157.737618] ffff88007b48e578 ffff880077a811c0 ffff88007b4941a8 0000000000000000
[ 157.737618] ffff88007b483e18 ffffffff811050e7 ffff88007b491540 ffff880000000001
[ 157.737618] Call Trace:
[ 157.737618] <IRQ>
[ 157.737618] [<ffffffff811050e7>] perf_event_task_tick+0x177/0x330
[ 157.737618] [<ffffffff8107b408>] scheduler_tick+0xd8/0x300
[ 157.737618] [<ffffffff81091abe>] update_process_times+0x6e/0x90
[ 157.737618] [<ffffffff810b49a3>] tick_sched_timer+0x93/0xe0
[ 157.737618] [<ffffffff810a7ed1>] __run_hrtimer+0x121/0x250
[ 157.737618] [<ffffffff810b4910>] ? tick_sched_timer+0x0/0xe0
[ 157.737618] [<ffffffff810a832e>] hrtimer_interrupt+0xee/0x210
[ 157.737618] [<ffffffff81514f5b>] smp_apic_timer_interrupt+0x6b/0x9b
[ 157.737618] [<ffffffff81514393>] apic_timer_interrupt+0x13/0x20
[ 157.737618] <EOI>
[ 157.737618] [<ffffffff8108166b>] ? vprintk+0x22b/0x480
[ 157.737618] [<ffffffff810fe02a>] ? perf_event_read+0x11a/0x120
[ 157.737618] [<ffffffff81508226>] printk+0x41/0x43
[ 157.737618] [<ffffffff810ff5d4>] perf_release+0x34/0xe0
[ 157.737618] [<ffffffff8114e644>] fput+0x124/0x2d0
[ 157.737618] [<ffffffff8114a66b>] filp_close+0x7b/0xd0
[ 157.737618] [<ffffffff8114a75d>] sys_close+0x9d/0x130
[ 157.737618] [<ffffffff81513c02>] tracesys+0xd0/0xd5
[ 157.737618] Code: 48 8d 7b 0c e8 d7 66 40 00 48 8b 43 50 48 89 45 c8 4c 8b 6d c8 49 83 ed 10 eb 10 0f 1f 44 00 00 48 89 45 c8 4d 8b 2f 49 83 ed 10
[ 157.737618] 8b 45 10 49 8d 55 10 49 39 d6 0f 18 08 74 55 41 83 7d 58 01
[ 157.737618] RIP [<ffffffff81104e9b>] perf_ctx_adjust_freq+0x5b/0x130
[ 157.737618] RSP <ffff88007b483d68>
[ 157.737618] CR2: 0000000000000000
[ 157.737618] ---[ end trace 7175df90a55db905 ]---
---
The reason is task_ctx (struct perf_cpu_context) is holding an invalid
pointer inside perf_rotate_context function.
I think I found the reason and can workaround the issue with attached
patch, but I dont think it's proper solution, and I'm not sure how to fix
this properly ;)
Now here's what I think is happening..
- once an event is created by sys_perf_event_open, task context
is created and it stays even if the event is closed, until the task
is finished ... thats what I see in code and I assume it's correct
- when the task opens event, perf_sched_events jump label is incremented
and following callbacks are started from scheduler
__perf_event_task_sched_in
__perf_event_task_sched_out
These callback *in/out set/unset cpuctx->task_ctx value to the task
context.
- close is called on event on CPU 0:
- the task is scheduled on CPU 0
- __perf_event_task_sched_in is called
- cpuctx->task_ctx is set
- perf_sched_events jump label is decremented and == 0
- __perf_event_task_sched_out is not called
- cpuctx->task_ctx on CPU 0 stays set
- exit is called on CPU 1:
- the task is scheduled on CPU 1
- perf_event_exit_task is called
- task_ctx_sched_out unsets cpuctx->task_ctx on CPU 1
- put_ctx destroys the context
- another call of perf_rotate_context on CPU 0 will use invalid
task_ctx pointer, and eventualy panic
The attached workaround makes sure that the task_ctx is not set
when the context is being removed. As I said it's not ment to be
fix.
any ideas?
thanks,
jirka
---
kernel/perf_event.c | 13 +++++++++++++
1 files changed, 13 insertions(+), 0 deletions(-)
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index c75925c..25baf13 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -596,6 +596,19 @@ static void free_ctx(struct rcu_head *head)
static void put_ctx(struct perf_event_context *ctx)
{
if (atomic_dec_and_test(&ctx->refcount)) {
+
+ if (!atomic_read(&perf_sched_events)) {
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct perf_cpu_context *cpuctx;
+
+ cpuctx = per_cpu_ptr(ctx->pmu->pmu_cpu_context, cpu);
+ if (cpuctx->task_ctx == ctx)
+ cpuctx->task_ctx = NULL;
+ }
+ }
+
if (ctx->parent_ctx)
put_ctx(ctx->parent_ctx);
if (ctx->task)
--
1.7.1
^ permalink raw reply related [flat|nested] 41+ messages in thread* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-24 16:44 [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value Jiri Olsa @ 2011-03-25 19:10 ` Oleg Nesterov 2011-03-26 15:37 ` Peter Zijlstra 1 sibling, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-25 19:10 UTC (permalink / raw) To: Jiri Olsa; +Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel On 03/24, Jiri Olsa wrote: > > - close is called on event on CPU 0: > - the task is scheduled on CPU 0 > - __perf_event_task_sched_in is called > - cpuctx->task_ctx is set > - perf_sched_events jump label is decremented and == 0 > - __perf_event_task_sched_out is not called > - cpuctx->task_ctx on CPU 0 stays set I think you are right. And this is already wrong afaics, even if we add the workaround into free_ctx/etc. For example, suppose that we attach another PERF_ATTACH_TASK counter to this task later. perf_sched_events will be incremented again, but perf_install_in_context() should hang retrying until this task runs again, ->is_active == T. Or, even if sys_perf_event_open() succeeds the next perf_event_context_sched_in() will do nothing because it checks cpuctx->task_ctx != ctx (unless another task has a counter and schedules in on that CPU). This should be fixed somehow, but so far I have no ideas. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-24 16:44 [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value Jiri Olsa 2011-03-25 19:10 ` Oleg Nesterov @ 2011-03-26 15:37 ` Peter Zijlstra 2011-03-26 16:13 ` Oleg Nesterov 1 sibling, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2011-03-26 15:37 UTC (permalink / raw) To: Jiri Olsa; +Cc: Paul Mackerras, Ingo Molnar, oleg, linux-kernel On Thu, 2011-03-24 at 17:44 +0100, Jiri Olsa wrote: > > Now here's what I think is happening.. > > - once an event is created by sys_perf_event_open, task context > is created and it stays even if the event is closed, until the task > is finished ... thats what I see in code and I assume it's correct Correct, I recently spoke to someone interested in 'curing' that, but as it stands that's how it is. > - when the task opens event, perf_sched_events jump label is incremented > and following callbacks are started from scheduler > > __perf_event_task_sched_in > __perf_event_task_sched_out > > These callback *in/out set/unset cpuctx->task_ctx value to the task > context. *nod* > - close is called on event on CPU 0: > - the task is scheduled on CPU 0 > - __perf_event_task_sched_in is called > - cpuctx->task_ctx is set > - perf_sched_events jump label is decremented and == 0 > - __perf_event_task_sched_out is not called > - cpuctx->task_ctx on CPU 0 stays set > > - exit is called on CPU 1: > - the task is scheduled on CPU 1 > - perf_event_exit_task is called > - task_ctx_sched_out unsets cpuctx->task_ctx on CPU 1 > - put_ctx destroys the context > > - another call of perf_rotate_context on CPU 0 will use invalid > task_ctx pointer, and eventualy panic > > > The attached workaround makes sure that the task_ctx is not set > when the context is being removed. As I said it's not ment to be > fix. Still having somewhat of a cold, how does the below look? (completely untested so far, will have to bang on your testcase a bit to make it work). --- kernel/perf_event.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/kernel/perf_event.c b/kernel/perf_event.c index c75925c..2a03cc4 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -1112,6 +1112,8 @@ static int __perf_remove_from_context(void *info) raw_spin_lock(&ctx->lock); event_sched_out(event, cpuctx, ctx); list_del_event(event, ctx); + if (cpuctx->task_ctx == event->ctx && !event->ctx->nr_active) + cpuctx->task_ctx = NULL; raw_spin_unlock(&ctx->lock); return 0; ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-26 15:37 ` Peter Zijlstra @ 2011-03-26 16:13 ` Oleg Nesterov 2011-03-26 16:38 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2011-03-26 16:13 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/26, Peter Zijlstra wrote: > > On Thu, 2011-03-24 at 17:44 +0100, Jiri Olsa wrote: > > > > - close is called on event on CPU 0: > > - the task is scheduled on CPU 0 > > - __perf_event_task_sched_in is called > > - cpuctx->task_ctx is set > > - perf_sched_events jump label is decremented and == 0 > > - __perf_event_task_sched_out is not called > > - cpuctx->task_ctx on CPU 0 stays set > > > > - exit is called on CPU 1: > > - the task is scheduled on CPU 1 > > - perf_event_exit_task is called > > - task_ctx_sched_out unsets cpuctx->task_ctx on CPU 1 > > - put_ctx destroys the context > > > > - another call of perf_rotate_context on CPU 0 will use invalid > > task_ctx pointer, and eventualy panic > > > > > > The attached workaround makes sure that the task_ctx is not set > > when the context is being removed. As I said it's not ment to be > > fix. > > Still having somewhat of a cold, how does the below look? > > (completely untested so far, will have to bang on your testcase a bit to > make it work). > > --- > kernel/perf_event.c | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > index c75925c..2a03cc4 100644 > --- a/kernel/perf_event.c > +++ b/kernel/perf_event.c > @@ -1112,6 +1112,8 @@ static int __perf_remove_from_context(void *info) > raw_spin_lock(&ctx->lock); > event_sched_out(event, cpuctx, ctx); > list_del_event(event, ctx); > + if (cpuctx->task_ctx == event->ctx && !event->ctx->nr_active) > + cpuctx->task_ctx = NULL; I don't think this is right. It is too late to clear ->task_ctx when the task exits. It is simply wrong that cpuctx->task_ctx != NULL after context_switch(). And, once again ->is_active is still true. Besides, you can trust __get_cpu_context(), it possible that another CPU has cpuctx->task_ctx == event->ctx. Otherwise task_ctx_sched_out() has already cleared cpuctx->task_ctx. Finally, in this case there are no events attached to this context, close(event_fd) removes the only one. But there is one thing I can't understand. Jiri can trigger this bug even with HAVE_JUMP_LABEL. How? OK, jump_label_dec/jump_label_inc are obviously racy, but this test-case can't trigger the race. So, we are doing free_event()->jump_label_dec()->jump_label_update(DISABLE) and this implies __stop_machine(). This means we have at least one context_switch() from the task with the active ->task_ctx to the migration thread. And this happens before JUMP_LABEL() code was actually changed, perf_event_task_sched_out() should call __perf_event_task_sched_out() and clear task_ctx. perf_sched_events is already zero, but this shouldn't matter. Confused. Oleg. Oleg. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-26 16:13 ` Oleg Nesterov @ 2011-03-26 16:38 ` Peter Zijlstra 2011-03-26 17:09 ` Oleg Nesterov 2011-03-26 17:09 ` Peter Zijlstra 0 siblings, 2 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-26 16:38 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Sat, 2011-03-26 at 17:13 +0100, Oleg Nesterov wrote: > > I don't think this is right. > > It is too late to clear ->task_ctx when the task exits. It is simply > wrong that cpuctx->task_ctx != NULL after context_switch(). And, once > again ->is_active is still true. Ah, indeed. Somehow I was thinking we did remove_from_context on close(), I thought about adding it to list_del_event() because that's the place we empty the context and I guess I should have done something like that. The problem with adding it to list_del_event() is that it isn't necessarily called on the correct cpu. /me ponders things for a bit ... The alternative seems to be to put it in event_sched_out(), right where we decrease ctx->nr_active. Again, I might have missed something quite trivial.. let me try and get that test-case running someplace. --- kernel/perf_event.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/kernel/perf_event.c b/kernel/perf_event.c index c75925c..e9e4e35 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -1073,6 +1073,8 @@ event_sched_out(struct perf_event *event, if (!is_software_event(event)) cpuctx->active_oncpu--; ctx->nr_active--; + if (!ctx->nr_active && cpuctx->task_ctx == ctx) + cpuctx->task_ctx = NULL; if (event->attr.exclusive || !cpuctx->active_oncpu) cpuctx->exclusive = 0; } ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-26 16:38 ` Peter Zijlstra @ 2011-03-26 17:09 ` Oleg Nesterov 2011-03-26 17:35 ` Oleg Nesterov 2011-03-26 17:09 ` Peter Zijlstra 1 sibling, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2011-03-26 17:09 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/26, Peter Zijlstra wrote: > > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > index c75925c..e9e4e35 100644 > --- a/kernel/perf_event.c > +++ b/kernel/perf_event.c > @@ -1073,6 +1073,8 @@ event_sched_out(struct perf_event *event, > if (!is_software_event(event)) > cpuctx->active_oncpu--; > ctx->nr_active--; > + if (!ctx->nr_active && cpuctx->task_ctx == ctx) > + cpuctx->task_ctx = NULL; If we clear cpuctx->task_ctx, we should also clear ctx->is_active. Otherwise I can't see any problem, but I do not understand this code enough. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-26 17:09 ` Oleg Nesterov @ 2011-03-26 17:35 ` Oleg Nesterov 2011-03-26 18:29 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2011-03-26 17:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/26, Oleg Nesterov wrote: > > On 03/26, Peter Zijlstra wrote: > > > > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > > index c75925c..e9e4e35 100644 > > --- a/kernel/perf_event.c > > +++ b/kernel/perf_event.c > > @@ -1073,6 +1073,8 @@ event_sched_out(struct perf_event *event, > > if (!is_software_event(event)) > > cpuctx->active_oncpu--; > > ctx->nr_active--; > > + if (!ctx->nr_active && cpuctx->task_ctx == ctx) > > + cpuctx->task_ctx = NULL; > > If we clear cpuctx->task_ctx, we should also clear ctx->is_active. > > Otherwise I can't see any problem, but I do not understand this code > enough. but probably we also need update_context_time(). Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-26 17:35 ` Oleg Nesterov @ 2011-03-26 18:29 ` Peter Zijlstra 2011-03-26 18:49 ` Oleg Nesterov 2011-03-28 13:30 ` Oleg Nesterov 0 siblings, 2 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-26 18:29 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Sat, 2011-03-26 at 18:35 +0100, Oleg Nesterov wrote: > On 03/26, Oleg Nesterov wrote: > > > > On 03/26, Peter Zijlstra wrote: > > > > > > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > > > index c75925c..e9e4e35 100644 > > > --- a/kernel/perf_event.c > > > +++ b/kernel/perf_event.c > > > @@ -1073,6 +1073,8 @@ event_sched_out(struct perf_event *event, > > > if (!is_software_event(event)) > > > cpuctx->active_oncpu--; > > > ctx->nr_active--; > > > + if (!ctx->nr_active && cpuctx->task_ctx == ctx) > > > + cpuctx->task_ctx = NULL; > > > > If we clear cpuctx->task_ctx, we should also clear ctx->is_active. Right. > > Otherwise I can't see any problem, but I do not understand this code > > enough. > > but probably we also need update_context_time(). It looks like event_sched_out() relies on up-to-date ctx->time through perf_event_time() and most call-paths leading to event_sched_out() do indeed seem to update the ctx time, all except the move_group branch in perf_event_open() afaict. Sadly the reproducer doesn't seem to trigger the issue at all, its still running on a plain -tip kernel. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-26 18:29 ` Peter Zijlstra @ 2011-03-26 18:49 ` Oleg Nesterov 2011-03-28 13:30 ` Oleg Nesterov 1 sibling, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-26 18:49 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/26, Peter Zijlstra wrote: > > On Sat, 2011-03-26 at 18:35 +0100, Oleg Nesterov wrote: > > > > but probably we also need update_context_time(). > > It looks like event_sched_out() relies on up-to-date ctx->time through > perf_event_time() and most call-paths leading to event_sched_out() do > indeed seem to update the ctx time, all except the move_group branch in > perf_event_open() afaict. OK, thanks... > Sadly the reproducer doesn't seem to trigger the issue at all, its still > running on a plain -tip kernel. This test-case is not "perfect", the task should change its CPU after it closes the last perf_even_fd and before it exits... But to me, the main question is: I do not understand how this test-case can ever trigger the problem with HAVE_JUMP_LABEL. But it does. I didn't try to reproduce, my gcc is old and not-CC_HAVE_ASM_GOTO. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-26 18:29 ` Peter Zijlstra 2011-03-26 18:49 ` Oleg Nesterov @ 2011-03-28 13:30 ` Oleg Nesterov 2011-03-28 14:57 ` Peter Zijlstra 1 sibling, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2011-03-28 13:30 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/26, Peter Zijlstra wrote: > > On Sat, 2011-03-26 at 18:35 +0100, Oleg Nesterov wrote: > > On 03/26, Oleg Nesterov wrote: > > > > > > On 03/26, Peter Zijlstra wrote: > > > > > > > > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > > > > index c75925c..e9e4e35 100644 > > > > --- a/kernel/perf_event.c > > > > +++ b/kernel/perf_event.c > > > > @@ -1073,6 +1073,8 @@ event_sched_out(struct perf_event *event, > > > > if (!is_software_event(event)) > > > > cpuctx->active_oncpu--; > > > > ctx->nr_active--; > > > > + if (!ctx->nr_active && cpuctx->task_ctx == ctx) > > > > + cpuctx->task_ctx = NULL; > > > > > > If we clear cpuctx->task_ctx, we should also clear ctx->is_active. > > Right. Wait... Yes, we have to clear ctx->is_active, otherwise we break, say, perf_install_in_context(). But if we clear ->is_active we break perf_event_enable(). Suppose we are doing ioctl(PERF_EVENT_IOC_DISABLE) + ioctl(PERF_EVENT_IOC_ENABLE). PERF_EVENT_IOC_DISABLE can sched_out the last event, but _IOC_ENABLE treats ctx->is_active == F as "it is not running". Btw, why ctx_sched_out() checks nr_events under perf_pmu_disable() ? Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-28 13:30 ` Oleg Nesterov @ 2011-03-28 14:57 ` Peter Zijlstra 2011-03-28 15:00 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-28 14:57 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Mon, 2011-03-28 at 15:30 +0200, Oleg Nesterov wrote: > On 03/26, Peter Zijlstra wrote: > > > > On Sat, 2011-03-26 at 18:35 +0100, Oleg Nesterov wrote: > > > On 03/26, Oleg Nesterov wrote: > > > > > > > > On 03/26, Peter Zijlstra wrote: > > > > > > > > > > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > > > > > index c75925c..e9e4e35 100644 > > > > > --- a/kernel/perf_event.c > > > > > +++ b/kernel/perf_event.c > > > > > @@ -1073,6 +1073,8 @@ event_sched_out(struct perf_event *event, > > > > > if (!is_software_event(event)) > > > > > cpuctx->active_oncpu--; > > > > > ctx->nr_active--; > > > > > + if (!ctx->nr_active && cpuctx->task_ctx == ctx) > > > > > + cpuctx->task_ctx = NULL; > > > > > > > > If we clear cpuctx->task_ctx, we should also clear ctx->is_active. > > > > Right. > > Wait... Yes, we have to clear ctx->is_active, otherwise we break, say, > perf_install_in_context(). > > But if we clear ->is_active we break perf_event_enable(). Suppose we > are doing ioctl(PERF_EVENT_IOC_DISABLE) + ioctl(PERF_EVENT_IOC_ENABLE). > PERF_EVENT_IOC_DISABLE can sched_out the last event, but _IOC_ENABLE > treats ctx->is_active == F as "it is not running". Right, same for the tick, if say we can only schedule 1 event at a time and we close the 1 event that is active, the tick will not rotate a new event in. /me goes ponder things.. > Btw, why ctx_sched_out() checks nr_events under perf_pmu_disable() ? hysterical-raisins or somesuch, how about the below: --- Subject: perf: Optimize ctx_sched_out From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Mon Mar 28 16:55:36 CEST 2011 Oleg noted that ctx_sched_out() disables the PMU even though it might not actually do something, avoid needless PMU-disabling. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> --- kernel/perf_event.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6/kernel/perf_event.c =================================================================== --- linux-2.6.orig/kernel/perf_event.c +++ linux-2.6/kernel/perf_event.c @@ -1767,7 +1767,6 @@ static void ctx_sched_out(struct perf_ev struct perf_event *event; raw_spin_lock(&ctx->lock); - perf_pmu_disable(ctx->pmu); ctx->is_active = 0; if (likely(!ctx->nr_events)) goto out; @@ -1777,6 +1776,7 @@ static void ctx_sched_out(struct perf_ev if (!ctx->nr_active) goto out; + perf_pmu_disable(ctx->pmu); if (event_type & EVENT_PINNED) { list_for_each_entry(event, &ctx->pinned_groups, group_entry) group_sched_out(event, cpuctx, ctx); @@ -1786,8 +1786,8 @@ static void ctx_sched_out(struct perf_ev list_for_each_entry(event, &ctx->flexible_groups, group_entry) group_sched_out(event, cpuctx, ctx); } -out: perf_pmu_enable(ctx->pmu); +out: raw_spin_unlock(&ctx->lock); } ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-28 14:57 ` Peter Zijlstra @ 2011-03-28 15:00 ` Peter Zijlstra 2011-03-28 15:15 ` Oleg Nesterov 2011-03-28 15:49 ` Peter Zijlstra 2 siblings, 0 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-28 15:00 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Mon, 2011-03-28 at 16:57 +0200, Peter Zijlstra wrote: > > --- > Subject: perf: Optimize ctx_sched_out > From: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Mon Mar 28 16:55:36 CEST 2011 > > Oleg noted that ctx_sched_out() disables the PMU even though it might > not actually do something, avoid needless PMU-disabling. > > Reported-by: Oleg Nesterov <oleg@redhat.com> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > LKML-Reference: <new-submission> > --- > kernel/perf_event.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > Index: linux-2.6/kernel/perf_event.c > =================================================================== > --- linux-2.6.orig/kernel/perf_event.c > +++ linux-2.6/kernel/perf_event.c > @@ -1767,7 +1767,6 @@ static void ctx_sched_out(struct perf_ev > struct perf_event *event; > > raw_spin_lock(&ctx->lock); > - perf_pmu_disable(ctx->pmu); > ctx->is_active = 0; > if (likely(!ctx->nr_events)) > goto out; > @@ -1777,6 +1776,7 @@ static void ctx_sched_out(struct perf_ev > if (!ctx->nr_active) > goto out; > > + perf_pmu_disable(ctx->pmu); > if (event_type & EVENT_PINNED) { > list_for_each_entry(event, &ctx->pinned_groups, group_entry) > group_sched_out(event, cpuctx, ctx); > @@ -1786,8 +1786,8 @@ static void ctx_sched_out(struct perf_ev > list_for_each_entry(event, &ctx->flexible_groups, group_entry) > group_sched_out(event, cpuctx, ctx); > } > -out: > perf_pmu_enable(ctx->pmu); > +out: > raw_spin_unlock(&ctx->lock); > } We could probably remove the PMU disabling all together, but that would involve like actual thinking :-) ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-28 14:57 ` Peter Zijlstra 2011-03-28 15:00 ` Peter Zijlstra @ 2011-03-28 15:15 ` Oleg Nesterov 2011-03-28 16:27 ` Peter Zijlstra 2011-03-28 15:49 ` Peter Zijlstra 2 siblings, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2011-03-28 15:15 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/28, Peter Zijlstra wrote: > > --- linux-2.6.orig/kernel/perf_event.c > +++ linux-2.6/kernel/perf_event.c > @@ -1767,7 +1767,6 @@ static void ctx_sched_out(struct perf_ev > struct perf_event *event; > > raw_spin_lock(&ctx->lock); > - perf_pmu_disable(ctx->pmu); > ctx->is_active = 0; > if (likely(!ctx->nr_events)) > goto out; > @@ -1777,6 +1776,7 @@ static void ctx_sched_out(struct perf_ev > if (!ctx->nr_active) > goto out; > > + perf_pmu_disable(ctx->pmu); > if (event_type & EVENT_PINNED) { > list_for_each_entry(event, &ctx->pinned_groups, group_entry) > group_sched_out(event, cpuctx, ctx); > @@ -1786,8 +1786,8 @@ static void ctx_sched_out(struct perf_ev > list_for_each_entry(event, &ctx->flexible_groups, group_entry) > group_sched_out(event, cpuctx, ctx); > } > -out: > perf_pmu_enable(ctx->pmu); > +out: > raw_spin_unlock(&ctx->lock); Yes, thanks. Probably this doesn't matter from the perfomance pov, but imho this makes the code more understandable. This is important for occasional readers like me ;) Could you answer another question? It is not immediately clear why ctx_sched_in() does not check nr_active != 0 before doing ctx_XXX_sched_in(). I guess, the only reason is perf_rotate_context() and the similar logic in perf_event_context_sched_in(). If we are doing, say, cpu_ctx_sched_out(FLEXIBLE) + cpu_ctx_sched_in(FLEXIBLE) then ->nr_active can be zero after cpu_ctx_sched_out(). Is my understanding correct? Or is there another reason? Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-28 15:15 ` Oleg Nesterov @ 2011-03-28 16:27 ` Peter Zijlstra 2011-03-28 15:39 ` Oleg Nesterov 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2011-03-28 16:27 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Mon, 2011-03-28 at 17:15 +0200, Oleg Nesterov wrote: > On 03/28, Peter Zijlstra wrote: > > > > --- linux-2.6.orig/kernel/perf_event.c > > +++ linux-2.6/kernel/perf_event.c > > @@ -1767,7 +1767,6 @@ static void ctx_sched_out(struct perf_ev > > struct perf_event *event; > > > > raw_spin_lock(&ctx->lock); > > - perf_pmu_disable(ctx->pmu); > > ctx->is_active = 0; > > if (likely(!ctx->nr_events)) > > goto out; > > @@ -1777,6 +1776,7 @@ static void ctx_sched_out(struct perf_ev > > if (!ctx->nr_active) > > goto out; > > > > + perf_pmu_disable(ctx->pmu); > > if (event_type & EVENT_PINNED) { > > list_for_each_entry(event, &ctx->pinned_groups, group_entry) > > group_sched_out(event, cpuctx, ctx); > > @@ -1786,8 +1786,8 @@ static void ctx_sched_out(struct perf_ev > > list_for_each_entry(event, &ctx->flexible_groups, group_entry) > > group_sched_out(event, cpuctx, ctx); > > } > > -out: > > perf_pmu_enable(ctx->pmu); > > +out: > > raw_spin_unlock(&ctx->lock); > > Yes, thanks. > > Probably this doesn't matter from the perfomance pov, but imho this > makes the code more understandable. This is important for occasional > readers like me ;) Could actually save quite a lot of cycles, pmu-disable/enable can be very expensive on some hardware. > Could you answer another question? It is not immediately clear why > ctx_sched_in() does not check nr_active != 0 before doing > ctx_XXX_sched_in(). I guess, the only reason is perf_rotate_context() > and the similar logic in perf_event_context_sched_in(). If we are > doing, say, cpu_ctx_sched_out(FLEXIBLE) + cpu_ctx_sched_in(FLEXIBLE) > then ->nr_active can be zero after cpu_ctx_sched_out(). > > Is my understanding correct? Or is there another reason? nr_active counts the number of events that have been scheduled in, so its perfectly fine to have either nr_active or !nr_active at that point. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-28 16:27 ` Peter Zijlstra @ 2011-03-28 15:39 ` Oleg Nesterov 0 siblings, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-28 15:39 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/28, Peter Zijlstra wrote: > > On Mon, 2011-03-28 at 17:15 +0200, Oleg Nesterov wrote: > > > Could you answer another question? It is not immediately clear why > > ctx_sched_in() does not check nr_active != 0 > > nr_active counts the number of events that have been scheduled in OOPS! indeed, can't understand what I was thinking about... Thanks. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-28 14:57 ` Peter Zijlstra 2011-03-28 15:00 ` Peter Zijlstra 2011-03-28 15:15 ` Oleg Nesterov @ 2011-03-28 15:49 ` Peter Zijlstra 2011-03-28 16:56 ` Oleg Nesterov 2 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2011-03-28 15:49 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Mon, 2011-03-28 at 16:57 +0200, Peter Zijlstra wrote: > > > > Wait... Yes, we have to clear ctx->is_active, otherwise we break, say, > > perf_install_in_context(). > > > > But if we clear ->is_active we break perf_event_enable(). Suppose we > > are doing ioctl(PERF_EVENT_IOC_DISABLE) + ioctl(PERF_EVENT_IOC_ENABLE). > > PERF_EVENT_IOC_DISABLE can sched_out the last event, but _IOC_ENABLE > > treats ctx->is_active == F as "it is not running". > > Right, same for the tick, if say we can only schedule 1 event at a time > and we close the 1 event that is active, the tick will not rotate a new > event in. Another fun race, suppose we do properly remove task_ctx and is_active, but then the task gets scheduled back in before free_event() gets around to disabling the jump_label.. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-28 15:49 ` Peter Zijlstra @ 2011-03-28 16:56 ` Oleg Nesterov 2011-03-29 8:32 ` Peter Zijlstra 2011-03-30 13:09 ` Jiri Olsa 0 siblings, 2 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-28 16:56 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/28, Peter Zijlstra wrote: > > Another fun race, suppose we do properly remove task_ctx and is_active, > but then the task gets scheduled back in before free_event() gets around > to disabling the jump_label.. Yes, this too... Well, ignoring the HAVE_JUMP_LABEL case... perhaps we can split perf_sched_events into 2 counters? I mean, atomic_t perf_sched_events_in, perf_sched_events_out; static inline void perf_event_task_sched_in(struct task_struct *task) { COND_STMT(&perf_sched_events_in, __perf_event_task_sched_in(task)); } static inline void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next) { perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); COND_STMT(&perf_sched_events_out, __perf_event_task_sched_out(task, next)); } void perf_sched_events_inc(void) { atomic_inc(&perf_sched_events_out); smp_mb__after_atomic_inc(); atomic_inc(&perf_sched_events_in); } void perf_sched_events_dec(void) { if (atomic_dec_and_test(&perf_sched_events_in)) synchronize_sched(); atomic_dec(&perf_sched_events_out); } The last 2 helpers should be used instead of jump_label_inc/dec. Or we can remove COND_STMT() from perf_event_task_sched_out(). Say, __perf_event_task_sched_in() can set PF_PERF_XXX or something, then perf_event_task_sched_out() can check this bit. As for HAVE_JUMP_LABEL, I still can't understand why this test-case triggers the problem. But jump_label_inc/dec logic looks obviously racy. jump_label_dec: if (atomic_dec_and_test(key)) jump_label_disable(key); Another thread can create the PERF_ATTACH_TASK event in between and call jump_label_update(JUMP_LABEL_ENABLE) first. Looks like, jump_label_update() should ensure that "type" matches the state of the "*key" under jump_label_lock(). But perhaps I misread this code completely. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-28 16:56 ` Oleg Nesterov @ 2011-03-29 8:32 ` Peter Zijlstra 2011-03-29 10:49 ` Peter Zijlstra 2011-03-29 16:28 ` Oleg Nesterov 2011-03-30 13:09 ` Jiri Olsa 1 sibling, 2 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-29 8:32 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Mon, 2011-03-28 at 18:56 +0200, Oleg Nesterov wrote: > On 03/28, Peter Zijlstra wrote: > > > > Another fun race, suppose we do properly remove task_ctx and is_active, > > but then the task gets scheduled back in before free_event() gets around > > to disabling the jump_label.. > > Yes, this too... > > Well, ignoring the HAVE_JUMP_LABEL case... perhaps we can split > perf_sched_events into 2 counters? I mean, > > atomic_t perf_sched_events_in, perf_sched_events_out; > > static inline void perf_event_task_sched_in(struct task_struct *task) > { > COND_STMT(&perf_sched_events_in, __perf_event_task_sched_in(task)); > } > > static inline > void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next) > { > perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); > > COND_STMT(&perf_sched_events_out, __perf_event_task_sched_out(task, next)); > } > > void perf_sched_events_inc(void) > { > atomic_inc(&perf_sched_events_out); > smp_mb__after_atomic_inc(); > atomic_inc(&perf_sched_events_in); > } > > void perf_sched_events_dec(void) > { > if (atomic_dec_and_test(&perf_sched_events_in)) > synchronize_sched(); > atomic_dec(&perf_sched_events_out); > } > > The last 2 helpers should be used instead of jump_label_inc/dec. Very clever, my approach was to make __perf_event_task_sched_in() a NOP when !nr_events, which opens up another race against perf_install_in_context() but hey ;-) Added my current hackery below. > As for HAVE_JUMP_LABEL, I still can't understand why this test-case > triggers the problem. FWIW I tested without that.. > But jump_label_inc/dec logic looks obviously > racy. > > jump_label_dec: > > if (atomic_dec_and_test(key)) > jump_label_disable(key); > > Another thread can create the PERF_ATTACH_TASK event in between > and call jump_label_update(JUMP_LABEL_ENABLE) first. Looks like, > jump_label_update() should ensure that "type" matches the state > of the "*key" under jump_label_lock(). No I think you're right, and I think we fixed that but it looks like Ingo still didn't merge the new jump-label patches :/ --- Index: linux-2.6/kernel/perf_event.c =================================================================== --- linux-2.6.orig/kernel/perf_event.c +++ linux-2.6/kernel/perf_event.c @@ -1461,9 +1461,6 @@ static void add_event_to_ctx(struct perf event->tstamp_stopped = tstamp; } -static void perf_event_context_sched_in(struct perf_event_context *ctx, - struct task_struct *tsk); - /* * Cross CPU call to install and enable a performance event * @@ -1473,20 +1470,11 @@ static int __perf_install_in_context(vo { struct perf_event *event = info; struct perf_event_context *ctx = event->ctx; - struct perf_event *leader = event->group_leader; struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); - int err; - - /* - * In case we're installing a new context to an already running task, - * could also happen before perf_event_task_sched_in() on architectures - * which do context switches with IRQs enabled. - */ - if (ctx->task && !cpuctx->task_ctx) - perf_event_context_sched_in(ctx, ctx->task); + struct perf_event_context *task_ctx; + struct task_struct *task = NULL; raw_spin_lock(&ctx->lock); - ctx->is_active = 1; update_context_time(ctx); /* * update cgrp time only if current cgrp @@ -1497,43 +1485,48 @@ static int __perf_install_in_context(vo add_event_to_ctx(event, ctx); - if (!event_filter_match(event)) - goto unlock; + if (!event_filter_match(event)) { + raw_spin_unlock(&ctx->lock); + return; + } + raw_spin_unlock(&ctx->lock); /* - * Don't put the event on if it is disabled or if - * it is in a group and the group isn't on. + * Since both these are only set during context-switches + * and IRQs are disabled, their value is stable. */ - if (event->state != PERF_EVENT_STATE_INACTIVE || - (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE)) - goto unlock; + task_ctx = cpuctx->task_ctx; + perf_pmu_disable(ctx->pmu); /* - * An exclusive event can't go on if there are already active - * hardware events, and no hardware event can go on if there - * is already an exclusive event on. - */ - if (!group_can_go_on(event, cpuctx, 1)) - err = -EEXIST; - else - err = event_sched_in(event, cpuctx, ctx); - - if (err) { - /* - * This event couldn't go on. If it is in a group - * then we have to pull the whole group off. - * If the event group is pinned then put it in error state. - */ - if (leader != event) - group_sched_out(leader, cpuctx, ctx); - if (leader->attr.pinned) { - update_group_times(leader); - leader->state = PERF_EVENT_STATE_ERROR; - } - } + * Reschedule the PMU to possible include the fresh event, we take the + * brute force approach of unscheduling everything and then re-add the + * events in the correct order (CPU-pinned, TASK-pinned, CPU-flexible, + * TASK-flexible). + * + * It is possible we received this IPI before the scheduler called + * perf_event_task_sched_in() on platforms that context switch with + * interrupts enabled. In that case the below DTRT. + */ + cpu_ctx_sched_out(cpuctx, EVENT_ALL); + if (task_ctx) + ctx_sched_out(task_ctx, cpuctx, EVENT_ALL); + + if (ctx->task) { + cpuctx->task_ctx = task_ctx = ctx; + task = ctx->task + } else if (task_ctx) + task = task_ctx->task; + + cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task); + if (task_ctx) + ctx_sched_in(task_ctx, cpuctx, EVENT_PINNED, task); + cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task); + if (task_ctx) + ctx_sched_in(task_ctx, cpuctx, EVENT_FLEXIBLE, task); -unlock: - raw_spin_unlock(&ctx->lock); + perf_pmu_rotate_start(ctx->pmu); + perf_pmu_enable(ctx->pmu); return 0; } @@ -2114,8 +2107,19 @@ static void perf_event_context_sched_in( struct perf_cpu_context *cpuctx; cpuctx = __get_cpu_context(ctx); - if (cpuctx->task_ctx == ctx) + raw_spin_lock(&ctx->lock); + /* + * Serialize against perf_install_in_context(), the interesting case + * is where perf_install_in_context() finds the context inactive and + * another cpu is just about to schedule the task in. In that case + * we need to avoid observing a stale ctx->nr_events. + */ + ctx->is_active = 1; + if (cpuctx->task_ctx == ctx || !ctx->nr_events) { + raw_spin_lock(&ctx->lock); return; + } + raw_spin_lock(&ctx->lock); perf_pmu_disable(ctx->pmu); /* @@ -2125,12 +2129,12 @@ static void perf_event_context_sched_in( */ cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE); + cpuctx->task_ctx = ctx; + ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task); cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task); ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task); - cpuctx->task_ctx = ctx; - /* * Since these rotations are per-cpu, we need to ensure the * cpu-context we got scheduled on is actually rotating. @@ -2922,15 +2926,40 @@ static void free_event(struct perf_event call_rcu(&event->rcu_head, free_event_rcu); } -int perf_event_release_kernel(struct perf_event *event) +static int __perf_event_release(void *info) { + struct perf_event *event = info; struct perf_event_context *ctx = event->ctx; + struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); + int ret; /* - * Remove from the PMU, can't get re-enabled since we got - * here because the last ref went. + * Disable the event if its still running, we're shutting down. */ - perf_event_disable(event); + ret = __perf_event_disable(info); + if (ret) + return ret; + + raw_spin_lock_irq(&ctx->lock); + perf_group_detach(event); + list_del_event(event, ctx); + /* + * In case we removed the last event from an active task_ctx + * deactivate the task_ctx because this event being freed might + * lead to the perf_sched_events jump_label being disabled + * which avoids the task sched-out hook from being called. + */ + if (!ctx->nr_events && cpuctx->task_ctx == ctx) { + ctx->is_active = 0; + cpuctx->task_ctx = NULL; + } + raw_spin_unlock_irq(&ctx->lock); +} + +int perf_event_release_kernel(struct perf_event *event) +{ + struct perf_event_context *ctx = event->ctx; + struct task_struct *task = ctx->task; WARN_ON_ONCE(ctx->parent_ctx); /* @@ -2946,10 +2975,28 @@ int perf_event_release_kernel(struct per * to trigger the AB-BA case. */ mutex_lock_nested(&ctx->mutex, SINGLE_DEPTH_NESTING); + if (!task) { + cpu_function_call(event->cpu, __perf_event_release, event); + goto unlock; + } + +retry: + if (!task_function_call(task, __perf_event_release, event)) + goto unlock; + raw_spin_lock_irq(&ctx->lock); + if (ctx->is_active) { + raw_spin_unlock_irq(&ctx->lock); + goto retry; + } + + WARN_ON_ONCE(event->state == PERF_EVENT_STATE_ACTIVE); + perf_group_detach(event); list_del_event(event, ctx); raw_spin_unlock_irq(&ctx->lock); + +unlock: mutex_unlock(&ctx->mutex); free_event(event); ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-29 8:32 ` Peter Zijlstra @ 2011-03-29 10:49 ` Peter Zijlstra 2011-03-29 16:28 ` Oleg Nesterov 1 sibling, 0 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-29 10:49 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Tue, 2011-03-29 at 10:32 +0200, Peter Zijlstra wrote: > @@ -2922,15 +2926,40 @@ static void free_event(struct perf_event > call_rcu(&event->rcu_head, free_event_rcu); > } > > -int perf_event_release_kernel(struct perf_event *event) > +static int __perf_event_release(void *info) > { > + struct perf_event *event = info; > struct perf_event_context *ctx = event->ctx; > + struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); > + int ret; > > /* > - * Remove from the PMU, can't get re-enabled since we got > - * here because the last ref went. > + * Disable the event if its still running, we're shutting down. > */ > - perf_event_disable(event); > + ret = __perf_event_disable(info); > + if (ret) > + return ret; > + > + raw_spin_lock_irq(&ctx->lock); > + perf_group_detach(event); > + list_del_event(event, ctx); > + /* > + * In case we removed the last event from an active task_ctx > + * deactivate the task_ctx because this event being freed might > + * lead to the perf_sched_events jump_label being disabled > + * which avoids the task sched-out hook from being called. > + */ > + if (!ctx->nr_events && cpuctx->task_ctx == ctx) { > + ctx->is_active = 0; > + cpuctx->task_ctx = NULL; > + } > + raw_spin_unlock_irq(&ctx->lock); > +} > + > +int perf_event_release_kernel(struct perf_event *event) > +{ > + struct perf_event_context *ctx = event->ctx; > + struct task_struct *task = ctx->task; > > WARN_ON_ONCE(ctx->parent_ctx); > /* > @@ -2946,10 +2975,28 @@ int perf_event_release_kernel(struct per > * to trigger the AB-BA case. > */ > mutex_lock_nested(&ctx->mutex, SINGLE_DEPTH_NESTING); > + if (!task) { > + cpu_function_call(event->cpu, __perf_event_release, event); > + goto unlock; > + } > + > +retry: > + if (!task_function_call(task, __perf_event_release, event)) > + goto unlock; > + > raw_spin_lock_irq(&ctx->lock); > + if (ctx->is_active) { > + raw_spin_unlock_irq(&ctx->lock); > + goto retry; > + } > + > + WARN_ON_ONCE(event->state == PERF_EVENT_STATE_ACTIVE); > + > perf_group_detach(event); > list_del_event(event, ctx); > raw_spin_unlock_irq(&ctx->lock); > + > +unlock: > mutex_unlock(&ctx->mutex); > > free_event(event); we can simplify that and use perf_remove_from_context(), except that changes the close() semantics slightly for grouped events, the current code will I think deschedule the complete group when you close the leader, when using pref_remote_from_context() we'll promote the siblings to individual events and let them run when you close the leader. I'm fairly sure no-one _should_ rely on that, but they _might_.. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-29 8:32 ` Peter Zijlstra 2011-03-29 10:49 ` Peter Zijlstra @ 2011-03-29 16:28 ` Oleg Nesterov 2011-03-29 19:01 ` Peter Zijlstra 1 sibling, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2011-03-29 16:28 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/29, Peter Zijlstra wrote: > > On Mon, 2011-03-28 at 18:56 +0200, Oleg Nesterov wrote: > > > > jump_label_dec: > > > > if (atomic_dec_and_test(key)) > > jump_label_disable(key); > > > > Another thread can create the PERF_ATTACH_TASK event in between > > and call jump_label_update(JUMP_LABEL_ENABLE) first. Looks like, > > jump_label_update() should ensure that "type" matches the state > > of the "*key" under jump_label_lock(). > > No I think you're right, and I think we fixed that but it looks like > Ingo still didn't merge the new jump-label patches :/ OK. To remind, we have another problem, perf_install can race with exit. But lets ignore this for now... You know, I honestly tried to convince myself I can understand your patch. All I can say, I'll try to read it again ;) But the main idea is clear, we give more respect to ->nr_events and once it is zero task_ctx must not be active. > @@ -2114,8 +2107,19 @@ static void perf_event_context_sched_in( > struct perf_cpu_context *cpuctx; > > cpuctx = __get_cpu_context(ctx); > - if (cpuctx->task_ctx == ctx) > + raw_spin_lock(&ctx->lock); > + /* > + * Serialize against perf_install_in_context(), the interesting case > + * is where perf_install_in_context() finds the context inactive and > + * another cpu is just about to schedule the task in. In that case > + * we need to avoid observing a stale ctx->nr_events. I don't understand the comment... We can't race __perf_install_in_context, it can only run on the same CPU but we are called with irqs disabled. > + ctx->is_active = 1; > + if (cpuctx->task_ctx == ctx || !ctx->nr_events) { > + raw_spin_lock(&ctx->lock); > return; I guess you meant _unlock. But now I don't understand what ->is_active means. We make it true, but doesn't set cpuctx->task_ctx = ctx. Why __perf_event_release() clears ->is_active then? This looks wrong at first glance. Suppose we have the same problem, this task misses perf_event_context_sched_out() after that. OK, ->task_ctx == NULL. But, suppose that after that this task sleeps "forever" and we create another counter and call perf_install_in_context() again. Now we hang in "retry" loop. It seems to me, instead we should change ctx_sched_in() to check nr_events and do nothing if it is zero. > + } > + raw_spin_lock(&ctx->lock); Again, s/lock/unlock/ > + cpuctx->task_ctx = ctx; > + > ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task); But we already dropped ctx->lock, __perf_event_release() can remove the last event before ctx_sched_in() takes it again. This should be moved into ctx_sched_in() afaics, but this is not simple. So, perhaps we can take ctx->lock and check nr_events after the 2nd ctx_sched_in(). If it is zero, we should clear task_ctx/is_active. Perhaps. Right now I got lost. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-29 16:28 ` Oleg Nesterov @ 2011-03-29 19:01 ` Peter Zijlstra 0 siblings, 0 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-29 19:01 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Tue, 2011-03-29 at 18:28 +0200, Oleg Nesterov wrote: > On 03/29, Peter Zijlstra wrote: > > > > On Mon, 2011-03-28 at 18:56 +0200, Oleg Nesterov wrote: > > > > > > jump_label_dec: > > > > > > if (atomic_dec_and_test(key)) > > > jump_label_disable(key); > > > > > > Another thread can create the PERF_ATTACH_TASK event in between > > > and call jump_label_update(JUMP_LABEL_ENABLE) first. Looks like, > > > jump_label_update() should ensure that "type" matches the state > > > of the "*key" under jump_label_lock(). > > > > No I think you're right, and I think we fixed that but it looks like > > Ingo still didn't merge the new jump-label patches :/ > > OK. To remind, we have another problem, perf_install can race with exit. > But lets ignore this for now... Yay! ;-) > > You know, I honestly tried to convince myself I can understand your > patch. All I can say, I'll try to read it again ;) But the main idea > is clear, we give more respect to ->nr_events and once it is zero > task_ctx must not be active. Right, except I'm leaking an ->is_active and was seriously considering your clever idea of splitting the sched_in and sched_out jump-labels. > > @@ -2114,8 +2107,19 @@ static void perf_event_context_sched_in( > > struct perf_cpu_context *cpuctx; > > > > cpuctx = __get_cpu_context(ctx); > > - if (cpuctx->task_ctx == ctx) > > + raw_spin_lock(&ctx->lock); > > + /* > > + * Serialize against perf_install_in_context(), the interesting case > > + * is where perf_install_in_context() finds the context inactive and > > + * another cpu is just about to schedule the task in. In that case > > + * we need to avoid observing a stale ctx->nr_events. > > I don't understand the comment... We can't race __perf_install_in_context, > it can only run on the same CPU but we are called with irqs disabled. Ah, I meant a race with perf_install_in_context() where task_function() fails because !task_curr(), in that case we'll attempt add_event_to_ctx() from the remote cpu. If we race wrong with sched_in nobody might schedule the event. > > + ctx->is_active = 1; > > + if (cpuctx->task_ctx == ctx || !ctx->nr_events) { > > + raw_spin_lock(&ctx->lock); > > return; > > I guess you meant _unlock. Yeah, utter fail day today :/ > But now I don't understand what ->is_active means. We make it true, > but doesn't set cpuctx->task_ctx = ctx. Why __perf_event_release() > clears ->is_active then? Yeha, that's the big problem I have with this. > It seems to me, instead we should change ctx_sched_in() to check > nr_events and do nothing if it is zero. you mean, not set ->is_active? > > + cpuctx->task_ctx = ctx; > > + > > ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task); > > But we already dropped ctx->lock, __perf_event_release() can remove > the last event before ctx_sched_in() takes it again. > > This should be moved into ctx_sched_in() afaics, but this is not simple. > > So, perhaps we can take ctx->lock and check nr_events after the 2nd > ctx_sched_in(). If it is zero, we should clear task_ctx/is_active. > > Perhaps. Right now I got lost. Yeah, you and me both.. I went to look at something else because I simply confused myself more. Hopefully tomorrow will bring some sanity. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-28 16:56 ` Oleg Nesterov 2011-03-29 8:32 ` Peter Zijlstra @ 2011-03-30 13:09 ` Jiri Olsa 2011-03-30 14:51 ` Peter Zijlstra 2011-03-30 15:32 ` Oleg Nesterov 1 sibling, 2 replies; 41+ messages in thread From: Jiri Olsa @ 2011-03-30 13:09 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel On Mon, Mar 28, 2011 at 06:56:48PM +0200, Oleg Nesterov wrote: > On 03/28, Peter Zijlstra wrote: > > > > Another fun race, suppose we do properly remove task_ctx and is_active, > > but then the task gets scheduled back in before free_event() gets around > > to disabling the jump_label.. > > Yes, this too... > > Well, ignoring the HAVE_JUMP_LABEL case... perhaps we can split > perf_sched_events into 2 counters? I mean, > > atomic_t perf_sched_events_in, perf_sched_events_out; > > static inline void perf_event_task_sched_in(struct task_struct *task) > { > COND_STMT(&perf_sched_events_in, __perf_event_task_sched_in(task)); > } > > static inline > void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next) > { > perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); > > COND_STMT(&perf_sched_events_out, __perf_event_task_sched_out(task, next)); > } > > void perf_sched_events_inc(void) > { > atomic_inc(&perf_sched_events_out); > smp_mb__after_atomic_inc(); > atomic_inc(&perf_sched_events_in); > } > > void perf_sched_events_dec(void) > { > if (atomic_dec_and_test(&perf_sched_events_in)) > synchronize_sched(); > atomic_dec(&perf_sched_events_out); > } > > The last 2 helpers should be used instead of jump_label_inc/dec. this approach seems to work, plz check attached patch > > > > Or we can remove COND_STMT() from perf_event_task_sched_out(). Say, > __perf_event_task_sched_in() can set PF_PERF_XXX or something, then > perf_event_task_sched_out() can check this bit. > > > As for HAVE_JUMP_LABEL, I still can't understand why this test-case > triggers the problem. But jump_label_inc/dec logic looks obviously > racy. I think the reason is, that there are actually 2 places that needs to be updated by jump label code - perf_event_task_sched_in - perf_event_task_sched_out As for my image it first patches perf_event_task_sched_out and then perf_event_task_sched_in Now, when the code is updated, migration thread is scheduled several times and perf schedule callbacks are called. This seems to be happening in my setup: perf release: - jump_label_dec(perf_sched_events) - migration thread is scheduled for disabling perf_event_task_sched_out jump label - perf_event_task_sched_out/perf_event_task_sched_in is called - task_ctx is NULL, since migration thread does not have any events - perf_event_task_sched_out jump label got disabled - release code is rescheduled -> now only perf_event_task_sched_in jump label is called - task_ctx is set - migration thread is scheduled for disabling perf_event_task_sched_in jump label - perf_event_task_sched_in is called for migration thread, task_ctx stays, since migration thread does not have any events - perf_event_task_sched_in jump label is disabled - release code is scheduled -> no sched callback is called - task_ctx stays != NULL I'm almost 5% sure it could happen as I described ;) Back to the patch below: IIUIC that calling synchronize_sched in perf_sched_events_dec ensures that final schedule back to the release code will not call perf_event_task_sched_in, so task_ctx stays sane (NULL). The cool think is, that it also fixies the original issue, which was: wrong counters for perf builtin test #3.. which I started with from the very beggining :) please let me know what you think, thanks jirka --- include/linux/perf_event.h | 44 +++++++++++++++++++++++++++++++++++++++++--- kernel/perf_event.c | 11 ++++++----- 2 files changed, 47 insertions(+), 8 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 311b4dc..4b4f114 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1074,11 +1074,12 @@ have_event: __perf_sw_event(event_id, nr, nmi, regs, addr); } -extern atomic_t perf_sched_events; +extern atomic_t perf_sched_events_in; +extern atomic_t perf_sched_events_out; static inline void perf_event_task_sched_in(struct task_struct *task) { - COND_STMT(&perf_sched_events, __perf_event_task_sched_in(task)); + COND_STMT(&perf_sched_events_in, __perf_event_task_sched_in(task)); } static inline @@ -1086,9 +1087,46 @@ void perf_event_task_sched_out(struct task_struct *task, struct task_struct *nex { perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); - COND_STMT(&perf_sched_events, __perf_event_task_sched_out(task, next)); + COND_STMT(&perf_sched_events_out, __perf_event_task_sched_out(task, next)); } +#ifdef HAVE_JUMP_LABEL +static inline +void perf_sched_events_inc(void) +{ + jump_label_inc(&perf_sched_events_out); + smp_mb__after_atomic_inc(); + jump_label_inc(&perf_sched_events_in); +} + +static inline +void perf_sched_events_dec(void) +{ + if (atomic_dec_and_test(&perf_sched_events_in)) { + jump_label_disable(&perf_sched_events_in); + synchronize_sched(); + } + + jump_label_dec(&perf_sched_events_out); +} +#else +static inline +void perf_sched_events_inc(void) +{ + atomic_inc(&perf_sched_events_out); + smp_mb__after_atomic_inc(); + atomic_inc(&perf_sched_events_in); +} + +static inline +void perf_sched_events_dec(void) +{ + if (atomic_dec_and_test(&perf_sched_events_in)) + synchronize_sched(); + atomic_dec(&perf_sched_events_out); +} +#endif /* HAVE_JUMP_LABEL */ + extern void perf_event_mmap(struct vm_area_struct *vma); extern struct perf_guest_info_callbacks *perf_guest_cbs; extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); diff --git a/kernel/perf_event.c b/kernel/perf_event.c index c75925c..db38d6e 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -125,7 +125,8 @@ enum event_type_t { * perf_sched_events : >0 events exist * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu */ -atomic_t perf_sched_events __read_mostly; +atomic_t perf_sched_events_in __read_mostly; +atomic_t perf_sched_events_out __read_mostly; static DEFINE_PER_CPU(atomic_t, perf_cgroup_events); static atomic_t nr_mmap_events __read_mostly; @@ -2894,7 +2895,7 @@ static void free_event(struct perf_event *event) if (!event->parent) { if (event->attach_state & PERF_ATTACH_TASK) - jump_label_dec(&perf_sched_events); + perf_sched_events_dec(); if (event->attr.mmap || event->attr.mmap_data) atomic_dec(&nr_mmap_events); if (event->attr.comm) @@ -2905,7 +2906,7 @@ static void free_event(struct perf_event *event) put_callchain_buffers(); if (is_cgroup_event(event)) { atomic_dec(&per_cpu(perf_cgroup_events, event->cpu)); - jump_label_dec(&perf_sched_events); + perf_sched_events_dec(); } } @@ -6248,7 +6249,7 @@ done: if (!event->parent) { if (event->attach_state & PERF_ATTACH_TASK) - jump_label_inc(&perf_sched_events); + perf_sched_events_inc(); if (event->attr.mmap || event->attr.mmap_data) atomic_inc(&nr_mmap_events); if (event->attr.comm) @@ -6490,7 +6491,7 @@ SYSCALL_DEFINE5(perf_event_open, * - that may need work on context switch */ atomic_inc(&per_cpu(perf_cgroup_events, event->cpu)); - jump_label_inc(&perf_sched_events); + perf_sched_events_inc(); } /* -- 1.7.1 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 13:09 ` Jiri Olsa @ 2011-03-30 14:51 ` Peter Zijlstra 2011-03-30 16:37 ` Oleg Nesterov 2011-03-31 13:28 ` [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value Oleg Nesterov 2011-03-30 15:32 ` Oleg Nesterov 1 sibling, 2 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-30 14:51 UTC (permalink / raw) To: Jiri Olsa; +Cc: Oleg Nesterov, Paul Mackerras, Ingo Molnar, linux-kernel On Wed, 2011-03-30 at 15:09 +0200, Jiri Olsa wrote: > > > +#ifdef HAVE_JUMP_LABEL > +static inline > +void perf_sched_events_inc(void) > +{ > + jump_label_inc(&perf_sched_events_out); > + smp_mb__after_atomic_inc(); > + jump_label_inc(&perf_sched_events_in); > +} > + > +static inline > +void perf_sched_events_dec(void) > +{ > + if (atomic_dec_and_test(&perf_sched_events_in)) { > + jump_label_disable(&perf_sched_events_in); > + synchronize_sched(); > + } > + > + jump_label_dec(&perf_sched_events_out); > +} > +#else > +static inline > +void perf_sched_events_inc(void) > +{ > + atomic_inc(&perf_sched_events_out); > + smp_mb__after_atomic_inc(); > + atomic_inc(&perf_sched_events_in); > +} > + > +static inline > +void perf_sched_events_dec(void) > +{ > + if (atomic_dec_and_test(&perf_sched_events_in)) > + synchronize_sched(); > + atomic_dec(&perf_sched_events_out); > +} > +#endif /* HAVE_JUMP_LABEL */ I wrote a similar patch which is slightly smaller: --- Subject: perf: Delay disabling the perf_event_task_sched_out() hook From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Wed Mar 30 13:47:09 CEST 2011 Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/perf_event.h | 7 ++++--- kernel/perf_event.c | 27 ++++++++++++++++++++++----- 2 files changed, 26 insertions(+), 8 deletions(-) Index: linux-2.6/include/linux/perf_event.h =================================================================== --- linux-2.6.orig/include/linux/perf_event.h +++ linux-2.6/include/linux/perf_event.h @@ -1074,11 +1074,12 @@ perf_sw_event(u32 event_id, u64 nr, int __perf_sw_event(event_id, nr, nmi, regs, addr); } -extern atomic_t perf_sched_events; +extern atomic_t perf_sched_events_in; +extern atomic_t perf_sched_events_out; static inline void perf_event_task_sched_in(struct task_struct *task) { - COND_STMT(&perf_sched_events, __perf_event_task_sched_in(task)); + COND_STMT(&perf_sched_events_in, __perf_event_task_sched_in(task)); } static inline @@ -1086,7 +1087,7 @@ void perf_event_task_sched_out(struct ta { perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); - COND_STMT(&perf_sched_events, __perf_event_task_sched_out(task, next)); + COND_STMT(&perf_sched_events_out, __perf_event_task_sched_out(task, next)); } extern void perf_event_mmap(struct vm_area_struct *vma); Index: linux-2.6/kernel/perf_event.c =================================================================== --- linux-2.6.orig/kernel/perf_event.c +++ linux-2.6/kernel/perf_event.c @@ -125,9 +125,25 @@ enum event_type_t { * perf_sched_events : >0 events exist * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu */ -atomic_t perf_sched_events __read_mostly; +atomic_t perf_sched_events_in __read_mostly; +atomic_t perf_sched_events_out __read_mostly; static DEFINE_PER_CPU(atomic_t, perf_cgroup_events); +static void perf_sched_events_inc(void) +{ + jump_label_inc(&perf_sched_events_out); + jump_label_inc(&perf_sched_events_in); +} + +static void perf_sched_events_dec(void) +{ + jump_label_dec(&perf_sched_events_in); + JUMP_LABEL(&perf_sched_events_in, no_sync); + synchronize_sched(); +no_sync: + jump_label_dec(&perf_sched_events_out); +} + static atomic_t nr_mmap_events __read_mostly; static atomic_t nr_comm_events __read_mostly; static atomic_t nr_task_events __read_mostly; @@ -2900,7 +2917,7 @@ static void free_event(struct perf_event if (!event->parent) { if (event->attach_state & PERF_ATTACH_TASK) - jump_label_dec(&perf_sched_events); + perf_sched_events_dec(); if (event->attr.mmap || event->attr.mmap_data) atomic_dec(&nr_mmap_events); if (event->attr.comm) @@ -2911,7 +2928,7 @@ static void free_event(struct perf_event put_callchain_buffers(); if (is_cgroup_event(event)) { atomic_dec(&per_cpu(perf_cgroup_events, event->cpu)); - jump_label_dec(&perf_sched_events); + perf_sched_events_dec(); } } @@ -6256,7 +6273,7 @@ perf_event_alloc(struct perf_event_attr if (!event->parent) { if (event->attach_state & PERF_ATTACH_TASK) - jump_label_inc(&perf_sched_events); + perf_sched_events_inc(); if (event->attr.mmap || event->attr.mmap_data) atomic_inc(&nr_mmap_events); if (event->attr.comm) @@ -6498,7 +6515,7 @@ SYSCALL_DEFINE5(perf_event_open, * - that may need work on context switch */ atomic_inc(&per_cpu(perf_cgroup_events, event->cpu)); - jump_label_inc(&perf_sched_events); + perf_sched_events_inc(); } /* ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 14:51 ` Peter Zijlstra @ 2011-03-30 16:37 ` Oleg Nesterov 2011-03-30 18:30 ` Paul E. McKenney 2011-03-30 21:26 ` Peter Zijlstra 2011-03-31 13:28 ` [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value Oleg Nesterov 1 sibling, 2 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-30 16:37 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel, Paul E. McKenney On 03/30, Peter Zijlstra wrote: > > --- linux-2.6.orig/kernel/perf_event.c > +++ linux-2.6/kernel/perf_event.c > @@ -125,9 +125,25 @@ enum event_type_t { > * perf_sched_events : >0 events exist > * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu > */ > -atomic_t perf_sched_events __read_mostly; > +atomic_t perf_sched_events_in __read_mostly; > +atomic_t perf_sched_events_out __read_mostly; > static DEFINE_PER_CPU(atomic_t, perf_cgroup_events); > > +static void perf_sched_events_inc(void) > +{ > + jump_label_inc(&perf_sched_events_out); > + jump_label_inc(&perf_sched_events_in); > +} > + > +static void perf_sched_events_dec(void) > +{ > + jump_label_dec(&perf_sched_events_in); > + JUMP_LABEL(&perf_sched_events_in, no_sync); > + synchronize_sched(); > +no_sync: > + jump_label_dec(&perf_sched_events_out); > +} Nice! I didn't realize we can simply use JUMP_LABEL() directly and then the code doesn't depend on HAVE_JUMP_LABEL. Now, the problem is, after I read the comments I am not sure I understand what synchronize_sched() actually doe. Add Paul. So. synchronize_sched() above should ensure that all CPUs do context switch at least once (ignoring idle). And I _thought_ that in practice this should work. But, unles I misread the comment above synchronize_sched(), it seems that it only guarantees the end of "everything" which disables preemption, explicitly or not. IOW, say, in theory rcu_read_unlock_sched() could trigger ->passed_quiesc == T without reschedule. Oh, and this is not theoretical, afaics. run_ksoftirqd() does rcu_note_context_switch(). So, I think we need something else :/ Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 16:37 ` Oleg Nesterov @ 2011-03-30 18:30 ` Paul E. McKenney 2011-03-30 19:53 ` Oleg Nesterov 2011-03-30 21:26 ` Peter Zijlstra 1 sibling, 1 reply; 41+ messages in thread From: Paul E. McKenney @ 2011-03-30 18:30 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Wed, Mar 30, 2011 at 06:37:30PM +0200, Oleg Nesterov wrote: > On 03/30, Peter Zijlstra wrote: > > > > --- linux-2.6.orig/kernel/perf_event.c > > +++ linux-2.6/kernel/perf_event.c > > @@ -125,9 +125,25 @@ enum event_type_t { > > * perf_sched_events : >0 events exist > > * perf_cgroup_events: >0 per-cpu cgroup events exist on this cpu > > */ > > -atomic_t perf_sched_events __read_mostly; > > +atomic_t perf_sched_events_in __read_mostly; > > +atomic_t perf_sched_events_out __read_mostly; > > static DEFINE_PER_CPU(atomic_t, perf_cgroup_events); > > > > +static void perf_sched_events_inc(void) > > +{ > > + jump_label_inc(&perf_sched_events_out); > > + jump_label_inc(&perf_sched_events_in); > > +} > > + > > +static void perf_sched_events_dec(void) > > +{ > > + jump_label_dec(&perf_sched_events_in); > > + JUMP_LABEL(&perf_sched_events_in, no_sync); > > + synchronize_sched(); > > +no_sync: > > + jump_label_dec(&perf_sched_events_out); > > +} > > Nice! I didn't realize we can simply use JUMP_LABEL() directly and then > the code doesn't depend on HAVE_JUMP_LABEL. > > Now, the problem is, after I read the comments I am not sure I understand > what synchronize_sched() actually doe. Add Paul. > > So. synchronize_sched() above should ensure that all CPUs do context > switch at least once (ignoring idle). And I _thought_ that in practice > this should work. > > But, unles I misread the comment above synchronize_sched(), it seems that > it only guarantees the end of "everything" which disables preemption, > explicitly or not. IOW, say, in theory rcu_read_unlock_sched() could > trigger ->passed_quiesc == T without reschedule. For rcu_read_lock() in preemptible RCU, this is true. But for rcu_read_unlock_sched(), the only way rcu_note_context_switch() is called is if the code is preempted, which ends up calling schedule(). > Oh, and this is not theoretical, afaics. run_ksoftirqd() does > rcu_note_context_switch(). Interesting... Color me confused. Suppose the rcu_note_context_switch() in run_ksoftirqd() was replaced with schedule(). This has to be OK, right? But schedule() itself invokes rcu_note_context_switch(). So if it is OK to call schedule(), it should be OK to call rcu_note_context_switch() directly, right? So, what am I missing here? > So, I think we need something else :/ The thing that I would be more concerned about is the idle loop. If a CPU is in the idle loop, then rcu_sched_qs() will be invoked (and which is invoked by rcu_note_context_switch()). So is it illegal to use the above in the idle loop? BTW, if it turns out that the idle loop is a problem, I could put an explicit call to rcu_sched_qs() in the affected idle loops. But currently anything in an idle thread is a quiescent state. Thanx, Paul ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 18:30 ` Paul E. McKenney @ 2011-03-30 19:53 ` Oleg Nesterov 0 siblings, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-30 19:53 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/30, Paul E. McKenney wrote: > > On Wed, Mar 30, 2011 at 06:37:30PM +0200, Oleg Nesterov wrote: > > > > So. synchronize_sched() above should ensure that all CPUs do context > > switch at least once (ignoring idle). And I _thought_ that in practice > > this should work. > > > > But, unles I misread the comment above synchronize_sched(), it seems that > > it only guarantees the end of "everything" which disables preemption, > > explicitly or not. IOW, say, in theory rcu_read_unlock_sched() could > > trigger ->passed_quiesc == T without reschedule. > > For rcu_read_lock() in preemptible RCU, this is true. Hmm, not sure I understand... Do you mean that with the current implementation rcu_read_unlock() can imply rcu_sched_qs() without rescheduling ? > But for > rcu_read_unlock_sched(), the only way rcu_note_context_switch() is called > is if the code is preempted, which ends up calling schedule(). Indeed, that is why I thought synchronize_sched() can help in this case. I meant, according to the documentation it could in theory. But, > > Oh, and this is not theoretical, afaics. run_ksoftirqd() does > > rcu_note_context_switch(). > > Interesting... Color me confused. > > Suppose the rcu_note_context_switch() in run_ksoftirqd() was replaced > with schedule(). This has to be OK, right? But schedule() itself > invokes rcu_note_context_switch(). So if it is OK to call schedule(), > it should be OK to call rcu_note_context_switch() directly, right? > > So, what am I missing here? It is me, not you. Damn. It is even worse than I thought. Somehow I missed the simple fact, schedule() does not necessarily mean context_switch(). So the idea to use synchronize_sched() was simply wrong. Sorry to all for wasting your time ;) > > So, I think we need something else :/ > > The thing that I would be more concerned about is the idle loop. > If a CPU is in the idle loop, then rcu_sched_qs() will be invoked > (and which is invoked by rcu_note_context_switch()). So is it > illegal to use the above in the idle loop? Not sure I understand what you mean, but the idle loop is fine. An idle thread can't have the counters attached, we don't care about them. Thanks Paul, Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 16:37 ` Oleg Nesterov 2011-03-30 18:30 ` Paul E. McKenney @ 2011-03-30 21:26 ` Peter Zijlstra 2011-03-30 21:35 ` Oleg Nesterov ` (2 more replies) 1 sibling, 3 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-30 21:26 UTC (permalink / raw) To: Oleg Nesterov Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel, Paul E. McKenney On Wed, 2011-03-30 at 18:37 +0200, Oleg Nesterov wrote: > > +static void perf_sched_events_dec(void) > > +{ > > + jump_label_dec(&perf_sched_events_in); > > + JUMP_LABEL(&perf_sched_events_in, no_sync); > > + synchronize_sched(); > > +no_sync: > > + jump_label_dec(&perf_sched_events_out); > > +} > > Nice! I didn't realize we can simply use JUMP_LABEL() directly and then > the code doesn't depend on HAVE_JUMP_LABEL. Yeah, avoids having to sprinkle tons of #ifdef goo around ;-) Anyway how about we do the partial revert below that should get us back to a working kernel and is a nice small patch to send -stable wards. After that we can try and be clever with clearing ->task_ctx from things like remove_from_context and the like. --- include/linux/perf_event.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 311b4dc..04d75a8 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1086,7 +1086,7 @@ void perf_event_task_sched_out(struct task_struct *task, struct task_struct *nex { perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); - COND_STMT(&perf_sched_events, __perf_event_task_sched_out(task, next)); + __perf_event_task_sched_out(task, next); } extern void perf_event_mmap(struct vm_area_struct *vma); ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 21:26 ` Peter Zijlstra @ 2011-03-30 21:35 ` Oleg Nesterov 2011-03-31 10:32 ` Jiri Olsa 2011-03-31 12:41 ` [tip:perf/urgent] perf: Fix task context scheduling tip-bot for Peter Zijlstra 2 siblings, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-30 21:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel, Paul E. McKenney On 03/30, Peter Zijlstra wrote: > > Anyway how about we do the partial revert below that should get us back > to a working kernel and is a nice small patch to send -stable wards. > > After that we can try and be clever with clearing ->task_ctx from things > like remove_from_context and the like. > > --- > include/linux/perf_event.h | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 311b4dc..04d75a8 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1086,7 +1086,7 @@ void perf_event_task_sched_out(struct task_struct *task, struct task_struct *nex > { > perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); > > - COND_STMT(&perf_sched_events, __perf_event_task_sched_out(task, next)); > + __perf_event_task_sched_out(task, next); > } I agree completely. We need the simple (and suitable for stable) fix first. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 21:26 ` Peter Zijlstra 2011-03-30 21:35 ` Oleg Nesterov @ 2011-03-31 10:32 ` Jiri Olsa 2011-03-31 12:41 ` [tip:perf/urgent] perf: Fix task context scheduling tip-bot for Peter Zijlstra 2 siblings, 0 replies; 41+ messages in thread From: Jiri Olsa @ 2011-03-31 10:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Paul Mackerras, Ingo Molnar, linux-kernel, Paul E. McKenney On Wed, Mar 30, 2011 at 11:26:45PM +0200, Peter Zijlstra wrote: > On Wed, 2011-03-30 at 18:37 +0200, Oleg Nesterov wrote: > > > +static void perf_sched_events_dec(void) > > > +{ > > > + jump_label_dec(&perf_sched_events_in); > > > + JUMP_LABEL(&perf_sched_events_in, no_sync); > > > + synchronize_sched(); > > > +no_sync: > > > + jump_label_dec(&perf_sched_events_out); > > > +} > > > > Nice! I didn't realize we can simply use JUMP_LABEL() directly and then > > the code doesn't depend on HAVE_JUMP_LABEL. > > Yeah, avoids having to sprinkle tons of #ifdef goo around ;-) > > Anyway how about we do the partial revert below that should get us back > to a working kernel and is a nice small patch to send -stable wards. > > After that we can try and be clever with clearing ->task_ctx from things > like remove_from_context and the like. > > > --- > include/linux/perf_event.h | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 311b4dc..04d75a8 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -1086,7 +1086,7 @@ void perf_event_task_sched_out(struct task_struct *task, struct task_struct *nex > { > perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); > > - COND_STMT(&perf_sched_events, __perf_event_task_sched_out(task, next)); > + __perf_event_task_sched_out(task, next); > } > > extern void perf_event_mmap(struct vm_area_struct *vma); > works for me, passed my tests ;) jirka ^ permalink raw reply [flat|nested] 41+ messages in thread
* [tip:perf/urgent] perf: Fix task context scheduling 2011-03-30 21:26 ` Peter Zijlstra 2011-03-30 21:35 ` Oleg Nesterov 2011-03-31 10:32 ` Jiri Olsa @ 2011-03-31 12:41 ` tip-bot for Peter Zijlstra 2 siblings, 0 replies; 41+ messages in thread From: tip-bot for Peter Zijlstra @ 2011-03-31 12:41 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, a.p.zijlstra, jolsa, tglx, oleg, mingo Commit-ID: ab711fe08297de1485fff0a366e6db8828cafd6a Gitweb: http://git.kernel.org/tip/ab711fe08297de1485fff0a366e6db8828cafd6a Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Thu, 31 Mar 2011 10:29:26 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Thu, 31 Mar 2011 13:02:55 +0200 perf: Fix task context scheduling Jiri reported: | | - once an event is created by sys_perf_event_open, task context | is created and it stays even if the event is closed, until the | task is finished ... thats what I see in code and I assume it's | correct | | - when the task opens event, perf_sched_events jump label is | incremented and following callbacks are started from scheduler | | __perf_event_task_sched_in | __perf_event_task_sched_out | | These callback *in/out set/unset cpuctx->task_ctx value to the | task context. | | - close is called on event on CPU 0: | - the task is scheduled on CPU 0 | - __perf_event_task_sched_in is called | - cpuctx->task_ctx is set | - perf_sched_events jump label is decremented and == 0 | - __perf_event_task_sched_out is not called | - cpuctx->task_ctx on CPU 0 stays set | | - exit is called on CPU 1: | - the task is scheduled on CPU 1 | - perf_event_exit_task is called | - task_ctx_sched_out unsets cpuctx->task_ctx on CPU 1 | - put_ctx destroys the context | | - another call of perf_rotate_context on CPU 0 will use invalid | task_ctx pointer, and eventualy panic. | Cure this the simplest possibly way by partially reverting the jump_label optimization for the sched_out case. Reported-and-tested-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@kernel.org> # .37+ LKML-Reference: <1301520405.4859.213.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/perf_event.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 311b4dc..04d75a8 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1086,7 +1086,7 @@ void perf_event_task_sched_out(struct task_struct *task, struct task_struct *nex { perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0); - COND_STMT(&perf_sched_events, __perf_event_task_sched_out(task, next)); + __perf_event_task_sched_out(task, next); } extern void perf_event_mmap(struct vm_area_struct *vma); ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 14:51 ` Peter Zijlstra 2011-03-30 16:37 ` Oleg Nesterov @ 2011-03-31 13:28 ` Oleg Nesterov 2011-03-31 13:51 ` Peter Zijlstra 1 sibling, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2011-03-31 13:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel, Paul E. McKenney On 03/30, Peter Zijlstra wrote: > > -atomic_t perf_sched_events __read_mostly; > +atomic_t perf_sched_events_in __read_mostly; > +atomic_t perf_sched_events_out __read_mostly; > static DEFINE_PER_CPU(atomic_t, perf_cgroup_events); > > +static void perf_sched_events_inc(void) > +{ > + jump_label_inc(&perf_sched_events_out); > + jump_label_inc(&perf_sched_events_in); > +} > + > +static void perf_sched_events_dec(void) > +{ > + jump_label_dec(&perf_sched_events_in); > + JUMP_LABEL(&perf_sched_events_in, no_sync); > + synchronize_sched(); > +no_sync: > + jump_label_dec(&perf_sched_events_out); > +} OK, synchronize_sched() can't work. How about static int force_perf_event_task_sched_out(void *unused) { struct task_struct *curr = current; __perf_event_task_sched_out(curr, task_rq(curr)->idle); return 0; } void synchronize_perf_event_task_sched_out(void) { stop_machine(force_perf_event_task_sched_out, NULL, cpu_possible_mask); } instead? - stop_machine(cpu_possible_mask) ensures that each cpu does the context switch and calls _sched_out - force_perf_event_task_sched_out() is only needed because the migration thread can have the counters too. Note, I am not sure this is the best solution. Just in case we don't find something better. In any case, do you think this can work or I missed something again? Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-31 13:28 ` [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value Oleg Nesterov @ 2011-03-31 13:51 ` Peter Zijlstra 2011-03-31 14:10 ` Oleg Nesterov 2011-04-04 16:20 ` Oleg Nesterov 0 siblings, 2 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-31 13:51 UTC (permalink / raw) To: Oleg Nesterov Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel, Paul E. McKenney On Thu, 2011-03-31 at 15:28 +0200, Oleg Nesterov wrote: > On 03/30, Peter Zijlstra wrote: > > > > -atomic_t perf_sched_events __read_mostly; > > +atomic_t perf_sched_events_in __read_mostly; > > +atomic_t perf_sched_events_out __read_mostly; > > static DEFINE_PER_CPU(atomic_t, perf_cgroup_events); > > > > +static void perf_sched_events_inc(void) > > +{ > > + jump_label_inc(&perf_sched_events_out); > > + jump_label_inc(&perf_sched_events_in); > > +} > > + > > +static void perf_sched_events_dec(void) > > +{ > > + jump_label_dec(&perf_sched_events_in); > > + JUMP_LABEL(&perf_sched_events_in, no_sync); > > + synchronize_sched(); > > +no_sync: > > + jump_label_dec(&perf_sched_events_out); > > +} > > OK, synchronize_sched() can't work. How about > > static int force_perf_event_task_sched_out(void *unused) > { > struct task_struct *curr = current; > > __perf_event_task_sched_out(curr, task_rq(curr)->idle); I'd call task_ctx_sched_out() there, we can't actually use struct rq outside of sched.c and we know we'll not swap contexts with the idle thread. > > return 0; > } > > void synchronize_perf_event_task_sched_out(void) > { > stop_machine(force_perf_event_task_sched_out, NULL, > cpu_possible_mask); > } > > instead? > > - stop_machine(cpu_possible_mask) ensures that each cpu > does the context switch and calls _sched_out > > - force_perf_event_task_sched_out() is only needed because > the migration thread can have the counters too. > > Note, I am not sure this is the best solution. Just in case we don't > find something better. > > In any case, do you think this can work or I missed something again? No I think that should work fine, but I'd rather keep the perf_event_task_sched_out() call unoptimized as put in a stop_machine call for the !JUMP_LABEL case too (-rt disables jump-labels to avoid all the stop-machine noise) Anyway, I _think_ we can do better.. but then, I haven't fully gone through things yet.. compile tested only.. What the below does is rework the ctx locking, what we used to do was lock and unlock ctx->lock and flip between task and cpu contexts a lot. I changed that to hold both cpuctx->ctx.lock and cpuctx->task_ctx->lock, which made things a lot saner. I then changed ctx->is_active to be a bitmask of EVENT_PINNED| EVENT_FLEXIBLE to reflect which part is actually scheduled in, and changed lots of code (except the cgroup bits) to simply sched_out eveything (or as much as needed) and then sched_in everything in the proper order. Like said, I think this'll get us there, but I've stared at this code for too long again and might not be seeing things straight anymore (time to go work on something else for a bit). --- perf_event.c | 427 +++++++++++++++++++++++++++-------------------------------- 1 file changed, 197 insertions(+), 230 deletions(-) Index: linux-2.6/kernel/perf_event.c =================================================================== --- linux-2.6.orig/kernel/perf_event.c +++ linux-2.6/kernel/perf_event.c @@ -172,13 +172,6 @@ int perf_proc_update_handler(struct ctl_ static atomic64_t perf_event_id; -static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx, - enum event_type_t event_type); - -static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx, - enum event_type_t event_type, - struct task_struct *task); - static void update_context_time(struct perf_event_context *ctx); static u64 perf_event_time(struct perf_event *event); @@ -202,6 +195,13 @@ __get_cpu_context(struct perf_event_cont #ifdef CONFIG_CGROUP_PERF +static void cpuctx_sched_out(struct perf_cpu_context *cpuctx, + enum event_type_t event_type); + +static void cpuctx_sched_in(struct perf_cpu_context *cpuctx, + enum event_type_t event_type, + struct task_struct *task); + /* * Must ensure cgroup is pinned (css_get) before calling * this function. In other words, we cannot call this function @@ -355,7 +355,7 @@ void perf_cgroup_switch(struct task_stru if (cpuctx->ctx.nr_cgroups > 0) { if (mode & PERF_CGROUP_SWOUT) { - cpu_ctx_sched_out(cpuctx, EVENT_ALL); + cpuctx_sched_out(cpuctx, EVENT_ALL); /* * must not be done before ctxswout due * to event_filter_match() in event_sched_out() @@ -369,7 +369,7 @@ void perf_cgroup_switch(struct task_stru * have to pass task around */ cpuctx->cgrp = perf_cgroup_from_task(task); - cpu_ctx_sched_in(cpuctx, EVENT_ALL, task); + cpuctx_sched_in(cpuctx, EVENT_ALL, task); } } @@ -1112,6 +1112,10 @@ static int __perf_remove_from_context(vo raw_spin_lock(&ctx->lock); event_sched_out(event, cpuctx, ctx); list_del_event(event, ctx); + if (!ctx->nr_events && cpuctx->task_ctx == ctx) { + ctx->is_active = 0; + cpuctx->task_ctx = NULL; + } raw_spin_unlock(&ctx->lock); return 0; @@ -1461,135 +1465,6 @@ static void add_event_to_ctx(struct perf event->tstamp_stopped = tstamp; } -static void perf_event_context_sched_in(struct perf_event_context *ctx, - struct task_struct *tsk); - -/* - * Cross CPU call to install and enable a performance event - * - * Must be called with ctx->mutex held - */ -static int __perf_install_in_context(void *info) -{ - struct perf_event *event = info; - struct perf_event_context *ctx = event->ctx; - struct perf_event *leader = event->group_leader; - struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); - int err; - - /* - * In case we're installing a new context to an already running task, - * could also happen before perf_event_task_sched_in() on architectures - * which do context switches with IRQs enabled. - */ - if (ctx->task && !cpuctx->task_ctx) - perf_event_context_sched_in(ctx, ctx->task); - - raw_spin_lock(&ctx->lock); - ctx->is_active = 1; - update_context_time(ctx); - /* - * update cgrp time only if current cgrp - * matches event->cgrp. Must be done before - * calling add_event_to_ctx() - */ - update_cgrp_time_from_event(event); - - add_event_to_ctx(event, ctx); - - if (!event_filter_match(event)) - goto unlock; - - /* - * Don't put the event on if it is disabled or if - * it is in a group and the group isn't on. - */ - if (event->state != PERF_EVENT_STATE_INACTIVE || - (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE)) - goto unlock; - - /* - * An exclusive event can't go on if there are already active - * hardware events, and no hardware event can go on if there - * is already an exclusive event on. - */ - if (!group_can_go_on(event, cpuctx, 1)) - err = -EEXIST; - else - err = event_sched_in(event, cpuctx, ctx); - - if (err) { - /* - * This event couldn't go on. If it is in a group - * then we have to pull the whole group off. - * If the event group is pinned then put it in error state. - */ - if (leader != event) - group_sched_out(leader, cpuctx, ctx); - if (leader->attr.pinned) { - update_group_times(leader); - leader->state = PERF_EVENT_STATE_ERROR; - } - } - -unlock: - raw_spin_unlock(&ctx->lock); - - return 0; -} - -/* - * Attach a performance event to a context - * - * First we add the event to the list with the hardware enable bit - * in event->hw_config cleared. - * - * If the event is attached to a task which is on a CPU we use a smp - * call to enable it in the task context. The task might have been - * scheduled away, but we check this in the smp call again. - */ -static void -perf_install_in_context(struct perf_event_context *ctx, - struct perf_event *event, - int cpu) -{ - struct task_struct *task = ctx->task; - - lockdep_assert_held(&ctx->mutex); - - event->ctx = ctx; - - if (!task) { - /* - * Per cpu events are installed via an smp call and - * the install is always successful. - */ - cpu_function_call(cpu, __perf_install_in_context, event); - return; - } - -retry: - if (!task_function_call(task, __perf_install_in_context, event)) - return; - - raw_spin_lock_irq(&ctx->lock); - /* - * If we failed to find a running task, but find the context active now - * that we've acquired the ctx->lock, retry. - */ - if (ctx->is_active) { - raw_spin_unlock_irq(&ctx->lock); - goto retry; - } - - /* - * Since the task isn't running, its safe to add the event, us holding - * the ctx->lock ensures the task won't get scheduled in. - */ - add_event_to_ctx(event, ctx); - raw_spin_unlock_irq(&ctx->lock); -} - /* * Put a event into inactive state and update time fields. * Enabling the leader of a group effectively enables all @@ -1765,30 +1640,51 @@ static void ctx_sched_out(struct perf_ev enum event_type_t event_type) { struct perf_event *event; + int is_active = ctx->is_active; - raw_spin_lock(&ctx->lock); - perf_pmu_disable(ctx->pmu); - ctx->is_active = 0; + lockdep_assert_held(&ctx->lock); + + ctx->is_active &= ~event_type; if (likely(!ctx->nr_events)) - goto out; + return; + update_context_time(ctx); update_cgrp_time_from_cpuctx(cpuctx); if (!ctx->nr_active) - goto out; + return; - if (event_type & EVENT_PINNED) { + perf_pmu_disable(ctx->pmu); + if ((event_type & EVENT_PINNED) && (is_active & EVENT_PINNED)) { list_for_each_entry(event, &ctx->pinned_groups, group_entry) group_sched_out(event, cpuctx, ctx); } - if (event_type & EVENT_FLEXIBLE) { + if ((event_type & EVENT_FLEXIBLE) && (is_active & EVENT_FLEXIBLE)) { list_for_each_entry(event, &ctx->flexible_groups, group_entry) group_sched_out(event, cpuctx, ctx); } -out: perf_pmu_enable(ctx->pmu); - raw_spin_unlock(&ctx->lock); +} + +static void cpuctx_sched_out(struct perf_cpu_context *cpuctx, + enum event_type_t event_type) +{ + ctx_sched_out(&cpuctx->ctx, cpuctx, event_type); +} + +static void task_ctx_sched_out(struct perf_event_context *ctx) +{ + struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); + + if (!cpuctx->task_ctx) + return; + + if (WARN_ON_ONCE(ctx != cpuctx->task_ctx)) + return; + + ctx_sched_out(ctx, cpuctx, EVENT_ALL); + cpuctx->task_ctx = NULL; } /* @@ -1936,8 +1832,9 @@ static void perf_event_context_sched_out rcu_read_unlock(); if (do_switch) { - ctx_sched_out(ctx, cpuctx, EVENT_ALL); - cpuctx->task_ctx = NULL; + raw_spin_lock(&ctx->lock); + task_ctx_sched_out(ctx); + raw_spin_unlock(&ctx->lock); } } @@ -1972,30 +1869,6 @@ void __perf_event_task_sched_out(struct perf_cgroup_sched_out(task); } -static void task_ctx_sched_out(struct perf_event_context *ctx, - enum event_type_t event_type) -{ - struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); - - if (!cpuctx->task_ctx) - return; - - if (WARN_ON_ONCE(ctx != cpuctx->task_ctx)) - return; - - ctx_sched_out(ctx, cpuctx, event_type); - cpuctx->task_ctx = NULL; -} - -/* - * Called with IRQs disabled - */ -static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx, - enum event_type_t event_type) -{ - ctx_sched_out(&cpuctx->ctx, cpuctx, event_type); -} - static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct perf_cpu_context *cpuctx) @@ -2062,11 +1935,13 @@ ctx_sched_in(struct perf_event_context * struct task_struct *task) { u64 now; + int is_active = ctx->is_active; - raw_spin_lock(&ctx->lock); - ctx->is_active = 1; + lockdep_assert_held(&ctx->lock); + + ctx->is_active |= event_type; if (likely(!ctx->nr_events)) - goto out; + return; now = perf_clock(); ctx->timestamp = now; @@ -2075,18 +1950,16 @@ ctx_sched_in(struct perf_event_context * * First go through the list and put on any pinned groups * in order to give them the best chance of going on. */ - if (event_type & EVENT_PINNED) + if ((event_type & EVENT_PINNED) && !(is_active & EVENT_PINNED)) ctx_pinned_sched_in(ctx, cpuctx); /* Then walk through the lower prio flexible groups */ - if (event_type & EVENT_FLEXIBLE) + if ((event_type & EVENT_FLEXIBLE) && !(is_active & EVENT_FLEXIBLE)) ctx_flexible_sched_in(ctx, cpuctx); - -out: - raw_spin_unlock(&ctx->lock); } -static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx, +#ifdef CONFIG_CGROUP_PERF +static void cpuctx_sched_in(struct perf_cpu_context *cpuctx, enum event_type_t event_type, struct task_struct *task) { @@ -2094,18 +1967,129 @@ static void cpu_ctx_sched_in(struct perf ctx_sched_in(ctx, cpuctx, event_type, task); } +#endif -static void task_ctx_sched_in(struct perf_event_context *ctx, - enum event_type_t event_type) +static void perf_event_sched_in(struct perf_cpu_context *cpuctx, + struct perf_event_context *ctx, + struct task_struct *task) { - struct perf_cpu_context *cpuctx; + struct perf_event_context *cctx = &cpuctx->ctx; - cpuctx = __get_cpu_context(ctx); - if (cpuctx->task_ctx == ctx) + /* + * We want to keep the following order in events: + * CPU-pinned + * TASK-pinned + * CPU-flexible + * TASK-flexible + */ + ctx_sched_in(cctx, cpuctx, EVENT_PINNED, task); + if (ctx) + ctx_sched_in(ctx, cpuctx, EVENT_PINNED, NULL); + ctx_sched_in(cctx, cpuctx, EVENT_FLEXIBLE, task); + if (ctx) { + ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, NULL); + cpuctx->task_ctx = ctx; + } +} + +static void perf_lock_ctx(struct perf_cpu_context *cpuctx, + struct perf_event_context *ctx) +{ + raw_spin_lock(&cpuctx->ctx.lock); + if (ctx) + raw_spin_lock(&ctx->lock); +} + +static void perf_unlock_ctx(struct perf_cpu_context *cpuctx, + struct perf_event_context *ctx) +{ + if (ctx) + raw_spin_unlock(&ctx->lock); + raw_spin_unlock(&cpuctx->ctx.lock); +} + +/* + * Cross CPU call to install and enable a performance event + * + * Must be called with ctx->mutex held + */ +static int __perf_install_in_context(void *info) +{ + struct perf_event *event = info; + struct perf_event_context *ctx = event->ctx; + struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); + + perf_lock_ctx(cpuctx, cpuctx->task_ctx); + perf_pmu_disable(cpuctx->ctx.pmu); + if (cpuctx->task_ctx) + ctx_sched_out(cpuctx->task_ctx, cpuctx, EVENT_ALL); + + if (ctx->task && cpuctx->task_ctx != ctx) { + if (cpuctx->task_ctx) + raw_spin_unlock(&cpuctx->task_ctx->lock); + cpuctx->task_ctx = ctx; + raw_spin_lock(&ctx->lock); + } + cpuctx_sched_out(cpuctx, EVENT_FLEXIBLE); + + add_event_to_ctx(event, ctx); + + perf_event_sched_in(cpuctx, cpuctx->task_ctx, current); + perf_unlock_ctx(cpuctx, cpuctx->task_ctx); + + return 0; +} + +/* + * Attach a performance event to a context + * + * First we add the event to the list with the hardware enable bit + * in event->hw_config cleared. + * + * If the event is attached to a task which is on a CPU we use a smp + * call to enable it in the task context. The task might have been + * scheduled away, but we check this in the smp call again. + */ +static void +perf_install_in_context(struct perf_event_context *ctx, + struct perf_event *event, + int cpu) +{ + struct task_struct *task = ctx->task; + + lockdep_assert_held(&ctx->mutex); + + event->ctx = ctx; + + if (!task) { + /* + * Per cpu events are installed via an smp call and + * the install is always successful. + */ + cpu_function_call(cpu, __perf_install_in_context, event); + return; + } + +retry: + if (!task_function_call(task, __perf_install_in_context, event)) return; - ctx_sched_in(ctx, cpuctx, event_type, NULL); - cpuctx->task_ctx = ctx; + raw_spin_lock_irq(&ctx->lock); + /* + * If we failed to find a running task, but find the context active now + * that we've acquired the ctx->lock, retry. + */ + if (ctx->is_active) { + raw_spin_unlock_irq(&ctx->lock); + goto retry; + } + + /* + * Since the task isn't running, its safe to add the event, us holding + * the ctx->lock ensures the task won't get scheduled in. + */ + add_event_to_ctx(event, ctx); + raw_spin_unlock_irq(&ctx->lock); } static void perf_event_context_sched_in(struct perf_event_context *ctx, @@ -2117,26 +2101,18 @@ static void perf_event_context_sched_in( if (cpuctx->task_ctx == ctx) return; + perf_lock_ctx(cpuctx, ctx); perf_pmu_disable(ctx->pmu); - /* - * We want to keep the following priority order: - * cpu pinned (that don't need to move), task pinned, - * cpu flexible, task flexible. - */ - cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE); - - ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task); - cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task); - ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task); - - cpuctx->task_ctx = ctx; + cpuctx_sched_out(cpuctx, EVENT_FLEXIBLE); + perf_event_sched_in(cpuctx, ctx, task); + perf_pmu_enable(ctx->pmu); /* * Since these rotations are per-cpu, we need to ensure the * cpu-context we got scheduled on is actually rotating. */ perf_pmu_rotate_start(ctx->pmu); - perf_pmu_enable(ctx->pmu); + perf_unlock_ctx(cpuctx, ctx); } /* @@ -2276,7 +2252,6 @@ static void perf_ctx_adjust_freq(struct u64 interrupts, now; s64 delta; - raw_spin_lock(&ctx->lock); list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { if (event->state != PERF_EVENT_STATE_ACTIVE) continue; @@ -2308,7 +2283,6 @@ static void perf_ctx_adjust_freq(struct if (delta > 0) perf_adjust_period(event, period, delta); } - raw_spin_unlock(&ctx->lock); } /* @@ -2316,16 +2290,12 @@ static void perf_ctx_adjust_freq(struct */ static void rotate_ctx(struct perf_event_context *ctx) { - raw_spin_lock(&ctx->lock); - /* * Rotate the first entry last of non-pinned groups. Rotation might be * disabled by the inheritance code. */ if (!ctx->rotate_disable) list_rotate_left(&ctx->flexible_groups); - - raw_spin_unlock(&ctx->lock); } /* @@ -2352,6 +2322,7 @@ static void perf_rotate_context(struct p rotate = 1; } + perf_lock_ctx(cpuctx, ctx); perf_pmu_disable(cpuctx->ctx.pmu); perf_ctx_adjust_freq(&cpuctx->ctx, interval); if (ctx) @@ -2360,23 +2331,22 @@ static void perf_rotate_context(struct p if (!rotate) goto done; - cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE); + cpuctx_sched_out(cpuctx, EVENT_FLEXIBLE); if (ctx) - task_ctx_sched_out(ctx, EVENT_FLEXIBLE); + ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE); rotate_ctx(&cpuctx->ctx); if (ctx) rotate_ctx(ctx); - cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, current); - if (ctx) - task_ctx_sched_in(ctx, EVENT_FLEXIBLE); + perf_event_sched_in(cpuctx, ctx, current); done: if (remove) list_del_init(&cpuctx->rotation_list); perf_pmu_enable(cpuctx->ctx.pmu); + perf_unlock_ctx(cpuctx, ctx); } void perf_event_task_tick(void) @@ -2423,10 +2393,10 @@ static void perf_event_enable_on_exec(st if (!ctx || !ctx->nr_events) goto out; - task_ctx_sched_out(ctx, EVENT_ALL); - raw_spin_lock(&ctx->lock); + task_ctx_sched_out(ctx); + list_for_each_entry(event, &ctx->pinned_groups, group_entry) { ret = event_enable_on_exec(event, ctx); if (ret) @@ -2831,16 +2801,12 @@ find_get_context(struct pmu *pmu, struct unclone_ctx(ctx); ++ctx->pin_count; raw_spin_unlock_irqrestore(&ctx->lock, flags); - } - - if (!ctx) { + } else { ctx = alloc_perf_context(pmu, task); err = -ENOMEM; if (!ctx) goto errout; - get_ctx(ctx); - err = 0; mutex_lock(&task->perf_event_mutex); /* @@ -2852,14 +2818,14 @@ find_get_context(struct pmu *pmu, struct else if (task->perf_event_ctxp[ctxn]) err = -EAGAIN; else { + get_ctx(ctx); ++ctx->pin_count; rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx); } mutex_unlock(&task->perf_event_mutex); if (unlikely(err)) { - put_task_struct(task); - kfree(ctx); + put_ctx(ctx); if (err == -EAGAIN) goto retry; @@ -2930,12 +2896,6 @@ int perf_event_release_kernel(struct per { struct perf_event_context *ctx = event->ctx; - /* - * Remove from the PMU, can't get re-enabled since we got - * here because the last ref went. - */ - perf_event_disable(event); - WARN_ON_ONCE(ctx->parent_ctx); /* * There are two ways this annotation is useful: @@ -2952,8 +2912,8 @@ int perf_event_release_kernel(struct per mutex_lock_nested(&ctx->mutex, SINGLE_DEPTH_NESTING); raw_spin_lock_irq(&ctx->lock); perf_group_detach(event); - list_del_event(event, ctx); raw_spin_unlock_irq(&ctx->lock); + perf_remove_from_context(event); mutex_unlock(&ctx->mutex); free_event(event); @@ -5982,6 +5942,7 @@ static int pmu_dev_alloc(struct pmu *pmu } static struct lock_class_key cpuctx_mutex; +static struct lock_class_key cpuctx_lock; int perf_pmu_register(struct pmu *pmu, char *name, int type) { @@ -6032,6 +5993,7 @@ int perf_pmu_register(struct pmu *pmu, c cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu); __perf_event_init_context(&cpuctx->ctx); lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex); + lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock); cpuctx->ctx.type = cpu_context; cpuctx->ctx.pmu = pmu; cpuctx->jiffies_interval = 1; @@ -6531,6 +6493,11 @@ SYSCALL_DEFINE5(perf_event_open, goto err_alloc; } + if (task) { + put_task_struct(task); + task = NULL; + } + /* * Look up the group leader (we will attach this event to it): */ @@ -6771,7 +6738,6 @@ static void perf_event_exit_task_context * our context. */ child_ctx = rcu_dereference_raw(child->perf_event_ctxp[ctxn]); - task_ctx_sched_out(child_ctx, EVENT_ALL); /* * Take the context lock here so that if find_get_context is @@ -6779,6 +6745,7 @@ static void perf_event_exit_task_context * incremented the context's refcount before we do put_ctx below. */ raw_spin_lock(&child_ctx->lock); + task_ctx_sched_out(child_ctx); child->perf_event_ctxp[ctxn] = NULL; /* * If this context is a clone; unclone it so it can't get ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-31 13:51 ` Peter Zijlstra @ 2011-03-31 14:10 ` Oleg Nesterov 2011-04-04 16:20 ` Oleg Nesterov 1 sibling, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-31 14:10 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel, Paul E. McKenney On 03/31, Peter Zijlstra wrote: > > On Thu, 2011-03-31 at 15:28 +0200, Oleg Nesterov wrote: > > > > OK, synchronize_sched() can't work. How about > > > > static int force_perf_event_task_sched_out(void *unused) > > { > > struct task_struct *curr = current; > > > > __perf_event_task_sched_out(curr, task_rq(curr)->idle); > > I'd call task_ctx_sched_out() there, Perhaps... but then we miss perf_cgroup_sched_out(). Anyway, I think it is better to not optimize this code. If __perf_event_task_sched_in() was called, it should be paired with __perf_event_task_sched_out(). > we can't actually use struct rq > outside of sched.c Yes, this should live in sched.c > and we know we'll not swap contexts with the idle > thread. Yes, that is why it uses next = rq->idle. > Anyway, I _think_ we can do better.. Yes, agreed. As I said, I don't like this too much. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-31 13:51 ` Peter Zijlstra 2011-03-31 14:10 ` Oleg Nesterov @ 2011-04-04 16:20 ` Oleg Nesterov 1 sibling, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-04-04 16:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel, Paul E. McKenney On 03/31, Peter Zijlstra wrote: > > Anyway, I _think_ we can do better.. but then, I haven't fully gone > through things yet.. compile tested only.. I'll try to read (and hopefully even understand ;) this patch carefully, but at first glance this should work. Personally, as a _reader_ of this code, I like the fact that perf_event_release_kernel() does perf_remove_from_context() instead of _disable + del_event. IMHO, this makes the things more understandable. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 13:09 ` Jiri Olsa 2011-03-30 14:51 ` Peter Zijlstra @ 2011-03-30 15:32 ` Oleg Nesterov 2011-03-30 15:40 ` Peter Zijlstra 2011-03-30 16:11 ` Peter Zijlstra 1 sibling, 2 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-30 15:32 UTC (permalink / raw) To: Jiri Olsa; +Cc: Peter Zijlstra, Paul Mackerras, Ingo Molnar, linux-kernel On 03/30, Jiri Olsa wrote: > > On Mon, Mar 28, 2011 at 06:56:48PM +0200, Oleg Nesterov wrote: > > > > As for HAVE_JUMP_LABEL, I still can't understand why this test-case > > triggers the problem. But jump_label_inc/dec logic looks obviously > > racy. > > I think the reason is, that there are actually 2 places that > needs to be updated by jump label code > - perf_event_task_sched_in > - perf_event_task_sched_out > > As for my image it first patches perf_event_task_sched_out and then > perf_event_task_sched_in Indeed! Thanks a lot Jiri, now I understand. jump_label_update() doesn't change all entries in a batch. So, as you explained, it is quite possible that jump_label_update() disables perf_event_task_sched_out() first. When the caller calls arch_jump_label_transform() again to disable the next entry, it has task_ctx != NULL which can't be cleared. > IIUIC that calling synchronize_sched in perf_sched_events_dec > ensures that final schedule back to the release code will not > call perf_event_task_sched_in, so task_ctx stays sane (NULL). Yes. IOW, this guarantees that perf_event_task_sched_in() is always paired with perf_event_task_sched_out(). > The cool think is, that it also fixies the original issue, > which was: wrong counters for perf builtin test #3.. > which I started with from the very beggining :) Great ;) > please let me know what you think, thanks Yes, this is what I meant. I'd like to look at this patch again later, perhaps we can move some code from #ifdef/#else... But I have the headache today, just can't understand all these ifdef's. > +#ifdef HAVE_JUMP_LABEL > +static inline > +void perf_sched_events_inc(void) > +{ > + jump_label_inc(&perf_sched_events_out); > + smp_mb__after_atomic_inc(); > + jump_label_inc(&perf_sched_events_in); > +} > + > +static inline > +void perf_sched_events_dec(void) > +{ > + if (atomic_dec_and_test(&perf_sched_events_in)) { > + jump_label_disable(&perf_sched_events_in); > + synchronize_sched(); > + } > + > + jump_label_dec(&perf_sched_events_out); probably smp_mb__after_atomic_inc() needs a comment... It is needed to avoid the race between perf_sched_events_dec() and perf_sched_events_inc(). Suppose that we have a single event, both counters == 1. We create another event and call perf_sched_events_inc(). Without the barrier we could increment the counters in reverse order, jump_label_inc(&perf_sched_events_in); /* ---- WINDOW ---- */ jump_label_inc(&perf_sched_events_out); Now, if perf_sched_events_dec() is called in between, it can disable _out but not _in. This means we can leak ->task_ctx again. --------------------------------------------------------------------- HOWEVER. While I think synchronize_sched() should work in practice, in theory (according to the documentation) it can't always help in this case. I'll write another email tomorrow, can't think properly today. Anyway this looks solveable, but perhaps we need stop_cpus() instead. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 15:32 ` Oleg Nesterov @ 2011-03-30 15:40 ` Peter Zijlstra 2011-03-30 15:52 ` Oleg Nesterov 2011-03-30 16:11 ` Peter Zijlstra 1 sibling, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2011-03-30 15:40 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Wed, 2011-03-30 at 17:32 +0200, Oleg Nesterov wrote: > > +#ifdef HAVE_JUMP_LABEL > > +static inline > > +void perf_sched_events_inc(void) > > +{ > > + jump_label_inc(&perf_sched_events_out); > > + smp_mb__after_atomic_inc(); > > + jump_label_inc(&perf_sched_events_in); > > +} > > + > > +static inline > > +void perf_sched_events_dec(void) > > +{ > > + if (atomic_dec_and_test(&perf_sched_events_in)) { > > + jump_label_disable(&perf_sched_events_in); > > + synchronize_sched(); > > + } > > + > > + jump_label_dec(&perf_sched_events_out); > > probably smp_mb__after_atomic_inc() needs a comment... You don't need it, jump_label_inc() uses atomic_add_return() which implies a full mb. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 15:40 ` Peter Zijlstra @ 2011-03-30 15:52 ` Oleg Nesterov 2011-03-30 15:57 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2011-03-30 15:52 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/30, Peter Zijlstra wrote: > > On Wed, 2011-03-30 at 17:32 +0200, Oleg Nesterov wrote: > > > +#ifdef HAVE_JUMP_LABEL > > > +static inline > > > +void perf_sched_events_inc(void) > > > +{ > > > + jump_label_inc(&perf_sched_events_out); > > > + smp_mb__after_atomic_inc(); > > > + jump_label_inc(&perf_sched_events_in); > > > +} > > > > probably smp_mb__after_atomic_inc() needs a comment... > > You don't need it, jump_label_inc() uses atomic_add_return() which > implies a full mb. Yes, but only if HAVE_JUMP_LABEL. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 15:52 ` Oleg Nesterov @ 2011-03-30 15:57 ` Peter Zijlstra 0 siblings, 0 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-30 15:57 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Wed, 2011-03-30 at 17:52 +0200, Oleg Nesterov wrote: > On 03/30, Peter Zijlstra wrote: > > > > On Wed, 2011-03-30 at 17:32 +0200, Oleg Nesterov wrote: > > > > +#ifdef HAVE_JUMP_LABEL > > > > +static inline > > > > +void perf_sched_events_inc(void) > > > > +{ > > > > + jump_label_inc(&perf_sched_events_out); > > > > + smp_mb__after_atomic_inc(); > > > > + jump_label_inc(&perf_sched_events_in); > > > > +} > > > > > > probably smp_mb__after_atomic_inc() needs a comment... > > > > You don't need it, jump_label_inc() uses atomic_add_return() which > > implies a full mb. > > Yes, but only if HAVE_JUMP_LABEL. argh, indeed. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 15:32 ` Oleg Nesterov 2011-03-30 15:40 ` Peter Zijlstra @ 2011-03-30 16:11 ` Peter Zijlstra 2011-03-30 17:13 ` Oleg Nesterov 1 sibling, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2011-03-30 16:11 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Wed, 2011-03-30 at 17:32 +0200, Oleg Nesterov wrote: > probably smp_mb__after_atomic_inc() needs a comment... > > It is needed to avoid the race between perf_sched_events_dec() and > perf_sched_events_inc(). > > Suppose that we have a single event, both counters == 1. We create > another event and call perf_sched_events_inc(). Without the barrier > we could increment the counters in reverse order, > > jump_label_inc(&perf_sched_events_in); > /* ---- WINDOW ---- */ > jump_label_inc(&perf_sched_events_out); > > Now, if perf_sched_events_dec() is called in between, it can disable > _out but not _in. This means we can leak ->task_ctx again. But in that case we need an mb in perf_sched_events_dec() too, because for the !JUMP_LABEL case that's a simple atomic_dec() and combined with synchronize_sched() being a nop for num_online_cpus()==1 there's no ordering there either. Also, wouldn't this then require an smp_rmb() in the perf_event_task_sched_{in,out} COND_STMT/JUMP_LABEL read side? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-30 16:11 ` Peter Zijlstra @ 2011-03-30 17:13 ` Oleg Nesterov 0 siblings, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2011-03-30 17:13 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On 03/30, Peter Zijlstra wrote: > > On Wed, 2011-03-30 at 17:32 +0200, Oleg Nesterov wrote: > > probably smp_mb__after_atomic_inc() needs a comment... > > > > It is needed to avoid the race between perf_sched_events_dec() and > > perf_sched_events_inc(). > > > > Suppose that we have a single event, both counters == 1. We create > > another event and call perf_sched_events_inc(). Without the barrier > > we could increment the counters in reverse order, > > > > jump_label_inc(&perf_sched_events_in); > > /* ---- WINDOW ---- */ > > jump_label_inc(&perf_sched_events_out); > > > > Now, if perf_sched_events_dec() is called in between, it can disable > > _out but not _in. This means we can leak ->task_ctx again. > > But in that case we need an mb in perf_sched_events_dec() too, because > for the !JUMP_LABEL case that's a simple atomic_dec() and combined with > synchronize_sched() being a nop for num_online_cpus()==1 there's no > ordering there either. I think you are right... afaics we only need barrier() in this case. > Also, wouldn't this then require an smp_rmb() in the > perf_event_task_sched_{in,out} COND_STMT/JUMP_LABEL read side? Oh, I don't think so, but can't prove. We don't need it in UP case. And if synchronize_sched() worked (see another email), it should ensure that perf_sched_events_in == 0 must be visible after it completes. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value 2011-03-26 16:38 ` Peter Zijlstra 2011-03-26 17:09 ` Oleg Nesterov @ 2011-03-26 17:09 ` Peter Zijlstra 1 sibling, 0 replies; 41+ messages in thread From: Peter Zijlstra @ 2011-03-26 17:09 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Jiri Olsa, Paul Mackerras, Ingo Molnar, linux-kernel On Sat, 2011-03-26 at 17:38 +0100, Peter Zijlstra wrote: > let me try and get that test-case running someplace. > OK, so I kicked the thing into shape and have it running for several minutes now but no fireworks yet.. will let it run longer, but it doesn't look to readily reproduce on that machine. ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2011-04-04 16:21 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-24 16:44 [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value Jiri Olsa 2011-03-25 19:10 ` Oleg Nesterov 2011-03-26 15:37 ` Peter Zijlstra 2011-03-26 16:13 ` Oleg Nesterov 2011-03-26 16:38 ` Peter Zijlstra 2011-03-26 17:09 ` Oleg Nesterov 2011-03-26 17:35 ` Oleg Nesterov 2011-03-26 18:29 ` Peter Zijlstra 2011-03-26 18:49 ` Oleg Nesterov 2011-03-28 13:30 ` Oleg Nesterov 2011-03-28 14:57 ` Peter Zijlstra 2011-03-28 15:00 ` Peter Zijlstra 2011-03-28 15:15 ` Oleg Nesterov 2011-03-28 16:27 ` Peter Zijlstra 2011-03-28 15:39 ` Oleg Nesterov 2011-03-28 15:49 ` Peter Zijlstra 2011-03-28 16:56 ` Oleg Nesterov 2011-03-29 8:32 ` Peter Zijlstra 2011-03-29 10:49 ` Peter Zijlstra 2011-03-29 16:28 ` Oleg Nesterov 2011-03-29 19:01 ` Peter Zijlstra 2011-03-30 13:09 ` Jiri Olsa 2011-03-30 14:51 ` Peter Zijlstra 2011-03-30 16:37 ` Oleg Nesterov 2011-03-30 18:30 ` Paul E. McKenney 2011-03-30 19:53 ` Oleg Nesterov 2011-03-30 21:26 ` Peter Zijlstra 2011-03-30 21:35 ` Oleg Nesterov 2011-03-31 10:32 ` Jiri Olsa 2011-03-31 12:41 ` [tip:perf/urgent] perf: Fix task context scheduling tip-bot for Peter Zijlstra 2011-03-31 13:28 ` [PATCH,RFC] perf: panic due to inclied cpu context task_ctx value Oleg Nesterov 2011-03-31 13:51 ` Peter Zijlstra 2011-03-31 14:10 ` Oleg Nesterov 2011-04-04 16:20 ` Oleg Nesterov 2011-03-30 15:32 ` Oleg Nesterov 2011-03-30 15:40 ` Peter Zijlstra 2011-03-30 15:52 ` Oleg Nesterov 2011-03-30 15:57 ` Peter Zijlstra 2011-03-30 16:11 ` Peter Zijlstra 2011-03-30 17:13 ` Oleg Nesterov 2011-03-26 17:09 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox