* Re: [BUG] perf/core: Task stuck on global_ctx_data_rwsem [not found] <aUnVfxDtLNUDJM_v@google.com> @ 2025-12-22 23:36 ` Namhyung Kim 2026-01-06 22:34 ` Namhyung Kim 0 siblings, 1 reply; 7+ messages in thread From: Namhyung Kim @ 2025-12-22 23:36 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar Cc: Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, linux-perf-users, linux-kernel Added a subject prefix and CC LKML. Thanks, Namhyung On Mon, Dec 22, 2025 at 03:34:23PM -0800, Namhyung Kim wrote: > Hello, > > I got a report that a task is stuck in perf_event_exit_task() waiting > for global_ctx_data_rwsem. On large systems, it'd have performance > issues when it grabs the lock to iterate all threads in the system to > allocate the context data. And it'd block task exit path which is > problematic especially under memory pressure. > > perf_event_open > perf_event_alloc > attach_perf_ctx_data > attach_global_ctx_data > percpu_down_write (global_ctx_data_rwsem) > for_each_process_thread > alloc_task_ctx_data > do_exit > perf_event_exit_task > percpu_down_read (global_ctx_data_rwsem) > > I think attach_global_ctx_data() should skip tasks with PF_EXITING and > it'd be nice if perf_event_exit_task() could release the ctx_data > unconditionally. But I'm not sure how to synchronize them properly. > > Any thoughts? > > Thanks, > Namhyung > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] perf/core: Task stuck on global_ctx_data_rwsem 2025-12-22 23:36 ` [BUG] perf/core: Task stuck on global_ctx_data_rwsem Namhyung Kim @ 2026-01-06 22:34 ` Namhyung Kim 2026-01-07 9:16 ` Peter Zijlstra 0 siblings, 1 reply; 7+ messages in thread From: Namhyung Kim @ 2026-01-06 22:34 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar Cc: Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, linux-perf-users, linux-kernel Hello, On Mon, Dec 22, 2025 at 03:36:53PM -0800, Namhyung Kim wrote: > On Mon, Dec 22, 2025 at 03:34:23PM -0800, Namhyung Kim wrote: > > Hello, > > > > I got a report that a task is stuck in perf_event_exit_task() waiting > > for global_ctx_data_rwsem. On large systems, it'd have performance > > issues when it grabs the lock to iterate all threads in the system to > > allocate the context data. And it'd block task exit path which is > > problematic especially under memory pressure. > > > > perf_event_open > > perf_event_alloc > > attach_perf_ctx_data > > attach_global_ctx_data > > percpu_down_write (global_ctx_data_rwsem) > > for_each_process_thread > > alloc_task_ctx_data > > do_exit > > perf_event_exit_task > > percpu_down_read (global_ctx_data_rwsem) > > > > I think attach_global_ctx_data() should skip tasks with PF_EXITING and > > it'd be nice if perf_event_exit_task() could release the ctx_data > > unconditionally. But I'm not sure how to synchronize them properly. > > > > Any thoughts? I'm curious if this makes any sense.. I feel like it needs to check the flag again before allocation. Thanks, Namhyung diff --git a/kernel/events/core.c b/kernel/events/core.c index 376fb07d869b8b50..2a8847e95d7eb698 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5469,6 +5469,8 @@ attach_global_ctx_data(struct kmem_cache *ctx_cache) /* Allocate everything */ scoped_guard (rcu) { for_each_process_thread(g, p) { + if (p->flags & PF_EXITING) + continue; cd = rcu_dereference(p->perf_ctx_data); if (cd && !cd->global) { cd->global = 1; @@ -14563,7 +14565,6 @@ void perf_event_exit_task(struct task_struct *task) /* * Detach the perf_ctx_data for the system-wide event. */ - guard(percpu_read)(&global_ctx_data_rwsem); detach_task_ctx_data(task); } ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [BUG] perf/core: Task stuck on global_ctx_data_rwsem 2026-01-06 22:34 ` Namhyung Kim @ 2026-01-07 9:16 ` Peter Zijlstra 2026-01-07 19:01 ` Namhyung Kim 0 siblings, 1 reply; 7+ messages in thread From: Peter Zijlstra @ 2026-01-07 9:16 UTC (permalink / raw) To: Namhyung Kim Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, linux-perf-users, linux-kernel On Tue, Jan 06, 2026 at 02:34:40PM -0800, Namhyung Kim wrote: > Hello, > > On Mon, Dec 22, 2025 at 03:36:53PM -0800, Namhyung Kim wrote: > > On Mon, Dec 22, 2025 at 03:34:23PM -0800, Namhyung Kim wrote: > > > Hello, > > > > > > I got a report that a task is stuck in perf_event_exit_task() waiting > > > for global_ctx_data_rwsem. On large systems, it'd have performance > > > issues when it grabs the lock to iterate all threads in the system to > > > allocate the context data. And it'd block task exit path which is > > > problematic especially under memory pressure. > > > > > > perf_event_open > > > perf_event_alloc > > > attach_perf_ctx_data > > > attach_global_ctx_data > > > percpu_down_write (global_ctx_data_rwsem) > > > for_each_process_thread > > > alloc_task_ctx_data > > > do_exit > > > perf_event_exit_task > > > percpu_down_read (global_ctx_data_rwsem) > > > > > > I think attach_global_ctx_data() should skip tasks with PF_EXITING and > > > it'd be nice if perf_event_exit_task() could release the ctx_data > > > unconditionally. But I'm not sure how to synchronize them properly. > > > > > > Any thoughts? > > I'm curious if this makes any sense.. I feel like it needs to check the > flag again before allocation. > > Thanks, > Namhyung > > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 376fb07d869b8b50..2a8847e95d7eb698 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -5469,6 +5469,8 @@ attach_global_ctx_data(struct kmem_cache *ctx_cache) > /* Allocate everything */ > scoped_guard (rcu) { > for_each_process_thread(g, p) { > + if (p->flags & PF_EXITING) > + continue; > cd = rcu_dereference(p->perf_ctx_data); > if (cd && !cd->global) { > cd->global = 1; I suppose this makes sense. > @@ -14563,7 +14565,6 @@ void perf_event_exit_task(struct task_struct *task) > /* > * Detach the perf_ctx_data for the system-wide event. > */ > - guard(percpu_read)(&global_ctx_data_rwsem); > detach_task_ctx_data(task); > } This would need a comment; something like: /* * This can be done without holding global_ctx_data_rwsem * because this is done after setting PF_EXITING such that * attach_global_ctx_data() will skip over this task. */ WARN_ON_ONCE(!(task->flags & PF_EXITING)) But yes, I suppose this can do. The question is however, how do you get into this predicament to begin with? Are you creating and destroying a lot of global LBR events or something? Would it make sense to delay detach_global_ctx_data() for a second or so? That is, what is your event creation pattern? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] perf/core: Task stuck on global_ctx_data_rwsem 2026-01-07 9:16 ` Peter Zijlstra @ 2026-01-07 19:01 ` Namhyung Kim 2026-01-07 22:28 ` Peter Zijlstra 0 siblings, 1 reply; 7+ messages in thread From: Namhyung Kim @ 2026-01-07 19:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, linux-perf-users, linux-kernel On Wed, Jan 07, 2026 at 10:16:52AM +0100, Peter Zijlstra wrote: > On Tue, Jan 06, 2026 at 02:34:40PM -0800, Namhyung Kim wrote: > > Hello, > > > > On Mon, Dec 22, 2025 at 03:36:53PM -0800, Namhyung Kim wrote: > > > On Mon, Dec 22, 2025 at 03:34:23PM -0800, Namhyung Kim wrote: > > > > Hello, > > > > > > > > I got a report that a task is stuck in perf_event_exit_task() waiting > > > > for global_ctx_data_rwsem. On large systems, it'd have performance > > > > issues when it grabs the lock to iterate all threads in the system to > > > > allocate the context data. And it'd block task exit path which is > > > > problematic especially under memory pressure. > > > > > > > > perf_event_open > > > > perf_event_alloc > > > > attach_perf_ctx_data > > > > attach_global_ctx_data > > > > percpu_down_write (global_ctx_data_rwsem) > > > > for_each_process_thread > > > > alloc_task_ctx_data > > > > do_exit > > > > perf_event_exit_task > > > > percpu_down_read (global_ctx_data_rwsem) > > > > > > > > I think attach_global_ctx_data() should skip tasks with PF_EXITING and > > > > it'd be nice if perf_event_exit_task() could release the ctx_data > > > > unconditionally. But I'm not sure how to synchronize them properly. > > > > > > > > Any thoughts? > > > > I'm curious if this makes any sense.. I feel like it needs to check the > > flag again before allocation. > > > > Thanks, > > Namhyung > > > > > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > index 376fb07d869b8b50..2a8847e95d7eb698 100644 > > --- a/kernel/events/core.c > > +++ b/kernel/events/core.c > > @@ -5469,6 +5469,8 @@ attach_global_ctx_data(struct kmem_cache *ctx_cache) > > /* Allocate everything */ > > scoped_guard (rcu) { > > for_each_process_thread(g, p) { > > + if (p->flags & PF_EXITING) > > + continue; > > cd = rcu_dereference(p->perf_ctx_data); > > if (cd && !cd->global) { > > cd->global = 1; > > I suppose this makes sense. > > > @@ -14563,7 +14565,6 @@ void perf_event_exit_task(struct task_struct *task) > > /* > > * Detach the perf_ctx_data for the system-wide event. > > */ > > - guard(percpu_read)(&global_ctx_data_rwsem); > > detach_task_ctx_data(task); > > } > > This would need a comment; something like: > > /* > * This can be done without holding global_ctx_data_rwsem > * because this is done after setting PF_EXITING such that > * attach_global_ctx_data() will skip over this task. > */ > WARN_ON_ONCE(!(task->flags & PF_EXITING)) > > But yes, I suppose this can do. The question is however, how do you get > into this predicament to begin with? Are you creating and destroying a > lot of global LBR events or something? I think it's just because there are too many tasks in the system like O(100K). And any thread going to exit needs to wait for attach_global_ctx_data() to finish the iteration over every task. > > Would it make sense to delay detach_global_ctx_data() for a second or > so? That is, what is your event creation pattern? I don't think it has a special pattern, but I'm curious how we can handle a race like below. attach_global_ctx_data check p->flags & PF_EXITING do_exit (preemption) set PF_EXITING detach_task_ctx_data() check p->perf_ctx_data attach_task_ctx_data() ---> memory leak Thanks, Namhyung ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] perf/core: Task stuck on global_ctx_data_rwsem 2026-01-07 19:01 ` Namhyung Kim @ 2026-01-07 22:28 ` Peter Zijlstra 2026-01-07 22:32 ` Peter Zijlstra 0 siblings, 1 reply; 7+ messages in thread From: Peter Zijlstra @ 2026-01-07 22:28 UTC (permalink / raw) To: Namhyung Kim Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, linux-perf-users, linux-kernel On Wed, Jan 07, 2026 at 11:01:53AM -0800, Namhyung Kim wrote: > > But yes, I suppose this can do. The question is however, how do you get > > into this predicament to begin with? Are you creating and destroying a > > lot of global LBR events or something? > > I think it's just because there are too many tasks in the system like > O(100K). And any thread going to exit needs to wait for > attach_global_ctx_data() to finish the iteration over every task. OMG, so many tasks ... > > Would it make sense to delay detach_global_ctx_data() for a second or > > so? That is, what is your event creation pattern? > > I don't think it has a special pattern, but I'm curious how we can > handle a race like below. > > attach_global_ctx_data > check p->flags & PF_EXITING > do_exit > (preemption) set PF_EXITING > detach_task_ctx_data() > check p->perf_ctx_data > attach_task_ctx_data() ---> memory leak Oh right. Something like so perhaps? --- diff --git a/kernel/events/core.c b/kernel/events/core.c index 3c2a491200c6..e5e716420eb3 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5421,9 +5421,19 @@ attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache, return -ENOMEM; for (;;) { - if (try_cmpxchg((struct perf_ctx_data **)&task->perf_ctx_data, &old, cd)) { + if (try_cmpxchg(&task->perf_ctx_data, &old, cd)) { if (old) perf_free_ctx_data_rcu(old); + /* + * try_cmpxchg() pairs with try_cmpxchg() from + * detach_task_ctx_data() such that + * if we race with perf_event_exit_task(), we must + * observe PF_EXITING. + */ + if (task->flags & PF_EXITING) { + task->perf_ctx_data = NULL; + perf_free_ctx_data_rcu(cd); + } return 0; } @@ -5469,6 +5479,8 @@ attach_global_ctx_data(struct kmem_cache *ctx_cache) /* Allocate everything */ scoped_guard (rcu) { for_each_process_thread(g, p) { + if (p->flags & PF_EXITING) + continue; cd = rcu_dereference(p->perf_ctx_data); if (cd && !cd->global) { cd->global = 1; @@ -14568,8 +14580,11 @@ void perf_event_exit_task(struct task_struct *task) /* * Detach the perf_ctx_data for the system-wide event. + * + * Done without holding global_ctx_data_rwsem; typically + * attach_global_ctx_data() will skip over this task, but otherwise + * attach_task_ctx_data() will observe PF_EXITING. */ - guard(percpu_read)(&global_ctx_data_rwsem); detach_task_ctx_data(task); } ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [BUG] perf/core: Task stuck on global_ctx_data_rwsem 2026-01-07 22:28 ` Peter Zijlstra @ 2026-01-07 22:32 ` Peter Zijlstra 2026-01-08 19:56 ` Namhyung Kim 0 siblings, 1 reply; 7+ messages in thread From: Peter Zijlstra @ 2026-01-07 22:32 UTC (permalink / raw) To: Namhyung Kim Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, linux-perf-users, linux-kernel On Wed, Jan 07, 2026 at 11:28:24PM +0100, Peter Zijlstra wrote: > On Wed, Jan 07, 2026 at 11:01:53AM -0800, Namhyung Kim wrote: > > > > But yes, I suppose this can do. The question is however, how do you get > > > into this predicament to begin with? Are you creating and destroying a > > > lot of global LBR events or something? > > > > I think it's just because there are too many tasks in the system like > > O(100K). And any thread going to exit needs to wait for > > attach_global_ctx_data() to finish the iteration over every task. > > OMG, so many tasks ... > > > > Would it make sense to delay detach_global_ctx_data() for a second or > > > so? That is, what is your event creation pattern? > > > > I don't think it has a special pattern, but I'm curious how we can > > handle a race like below. > > > > attach_global_ctx_data > > check p->flags & PF_EXITING > > do_exit > > (preemption) set PF_EXITING > > detach_task_ctx_data() > > check p->perf_ctx_data > > attach_task_ctx_data() ---> memory leak > > Oh right. Something like so perhaps? > > --- > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 3c2a491200c6..e5e716420eb3 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -5421,9 +5421,19 @@ attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache, > return -ENOMEM; > > for (;;) { > - if (try_cmpxchg((struct perf_ctx_data **)&task->perf_ctx_data, &old, cd)) { > + if (try_cmpxchg(&task->perf_ctx_data, &old, cd)) { > if (old) > perf_free_ctx_data_rcu(old); > + /* > + * try_cmpxchg() pairs with try_cmpxchg() from > + * detach_task_ctx_data() such that > + * if we race with perf_event_exit_task(), we must > + * observe PF_EXITING. > + */ > + if (task->flags & PF_EXITING) { > + task->perf_ctx_data = NULL; > + perf_free_ctx_data_rcu(cd); Ugh and now it can race and do a double free, another try_cmpxchg() is needed here. > + } > return 0; > } > > @@ -5469,6 +5479,8 @@ attach_global_ctx_data(struct kmem_cache *ctx_cache) > /* Allocate everything */ > scoped_guard (rcu) { > for_each_process_thread(g, p) { > + if (p->flags & PF_EXITING) > + continue; > cd = rcu_dereference(p->perf_ctx_data); > if (cd && !cd->global) { > cd->global = 1; > @@ -14568,8 +14580,11 @@ void perf_event_exit_task(struct task_struct *task) > > /* > * Detach the perf_ctx_data for the system-wide event. > + * > + * Done without holding global_ctx_data_rwsem; typically > + * attach_global_ctx_data() will skip over this task, but otherwise > + * attach_task_ctx_data() will observe PF_EXITING. > */ > - guard(percpu_read)(&global_ctx_data_rwsem); > detach_task_ctx_data(task); > } > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] perf/core: Task stuck on global_ctx_data_rwsem 2026-01-07 22:32 ` Peter Zijlstra @ 2026-01-08 19:56 ` Namhyung Kim 0 siblings, 0 replies; 7+ messages in thread From: Namhyung Kim @ 2026-01-08 19:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter, James Clark, linux-perf-users, linux-kernel On Wed, Jan 07, 2026 at 11:32:56PM +0100, Peter Zijlstra wrote: > On Wed, Jan 07, 2026 at 11:28:24PM +0100, Peter Zijlstra wrote: > > On Wed, Jan 07, 2026 at 11:01:53AM -0800, Namhyung Kim wrote: > > > > > > But yes, I suppose this can do. The question is however, how do you get > > > > into this predicament to begin with? Are you creating and destroying a > > > > lot of global LBR events or something? > > > > > > I think it's just because there are too many tasks in the system like > > > O(100K). And any thread going to exit needs to wait for > > > attach_global_ctx_data() to finish the iteration over every task. > > > > OMG, so many tasks ... > > > > > > Would it make sense to delay detach_global_ctx_data() for a second or > > > > so? That is, what is your event creation pattern? > > > > > > I don't think it has a special pattern, but I'm curious how we can > > > handle a race like below. > > > > > > attach_global_ctx_data > > > check p->flags & PF_EXITING > > > do_exit > > > (preemption) set PF_EXITING > > > detach_task_ctx_data() > > > check p->perf_ctx_data > > > attach_task_ctx_data() ---> memory leak > > > > Oh right. Something like so perhaps? > > > > --- > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > index 3c2a491200c6..e5e716420eb3 100644 > > --- a/kernel/events/core.c > > +++ b/kernel/events/core.c > > @@ -5421,9 +5421,19 @@ attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache, > > return -ENOMEM; > > > > for (;;) { > > - if (try_cmpxchg((struct perf_ctx_data **)&task->perf_ctx_data, &old, cd)) { > > + if (try_cmpxchg(&task->perf_ctx_data, &old, cd)) { > > if (old) > > perf_free_ctx_data_rcu(old); > > + /* > > + * try_cmpxchg() pairs with try_cmpxchg() from > > + * detach_task_ctx_data() such that > > + * if we race with perf_event_exit_task(), we must > > + * observe PF_EXITING. > > + */ > > + if (task->flags & PF_EXITING) { > > + task->perf_ctx_data = NULL; > > + perf_free_ctx_data_rcu(cd); > > Ugh and now it can race and do a double free, another try_cmpxchg() is > needed here. Thanks! Something like this? Namhyung diff --git a/kernel/events/core.c b/kernel/events/core.c index 376fb07d869b8b50..cf252d8f49b2b259 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5421,9 +5421,20 @@ attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache, return -ENOMEM; for (;;) { - if (try_cmpxchg((struct perf_ctx_data **)&task->perf_ctx_data, &old, cd)) { + if (try_cmpxchg(&task->perf_ctx_data, &old, cd)) { if (old) perf_free_ctx_data_rcu(old); + /* + * try_cmpxchg() pairs with try_cmpxchg() from + * detach_task_ctx_data() such that + * if we race with perf_event_exit_task(), we must + * observe PF_EXITING. + */ + if (task->flags & PF_EXITING) { + /* detach_task_ctx_data() may free it already */ + if (try_cmpxchg(&task->perf_ctx_data, &cd, NULL)) + perf_free_ctx_data_rcu(cd); + } return 0; } @@ -5469,6 +5480,8 @@ attach_global_ctx_data(struct kmem_cache *ctx_cache) /* Allocate everything */ scoped_guard (rcu) { for_each_process_thread(g, p) { + if (p->flags & PF_EXITING) + continue; cd = rcu_dereference(p->perf_ctx_data); if (cd && !cd->global) { cd->global = 1; @@ -14562,8 +14575,11 @@ void perf_event_exit_task(struct task_struct *task) /* * Detach the perf_ctx_data for the system-wide event. + * + * Done without holding global_ctx_data_rwsem; typically + * attach_global_ctx_data() will skip over this task, but otherwise + * attach_task_ctx_data() will observe PF_EXITING. */ - guard(percpu_read)(&global_ctx_data_rwsem); detach_task_ctx_data(task); } ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-01-08 19:57 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <aUnVfxDtLNUDJM_v@google.com>
2025-12-22 23:36 ` [BUG] perf/core: Task stuck on global_ctx_data_rwsem Namhyung Kim
2026-01-06 22:34 ` Namhyung Kim
2026-01-07 9:16 ` Peter Zijlstra
2026-01-07 19:01 ` Namhyung Kim
2026-01-07 22:28 ` Peter Zijlstra
2026-01-07 22:32 ` Peter Zijlstra
2026-01-08 19:56 ` Namhyung Kim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox