From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B01BF1B81CA; Thu, 12 Feb 2026 01:10:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770858620; cv=none; b=Y41lpMsGYBN+lM+U/l9rTKO1qVvqH2pn/QJ6IFNLt47kmwK/Q6eSnGjLuF97VfIqmdKF6oj/ltQLqz7Oc/G62i1v8GDVqlWuFJP58Mq0AReAE+8SH/tm1fy5Nj9KrjzrV2KOjzZTbPVQSezJ5bnCaVZhX1fX754Tcd2Em/s5xlQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770858620; c=relaxed/simple; bh=3EDqfbwCBf5FnZ/AoYq0Uf0EzSt1GquuJEJw+pJY3Ls=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rJ96ljl/7Yy7rvZRvX4BhOuro3JbdQaoQsiHfkjBibuBJXMeq8hc1Zvlwgd1DsP4CpKisdB3OG8AwjHCtVteXFGUaJY9Si3eyowi7/4rZD3it5wmkjon0QsacCmIVfiOWxtRW86xQB1vzWsCtnYLSZ/wOHOKl6zHVaASBmnOfIg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Y1/2P6Kf; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Y1/2P6Kf" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 87739C4CEF7; Thu, 12 Feb 2026 01:10:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1770858618; bh=3EDqfbwCBf5FnZ/AoYq0Uf0EzSt1GquuJEJw+pJY3Ls=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Y1/2P6KfJiq7HNX01r5yeHGoVTXA2k0kvFZIfVkgw82b97m9ldDRvKMSuf1EWxCbO WqBqv4CX0tVPivTz5u8xOaW6J5Jjj6inBLQSdpWIDh2UsEQjADznVFRX/n/BBgdKZZ rMI9U458UNV0ApXV3d4uR4IgcXhTeL9JLqk2H/H3/k9yCPJOqGqdTEMqpGR9D1N+CS o3QS4YwdfnhohNuw414zE2Zc71w/NsU6s998X9NmhRQSAKLoX8PBV/JQIqiTmsaYFb 6nFDroVxSR/jp1yYzJedg0zgoynofdzrVZhouDdfFezPEPpuO88fjHtXNb0J+m8b3k zrZoPWRAu6kCQ== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Namhyung Kim , Rosalie Fang , Peter Zijlstra , Sasha Levin , mingo@redhat.com, acme@kernel.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH AUTOSEL 6.19-6.18] perf/core: Fix slow perf_event_task_exit() with LBR callstacks Date: Wed, 11 Feb 2026 20:09:36 -0500 Message-ID: <20260212010955.3480391-13-sashal@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260212010955.3480391-1-sashal@kernel.org> References: <20260212010955.3480391-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.19 Content-Transfer-Encoding: 8bit From: Namhyung Kim [ Upstream commit 4960626f956d63dce57f099016c2ecbe637a8229 ] I got a report that a task is stuck in perf_event_exit_task() waiting for global_ctx_data_rwsem. On large systems with lots threads, it'd have performance issues when it grabs the lock to iterate all threads in the system to allocate the context data. And it'd block task exit path which is problematic especially under memory pressure. perf_event_open perf_event_alloc attach_perf_ctx_data attach_global_ctx_data percpu_down_write (global_ctx_data_rwsem) for_each_process_thread alloc_task_ctx_data do_exit perf_event_exit_task percpu_down_read (global_ctx_data_rwsem) It should not hold the global_ctx_data_rwsem on the exit path. Let's skip allocation for exiting tasks and free the data carefully. Reported-by: Rosalie Fang Suggested-by: Peter Zijlstra Signed-off-by: Namhyung Kim Signed-off-by: Peter Zijlstra (Intel) Link: https://patch.msgid.link/20260112165157.1919624-1-namhyung@kernel.org Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: --- ## Comprehensive Analysis ### 1. COMMIT MESSAGE ANALYSIS **Subject**: "perf/core: Fix slow perf_event_task_exit() with LBR callstacks" The commit message clearly describes a **performance regression / system hang** scenario. Key indicators: - **"I got a report that a task is stuck"** - real-world report from a Google engineer (Rosalie Fang) - **"On large systems with lots threads, it'd have performance issues"** - reproducible on production systems - **"it'd block task exit path which is problematic especially under memory pressure"** - can escalate to system unresponsiveness - **Reported-by: Rosalie Fang ** - actual user report - **Suggested-by: Peter Zijlstra** and **Signed-off-by: Peter Zijlstra** - the perf subsystem maintainer suggested and approved the fix The commit message illustrates the exact deadlock scenario: 1. `perf_event_open` -> `attach_global_ctx_data` takes `global_ctx_data_rwsem` as a **writer** and iterates all threads to allocate context data 2. Simultaneously, a task calling `do_exit` -> `perf_event_exit_task` tries to take `global_ctx_data_rwsem` as a **reader** 3. On large systems with many threads, the write lock is held for a long time during the `for_each_process_thread` loop, blocking ALL task exits This is effectively a **priority inversion / livelock** scenario where task exit (a critical path) is blocked by a potentially very long operation (iterating and allocating for all threads in the system). ### 2. CODE CHANGE ANALYSIS The patch makes three coordinated changes: #### Change 1: Skip exiting tasks in `attach_global_ctx_data()` (lines 5483-5484 in the diff) ```c for_each_process_thread(g, p) { if (p->flags & PF_EXITING) continue; ``` This adds a check to skip tasks that are already exiting during the global iteration. No point allocating context data for a task that's about to die. #### Change 2: Detect and undo allocation for exiting tasks in `attach_task_ctx_data()` (lines 5427-5434 in the diff) After successfully attaching via `try_cmpxchg`, the code now checks: ```c if (task->flags & PF_EXITING) { /* detach_task_ctx_data() may free it already */ if (try_cmpxchg(&task->perf_ctx_data, &cd, NULL)) perf_free_ctx_data_rcu(cd); } ``` This handles the race where `attach_global_ctx_data()` allocates for a task that starts exiting between the `PF_EXITING` check and the `try_cmpxchg`. If we detect the task is exiting, we undo our allocation. The key insight: The `try_cmpxchg()` in `attach_task_ctx_data()` pairs with the `try_cmpxchg()` in `detach_task_ctx_data()` to provide total ordering. If `attach_task_ctx_data()` succeeds the cmpxchg first, it will see `PF_EXITING` and undo the allocation. If `detach_task_ctx_data()` (called from `perf_event_exit_task`) succeeds first, the undo cmpxchg will fail (because `cd` is no longer at `task->perf_ctx_data`), which is fine. #### Change 3: Remove lock from `perf_event_exit_task()` (lines 14558-14603 in the diff) The critical change: ```c // BEFORE: guard(percpu_read)(&global_ctx_data_rwsem); detach_task_ctx_data(task); // AFTER (no lock): detach_task_ctx_data(task); ``` The comment explains the correctness: > Done without holding global_ctx_data_rwsem; typically attach_global_ctx_data() will skip over this task, but otherwise attach_task_ctx_data() will observe PF_EXITING. **Correctness argument**: - `PF_EXITING` is set in `exit_signals()` (line 913 of exit.c) **before** `perf_event_exit_task()` is called (line 951) - The `try_cmpxchg()` operations provide atomic visibility of `task->perf_ctx_data` changes - If `attach_global_ctx_data()` races with exit: either it sees `PF_EXITING` and skips, or if it allocates, `attach_task_ctx_data()` sees `PF_EXITING` after the cmpxchg and undoes the allocation - `detach_task_ctx_data()` uses `try_cmpxchg` to atomically clear the pointer, so concurrent operations are safe ### 3. BUG CLASSIFICATION This is a **performance regression / system hang** fix. The `global_ctx_data_rwsem` write lock blocks ALL readers (task exits) while iterating ALL threads. On systems with thousands of threads: - Opening a perf event with LBR callstacks causes the write lock to be held for a long time - Every task trying to exit during this period blocks on the read lock - Under memory pressure, blocked task exits compound the problem (tasks holding memory can't release it) - This can effectively hang the system ### 4. SCOPE AND RISK ASSESSMENT **Lines changed**: ~25 lines added/changed in a single file (`kernel/events/core.c`) **Files touched**: 1 **Complexity**: Moderate - the synchronization relies on cmpxchg + PF_EXITING flag ordering **Risk**: LOW-MEDIUM - The fix is self-contained within the perf subsystem - The cmpxchg-based synchronization replaces a lock-based approach, which is more lockless but well-reasoned - Peter Zijlstra (the maintainer) both suggested and signed off on the approach - The worst case if the fix has a subtle race: a small memory leak of one `perf_ctx_data` allocation (not a crash) ### 5. USER IMPACT **Who is affected**: Anyone using perf with LBR callstacks (Intel) in system-wide mode on systems with many threads. This is common on: - Large servers doing production profiling - CI/CD systems running perf monitoring - Google's production fleet (where the bug was reported) **Severity**: HIGH - can block the task exit path, which is critical. Under memory pressure this can effectively hang the system. ### 6. STABILITY & TRUST INDICATORS - **Reported-by**: Rosalie Fang from Google - real production issue - **Suggested-by**: Peter Zijlstra (perf maintainer) - expert-designed solution - **Signed-off-by**: Both Namhyung Kim (author) and Peter Zijlstra (maintainer) - **Link**: Has a patch.msgid.link reference showing proper mailing list review ### 7. DEPENDENCY CHECK The affected code was introduced in: - `cb43691293390` "perf: Save PMU specific data in task_struct" (v6.15-rc1) - `506e64e710ff9` "perf: attach/detach PMU specific data" (v6.15-rc1) **Applicable stable trees**: 6.15.y, 6.16.y, 6.17.y, 6.18.y **NOT applicable**: 6.14.y and earlier (code doesn't exist) The patch is self-contained - it doesn't depend on any other commits. It modifies only the code introduced by the above commits and the logic is self-sufficient. ### 8. MINOR CLEANUP CONCERN The patch also removes a `(struct perf_ctx_data **)` cast from one `try_cmpxchg` call in `attach_task_ctx_data()`. This is a minor whitespace/type-correctness issue and doesn't affect functionality - `try_cmpxchg` on an `__rcu` pointer works the same way. ### CONCLUSION This commit fixes a real, reported, production-impacting performance bug that can cause task exit to hang on systems with many threads when LBR callstack profiling is active. The fix is: 1. **Obviously correct**: The synchronization relies on the well- established pattern of `PF_EXITING` flag + atomic cmpxchg operations, with clear ordering guarantees 2. **Fixes a real bug**: Reported by a Google engineer on production systems 3. **Important**: Can cause system hangs, especially under memory pressure 4. **Small and contained**: ~25 lines in a single file, touching only the perf subsystem 5. **No new features**: Just removes unnecessary lock contention 6. **Expert-reviewed**: Suggested by Peter Zijlstra, the perf subsystem maintainer The fix should be backported to all stable trees containing the affected code (6.15.y through 6.18.y). **YES** kernel/events/core.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 8cca800946248..69c56cad88a89 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5280,9 +5280,20 @@ attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache, return -ENOMEM; for (;;) { - if (try_cmpxchg((struct perf_ctx_data **)&task->perf_ctx_data, &old, cd)) { + if (try_cmpxchg(&task->perf_ctx_data, &old, cd)) { if (old) perf_free_ctx_data_rcu(old); + /* + * Above try_cmpxchg() pairs with try_cmpxchg() from + * detach_task_ctx_data() such that + * if we race with perf_event_exit_task(), we must + * observe PF_EXITING. + */ + if (task->flags & PF_EXITING) { + /* detach_task_ctx_data() may free it already */ + if (try_cmpxchg(&task->perf_ctx_data, &cd, NULL)) + perf_free_ctx_data_rcu(cd); + } return 0; } @@ -5328,6 +5339,8 @@ attach_global_ctx_data(struct kmem_cache *ctx_cache) /* Allocate everything */ scoped_guard (rcu) { for_each_process_thread(g, p) { + if (p->flags & PF_EXITING) + continue; cd = rcu_dereference(p->perf_ctx_data); if (cd && !cd->global) { cd->global = 1; @@ -14294,8 +14307,11 @@ void perf_event_exit_task(struct task_struct *task) /* * Detach the perf_ctx_data for the system-wide event. + * + * Done without holding global_ctx_data_rwsem; typically + * attach_global_ctx_data() will skip over this task, but otherwise + * attach_task_ctx_data() will observe PF_EXITING. */ - guard(percpu_read)(&global_ctx_data_rwsem); detach_task_ctx_data(task); } -- 2.51.0