From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
patches@lists.linux.dev, Gao Yingjie <gaoyingjie@uniontech.com>,
Chen Ridong <chenridong@huawei.com>, Teju Heo <tj@kernel.org>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH 6.6 06/70] cgroup: split cgroup_destroy_wq into 3 workqueues
Date: Mon, 22 Sep 2025 21:29:06 +0200 [thread overview]
Message-ID: <20250922192404.646095954@linuxfoundation.org> (raw)
In-Reply-To: <20250922192404.455120315@linuxfoundation.org>
6.6-stable review patch. If anyone has any objections, please let me know.
------------------
From: Chen Ridong <chenridong@huawei.com>
[ Upstream commit 79f919a89c9d06816dbdbbd168fa41d27411a7f9 ]
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.
Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
Call Trace:
cgroup_lock_and_drain_offline+0x14c/0x1e8
cgroup_destroy_root+0x3c/0x2c0
css_free_rwork_fn+0x248/0x338
process_one_work+0x16c/0x3b8
worker_thread+0x22c/0x3b0
kthread+0xec/0x100
ret_from_fork+0x10/0x20
Root Cause:
CPU0 CPU1
mount perf_event umount net_prio
cgroup1_get_tree cgroup_kill_sb
rebind_subsystems // root destruction enqueues
// cgroup_destroy_wq
// kill all perf_event css
// one perf_event css A is dying
// css A offline enqueues cgroup_destroy_wq
// root destruction will be executed first
css_free_rwork_fn
cgroup_destroy_root
cgroup_lock_and_drain_offline
// some perf descendants are dying
// cgroup_destroy_wq max_active = 1
// waiting for css A to die
Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
blocked behind root destruction in cgroup_destroy_wq (max_active=1)
Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation
This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.
[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++++-------
1 file changed, 36 insertions(+), 7 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e8ef062f6ca05..5135838b5899f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -123,8 +123,31 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
* of concurrent destructions. Use a separate workqueue so that cgroup
* destruction work items don't end up filling up max_active of system_wq
* which may lead to deadlock.
+ *
+ * A cgroup destruction should enqueue work sequentially to:
+ * cgroup_offline_wq: use for css offline work
+ * cgroup_release_wq: use for css release work
+ * cgroup_free_wq: use for free work
+ *
+ * Rationale for using separate workqueues:
+ * The cgroup root free work may depend on completion of other css offline
+ * operations. If all tasks were enqueued to a single workqueue, this could
+ * create a deadlock scenario where:
+ * - Free work waits for other css offline work to complete.
+ * - But other css offline work is queued after free work in the same queue.
+ *
+ * Example deadlock scenario with single workqueue (cgroup_destroy_wq):
+ * 1. umount net_prio
+ * 2. net_prio root destruction enqueues work to cgroup_destroy_wq (CPUx)
+ * 3. perf_event CSS A offline enqueues work to same cgroup_destroy_wq (CPUx)
+ * 4. net_prio cgroup_destroy_root->cgroup_lock_and_drain_offline.
+ * 5. net_prio root destruction blocks waiting for perf_event CSS A offline,
+ * which can never complete as it's behind in the same queue and
+ * workqueue's max_active is 1.
*/
-static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_offline_wq;
+static struct workqueue_struct *cgroup_release_wq;
+static struct workqueue_struct *cgroup_free_wq;
/* generate an array of cgroup subsystem pointers */
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
@@ -5435,7 +5458,7 @@ static void css_release_work_fn(struct work_struct *work)
cgroup_unlock();
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
}
static void css_release(struct percpu_ref *ref)
@@ -5444,7 +5467,7 @@ static void css_release(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);
INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_release_wq, &css->destroy_work);
}
static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5566,7 +5589,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
err_free_css:
list_del_rcu(&css->rstat_css_node);
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
return ERR_PTR(err);
}
@@ -5801,7 +5824,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
if (atomic_dec_and_test(&css->online_cnt)) {
INIT_WORK(&css->destroy_work, css_killed_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_offline_wq, &css->destroy_work);
}
}
@@ -6173,8 +6196,14 @@ static int __init cgroup_wq_init(void)
* We would prefer to do this in cgroup_init() above, but that
* is called before init_workqueues(): so leave this until after.
*/
- cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
- BUG_ON(!cgroup_destroy_wq);
+ cgroup_offline_wq = alloc_workqueue("cgroup_offline", 0, 1);
+ BUG_ON(!cgroup_offline_wq);
+
+ cgroup_release_wq = alloc_workqueue("cgroup_release", 0, 1);
+ BUG_ON(!cgroup_release_wq);
+
+ cgroup_free_wq = alloc_workqueue("cgroup_free", 0, 1);
+ BUG_ON(!cgroup_free_wq);
return 0;
}
core_initcall(cgroup_wq_init);
--
2.51.0
next prev parent reply other threads:[~2025-09-22 19:34 UTC|newest]
Thread overview: 80+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-22 19:29 [PATCH 6.6 00/70] 6.6.108-rc1 review Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 01/70] wifi: wilc1000: avoid buffer overflow in WID string configuration Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 02/70] ALSA: firewire-motu: drop EPOLLOUT from poll return values as write is not supported Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 03/70] wifi: mac80211: increase scan_ies_len for S1G Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 04/70] wifi: mac80211: fix incorrect type for ret Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 05/70] pcmcia: omap_cf: Mark driver struct with __refdata to prevent section mismatch Greg Kroah-Hartman
2025-09-22 19:29 ` Greg Kroah-Hartman [this message]
2025-09-22 19:29 ` [PATCH 6.6 07/70] btrfs: fix invalid extref key setup when replaying dentry Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 08/70] um: virtio_uml: Fix use-after-free after put_device in probe Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 09/70] dpaa2-switch: fix buffer pool seeding for control traffic Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 10/70] qed: Dont collect too many protection override GRC elements Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 11/70] bonding: set random address only when slaves already exist Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 12/70] mptcp: set remote_deny_join_id0 on SYN recv Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 13/70] mptcp: tfo: record deny join id0 info Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 14/70] selftests: mptcp: sockopt: fix error messages Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 15/70] net: natsemi: fix `rx_dropped` double accounting on `netif_rx()` failure Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 16/70] i40e: remove redundant memory barrier when cleaning Tx descs Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 17/70] net/mlx5e: Consider aggregated port speed during rate configuration Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 18/70] net/mlx5e: Harden uplink netdev access against device unbind Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 19/70] bonding: dont set oif to bond dev when getting NS target destination Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 20/70] tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 21/70] tls: make sure to abort the stream if headers are bogus Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 22/70] Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set" Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 23/70] net: liquidio: fix overflow in octeon_init_instr_queue() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 24/70] cnic: Fix use-after-free bugs in cnic_delete_task Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 25/70] octeontx2-pf: Fix use-after-free bugs in otx2_sync_tstamp() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 26/70] ksmbd: smbdirect: validate data_offset and data_length field of smb_direct_data_transfer Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 27/70] ksmbd: smbdirect: verify remaining_data_length respects max_fragmented_recv_size Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 28/70] nilfs2: fix CFI failure when accessing /sys/fs/nilfs2/features/* Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 29/70] crypto: af_alg - Disallow concurrent writes in af_alg_sendmsg Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 30/70] power: supply: bq27xxx: fix error return in case of no bq27000 hdq battery Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 31/70] power: supply: bq27xxx: restrict no-battery detection to bq27000 Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 32/70] LoongArch: Update help info of ARCH_STRICT_ALIGN Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 33/70] LoongArch: Align ACPI structures if ARCH_STRICT_ALIGN enabled Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 34/70] LoongArch: Check the return value when creating kobj Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 35/70] iommu/vt-d: Fix __domain_mapping()s usage of switch_to_super_page() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 36/70] btrfs: tree-checker: fix the incorrect inode ref size check Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 37/70] ASoC: qcom: audioreach: Fix lpaif_type configuration for the I2S interface Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 38/70] ASoC: qcom: q6apm-lpass-dais: Fix NULL pointer dereference if source graph failed Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 39/70] ASoC: qcom: q6apm-lpass-dais: Fix missing set_fmt DAI op for I2S Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 40/70] mmc: mvsdio: Fix dma_unmap_sg() nents value Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 41/70] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 42/70] net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 43/70] rds: ib: Increment i_fastreg_wrs before bailing out Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 44/70] selftests: mptcp: connect: catch IO errors on listen side Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 45/70] selftests: mptcp: avoid spurious errors on TCP disconnect Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 46/70] ALSA: hda/realtek: Fix mute led for HP Laptop 15-dw4xx Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 47/70] io_uring: backport io_should_terminate_tw() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 48/70] io_uring: include dying ring in task_work "should cancel" state Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 49/70] ASoC: wm8940: Correct PLL rate rounding Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 50/70] ASoC: wm8940: Correct typo in control name Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 51/70] ASoC: wm8974: Correct PLL rate rounding Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 52/70] ASoC: SOF: Intel: hda-stream: Fix incorrect variable used in error message Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 53/70] drm: bridge: anx7625: Fix NULL pointer dereference with early IRQ Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 54/70] drm: bridge: cdns-mhdp8546: Fix missing mutex unlock on error path Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 55/70] crypto: af_alg - Set merge to zero early in af_alg_sendmsg Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 56/70] smb: client: fix smbdirect_recv_io leak in smbd_negotiate() error path Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 57/70] vmxnet3: unregister xdp rxq info in the reset path Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 58/70] mptcp: pm: nl: announce deny-join-id0 flag Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 59/70] selftests: mptcp: userspace pm: validate " Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 60/70] phy: Use device_get_match_data() Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 61/70] phy: ti: omap-usb2: fix device leak at unbind Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 62/70] xhci: dbc: decouple endpoint allocation from initialization Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 63/70] xhci: dbc: Fix full DbC transfer ring after several reconnects Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 64/70] iommu/amd/pgtbl: Fix possible race while increase page table level Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 65/70] rtc: pcf2127: fix SPI command byte for PCF2131 backport Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 66/70] mptcp: propagate shutdown to subflows when possible Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 67/70] minmax: avoid overly complicated constant expressions in VM code Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 68/70] minmax: simplify and clarify min_t()/max_t() implementation Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 69/70] minmax: add a few more MIN_T/MAX_T users Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 70/70] Revert "loop: Avoid updating block size under exclusive owner" Greg Kroah-Hartman
2025-09-22 22:48 ` [PATCH 6.6 00/70] 6.6.108-rc1 review Florian Fainelli
2025-09-23 6:53 ` Naresh Kamboju
2025-09-23 7:26 ` Brett A C Sheffield
2025-09-23 10:16 ` [PATCH 6.6 00/70] " Peter Schneider
2025-09-23 13:10 ` Jon Hunter
2025-09-23 20:39 ` Miguel Ojeda
2025-09-24 0:29 ` Shuah Khan
2025-09-24 6:57 ` Hardik Garg
2025-09-24 8:39 ` Mark Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250922192404.646095954@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=chenridong@huawei.com \
--cc=gaoyingjie@uniontech.com \
--cc=patches@lists.linux.dev \
--cc=sashal@kernel.org \
--cc=stable@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox