From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
patches@lists.linux.dev, Gao Yingjie <gaoyingjie@uniontech.com>,
Chen Ridong <chenridong@huawei.com>, Teju Heo <tj@kernel.org>,
Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.4 34/81] cgroup: split cgroup_destroy_wq into 3 workqueues
Date: Tue, 30 Sep 2025 16:46:36 +0200 [thread overview]
Message-ID: <20250930143821.094175377@linuxfoundation.org> (raw)
In-Reply-To: <20250930143819.654157320@linuxfoundation.org>
5.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Chen Ridong <chenridong@huawei.com>
[ Upstream commit 79f919a89c9d06816dbdbbd168fa41d27411a7f9 ]
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.
Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
Call Trace:
cgroup_lock_and_drain_offline+0x14c/0x1e8
cgroup_destroy_root+0x3c/0x2c0
css_free_rwork_fn+0x248/0x338
process_one_work+0x16c/0x3b8
worker_thread+0x22c/0x3b0
kthread+0xec/0x100
ret_from_fork+0x10/0x20
Root Cause:
CPU0 CPU1
mount perf_event umount net_prio
cgroup1_get_tree cgroup_kill_sb
rebind_subsystems // root destruction enqueues
// cgroup_destroy_wq
// kill all perf_event css
// one perf_event css A is dying
// css A offline enqueues cgroup_destroy_wq
// root destruction will be executed first
css_free_rwork_fn
cgroup_destroy_root
cgroup_lock_and_drain_offline
// some perf descendants are dying
// cgroup_destroy_wq max_active = 1
// waiting for css A to die
Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
blocked behind root destruction in cgroup_destroy_wq (max_active=1)
Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation
This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.
[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++++-------
1 file changed, 36 insertions(+), 7 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 801022a8899b5..b761d70eccbf4 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -114,8 +114,31 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
* of concurrent destructions. Use a separate workqueue so that cgroup
* destruction work items don't end up filling up max_active of system_wq
* which may lead to deadlock.
+ *
+ * A cgroup destruction should enqueue work sequentially to:
+ * cgroup_offline_wq: use for css offline work
+ * cgroup_release_wq: use for css release work
+ * cgroup_free_wq: use for free work
+ *
+ * Rationale for using separate workqueues:
+ * The cgroup root free work may depend on completion of other css offline
+ * operations. If all tasks were enqueued to a single workqueue, this could
+ * create a deadlock scenario where:
+ * - Free work waits for other css offline work to complete.
+ * - But other css offline work is queued after free work in the same queue.
+ *
+ * Example deadlock scenario with single workqueue (cgroup_destroy_wq):
+ * 1. umount net_prio
+ * 2. net_prio root destruction enqueues work to cgroup_destroy_wq (CPUx)
+ * 3. perf_event CSS A offline enqueues work to same cgroup_destroy_wq (CPUx)
+ * 4. net_prio cgroup_destroy_root->cgroup_lock_and_drain_offline.
+ * 5. net_prio root destruction blocks waiting for perf_event CSS A offline,
+ * which can never complete as it's behind in the same queue and
+ * workqueue's max_active is 1.
*/
-static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_offline_wq;
+static struct workqueue_struct *cgroup_release_wq;
+static struct workqueue_struct *cgroup_free_wq;
/* generate an array of cgroup subsystem pointers */
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
@@ -5233,7 +5256,7 @@ static void css_release_work_fn(struct work_struct *work)
mutex_unlock(&cgroup_mutex);
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
}
static void css_release(struct percpu_ref *ref)
@@ -5242,7 +5265,7 @@ static void css_release(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);
INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_release_wq, &css->destroy_work);
}
static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5373,7 +5396,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
err_free_css:
list_del_rcu(&css->rstat_css_node);
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
return ERR_PTR(err);
}
@@ -5626,7 +5649,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
if (atomic_dec_and_test(&css->online_cnt)) {
INIT_WORK(&css->destroy_work, css_killed_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_offline_wq, &css->destroy_work);
}
}
@@ -6002,8 +6025,14 @@ static int __init cgroup_wq_init(void)
* We would prefer to do this in cgroup_init() above, but that
* is called before init_workqueues(): so leave this until after.
*/
- cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
- BUG_ON(!cgroup_destroy_wq);
+ cgroup_offline_wq = alloc_workqueue("cgroup_offline", 0, 1);
+ BUG_ON(!cgroup_offline_wq);
+
+ cgroup_release_wq = alloc_workqueue("cgroup_release", 0, 1);
+ BUG_ON(!cgroup_release_wq);
+
+ cgroup_free_wq = alloc_workqueue("cgroup_free", 0, 1);
+ BUG_ON(!cgroup_free_wq);
return 0;
}
core_initcall(cgroup_wq_init);
--
2.51.0
next prev parent reply other threads:[~2025-09-30 14:52 UTC|newest]
Thread overview: 88+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-30 14:46 [PATCH 5.4 00/81] 5.4.300-rc1 review Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 01/81] usb: hub: Fix flushing of delayed work used for post resume purposes Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 02/81] net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 03/81] NFSv4: Dont clear capabilities that wont be reset Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 04/81] tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 05/81] EDAC/altera: Delete an inappropriate dma_free_coherent() call Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 06/81] ocfs2: fix recursive semaphore deadlock in fiemap call Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 07/81] mtd: rawnand: stm32_fmc2: fix ECC overwrite Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 08/81] fuse: check if copy_file_range() returns larger than requested size Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 09/81] fuse: prevent overflow in copy_file_range return value Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 10/81] mm/khugepaged: fix the address passed to notifier on testing young Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 11/81] mtd: rawnand: stm32_fmc2: avoid overlapping mappings on ECC buffer Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 12/81] mtd: nand: raw: atmel: Fix comment in timings preparation Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 13/81] mtd: nand: raw: atmel: Respect tAR, tCLR in read setup timing Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 14/81] tty: hvc_console: Call hvc_kick in hvc_write unconditionally Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 15/81] USB: serial: option: add Telit Cinterion FN990A w/audio compositions Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 16/81] USB: serial: option: add Telit Cinterion LE910C4-WWX new compositions Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 17/81] net: fec: Fix possible NPD in fec_enet_phy_reset_after_clk_enable() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 18/81] igb: fix link test skipping when interface is admin down Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 19/81] genirq/affinity: Add irq_update_affinity_desc() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 20/81] genirq: Export affinity setter for modules Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 21/81] genirq: Provide new interfaces for affinity hints Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 22/81] i40e: Use irq_update_affinity_hint() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 23/81] i40e: fix IRQ freeing in i40e_vsi_request_irq_msix error path Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 24/81] can: j1939: j1939_sk_bind(): call j1939_priv_put() immediately when j1939_local_ecu_get() failed Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 25/81] can: j1939: j1939_local_ecu_get(): undo increment when j1939_local_ecu_get() fails Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 26/81] dmaengine: ti: edma: Fix memory allocation size for queue_priority_map Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 27/81] dmaengine: qcom: bam_dma: Fix DT error handling for num-channels/ees Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 28/81] phy: ti-pipe3: fix device leak at unbind Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 29/81] soc: qcom: mdt_loader: Deal with zero e_shentsize Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 30/81] mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 31/81] ALSA: firewire-motu: drop EPOLLOUT from poll return values as write is not supported Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 32/81] wifi: mac80211: fix incorrect type for ret Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 33/81] pcmcia: omap_cf: Mark driver struct with __refdata to prevent section mismatch Greg Kroah-Hartman
2025-09-30 14:46 ` Greg Kroah-Hartman [this message]
2025-09-30 14:46 ` [PATCH 5.4 35/81] net: natsemi: fix `rx_dropped` double accounting on `netif_rx()` failure Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 36/81] i40e: remove redundant memory barrier when cleaning Tx descs Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 37/81] tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 38/81] Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set" Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 39/81] net: liquidio: fix overflow in octeon_init_instr_queue() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 40/81] cnic: Fix use-after-free bugs in cnic_delete_task Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 41/81] nilfs2: fix CFI failure when accessing /sys/fs/nilfs2/features/* Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 42/81] power: supply: bq27xxx: fix error return in case of no bq27000 hdq battery Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 43/81] power: supply: bq27xxx: restrict no-battery detection to bq27000 Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 44/81] mmc: mvsdio: Fix dma_unmap_sg() nents value Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 45/81] rds: ib: Increment i_fastreg_wrs before bailing out Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 46/81] ASoC: wm8940: Correct typo in control name Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 47/81] ASoC: wm8974: Correct PLL rate rounding Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 48/81] ASoC: SOF: Intel: hda-stream: Fix incorrect variable used in error message Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 49/81] usb: gadget: dummy_hcd: remove usage of list iterator past the loop body Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 50/81] USB: gadget: dummy-hcd: Fix locking bug in RT-enabled kernels Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 51/81] serial: sc16is7xx: fix bug in flow control levels init Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 52/81] net: rfkill: gpio: add DT support Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 53/81] net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 54/81] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 55/81] ALSA: usb-audio: Fix block comments in mixer_quirks Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 56/81] ALSA: usb-audio: Avoid multiple assignments " Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 57/81] ALSA: usb-audio: Simplify NULL comparison " Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 58/81] ALSA: usb-audio: Remove unneeded wmb() " Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 59/81] ALSA: usb-audio: Add mixer quirk for Sony DualSense PS5 Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 60/81] ALSA: usb-audio: Convert comma to semicolon Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 61/81] ALSA: usb-audio: Fix build with CONFIG_INPUT=n Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 62/81] usb: core: Add 0x prefix to quirks debug output Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 63/81] IB/mlx5: Fix obj_type mismatch for SRQ event subscriptions Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 64/81] can: rcar_can: rcar_can_resume(): fix s2ram with PSCI Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 65/81] can: hi311x: populate ndo_change_mtu() to prevent buffer overflow Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 66/81] can: sun4i_can: " Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 67/81] can: mcba_usb: " Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 68/81] can: peak_usb: fix shift-out-of-bounds issue Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 69/81] drm/gma500: Fix null dereference in hdmi teardown Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 70/81] i40e: fix idx validation in i40e_validate_queue_map Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 71/81] i40e: fix input validation logic for action_meta Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 72/81] i40e: add max boundary check for VF filters Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 73/81] fbcon: fix integer overflow in fbcon_do_set_font Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 74/81] fbcon: Fix OOB access in font allocation Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 75/81] mm/migrate_device: dont add folio to be freed to LRU in migrate_device_finalize() Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 76/81] i40e: increase max descriptors for XL710 Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 77/81] i40e: add validation for ring_len param Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 78/81] i40e: fix idx validation in config queues msg Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 79/81] i40e: fix validation of VF state in get resources Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 80/81] i40e: add mask to apply valid bits for itr_idx Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 81/81] mm/hugetlb: fix folio is still mapped when deleted Greg Kroah-Hartman
2025-09-30 17:06 ` [PATCH 5.4 00/81] 5.4.300-rc1 review Florian Fainelli
2025-09-30 18:52 ` Brett A C Sheffield
2025-10-01 9:11 ` [PATCH 5.4 00/81] " Jon Hunter
2025-10-01 12:07 ` Naresh Kamboju
2025-10-01 13:37 ` [External] : " ALOK TIWARI
2025-10-01 16:21 ` Shuah Khan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250930143821.094175377@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=chenridong@huawei.com \
--cc=gaoyingjie@uniontech.com \
--cc=patches@lists.linux.dev \
--cc=sashal@kernel.org \
--cc=stable@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).