Linux kernel -stable discussions
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	patches@lists.linux.dev, Gao Yingjie <gaoyingjie@uniontech.com>,
	Chen Ridong <chenridong@huawei.com>, Teju Heo <tj@kernel.org>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 6.6 06/70] cgroup: split cgroup_destroy_wq into 3 workqueues
Date: Mon, 22 Sep 2025 21:29:06 +0200	[thread overview]
Message-ID: <20250922192404.646095954@linuxfoundation.org> (raw)
In-Reply-To: <20250922192404.455120315@linuxfoundation.org>

6.6-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Chen Ridong <chenridong@huawei.com>

[ Upstream commit 79f919a89c9d06816dbdbbd168fa41d27411a7f9 ]

A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation

This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.

[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e8ef062f6ca05..5135838b5899f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -123,8 +123,31 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
  * of concurrent destructions.  Use a separate workqueue so that cgroup
  * destruction work items don't end up filling up max_active of system_wq
  * which may lead to deadlock.
+ *
+ * A cgroup destruction should enqueue work sequentially to:
+ * cgroup_offline_wq: use for css offline work
+ * cgroup_release_wq: use for css release work
+ * cgroup_free_wq: use for free work
+ *
+ * Rationale for using separate workqueues:
+ * The cgroup root free work may depend on completion of other css offline
+ * operations. If all tasks were enqueued to a single workqueue, this could
+ * create a deadlock scenario where:
+ * - Free work waits for other css offline work to complete.
+ * - But other css offline work is queued after free work in the same queue.
+ *
+ * Example deadlock scenario with single workqueue (cgroup_destroy_wq):
+ * 1. umount net_prio
+ * 2. net_prio root destruction enqueues work to cgroup_destroy_wq (CPUx)
+ * 3. perf_event CSS A offline enqueues work to same cgroup_destroy_wq (CPUx)
+ * 4. net_prio cgroup_destroy_root->cgroup_lock_and_drain_offline.
+ * 5. net_prio root destruction blocks waiting for perf_event CSS A offline,
+ *    which can never complete as it's behind in the same queue and
+ *    workqueue's max_active is 1.
  */
-static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_offline_wq;
+static struct workqueue_struct *cgroup_release_wq;
+static struct workqueue_struct *cgroup_free_wq;
 
 /* generate an array of cgroup subsystem pointers */
 #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
@@ -5435,7 +5458,7 @@ static void css_release_work_fn(struct work_struct *work)
 	cgroup_unlock();
 
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
-	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+	queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
 }
 
 static void css_release(struct percpu_ref *ref)
@@ -5444,7 +5467,7 @@ static void css_release(struct percpu_ref *ref)
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
 	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	queue_work(cgroup_release_wq, &css->destroy_work);
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5566,7 +5589,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_free_css:
 	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
-	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+	queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
 }
 
@@ -5801,7 +5824,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 
 	if (atomic_dec_and_test(&css->online_cnt)) {
 		INIT_WORK(&css->destroy_work, css_killed_work_fn);
-		queue_work(cgroup_destroy_wq, &css->destroy_work);
+		queue_work(cgroup_offline_wq, &css->destroy_work);
 	}
 }
 
@@ -6173,8 +6196,14 @@ static int __init cgroup_wq_init(void)
 	 * We would prefer to do this in cgroup_init() above, but that
 	 * is called before init_workqueues(): so leave this until after.
 	 */
-	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
-	BUG_ON(!cgroup_destroy_wq);
+	cgroup_offline_wq = alloc_workqueue("cgroup_offline", 0, 1);
+	BUG_ON(!cgroup_offline_wq);
+
+	cgroup_release_wq = alloc_workqueue("cgroup_release", 0, 1);
+	BUG_ON(!cgroup_release_wq);
+
+	cgroup_free_wq = alloc_workqueue("cgroup_free", 0, 1);
+	BUG_ON(!cgroup_free_wq);
 	return 0;
 }
 core_initcall(cgroup_wq_init);
-- 
2.51.0




  parent reply	other threads:[~2025-09-22 19:34 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-22 19:29 [PATCH 6.6 00/70] 6.6.108-rc1 review Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 01/70] wifi: wilc1000: avoid buffer overflow in WID string configuration Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 02/70] ALSA: firewire-motu: drop EPOLLOUT from poll return values as write is not supported Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 03/70] wifi: mac80211: increase scan_ies_len for S1G Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 04/70] wifi: mac80211: fix incorrect type for ret Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 05/70] pcmcia: omap_cf: Mark driver struct with __refdata to prevent section mismatch Greg Kroah-Hartman
2025-09-22 19:29 ` Greg Kroah-Hartman [this message]
2025-09-22 19:29 ` [PATCH 6.6 07/70] btrfs: fix invalid extref key setup when replaying dentry Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 08/70] um: virtio_uml: Fix use-after-free after put_device in probe Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 09/70] dpaa2-switch: fix buffer pool seeding for control traffic Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 10/70] qed: Dont collect too many protection override GRC elements Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 11/70] bonding: set random address only when slaves already exist Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 12/70] mptcp: set remote_deny_join_id0 on SYN recv Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 13/70] mptcp: tfo: record deny join id0 info Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 14/70] selftests: mptcp: sockopt: fix error messages Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 15/70] net: natsemi: fix `rx_dropped` double accounting on `netif_rx()` failure Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 16/70] i40e: remove redundant memory barrier when cleaning Tx descs Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 17/70] net/mlx5e: Consider aggregated port speed during rate configuration Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 18/70] net/mlx5e: Harden uplink netdev access against device unbind Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 19/70] bonding: dont set oif to bond dev when getting NS target destination Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 20/70] tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 21/70] tls: make sure to abort the stream if headers are bogus Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 22/70] Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set" Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 23/70] net: liquidio: fix overflow in octeon_init_instr_queue() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 24/70] cnic: Fix use-after-free bugs in cnic_delete_task Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 25/70] octeontx2-pf: Fix use-after-free bugs in otx2_sync_tstamp() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 26/70] ksmbd: smbdirect: validate data_offset and data_length field of smb_direct_data_transfer Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 27/70] ksmbd: smbdirect: verify remaining_data_length respects max_fragmented_recv_size Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 28/70] nilfs2: fix CFI failure when accessing /sys/fs/nilfs2/features/* Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 29/70] crypto: af_alg - Disallow concurrent writes in af_alg_sendmsg Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 30/70] power: supply: bq27xxx: fix error return in case of no bq27000 hdq battery Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 31/70] power: supply: bq27xxx: restrict no-battery detection to bq27000 Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 32/70] LoongArch: Update help info of ARCH_STRICT_ALIGN Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 33/70] LoongArch: Align ACPI structures if ARCH_STRICT_ALIGN enabled Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 34/70] LoongArch: Check the return value when creating kobj Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 35/70] iommu/vt-d: Fix __domain_mapping()s usage of switch_to_super_page() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 36/70] btrfs: tree-checker: fix the incorrect inode ref size check Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 37/70] ASoC: qcom: audioreach: Fix lpaif_type configuration for the I2S interface Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 38/70] ASoC: qcom: q6apm-lpass-dais: Fix NULL pointer dereference if source graph failed Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 39/70] ASoC: qcom: q6apm-lpass-dais: Fix missing set_fmt DAI op for I2S Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 40/70] mmc: mvsdio: Fix dma_unmap_sg() nents value Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 41/70] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 42/70] net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 43/70] rds: ib: Increment i_fastreg_wrs before bailing out Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 44/70] selftests: mptcp: connect: catch IO errors on listen side Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 45/70] selftests: mptcp: avoid spurious errors on TCP disconnect Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 46/70] ALSA: hda/realtek: Fix mute led for HP Laptop 15-dw4xx Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 47/70] io_uring: backport io_should_terminate_tw() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 48/70] io_uring: include dying ring in task_work "should cancel" state Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 49/70] ASoC: wm8940: Correct PLL rate rounding Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 50/70] ASoC: wm8940: Correct typo in control name Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 51/70] ASoC: wm8974: Correct PLL rate rounding Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 52/70] ASoC: SOF: Intel: hda-stream: Fix incorrect variable used in error message Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 53/70] drm: bridge: anx7625: Fix NULL pointer dereference with early IRQ Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 54/70] drm: bridge: cdns-mhdp8546: Fix missing mutex unlock on error path Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 55/70] crypto: af_alg - Set merge to zero early in af_alg_sendmsg Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 56/70] smb: client: fix smbdirect_recv_io leak in smbd_negotiate() error path Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 57/70] vmxnet3: unregister xdp rxq info in the reset path Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 58/70] mptcp: pm: nl: announce deny-join-id0 flag Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.6 59/70] selftests: mptcp: userspace pm: validate " Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 60/70] phy: Use device_get_match_data() Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 61/70] phy: ti: omap-usb2: fix device leak at unbind Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 62/70] xhci: dbc: decouple endpoint allocation from initialization Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 63/70] xhci: dbc: Fix full DbC transfer ring after several reconnects Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 64/70] iommu/amd/pgtbl: Fix possible race while increase page table level Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 65/70] rtc: pcf2127: fix SPI command byte for PCF2131 backport Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 66/70] mptcp: propagate shutdown to subflows when possible Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 67/70] minmax: avoid overly complicated constant expressions in VM code Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 68/70] minmax: simplify and clarify min_t()/max_t() implementation Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 69/70] minmax: add a few more MIN_T/MAX_T users Greg Kroah-Hartman
2025-09-22 19:30 ` [PATCH 6.6 70/70] Revert "loop: Avoid updating block size under exclusive owner" Greg Kroah-Hartman
2025-09-22 22:48 ` [PATCH 6.6 00/70] 6.6.108-rc1 review Florian Fainelli
2025-09-23  6:53 ` Naresh Kamboju
2025-09-23  7:26 ` Brett A C Sheffield
2025-09-23 10:16 ` [PATCH 6.6 00/70] " Peter Schneider
2025-09-23 13:10 ` Jon Hunter
2025-09-23 20:39 ` Miguel Ojeda
2025-09-24  0:29 ` Shuah Khan
2025-09-24  6:57 ` Hardik Garg
2025-09-24  8:39 ` Mark Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250922192404.646095954@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=chenridong@huawei.com \
    --cc=gaoyingjie@uniontech.com \
    --cc=patches@lists.linux.dev \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox