Linux kernel -stable discussions
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	patches@lists.linux.dev, Gao Yingjie <gaoyingjie@uniontech.com>,
	Chen Ridong <chenridong@huawei.com>, Teju Heo <tj@kernel.org>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 6.1 05/61] cgroup: split cgroup_destroy_wq into 3 workqueues
Date: Mon, 22 Sep 2025 21:28:58 +0200	[thread overview]
Message-ID: <20250922192403.694104160@linuxfoundation.org> (raw)
In-Reply-To: <20250922192403.524848428@linuxfoundation.org>

6.1-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Chen Ridong <chenridong@huawei.com>

[ Upstream commit 79f919a89c9d06816dbdbbd168fa41d27411a7f9 ]

A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation

This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.

[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 7997c8021b62f..9742574ec62fd 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -123,8 +123,31 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
  * of concurrent destructions.  Use a separate workqueue so that cgroup
  * destruction work items don't end up filling up max_active of system_wq
  * which may lead to deadlock.
+ *
+ * A cgroup destruction should enqueue work sequentially to:
+ * cgroup_offline_wq: use for css offline work
+ * cgroup_release_wq: use for css release work
+ * cgroup_free_wq: use for free work
+ *
+ * Rationale for using separate workqueues:
+ * The cgroup root free work may depend on completion of other css offline
+ * operations. If all tasks were enqueued to a single workqueue, this could
+ * create a deadlock scenario where:
+ * - Free work waits for other css offline work to complete.
+ * - But other css offline work is queued after free work in the same queue.
+ *
+ * Example deadlock scenario with single workqueue (cgroup_destroy_wq):
+ * 1. umount net_prio
+ * 2. net_prio root destruction enqueues work to cgroup_destroy_wq (CPUx)
+ * 3. perf_event CSS A offline enqueues work to same cgroup_destroy_wq (CPUx)
+ * 4. net_prio cgroup_destroy_root->cgroup_lock_and_drain_offline.
+ * 5. net_prio root destruction blocks waiting for perf_event CSS A offline,
+ *    which can never complete as it's behind in the same queue and
+ *    workqueue's max_active is 1.
  */
-static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_offline_wq;
+static struct workqueue_struct *cgroup_release_wq;
+static struct workqueue_struct *cgroup_free_wq;
 
 /* generate an array of cgroup subsystem pointers */
 #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
@@ -5444,7 +5467,7 @@ static void css_release_work_fn(struct work_struct *work)
 	cgroup_unlock();
 
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
-	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+	queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
 }
 
 static void css_release(struct percpu_ref *ref)
@@ -5453,7 +5476,7 @@ static void css_release(struct percpu_ref *ref)
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
 	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	queue_work(cgroup_release_wq, &css->destroy_work);
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5575,7 +5598,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_free_css:
 	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
-	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+	queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
 }
 
@@ -5811,7 +5834,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 
 	if (atomic_dec_and_test(&css->online_cnt)) {
 		INIT_WORK(&css->destroy_work, css_killed_work_fn);
-		queue_work(cgroup_destroy_wq, &css->destroy_work);
+		queue_work(cgroup_offline_wq, &css->destroy_work);
 	}
 }
 
@@ -6183,8 +6206,14 @@ static int __init cgroup_wq_init(void)
 	 * We would prefer to do this in cgroup_init() above, but that
 	 * is called before init_workqueues(): so leave this until after.
 	 */
-	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
-	BUG_ON(!cgroup_destroy_wq);
+	cgroup_offline_wq = alloc_workqueue("cgroup_offline", 0, 1);
+	BUG_ON(!cgroup_offline_wq);
+
+	cgroup_release_wq = alloc_workqueue("cgroup_release", 0, 1);
+	BUG_ON(!cgroup_release_wq);
+
+	cgroup_free_wq = alloc_workqueue("cgroup_free", 0, 1);
+	BUG_ON(!cgroup_free_wq);
 	return 0;
 }
 core_initcall(cgroup_wq_init);
-- 
2.51.0




  parent reply	other threads:[~2025-09-22 19:31 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-22 19:28 [PATCH 6.1 00/61] 6.1.154-rc1 review Greg Kroah-Hartman
2025-09-22 19:28 ` [PATCH 6.1 01/61] ALSA: firewire-motu: drop EPOLLOUT from poll return values as write is not supported Greg Kroah-Hartman
2025-09-22 19:28 ` [PATCH 6.1 02/61] wifi: mac80211: increase scan_ies_len for S1G Greg Kroah-Hartman
2025-09-22 19:28 ` [PATCH 6.1 03/61] wifi: mac80211: fix incorrect type for ret Greg Kroah-Hartman
2025-09-22 19:28 ` [PATCH 6.1 04/61] pcmcia: omap_cf: Mark driver struct with __refdata to prevent section mismatch Greg Kroah-Hartman
2025-09-22 19:28 ` Greg Kroah-Hartman [this message]
2025-09-22 19:28 ` [PATCH 6.1 06/61] btrfs: fix invalid extref key setup when replaying dentry Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 07/61] um: virtio_uml: Fix use-after-free after put_device in probe Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 08/61] dpaa2-switch: fix buffer pool seeding for control traffic Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 09/61] qed: Dont collect too many protection override GRC elements Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 10/61] mptcp: set remote_deny_join_id0 on SYN recv Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 11/61] net: natsemi: fix `rx_dropped` double accounting on `netif_rx()` failure Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 12/61] i40e: remove redundant memory barrier when cleaning Tx descs Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 13/61] bonding: dont set oif to bond dev when getting NS target destination Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 14/61] tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 15/61] tls: make sure to abort the stream if headers are bogus Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 16/61] Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set" Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 17/61] net: liquidio: fix overflow in octeon_init_instr_queue() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 18/61] cnic: Fix use-after-free bugs in cnic_delete_task Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 19/61] octeontx2-pf: Fix use-after-free bugs in otx2_sync_tstamp() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 20/61] ksmbd: smbdirect: validate data_offset and data_length field of smb_direct_data_transfer Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 21/61] ksmbd: smbdirect: verify remaining_data_length respects max_fragmented_recv_size Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 22/61] nilfs2: fix CFI failure when accessing /sys/fs/nilfs2/features/* Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 23/61] power: supply: bq27xxx: fix error return in case of no bq27000 hdq battery Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 24/61] power: supply: bq27xxx: restrict no-battery detection to bq27000 Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 25/61] LoongArch: Align ACPI structures if ARCH_STRICT_ALIGN enabled Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 26/61] LoongArch: Check the return value when creating kobj Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 27/61] iommu/vt-d: Fix __domain_mapping()s usage of switch_to_super_page() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 28/61] btrfs: tree-checker: fix the incorrect inode ref size check Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 29/61] ASoC: qcom: audioreach: Fix lpaif_type configuration for the I2S interface Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 30/61] ASoC: qcom: q6apm-lpass-dais: Fix missing set_fmt DAI op for I2S Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 31/61] mmc: mvsdio: Fix dma_unmap_sg() nents value Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 32/61] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 33/61] rds: ib: Increment i_fastreg_wrs before bailing out Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 34/61] selftests: mptcp: avoid spurious errors on TCP disconnect Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 35/61] ALSA: hda/realtek: Fix mute led for HP Laptop 15-dw4xx Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 36/61] io_uring: backport io_should_terminate_tw() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 37/61] io_uring: include dying ring in task_work "should cancel" state Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 38/61] ASoC: wm8940: Correct typo in control name Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 39/61] ASoC: wm8974: Correct PLL rate rounding Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 40/61] ASoC: SOF: Intel: hda-stream: Fix incorrect variable used in error message Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 41/61] drm: bridge: anx7625: Fix NULL pointer dereference with early IRQ Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 42/61] drm: bridge: cdns-mhdp8546: Fix missing mutex unlock on error path Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 43/61] crypto: af_alg: Indent the loop in af_alg_sendmsg() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 44/61] crypto: af_alg - Set merge to zero early in af_alg_sendmsg Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 45/61] smb: client: fix smbdirect_recv_io leak in smbd_negotiate() error path Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 46/61] mptcp: pm: nl: announce deny-join-id0 flag Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 47/61] selftests: mptcp: userspace pm: validate " Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 48/61] phy: broadcom: ns-usb3: fix Wvoid-pointer-to-enum-cast warning Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 49/61] phy: Use device_get_match_data() Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 50/61] phy: ti: omap-usb2: fix device leak at unbind Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 51/61] xhci: dbc: decouple endpoint allocation from initialization Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 52/61] xhci: dbc: Fix full DbC transfer ring after several reconnects Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 53/61] mptcp: propagate shutdown to subflows when possible Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 54/61] selftests: mptcp: connect: catch IO errors on listen side Greg Kroah-Hartman
2025-09-30 15:30   ` Kenta Akagi
2025-10-01  7:56     ` Matthieu Baerts
2025-10-01 15:24       ` Kenta Akagi
2025-10-01 16:43       ` Kenta Akagi
2025-10-01 17:09         ` Matthieu Baerts
2025-10-02 16:06           ` Kenta Akagi
2025-09-22 19:29 ` [PATCH 6.1 55/61] net: rfkill: gpio: add DT support Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 56/61] net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 57/61] ASoC: qcom: q6apm-lpass-dai: close graphs before opening a new one Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 58/61] ASoC: q6apm-lpass-dai: close graph on prepare errors Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 59/61] ASoC: qcom: q6apm-lpass-dais: Fix NULL pointer dereference if source graph failed Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 60/61] crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES Greg Kroah-Hartman
2025-09-22 19:29 ` [PATCH 6.1 61/61] crypto: af_alg - Disallow concurrent writes in af_alg_sendmsg Greg Kroah-Hartman
2025-09-22 22:43 ` [PATCH 6.1 00/61] 6.1.154-rc1 review Florian Fainelli
2025-09-23  7:27 ` Brett A C Sheffield
2025-09-23 10:02 ` [PATCH 6.1 00/61] " Peter Schneider
2025-09-23 10:30 ` Naresh Kamboju
2025-09-23 13:06 ` Jon Hunter
2025-09-23 13:12 ` Mark Brown
2025-09-23 15:16 ` Ron Economos
2025-09-23 20:36 ` Miguel Ojeda
2025-09-24  0:33 ` Shuah Khan
2025-09-24  6:56 ` Hardik Garg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250922192403.694104160@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=chenridong@huawei.com \
    --cc=gaoyingjie@uniontech.com \
    --cc=patches@lists.linux.dev \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox