All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	patches@lists.linux.dev, Gao Yingjie <gaoyingjie@uniontech.com>,
	Chen Ridong <chenridong@huawei.com>, Teju Heo <tj@kernel.org>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.4 34/81] cgroup: split cgroup_destroy_wq into 3 workqueues
Date: Tue, 30 Sep 2025 16:46:36 +0200	[thread overview]
Message-ID: <20250930143821.094175377@linuxfoundation.org> (raw)
In-Reply-To: <20250930143819.654157320@linuxfoundation.org>

5.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Chen Ridong <chenridong@huawei.com>

[ Upstream commit 79f919a89c9d06816dbdbbd168fa41d27411a7f9 ]

A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.

Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio

Call Trace:
	cgroup_lock_and_drain_offline+0x14c/0x1e8
	cgroup_destroy_root+0x3c/0x2c0
	css_free_rwork_fn+0x248/0x338
	process_one_work+0x16c/0x3b8
	worker_thread+0x22c/0x3b0
	kthread+0xec/0x100
	ret_from_fork+0x10/0x20

Root Cause:

CPU0                            CPU1
mount perf_event                umount net_prio
cgroup1_get_tree                cgroup_kill_sb
rebind_subsystems               // root destruction enqueues
				// cgroup_destroy_wq
// kill all perf_event css
                                // one perf_event css A is dying
                                // css A offline enqueues cgroup_destroy_wq
                                // root destruction will be executed first
                                css_free_rwork_fn
                                cgroup_destroy_root
                                cgroup_lock_and_drain_offline
                                // some perf descendants are dying
                                // cgroup_destroy_wq max_active = 1
                                // waiting for css A to die

Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
   blocked behind root destruction in cgroup_destroy_wq (max_active=1)

Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation

This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.

[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 801022a8899b5..b761d70eccbf4 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -114,8 +114,31 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
  * of concurrent destructions.  Use a separate workqueue so that cgroup
  * destruction work items don't end up filling up max_active of system_wq
  * which may lead to deadlock.
+ *
+ * A cgroup destruction should enqueue work sequentially to:
+ * cgroup_offline_wq: use for css offline work
+ * cgroup_release_wq: use for css release work
+ * cgroup_free_wq: use for free work
+ *
+ * Rationale for using separate workqueues:
+ * The cgroup root free work may depend on completion of other css offline
+ * operations. If all tasks were enqueued to a single workqueue, this could
+ * create a deadlock scenario where:
+ * - Free work waits for other css offline work to complete.
+ * - But other css offline work is queued after free work in the same queue.
+ *
+ * Example deadlock scenario with single workqueue (cgroup_destroy_wq):
+ * 1. umount net_prio
+ * 2. net_prio root destruction enqueues work to cgroup_destroy_wq (CPUx)
+ * 3. perf_event CSS A offline enqueues work to same cgroup_destroy_wq (CPUx)
+ * 4. net_prio cgroup_destroy_root->cgroup_lock_and_drain_offline.
+ * 5. net_prio root destruction blocks waiting for perf_event CSS A offline,
+ *    which can never complete as it's behind in the same queue and
+ *    workqueue's max_active is 1.
  */
-static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_offline_wq;
+static struct workqueue_struct *cgroup_release_wq;
+static struct workqueue_struct *cgroup_free_wq;
 
 /* generate an array of cgroup subsystem pointers */
 #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
@@ -5233,7 +5256,7 @@ static void css_release_work_fn(struct work_struct *work)
 	mutex_unlock(&cgroup_mutex);
 
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
-	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+	queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
 }
 
 static void css_release(struct percpu_ref *ref)
@@ -5242,7 +5265,7 @@ static void css_release(struct percpu_ref *ref)
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
 	INIT_WORK(&css->destroy_work, css_release_work_fn);
-	queue_work(cgroup_destroy_wq, &css->destroy_work);
+	queue_work(cgroup_release_wq, &css->destroy_work);
 }
 
 static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5373,7 +5396,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
 err_free_css:
 	list_del_rcu(&css->rstat_css_node);
 	INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
-	queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+	queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
 	return ERR_PTR(err);
 }
 
@@ -5626,7 +5649,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
 
 	if (atomic_dec_and_test(&css->online_cnt)) {
 		INIT_WORK(&css->destroy_work, css_killed_work_fn);
-		queue_work(cgroup_destroy_wq, &css->destroy_work);
+		queue_work(cgroup_offline_wq, &css->destroy_work);
 	}
 }
 
@@ -6002,8 +6025,14 @@ static int __init cgroup_wq_init(void)
 	 * We would prefer to do this in cgroup_init() above, but that
 	 * is called before init_workqueues(): so leave this until after.
 	 */
-	cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
-	BUG_ON(!cgroup_destroy_wq);
+	cgroup_offline_wq = alloc_workqueue("cgroup_offline", 0, 1);
+	BUG_ON(!cgroup_offline_wq);
+
+	cgroup_release_wq = alloc_workqueue("cgroup_release", 0, 1);
+	BUG_ON(!cgroup_release_wq);
+
+	cgroup_free_wq = alloc_workqueue("cgroup_free", 0, 1);
+	BUG_ON(!cgroup_free_wq);
 	return 0;
 }
 core_initcall(cgroup_wq_init);
-- 
2.51.0




  parent reply	other threads:[~2025-09-30 14:52 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-30 14:46 [PATCH 5.4 00/81] 5.4.300-rc1 review Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 01/81] usb: hub: Fix flushing of delayed work used for post resume purposes Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 02/81] net: Fix null-ptr-deref by sock_lock_init_class_and_name() and rmmod Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 03/81] NFSv4: Dont clear capabilities that wont be reset Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 04/81] tcp_bpf: Call sk_msg_free() when tcp_bpf_send_verdict() fails to allocate psock->cork Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 05/81] EDAC/altera: Delete an inappropriate dma_free_coherent() call Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 06/81] ocfs2: fix recursive semaphore deadlock in fiemap call Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 07/81] mtd: rawnand: stm32_fmc2: fix ECC overwrite Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 08/81] fuse: check if copy_file_range() returns larger than requested size Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 09/81] fuse: prevent overflow in copy_file_range return value Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 10/81] mm/khugepaged: fix the address passed to notifier on testing young Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 11/81] mtd: rawnand: stm32_fmc2: avoid overlapping mappings on ECC buffer Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 12/81] mtd: nand: raw: atmel: Fix comment in timings preparation Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 13/81] mtd: nand: raw: atmel: Respect tAR, tCLR in read setup timing Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 14/81] tty: hvc_console: Call hvc_kick in hvc_write unconditionally Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 15/81] USB: serial: option: add Telit Cinterion FN990A w/audio compositions Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 16/81] USB: serial: option: add Telit Cinterion LE910C4-WWX new compositions Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 17/81] net: fec: Fix possible NPD in fec_enet_phy_reset_after_clk_enable() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 18/81] igb: fix link test skipping when interface is admin down Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 19/81] genirq/affinity: Add irq_update_affinity_desc() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 20/81] genirq: Export affinity setter for modules Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 21/81] genirq: Provide new interfaces for affinity hints Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 22/81] i40e: Use irq_update_affinity_hint() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 23/81] i40e: fix IRQ freeing in i40e_vsi_request_irq_msix error path Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 24/81] can: j1939: j1939_sk_bind(): call j1939_priv_put() immediately when j1939_local_ecu_get() failed Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 25/81] can: j1939: j1939_local_ecu_get(): undo increment when j1939_local_ecu_get() fails Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 26/81] dmaengine: ti: edma: Fix memory allocation size for queue_priority_map Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 27/81] dmaengine: qcom: bam_dma: Fix DT error handling for num-channels/ees Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 28/81] phy: ti-pipe3: fix device leak at unbind Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 29/81] soc: qcom: mdt_loader: Deal with zero e_shentsize Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 30/81] mm/memory-failure: fix VM_BUG_ON_PAGE(PagePoisoned(page)) when unpoison memory Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 31/81] ALSA: firewire-motu: drop EPOLLOUT from poll return values as write is not supported Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 32/81] wifi: mac80211: fix incorrect type for ret Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 33/81] pcmcia: omap_cf: Mark driver struct with __refdata to prevent section mismatch Greg Kroah-Hartman
2025-09-30 14:46 ` Greg Kroah-Hartman [this message]
2025-09-30 14:46 ` [PATCH 5.4 35/81] net: natsemi: fix `rx_dropped` double accounting on `netif_rx()` failure Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 36/81] i40e: remove redundant memory barrier when cleaning Tx descs Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 37/81] tcp: Clear tcp_sk(sk)->fastopen_rsk in tcp_disconnect() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 38/81] Revert "net/mlx5e: Update and set Xon/Xoff upon port speed set" Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 39/81] net: liquidio: fix overflow in octeon_init_instr_queue() Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 40/81] cnic: Fix use-after-free bugs in cnic_delete_task Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 41/81] nilfs2: fix CFI failure when accessing /sys/fs/nilfs2/features/* Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 42/81] power: supply: bq27xxx: fix error return in case of no bq27000 hdq battery Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 43/81] power: supply: bq27xxx: restrict no-battery detection to bq27000 Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 44/81] mmc: mvsdio: Fix dma_unmap_sg() nents value Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 45/81] rds: ib: Increment i_fastreg_wrs before bailing out Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 46/81] ASoC: wm8940: Correct typo in control name Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 47/81] ASoC: wm8974: Correct PLL rate rounding Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 48/81] ASoC: SOF: Intel: hda-stream: Fix incorrect variable used in error message Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 49/81] usb: gadget: dummy_hcd: remove usage of list iterator past the loop body Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 50/81] USB: gadget: dummy-hcd: Fix locking bug in RT-enabled kernels Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 51/81] serial: sc16is7xx: fix bug in flow control levels init Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 52/81] net: rfkill: gpio: add DT support Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 53/81] net: rfkill: gpio: Fix crash due to dereferencering uninitialized pointer Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 54/81] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 55/81] ALSA: usb-audio: Fix block comments in mixer_quirks Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 56/81] ALSA: usb-audio: Avoid multiple assignments " Greg Kroah-Hartman
2025-09-30 14:46 ` [PATCH 5.4 57/81] ALSA: usb-audio: Simplify NULL comparison " Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 58/81] ALSA: usb-audio: Remove unneeded wmb() " Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 59/81] ALSA: usb-audio: Add mixer quirk for Sony DualSense PS5 Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 60/81] ALSA: usb-audio: Convert comma to semicolon Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 61/81] ALSA: usb-audio: Fix build with CONFIG_INPUT=n Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 62/81] usb: core: Add 0x prefix to quirks debug output Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 63/81] IB/mlx5: Fix obj_type mismatch for SRQ event subscriptions Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 64/81] can: rcar_can: rcar_can_resume(): fix s2ram with PSCI Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 65/81] can: hi311x: populate ndo_change_mtu() to prevent buffer overflow Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 66/81] can: sun4i_can: " Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 67/81] can: mcba_usb: " Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 68/81] can: peak_usb: fix shift-out-of-bounds issue Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 69/81] drm/gma500: Fix null dereference in hdmi teardown Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 70/81] i40e: fix idx validation in i40e_validate_queue_map Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 71/81] i40e: fix input validation logic for action_meta Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 72/81] i40e: add max boundary check for VF filters Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 73/81] fbcon: fix integer overflow in fbcon_do_set_font Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 74/81] fbcon: Fix OOB access in font allocation Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 75/81] mm/migrate_device: dont add folio to be freed to LRU in migrate_device_finalize() Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 76/81] i40e: increase max descriptors for XL710 Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 77/81] i40e: add validation for ring_len param Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 78/81] i40e: fix idx validation in config queues msg Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 79/81] i40e: fix validation of VF state in get resources Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 80/81] i40e: add mask to apply valid bits for itr_idx Greg Kroah-Hartman
2025-09-30 14:47 ` [PATCH 5.4 81/81] mm/hugetlb: fix folio is still mapped when deleted Greg Kroah-Hartman
2025-09-30 17:06 ` [PATCH 5.4 00/81] 5.4.300-rc1 review Florian Fainelli
2025-09-30 18:52 ` Brett A C Sheffield
2025-10-01  9:11 ` [PATCH 5.4 00/81] " Jon Hunter
2025-10-01 12:07 ` Naresh Kamboju
2025-10-01 13:37 ` [External] : " ALOK TIWARI
2025-10-01 16:21 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250930143821.094175377@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=chenridong@huawei.com \
    --cc=gaoyingjie@uniontech.com \
    --cc=patches@lists.linux.dev \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.