* [PATCH v4] cgroup: split cgroup_destroy_wq into 3 workqueues
@ 2025-08-18 3:43 Chen Ridong
2025-08-18 6:14 ` [PATCH v5] " Chen Ridong
0 siblings, 1 reply; 9+ messages in thread
From: Chen Ridong @ 2025-08-18 3:43 UTC (permalink / raw)
To: tj, hannes, mkoutny, lizefan
Cc: cgroups, linux-kernel, lujialin4, chenridong, hdanton, gaoyingjie
From: Chen Ridong <chenridong@huawei.com>
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.
Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
Call Trace:
cgroup_lock_and_drain_offline+0x14c/0x1e8
cgroup_destroy_root+0x3c/0x2c0
css_free_rwork_fn+0x248/0x338
process_one_work+0x16c/0x3b8
worker_thread+0x22c/0x3b0
kthread+0xec/0x100
ret_from_fork+0x10/0x20
Root Cause:
CPU0 CPU1
mount perf_event umount net_prio
cgroup1_get_tree cgroup_kill_sb
rebind_subsystems // root destruction enqueues
// cgroup_destroy_wq
// kill all perf_event css
// one perf_event css A is dying
// css A offline enqueues cgroup_destroy_wq
// root destruction will be executed first
css_free_rwork_fn
cgroup_destroy_root
cgroup_lock_and_drain_offline
// some perf descendants are dying
// cgroup_destroy_wq max_active = 1
// waiting for css A to die
Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
blocked behind root destruction in cgroup_destroy_wq (max_active=1)
Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation
This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.
[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
---
kernel/cgroup/cgroup.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d8b82afed181..a3ce220b837f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -126,8 +126,14 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
* of concurrent destructions. Use a separate workqueue so that cgroup
* destruction work items don't end up filling up max_active of system_wq
* which may lead to deadlock.
+ * A cgroup destruction should enqueue work sequentially to:
+ * cgroup_offline_wq: use for css offline work
+ * cgroup_release_wq: use for css release work
+ * cgroup_free_wq: use for free work
*/
-static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_offline_wq;
+static struct workqueue_struct *cgroup_release_wq;
+static struct workqueue_struct *cgroup_free_wq;
/* generate an array of cgroup subsystem pointers */
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
@@ -5559,7 +5565,7 @@ static void css_release_work_fn(struct work_struct *work)
cgroup_unlock();
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
}
static void css_release(struct percpu_ref *ref)
@@ -5568,7 +5574,7 @@ static void css_release(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);
INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_release_wq, &css->destroy_work);
}
static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5702,7 +5708,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
list_del_rcu(&css->sibling);
err_free_css:
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
return ERR_PTR(err);
}
@@ -5940,7 +5946,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
if (atomic_dec_and_test(&css->online_cnt)) {
INIT_WORK(&css->destroy_work, css_killed_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_offline_wq, &css->destroy_work);
}
}
@@ -6326,8 +6332,14 @@ static int __init cgroup_wq_init(void)
* We would prefer to do this in cgroup_init() above, but that
* is called before init_workqueues(): so leave this until after.
*/
- cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
- BUG_ON(!cgroup_destroy_wq);
+ cgroup_offline_wq = alloc_workqueue("cgroup_offline", 0, 1);
+ BUG_ON(!cgroup_offline_wq);
+
+ cgroup_release_wq = alloc_workqueue("cgroup_release", 0, 1);
+ BUG_ON(!cgroup_release_wq);
+
+ cgroup_free_wq = alloc_workqueue("cgroup_free", 0, 1);
+ BUG_ON(!cgroup_free_wq);
return 0;
}
core_initcall(cgroup_wq_init);
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v5] cgroup: split cgroup_destroy_wq into 3 workqueues
2025-08-18 3:43 [PATCH v4] cgroup: split cgroup_destroy_wq into 3 workqueues Chen Ridong
@ 2025-08-18 6:14 ` Chen Ridong
2025-08-18 6:42 ` Chen Ridong
2025-08-18 17:08 ` Tejun Heo
0 siblings, 2 replies; 9+ messages in thread
From: Chen Ridong @ 2025-08-18 6:14 UTC (permalink / raw)
To: tj, hannes, mkoutny, lizefan
Cc: cgroups, linux-kernel, lujialin4, chenridong, hdanton, gaoyingjie
From: Chen Ridong <chenridong@huawei.com>
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.
Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
Call Trace:
cgroup_lock_and_drain_offline+0x14c/0x1e8
cgroup_destroy_root+0x3c/0x2c0
css_free_rwork_fn+0x248/0x338
process_one_work+0x16c/0x3b8
worker_thread+0x22c/0x3b0
kthread+0xec/0x100
ret_from_fork+0x10/0x20
Root Cause:
CPU0 CPU1
mount perf_event umount net_prio
cgroup1_get_tree cgroup_kill_sb
rebind_subsystems // root destruction enqueues
// cgroup_destroy_wq
// kill all perf_event css
// one perf_event css A is dying
// css A offline enqueues cgroup_destroy_wq
// root destruction will be executed first
css_free_rwork_fn
cgroup_destroy_root
cgroup_lock_and_drain_offline
// some perf descendants are dying
// cgroup_destroy_wq max_active = 1
// waiting for css A to die
Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
blocked behind root destruction in cgroup_destroy_wq (max_active=1)
Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation
This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.
[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
---
kernel/cgroup/cgroup.c | 34 +++++++++++++++++++++++++++-------
1 file changed, 27 insertions(+), 7 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..679dc216e3ed 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -126,8 +126,22 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
* of concurrent destructions. Use a separate workqueue so that cgroup
* destruction work items don't end up filling up max_active of system_wq
* which may lead to deadlock.
+ *
+ * A cgroup destruction should enqueue work sequentially to:
+ * cgroup_offline_wq: use for css offline work
+ * cgroup_release_wq: use for css release work
+ * cgroup_free_wq: use for free work
+ *
+ * Rationale for using separate workqueues:
+ * The cgroup root free work may depend on completion of other css offline
+ * operations. If all tasks were enqueued to a single workqueue, this could
+ * create a deadlock scenario where:
+ * - Free work waits for other css offline work to complete.
+ * - But other css offline work is queued after free work in the same queue.
*/
-static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_offline_wq;
+static struct workqueue_struct *cgroup_release_wq;
+static struct workqueue_struct *cgroup_free_wq;
/* generate an array of cgroup subsystem pointers */
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
@@ -5558,7 +5572,7 @@ static void css_release_work_fn(struct work_struct *work)
cgroup_unlock();
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
}
static void css_release(struct percpu_ref *ref)
@@ -5567,7 +5581,7 @@ static void css_release(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);
INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_release_wq, &css->destroy_work);
}
static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5701,7 +5715,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
list_del_rcu(&css->sibling);
err_free_css:
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
return ERR_PTR(err);
}
@@ -5939,7 +5953,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
if (atomic_dec_and_test(&css->online_cnt)) {
INIT_WORK(&css->destroy_work, css_killed_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_offline_wq, &css->destroy_work);
}
}
@@ -6325,8 +6339,14 @@ static int __init cgroup_wq_init(void)
* We would prefer to do this in cgroup_init() above, but that
* is called before init_workqueues(): so leave this until after.
*/
- cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
- BUG_ON(!cgroup_destroy_wq);
+ cgroup_offline_wq = alloc_workqueue("cgroup_offline", 0, 1);
+ BUG_ON(!cgroup_offline_wq);
+
+ cgroup_release_wq = alloc_workqueue("cgroup_release", 0, 1);
+ BUG_ON(!cgroup_release_wq);
+
+ cgroup_free_wq = alloc_workqueue("cgroup_free", 0, 1);
+ BUG_ON(!cgroup_free_wq);
return 0;
}
core_initcall(cgroup_wq_init);
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v5] cgroup: split cgroup_destroy_wq into 3 workqueues
2025-08-18 6:14 ` [PATCH v5] " Chen Ridong
@ 2025-08-18 6:42 ` Chen Ridong
2025-08-18 17:08 ` Tejun Heo
1 sibling, 0 replies; 9+ messages in thread
From: Chen Ridong @ 2025-08-18 6:42 UTC (permalink / raw)
To: tj, hannes, mkoutny, lizefan
Cc: cgroups, linux-kernel, lujialin4, chenridong, hdanton, gaoyingjie
On 2025/8/18 14:14, Chen Ridong wrote:
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..679dc216e3ed 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -126,8 +126,22 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
> * of concurrent destructions. Use a separate workqueue so that cgroup
> * destruction work items don't end up filling up max_active of system_wq
> * which may lead to deadlock.
> + *
> + * A cgroup destruction should enqueue work sequentially to:
> + * cgroup_offline_wq: use for css offline work
> + * cgroup_release_wq: use for css release work
> + * cgroup_free_wq: use for free work
> + *
> + * Rationale for using separate workqueues:
> + * The cgroup root free work may depend on completion of other css offline
> + * operations. If all tasks were enqueued to a single workqueue, this could
> + * create a deadlock scenario where:
> + * - Free work waits for other css offline work to complete.
> + * - But other css offline work is queued after free work in the same queue.
> */
More comments are added to clarify why we split the destroy work into 3 workqueues in v5.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5] cgroup: split cgroup_destroy_wq into 3 workqueues
2025-08-18 6:14 ` [PATCH v5] " Chen Ridong
2025-08-18 6:42 ` Chen Ridong
@ 2025-08-18 17:08 ` Tejun Heo
2025-08-19 1:07 ` [PATCH v6] " Chen Ridong
2025-08-19 1:56 ` [PATCH v5] " Chen Ridong
1 sibling, 2 replies; 9+ messages in thread
From: Tejun Heo @ 2025-08-18 17:08 UTC (permalink / raw)
To: Chen Ridong
Cc: hannes, mkoutny, lizefan, cgroups, linux-kernel, lujialin4,
chenridong, hdanton, gaoyingjie
Hello,
On Mon, Aug 18, 2025 at 06:14:35AM +0000, Chen Ridong wrote:
...
> + * Rationale for using separate workqueues:
> + * The cgroup root free work may depend on completion of other css offline
> + * operations. If all tasks were enqueued to a single workqueue, this could
> + * create a deadlock scenario where:
> + * - Free work waits for other css offline work to complete.
> + * - But other css offline work is queued after free work in the same queue.
Can you please refer to the concrete example too?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v6] cgroup: split cgroup_destroy_wq into 3 workqueues
2025-08-18 17:08 ` Tejun Heo
@ 2025-08-19 1:07 ` Chen Ridong
2025-08-22 0:44 ` Chen Ridong
2025-08-22 17:45 ` Tejun Heo
2025-08-19 1:56 ` [PATCH v5] " Chen Ridong
1 sibling, 2 replies; 9+ messages in thread
From: Chen Ridong @ 2025-08-19 1:07 UTC (permalink / raw)
To: tj, hannes, mkoutny, lizefan
Cc: cgroups, linux-kernel, lujialin4, chenridong, hdanton, gaoyingjie
From: Chen Ridong <chenridong@huawei.com>
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.
Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
Call Trace:
cgroup_lock_and_drain_offline+0x14c/0x1e8
cgroup_destroy_root+0x3c/0x2c0
css_free_rwork_fn+0x248/0x338
process_one_work+0x16c/0x3b8
worker_thread+0x22c/0x3b0
kthread+0xec/0x100
ret_from_fork+0x10/0x20
Root Cause:
CPU0 CPU1
mount perf_event umount net_prio
cgroup1_get_tree cgroup_kill_sb
rebind_subsystems // root destruction enqueues
// cgroup_destroy_wq
// kill all perf_event css
// one perf_event css A is dying
// css A offline enqueues cgroup_destroy_wq
// root destruction will be executed first
css_free_rwork_fn
cgroup_destroy_root
cgroup_lock_and_drain_offline
// some perf descendants are dying
// cgroup_destroy_wq max_active = 1
// waiting for css A to die
Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
blocked behind root destruction in cgroup_destroy_wq (max_active=1)
Solution:
Split cgroup_destroy_wq into three dedicated workqueues:
cgroup_offline_wq – Handles CSS offline operations
cgroup_release_wq – Manages resource release
cgroup_free_wq – Performs final memory deallocation
This separation eliminates blocking in the CSS free path while waiting for
offline operations to complete.
[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Suggested-by: Teju Heo <tj@kernel.org>
---
kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++++-------
1 file changed, 36 insertions(+), 7 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..79b1d79f86a3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -126,8 +126,31 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
* of concurrent destructions. Use a separate workqueue so that cgroup
* destruction work items don't end up filling up max_active of system_wq
* which may lead to deadlock.
+ *
+ * A cgroup destruction should enqueue work sequentially to:
+ * cgroup_offline_wq: use for css offline work
+ * cgroup_release_wq: use for css release work
+ * cgroup_free_wq: use for free work
+ *
+ * Rationale for using separate workqueues:
+ * The cgroup root free work may depend on completion of other css offline
+ * operations. If all tasks were enqueued to a single workqueue, this could
+ * create a deadlock scenario where:
+ * - Free work waits for other css offline work to complete.
+ * - But other css offline work is queued after free work in the same queue.
+ *
+ * Example deadlock scenario with single workqueue (cgroup_destroy_wq):
+ * 1. umount net_prio
+ * 2. net_prio root destruction enqueues work to cgroup_destroy_wq (CPUx)
+ * 3. perf_event CSS A offline enqueues work to same cgroup_destroy_wq (CPUx)
+ * 4. net_prio cgroup_destroy_root->cgroup_lock_and_drain_offline.
+ * 5. net_prio root destruction blocks waiting for perf_event CSS A offline,
+ * which can never complete as it's behind in the same queue and
+ * workqueue's max_active is 1.
*/
-static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_offline_wq;
+static struct workqueue_struct *cgroup_release_wq;
+static struct workqueue_struct *cgroup_free_wq;
/* generate an array of cgroup subsystem pointers */
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
@@ -5558,7 +5581,7 @@ static void css_release_work_fn(struct work_struct *work)
cgroup_unlock();
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
}
static void css_release(struct percpu_ref *ref)
@@ -5567,7 +5590,7 @@ static void css_release(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);
INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_release_wq, &css->destroy_work);
}
static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5701,7 +5724,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
list_del_rcu(&css->sibling);
err_free_css:
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
- queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
+ queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
return ERR_PTR(err);
}
@@ -5939,7 +5962,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
if (atomic_dec_and_test(&css->online_cnt)) {
INIT_WORK(&css->destroy_work, css_killed_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ queue_work(cgroup_offline_wq, &css->destroy_work);
}
}
@@ -6325,8 +6348,14 @@ static int __init cgroup_wq_init(void)
* We would prefer to do this in cgroup_init() above, but that
* is called before init_workqueues(): so leave this until after.
*/
- cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
- BUG_ON(!cgroup_destroy_wq);
+ cgroup_offline_wq = alloc_workqueue("cgroup_offline", 0, 1);
+ BUG_ON(!cgroup_offline_wq);
+
+ cgroup_release_wq = alloc_workqueue("cgroup_release", 0, 1);
+ BUG_ON(!cgroup_release_wq);
+
+ cgroup_free_wq = alloc_workqueue("cgroup_free", 0, 1);
+ BUG_ON(!cgroup_free_wq);
return 0;
}
core_initcall(cgroup_wq_init);
--
2.34.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v5] cgroup: split cgroup_destroy_wq into 3 workqueues
2025-08-18 17:08 ` Tejun Heo
2025-08-19 1:07 ` [PATCH v6] " Chen Ridong
@ 2025-08-19 1:56 ` Chen Ridong
1 sibling, 0 replies; 9+ messages in thread
From: Chen Ridong @ 2025-08-19 1:56 UTC (permalink / raw)
To: Tejun Heo
Cc: hannes, mkoutny, lizefan, cgroups, linux-kernel, lujialin4,
chenridong, hdanton, gaoyingjie
On 2025/8/19 1:08, Tejun Heo wrote:
> Hello,
>
> On Mon, Aug 18, 2025 at 06:14:35AM +0000, Chen Ridong wrote:
> ...
>> + * Rationale for using separate workqueues:
>> + * The cgroup root free work may depend on completion of other css offline
>> + * operations. If all tasks were enqueued to a single workqueue, this could
>> + * create a deadlock scenario where:
>> + * - Free work waits for other css offline work to complete.
>> + * - But other css offline work is queued after free work in the same queue.
>
> Can you please refer to the concrete example too?
>
> Thanks.
>
Sure, the comment has been updated in v6. Please let me know if you have any feedback.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v6] cgroup: split cgroup_destroy_wq into 3 workqueues
2025-08-19 1:07 ` [PATCH v6] " Chen Ridong
@ 2025-08-22 0:44 ` Chen Ridong
2025-08-22 17:45 ` Tejun Heo
1 sibling, 0 replies; 9+ messages in thread
From: Chen Ridong @ 2025-08-22 0:44 UTC (permalink / raw)
To: tj, hannes, mkoutny, lizefan
Cc: cgroups, linux-kernel, lujialin4, chenridong, hdanton, gaoyingjie
On 2025/8/19 9:07, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> A hung task can occur during [1] LTP cgroup testing when repeatedly
> mounting/unmounting perf_event and net_prio controllers with
> systemd.unified_cgroup_hierarchy=1. The hang manifests in
> cgroup_lock_and_drain_offline() during root destruction.
>
> Related case:
> cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
> cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
>
> Call Trace:
> cgroup_lock_and_drain_offline+0x14c/0x1e8
> cgroup_destroy_root+0x3c/0x2c0
> css_free_rwork_fn+0x248/0x338
> process_one_work+0x16c/0x3b8
> worker_thread+0x22c/0x3b0
> kthread+0xec/0x100
> ret_from_fork+0x10/0x20
>
> Root Cause:
>
> CPU0 CPU1
> mount perf_event umount net_prio
> cgroup1_get_tree cgroup_kill_sb
> rebind_subsystems // root destruction enqueues
> // cgroup_destroy_wq
> // kill all perf_event css
> // one perf_event css A is dying
> // css A offline enqueues cgroup_destroy_wq
> // root destruction will be executed first
> css_free_rwork_fn
> cgroup_destroy_root
> cgroup_lock_and_drain_offline
> // some perf descendants are dying
> // cgroup_destroy_wq max_active = 1
> // waiting for css A to die
>
> Problem scenario:
> 1. CPU0 mounts perf_event (rebind_subsystems)
> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> 3. A dying perf_event CSS gets queued for offline after root destruction
> 4. Root destruction waits for offline completion, but offline work is
> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>
> Solution:
> Split cgroup_destroy_wq into three dedicated workqueues:
> cgroup_offline_wq – Handles CSS offline operations
> cgroup_release_wq – Manages resource release
> cgroup_free_wq – Performs final memory deallocation
>
> This separation eliminates blocking in the CSS free path while waiting for
> offline operations to complete.
>
> [1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
> Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
> Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> Suggested-by: Teju Heo <tj@kernel.org>
> ---
> kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++++++++++++-------
> 1 file changed, 36 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 312c6a8b55bb..79b1d79f86a3 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -126,8 +126,31 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
> * of concurrent destructions. Use a separate workqueue so that cgroup
> * destruction work items don't end up filling up max_active of system_wq
> * which may lead to deadlock.
> + *
> + * A cgroup destruction should enqueue work sequentially to:
> + * cgroup_offline_wq: use for css offline work
> + * cgroup_release_wq: use for css release work
> + * cgroup_free_wq: use for free work
> + *
> + * Rationale for using separate workqueues:
> + * The cgroup root free work may depend on completion of other css offline
> + * operations. If all tasks were enqueued to a single workqueue, this could
> + * create a deadlock scenario where:
> + * - Free work waits for other css offline work to complete.
> + * - But other css offline work is queued after free work in the same queue.
> + *
> + * Example deadlock scenario with single workqueue (cgroup_destroy_wq):
> + * 1. umount net_prio
> + * 2. net_prio root destruction enqueues work to cgroup_destroy_wq (CPUx)
> + * 3. perf_event CSS A offline enqueues work to same cgroup_destroy_wq (CPUx)
> + * 4. net_prio cgroup_destroy_root->cgroup_lock_and_drain_offline.
> + * 5. net_prio root destruction blocks waiting for perf_event CSS A offline,
> + * which can never complete as it's behind in the same queue and
> + * workqueue's max_active is 1.
> */
> -static struct workqueue_struct *cgroup_destroy_wq;
> +static struct workqueue_struct *cgroup_offline_wq;
> +static struct workqueue_struct *cgroup_release_wq;
> +static struct workqueue_struct *cgroup_free_wq;
>
> /* generate an array of cgroup subsystem pointers */
> #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
> @@ -5558,7 +5581,7 @@ static void css_release_work_fn(struct work_struct *work)
> cgroup_unlock();
>
> INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
> - queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
> + queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
> }
>
> static void css_release(struct percpu_ref *ref)
> @@ -5567,7 +5590,7 @@ static void css_release(struct percpu_ref *ref)
> container_of(ref, struct cgroup_subsys_state, refcnt);
>
> INIT_WORK(&css->destroy_work, css_release_work_fn);
> - queue_work(cgroup_destroy_wq, &css->destroy_work);
> + queue_work(cgroup_release_wq, &css->destroy_work);
> }
>
> static void init_and_link_css(struct cgroup_subsys_state *css,
> @@ -5701,7 +5724,7 @@ static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
> list_del_rcu(&css->sibling);
> err_free_css:
> INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
> - queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
> + queue_rcu_work(cgroup_free_wq, &css->destroy_rwork);
> return ERR_PTR(err);
> }
>
> @@ -5939,7 +5962,7 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
>
> if (atomic_dec_and_test(&css->online_cnt)) {
> INIT_WORK(&css->destroy_work, css_killed_work_fn);
> - queue_work(cgroup_destroy_wq, &css->destroy_work);
> + queue_work(cgroup_offline_wq, &css->destroy_work);
> }
> }
>
> @@ -6325,8 +6348,14 @@ static int __init cgroup_wq_init(void)
> * We would prefer to do this in cgroup_init() above, but that
> * is called before init_workqueues(): so leave this until after.
> */
> - cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
> - BUG_ON(!cgroup_destroy_wq);
> + cgroup_offline_wq = alloc_workqueue("cgroup_offline", 0, 1);
> + BUG_ON(!cgroup_offline_wq);
> +
> + cgroup_release_wq = alloc_workqueue("cgroup_release", 0, 1);
> + BUG_ON(!cgroup_release_wq);
> +
> + cgroup_free_wq = alloc_workqueue("cgroup_free", 0, 1);
> + BUG_ON(!cgroup_free_wq);
> return 0;
> }
> core_initcall(cgroup_wq_init);
Hi, Tj,
Just checking in on the v6 I sent here. I am not sure whether you miss it.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v6] cgroup: split cgroup_destroy_wq into 3 workqueues
2025-08-19 1:07 ` [PATCH v6] " Chen Ridong
2025-08-22 0:44 ` Chen Ridong
@ 2025-08-22 17:45 ` Tejun Heo
2025-08-23 0:23 ` Chen Ridong
1 sibling, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2025-08-22 17:45 UTC (permalink / raw)
To: Chen Ridong
Cc: hannes, mkoutny, lizefan, cgroups, linux-kernel, lujialin4,
chenridong, hdanton, gaoyingjie
On Tue, Aug 19, 2025 at 01:07:24AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> A hung task can occur during [1] LTP cgroup testing when repeatedly
> mounting/unmounting perf_event and net_prio controllers with
> systemd.unified_cgroup_hierarchy=1. The hang manifests in
> cgroup_lock_and_drain_offline() during root destruction.
>
> Related case:
> cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
> cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
>
> Call Trace:
> cgroup_lock_and_drain_offline+0x14c/0x1e8
> cgroup_destroy_root+0x3c/0x2c0
> css_free_rwork_fn+0x248/0x338
> process_one_work+0x16c/0x3b8
> worker_thread+0x22c/0x3b0
> kthread+0xec/0x100
> ret_from_fork+0x10/0x20
>
> Root Cause:
>
> CPU0 CPU1
> mount perf_event umount net_prio
> cgroup1_get_tree cgroup_kill_sb
> rebind_subsystems // root destruction enqueues
> // cgroup_destroy_wq
> // kill all perf_event css
> // one perf_event css A is dying
> // css A offline enqueues cgroup_destroy_wq
> // root destruction will be executed first
> css_free_rwork_fn
> cgroup_destroy_root
> cgroup_lock_and_drain_offline
> // some perf descendants are dying
> // cgroup_destroy_wq max_active = 1
> // waiting for css A to die
>
> Problem scenario:
> 1. CPU0 mounts perf_event (rebind_subsystems)
> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> 3. A dying perf_event CSS gets queued for offline after root destruction
> 4. Root destruction waits for offline completion, but offline work is
> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>
> Solution:
> Split cgroup_destroy_wq into three dedicated workqueues:
> cgroup_offline_wq – Handles CSS offline operations
> cgroup_release_wq – Manages resource release
> cgroup_free_wq – Performs final memory deallocation
>
> This separation eliminates blocking in the CSS free path while waiting for
> offline operations to complete.
>
> [1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
> Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
> Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> Suggested-by: Teju Heo <tj@kernel.org>
Applied to cgroup/for-6.17-fixes. Sorry about the delay. I missed the
thread.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v6] cgroup: split cgroup_destroy_wq into 3 workqueues
2025-08-22 17:45 ` Tejun Heo
@ 2025-08-23 0:23 ` Chen Ridong
0 siblings, 0 replies; 9+ messages in thread
From: Chen Ridong @ 2025-08-23 0:23 UTC (permalink / raw)
To: Tejun Heo
Cc: hannes, mkoutny, lizefan, cgroups, linux-kernel, lujialin4,
chenridong, hdanton, gaoyingjie
On 2025/8/23 1:45, Tejun Heo wrote:
> On Tue, Aug 19, 2025 at 01:07:24AM +0000, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> A hung task can occur during [1] LTP cgroup testing when repeatedly
>> mounting/unmounting perf_event and net_prio controllers with
>> systemd.unified_cgroup_hierarchy=1. The hang manifests in
>> cgroup_lock_and_drain_offline() during root destruction.
>>
>> Related case:
>> cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
>> cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
>>
>> Call Trace:
>> cgroup_lock_and_drain_offline+0x14c/0x1e8
>> cgroup_destroy_root+0x3c/0x2c0
>> css_free_rwork_fn+0x248/0x338
>> process_one_work+0x16c/0x3b8
>> worker_thread+0x22c/0x3b0
>> kthread+0xec/0x100
>> ret_from_fork+0x10/0x20
>>
>> Root Cause:
>>
>> CPU0 CPU1
>> mount perf_event umount net_prio
>> cgroup1_get_tree cgroup_kill_sb
>> rebind_subsystems // root destruction enqueues
>> // cgroup_destroy_wq
>> // kill all perf_event css
>> // one perf_event css A is dying
>> // css A offline enqueues cgroup_destroy_wq
>> // root destruction will be executed first
>> css_free_rwork_fn
>> cgroup_destroy_root
>> cgroup_lock_and_drain_offline
>> // some perf descendants are dying
>> // cgroup_destroy_wq max_active = 1
>> // waiting for css A to die
>>
>> Problem scenario:
>> 1. CPU0 mounts perf_event (rebind_subsystems)
>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>> 3. A dying perf_event CSS gets queued for offline after root destruction
>> 4. Root destruction waits for offline completion, but offline work is
>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>
>> Solution:
>> Split cgroup_destroy_wq into three dedicated workqueues:
>> cgroup_offline_wq – Handles CSS offline operations
>> cgroup_release_wq – Manages resource release
>> cgroup_free_wq – Performs final memory deallocation
>>
>> This separation eliminates blocking in the CSS free path while waiting for
>> offline operations to complete.
>>
>> [1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
>> Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
>> Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>> Suggested-by: Teju Heo <tj@kernel.org>
>
> Applied to cgroup/for-6.17-fixes. Sorry about the delay. I missed the
> thread.
>
> Thanks.
>
Thanks
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-08-23 0:23 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-18 3:43 [PATCH v4] cgroup: split cgroup_destroy_wq into 3 workqueues Chen Ridong
2025-08-18 6:14 ` [PATCH v5] " Chen Ridong
2025-08-18 6:42 ` Chen Ridong
2025-08-18 17:08 ` Tejun Heo
2025-08-19 1:07 ` [PATCH v6] " Chen Ridong
2025-08-22 0:44 ` Chen Ridong
2025-08-22 17:45 ` Tejun Heo
2025-08-23 0:23 ` Chen Ridong
2025-08-19 1:56 ` [PATCH v5] " Chen Ridong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).