* [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
@ 2025-07-22 11:27 Chen Ridong
2025-07-24 13:35 ` Michal Koutný
2025-07-25 1:48 ` Chen Ridong
0 siblings, 2 replies; 17+ messages in thread
From: Chen Ridong @ 2025-07-22 11:27 UTC (permalink / raw)
To: tj, hannes, mkoutny, lizefan
Cc: cgroups, linux-kernel, lujialin4, chenridong, gaoyingjie
From: Chen Ridong <chenridong@huawei.com>
A hung task can occur during [1] LTP cgroup testing when repeatedly
mounting/unmounting perf_event and net_prio controllers with
systemd.unified_cgroup_hierarchy=1. The hang manifests in
cgroup_lock_and_drain_offline() during root destruction.
Related case:
cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
Call Trace:
cgroup_lock_and_drain_offline+0x14c/0x1e8
cgroup_destroy_root+0x3c/0x2c0
css_free_rwork_fn+0x248/0x338
process_one_work+0x16c/0x3b8
worker_thread+0x22c/0x3b0
kthread+0xec/0x100
ret_from_fork+0x10/0x20
Root Cause:
CPU0 CPU1
mount perf_event umount net_prio
cgroup1_get_tree cgroup_kill_sb
rebind_subsystems // root destruction enqueues
// cgroup_destroy_wq
// kill all perf_event css
// one perf_event css A is dying
// css A offline enqueues cgroup_destroy_wq
// root destruction will be executed first
css_free_rwork_fn
cgroup_destroy_root
cgroup_lock_and_drain_offline
// some perf descendants are dying
// cgroup_destroy_wq max_active = 1
// waiting for css A to die
Problem scenario:
1. CPU0 mounts perf_event (rebind_subsystems)
2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
3. A dying perf_event CSS gets queued for offline after root destruction
4. Root destruction waits for offline completion, but offline work is
blocked behind root destruction in cgroup_destroy_wq (max_active=1)
Solution:
Move cgroup_lock_and_drain_offline() to the start of unmount operations.
This ensures:
1. cgroup_lock_and_drain_offline() will not be called within
cgroup_destroy_wq context.
2. No new dying csses for the subsystem being unmounted can appear in
cgrp_dfl_root between unmount start and subsystem rebinding.
[1] https://github.com/linux-test-project/ltp/blob/master/runtest/controllers
Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends")
Reported-by: Gao Yingjie <gaoyingjie@uniontech.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
kernel/cgroup/cgroup.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 312c6a8b55bb..af81a90f8c92 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1346,8 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
trace_cgroup_destroy_root(root);
- cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
-
+ cgroup_lock();
BUG_ON(atomic_read(&root->nr_cgrps));
BUG_ON(!list_empty(&cgrp->self.children));
@@ -2336,6 +2335,8 @@ static void cgroup_kill_sb(struct super_block *sb)
*
* And don't kill the default root.
*/
+ cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
+ cgroup_unlock();
if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
!percpu_ref_is_dying(&root->cgrp.self.refcnt))
percpu_ref_kill(&root->cgrp.self.refcnt);
--
2.34.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-22 11:27 [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks Chen Ridong
@ 2025-07-24 13:35 ` Michal Koutný
2025-07-25 1:42 ` Chen Ridong
2025-07-25 1:48 ` Chen Ridong
1 sibling, 1 reply; 17+ messages in thread
From: Michal Koutný @ 2025-07-24 13:35 UTC (permalink / raw)
To: Chen Ridong
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
[-- Attachment #1: Type: text/plain, Size: 1998 bytes --]
Hi Ridong.
On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
> CPU0 CPU1
> mount perf_event umount net_prio
> cgroup1_get_tree cgroup_kill_sb
> rebind_subsystems // root destruction enqueues
> // cgroup_destroy_wq
> // kill all perf_event css
> // one perf_event css A is dying
> // css A offline enqueues cgroup_destroy_wq
> // root destruction will be executed first
> css_free_rwork_fn
> cgroup_destroy_root
> cgroup_lock_and_drain_offline
> // some perf descendants are dying
> // cgroup_destroy_wq max_active = 1
> // waiting for css A to die
>
> Problem scenario:
> 1. CPU0 mounts perf_event (rebind_subsystems)
> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> 3. A dying perf_event CSS gets queued for offline after root destruction
> 4. Root destruction waits for offline completion, but offline work is
> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
What's concerning me is why umount of net_prio hierarhy waits for
draining of the default hierachy? (Where you then run into conflict with
perf_event that's implicit_on_dfl.)
IOW why not this:
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
trace_cgroup_destroy_root(root);
- cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
+ cgroup_lock_and_drain_offline(cgrp);
BUG_ON(atomic_read(&root->nr_cgrps));
BUG_ON(!list_empty(&cgrp->self.children));
Does this correct the LTP scenario?
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-24 13:35 ` Michal Koutný
@ 2025-07-25 1:42 ` Chen Ridong
2025-07-25 17:17 ` Michal Koutný
0 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-07-25 1:42 UTC (permalink / raw)
To: Michal Koutný
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/7/24 21:35, Michal Koutný wrote:
> Hi Ridong.
>
> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> CPU0 CPU1
>> mount perf_event umount net_prio
>> cgroup1_get_tree cgroup_kill_sb
>> rebind_subsystems // root destruction enqueues
>> // cgroup_destroy_wq
>> // kill all perf_event css
>> // one perf_event css A is dying
>> // css A offline enqueues cgroup_destroy_wq
>> // root destruction will be executed first
>> css_free_rwork_fn
>> cgroup_destroy_root
>> cgroup_lock_and_drain_offline
>> // some perf descendants are dying
>> // cgroup_destroy_wq max_active = 1
>> // waiting for css A to die
>>
>> Problem scenario:
>> 1. CPU0 mounts perf_event (rebind_subsystems)
>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>> 3. A dying perf_event CSS gets queued for offline after root destruction
>> 4. Root destruction waits for offline completion, but offline work is
>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>
> What's concerning me is why umount of net_prio hierarhy waits for
> draining of the default hierachy? (Where you then run into conflict with
> perf_event that's implicit_on_dfl.)
>
This was also first respond.
> IOW why not this:
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>
> trace_cgroup_destroy_root(root);
>
> - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
> + cgroup_lock_and_drain_offline(cgrp);
>
> BUG_ON(atomic_read(&root->nr_cgrps));
> BUG_ON(!list_empty(&cgrp->self.children));
>
> Does this correct the LTP scenario?
>
> Thanks,
> Michal
I've tested this approach and discovered it can lead to another issue that required significant
investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
draining of the default hierarchy.
Consider this sequence:
mount net_prio umount perf_event
cgroup1_get_tree
// &cgrp_dfl_root.cgrp
cgroup_lock_and_drain_offline
// wait for all perf_event csses dead
prepare_to_wait(&dsct->offline_waitq)
schedule();
cgroup_destroy_root
// &root->cgrp, not cgrp_dfl_root
cgroup_lock_and_drain_offline
rebind_subsystems
rcu_assign_pointer(dcgrp->subsys[ssid], css);
dst_root->subsys_mask |= 1 << ssid;
cgroup_propagate_control
// enable cgrp_dfl_root perf_event css
cgroup_apply_control_enable
css = cgroup_css(dsct, ss);
// since we drain root->cgrp not cgrp_dfl_root
// css(dying) is not null on the cgrp_dfl_root
// we won't create css, but the css is dying
// got the offline_waitq wake up
goto restart;
// some perf_event dying csses are online now
prepare_to_wait(&dsct->offline_waitq)
schedule();
// never get the offline_waitq wake up
I encountered two main issues:
1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
2.Potential hangs during cgrp_dfl_root draining in the mounting process
I have tried calling cgroup_lock_and_drain_offline with the subtree_ss_mask, It seems that can fix
this issue I encountered. But I am not sure there are scenarios [u]mounting mutil legacy subsystem
in the same time.
I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
begins. How can we guarantee this ordering? Therefore, I propose moving the draining operation
outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
potential race condition. This patch implements that approach.
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-22 11:27 [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks Chen Ridong
2025-07-24 13:35 ` Michal Koutný
@ 2025-07-25 1:48 ` Chen Ridong
1 sibling, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-07-25 1:48 UTC (permalink / raw)
To: tj, hannes, mkoutny, lizefan
Cc: cgroups, linux-kernel, lujialin4, chenridong, gaoyingjie
On 2025/7/22 19:27, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> A hung task can occur during [1] LTP cgroup testing when repeatedly
> mounting/unmounting perf_event and net_prio controllers with
> systemd.unified_cgroup_hierarchy=1. The hang manifests in
> cgroup_lock_and_drain_offline() during root destruction.
>
> Related case:
> cgroup_fj_function_perf_event cgroup_fj_function.sh perf_event
> cgroup_fj_function_net_prio cgroup_fj_function.sh net_prio
This can be easily reproduced by the process(offered by Yingjie):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <unistd.h>
#define LOOPS 10000
#define TMPDIR "/tmp"
#define CGROUP_BASE "ltp_cgtest"
#define CONTROLLERS_COUNT 2
const char *controllers[CONTROLLERS_COUNT] = {"perf_event", "net_prio"};
void safe_mkdir(const char *path) {
if (mkdir(path, 0777) == -1 && errno != EEXIST) {
fprintf(stderr, "mkdir(%s) failed: %s\n", path, strerror(errno));
exit(1);
}
}
void safe_rmdir(const char *path) {
if (rmdir(path) == -1 && errno != ENOENT) {
fprintf(stderr, "rmdir(%s) failed: %s\n", path, strerror(errno));
}
}
int is_mounted(const char *mnt) {
FILE *fp = fopen("/proc/mounts", "r");
if (!fp) return 0;
char line[512];
int found = 0;
while (fgets(line, sizeof(line), fp)) {
if (strstr(line, mnt)) {
found = 1;
break;
}
}
fclose(fp);
return found;
}
int main(void) {
if (getuid() != 0) {
fprintf(stderr, "This program must be run as root\n");
return 1;
}
FILE *fcg = fopen("/proc/cgroups", "r");
if (!fcg) {
fprintf(stderr, "Kernel does not support cgroups\n");
return 1;
}
fclose(fcg);
char mnt[256];
for (int i = 1; i <= LOOPS; ++i) {
for (int c = 0; c < CONTROLLERS_COUNT; ++c) {
snprintf(mnt, sizeof(mnt), "%s/cgroup_%s", TMPDIR, controllers[c]);
printf("=== Loop %d: %s ===\n", i, controllers[c]);
fflush(stdout);
safe_mkdir(mnt);
if (!is_mounted(mnt)) {
if (mount(controllers[c], mnt, "cgroup", 0, controllers[c]) == -1) {
fprintf(stderr, "Failed to mount cgroup v1 for %s at %s: %s\n",
controllers[c], mnt, strerror(errno));
safe_rmdir(mnt);
continue;
}
}
if (umount2(mnt, 0) == -1) {
fprintf(stderr, "umount(%s) failed: %s\n", mnt, strerror(errno));
}
safe_rmdir(mnt);
}
}
return 0;
}
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-25 1:42 ` Chen Ridong
@ 2025-07-25 17:17 ` Michal Koutný
2025-07-26 0:52 ` Chen Ridong
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Michal Koutný @ 2025-07-25 17:17 UTC (permalink / raw)
To: Chen Ridong
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
[-- Attachment #1: Type: text/plain, Size: 5096 bytes --]
On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
> > On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
> >> CPU0 CPU1
> >> mount perf_event umount net_prio
> >> cgroup1_get_tree cgroup_kill_sb
> >> rebind_subsystems // root destruction enqueues
> >> // cgroup_destroy_wq
> >> // kill all perf_event css
> >> // one perf_event css A is dying
> >> // css A offline enqueues cgroup_destroy_wq
> >> // root destruction will be executed first
> >> css_free_rwork_fn
> >> cgroup_destroy_root
> >> cgroup_lock_and_drain_offline
> >> // some perf descendants are dying
> >> // cgroup_destroy_wq max_active = 1
> >> // waiting for css A to die
> >>
> >> Problem scenario:
> >> 1. CPU0 mounts perf_event (rebind_subsystems)
> >> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> >> 3. A dying perf_event CSS gets queued for offline after root destruction
> >> 4. Root destruction waits for offline completion, but offline work is
> >> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
> >
> > What's concerning me is why umount of net_prio hierarhy waits for
> > draining of the default hierachy? (Where you then run into conflict with
> > perf_event that's implicit_on_dfl.)
> >
>
> This was also first respond.
>
> > IOW why not this:
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
> >
> > trace_cgroup_destroy_root(root);
> >
> > - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
> > + cgroup_lock_and_drain_offline(cgrp);
> >
> > BUG_ON(atomic_read(&root->nr_cgrps));
> > BUG_ON(!list_empty(&cgrp->self.children));
> >
> > Does this correct the LTP scenario?
> >
> > Thanks,
> > Michal
>
> I've tested this approach and discovered it can lead to another issue that required significant
> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
> draining of the default hierarchy.
>
> Consider this sequence:
>
> mount net_prio umount perf_event
> cgroup1_get_tree
> // &cgrp_dfl_root.cgrp
> cgroup_lock_and_drain_offline
> // wait for all perf_event csses dead
> prepare_to_wait(&dsct->offline_waitq)
> schedule();
> cgroup_destroy_root
> // &root->cgrp, not cgrp_dfl_root
> cgroup_lock_and_drain_offline
perf_event's css (offline but dying)
> rebind_subsystems
> rcu_assign_pointer(dcgrp->subsys[ssid], css);
> dst_root->subsys_mask |= 1 << ssid;
> cgroup_propagate_control
> // enable cgrp_dfl_root perf_event css
> cgroup_apply_control_enable
> css = cgroup_css(dsct, ss);
> // since we drain root->cgrp not cgrp_dfl_root
> // css(dying) is not null on the cgrp_dfl_root
> // we won't create css, but the css is dying
What would prevent seeing a dying css when
cgrp_dfl_root is drained?
(Or nothing drained as in the patch?)
I assume you've seen this warning from
cgroup_apply_control_enable
WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
>
> // got the offline_waitq wake up
> goto restart;
> // some perf_event dying csses are online now
> prepare_to_wait(&dsct->offline_waitq)
> schedule();
> // never get the offline_waitq wake up
>
> I encountered two main issues:
> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
Is this really resolved by the patch? (The questions above.)
> 2.Potential hangs during cgrp_dfl_root draining in the mounting process
Fortunately, the typical use case (mounting at boot) wouldn't suffer
from this.
> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
> begins.
This is a valid point.
> How can we guarantee this ordering? Therefore, I propose moving the draining operation
> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
> potential race condition. This patch implements that approach.
I acknowledge the issue (although rare in real world). Some entity will
always have to wait of the offlining. It may be OK in cgroup_kill_sb
(ideally, if this was bound to process context of umount caller, not
sure if that's how kill_sb works).
I slightly dislike the form of an empty lock/unlock -- which makes me
wonder if this is the best solution.
Let me think more about this...
Thanks,
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-25 17:17 ` Michal Koutný
@ 2025-07-26 0:52 ` Chen Ridong
2025-07-31 11:53 ` Chen Ridong
` (2 subsequent siblings)
3 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-07-26 0:52 UTC (permalink / raw)
To: Michal Koutný
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/7/26 1:17, Michal Koutný wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0 CPU1
>>>> mount perf_event umount net_prio
>>>> cgroup1_get_tree cgroup_kill_sb
>>>> rebind_subsystems // root destruction enqueues
>>>> // cgroup_destroy_wq
>>>> // kill all perf_event css
>>>> // one perf_event css A is dying
>>>> // css A offline enqueues cgroup_destroy_wq
>>>> // root destruction will be executed first
>>>> css_free_rwork_fn
>>>> cgroup_destroy_root
>>>> cgroup_lock_and_drain_offline
>>>> // some perf descendants are dying
>>>> // cgroup_destroy_wq max_active = 1
>>>> // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
>>
>> This was also first respond.
>>
>>> IOW why not this:
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>>>
>>> trace_cgroup_destroy_root(root);
>>>
>>> - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>>> + cgroup_lock_and_drain_offline(cgrp);
>>>
>>> BUG_ON(atomic_read(&root->nr_cgrps));
>>> BUG_ON(!list_empty(&cgrp->self.children));
>>>
>>> Does this correct the LTP scenario?
>>>
>>> Thanks,
>>> Michal
>>
>> I've tested this approach and discovered it can lead to another issue that required significant
>> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
>> draining of the default hierarchy.
>>
>> Consider this sequence:
>>
>> mount net_prio umount perf_event
>> cgroup1_get_tree
>> // &cgrp_dfl_root.cgrp
>> cgroup_lock_and_drain_offline
>> // wait for all perf_event csses dead
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> cgroup_destroy_root
>> // &root->cgrp, not cgrp_dfl_root
>> cgroup_lock_and_drain_offline
> perf_event's css (offline but dying)
>
>> rebind_subsystems
>> rcu_assign_pointer(dcgrp->subsys[ssid], css);
>> dst_root->subsys_mask |= 1 << ssid;
>> cgroup_propagate_control
>> // enable cgrp_dfl_root perf_event css
>> cgroup_apply_control_enable
>> css = cgroup_css(dsct, ss);
>> // since we drain root->cgrp not cgrp_dfl_root
>> // css(dying) is not null on the cgrp_dfl_root
>> // we won't create css, but the css is dying
>
> What would prevent seeing a dying css when
> cgrp_dfl_root is drained?
> (Or nothing drained as in the patch?)
> I assume you've seen this warning from
> cgroup_apply_control_enable
> WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
>
>
WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
-- Yes
Draining the cgrp_dfl_root can prevent seeing the dying css.
Q:When the task can be woken up if it is waiting on offline_waitq?
A:The offline_css is invoked, and:
RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
If we drain the cgrp_dfl_root, it traverses all the csses
That means cgroup_lock_and_drain_offline can only return when all
the dying have disappeared, thus preventing seeing a dying css.
>>
>> // got the offline_waitq wake up
>> goto restart;
>> // some perf_event dying csses are online now
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> // never get the offline_waitq wake up
>>
>> I encountered two main issues:
>> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
>
> Is this really resolved by the patch? (The questions above.)
>
>> 2.Potential hangs during cgrp_dfl_root draining in the mounting process
>
> Fortunately, the typical use case (mounting at boot) wouldn't suffer
> from this.
>
>> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
>> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
>> begins.
>
> This is a valid point.
>
>> How can we guarantee this ordering? Therefore, I propose moving the draining operation
>> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
>> potential race condition. This patch implements that approach.
>
> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.
Thank you, I’d appreciate it if you could suggest a better solution.
Thanks,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-25 17:17 ` Michal Koutný
2025-07-26 0:52 ` Chen Ridong
@ 2025-07-31 11:53 ` Chen Ridong
2025-08-14 15:17 ` Michal Koutný
2025-08-15 2:40 ` Hillf Danton
2025-08-15 7:24 ` Chen Ridong
3 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-07-31 11:53 UTC (permalink / raw)
To: Michal Koutný
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.
>
> Let me think more about this...
>
> Thanks,
> Michal
Hi Michal,
Have you come up with a better solution for this?
Would appreciate your thoughts when you have time.
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-31 11:53 ` Chen Ridong
@ 2025-08-14 15:17 ` Michal Koutný
2025-08-15 0:30 ` Chen Ridong
0 siblings, 1 reply; 17+ messages in thread
From: Michal Koutný @ 2025-08-14 15:17 UTC (permalink / raw)
To: Chen Ridong
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
[-- Attachment #1: Type: text/plain, Size: 728 bytes --]
Hi Ridong.
On Thu, Jul 31, 2025 at 07:53:02PM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
> Have you come up with a better solution for this?
> Would appreciate your thoughts when you have time.
Sorry for taking so long. (Also expect my next response here may be
slow.)
I tried reproducing it with the described LTP tests [1] (to get a better
idea about what and why needs to be offlined) but I cannot bring it to a
hang nor lockdep report. How do you launch the particular LTP tests to
trigger this?
Thanks,
Michal
[1]
while true ; do
/opt/ltp/testcases/bin/cgroup_fj_function.sh net_cls $pp
/opt/ltp/testcases/bin/cgroup_fj_function.sh perf_event
done
(with pp both `;` or `&` for concurrent runs, two vCPUs)
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-14 15:17 ` Michal Koutný
@ 2025-08-15 0:30 ` Chen Ridong
0 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-08-15 0:30 UTC (permalink / raw)
To: Michal Koutný
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/8/14 23:17, Michal Koutný wrote:
> Hi Ridong.
>
> On Thu, Jul 31, 2025 at 07:53:02PM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> Have you come up with a better solution for this?
>> Would appreciate your thoughts when you have time.
>
> Sorry for taking so long. (Also expect my next response here may be
> slow.)
> I tried reproducing it with the described LTP tests [1] (to get a better
> idea about what and why needs to be offlined) but I cannot bring it to a
> hang nor lockdep report. How do you launch the particular LTP tests to
> trigger this?
>
> Thanks,
> Michal
>
> [1]
> while true ; do
> /opt/ltp/testcases/bin/cgroup_fj_function.sh net_cls $pp
> /opt/ltp/testcases/bin/cgroup_fj_function.sh perf_event
> done
> (with pp both `;` or `&` for concurrent runs, two vCPUs)
Hi Michal,
I’ve provided details on how to reproduce the issue—it’s quite straightforward. For your reference,
here’s the link to the discussion:
https://lore.kernel.org/cgroups/btaaerpdl3bolxbysbqcig6kiccdgsoo32td64sk6yo4m5l5zy@nds6s35p6e6w/T/#m01f1229f2e84e1186abaf04378b4c0f47151f7e4
Let me know if you need further clarification.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-25 17:17 ` Michal Koutný
2025-07-26 0:52 ` Chen Ridong
2025-07-31 11:53 ` Chen Ridong
@ 2025-08-15 2:40 ` Hillf Danton
2025-08-15 7:29 ` Chen Ridong
2025-08-15 7:24 ` Chen Ridong
3 siblings, 1 reply; 17+ messages in thread
From: Hillf Danton @ 2025-08-15 2:40 UTC (permalink / raw)
To: Michal Koutný, Chen Ridong
Cc: tj, cgroups, linux-kernel, lujialin4, chenridong, gaoyingjie
On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
> > On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
> >> CPU0 CPU1
> >> mount perf_event umount net_prio
> >> cgroup1_get_tree cgroup_kill_sb
> >> rebind_subsystems // root destruction enqueues
> >> // cgroup_destroy_wq
> >> // kill all perf_event css
> >> // one perf_event css A is dying
> >> // css A offline enqueues cgroup_destroy_wq
> >> // root destruction will be executed first
> >> css_free_rwork_fn
> >> cgroup_destroy_root
> >> cgroup_lock_and_drain_offline
> >> // some perf descendants are dying
> >> // cgroup_destroy_wq max_active = 1
> >> // waiting for css A to die
> >>
> >> Problem scenario:
> >> 1. CPU0 mounts perf_event (rebind_subsystems)
> >> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> >> 3. A dying perf_event CSS gets queued for offline after root destruction
> >> 4. Root destruction waits for offline completion, but offline work is
> >> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
> >
> > What's concerning me is why umount of net_prio hierarhy waits for
> > draining of the default hierachy? (Where you then run into conflict with
> > perf_event that's implicit_on_dfl.)
> >
/*
* cgroup destruction makes heavy use of work items and there can be a lot
* of concurrent destructions. Use a separate workqueue so that cgroup
* destruction work items don't end up filling up max_active of system_wq
* which may lead to deadlock.
*/
If task hung could be reliably reproduced, it is right time to cut
max_active off for cgroup_destroy_wq according to its comment.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-25 17:17 ` Michal Koutný
` (2 preceding siblings ...)
2025-08-15 2:40 ` Hillf Danton
@ 2025-08-15 7:24 ` Chen Ridong
3 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-08-15 7:24 UTC (permalink / raw)
To: Michal Koutný
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/7/26 1:17, Michal Koutný wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0 CPU1
>>>> mount perf_event umount net_prio
>>>> cgroup1_get_tree cgroup_kill_sb
>>>> rebind_subsystems // root destruction enqueues
>>>> // cgroup_destroy_wq
>>>> // kill all perf_event css
>>>> // one perf_event css A is dying
>>>> // css A offline enqueues cgroup_destroy_wq
>>>> // root destruction will be executed first
>>>> css_free_rwork_fn
>>>> cgroup_destroy_root
>>>> cgroup_lock_and_drain_offline
>>>> // some perf descendants are dying
>>>> // cgroup_destroy_wq max_active = 1
>>>> // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
>>
>> This was also first respond.
>>
>>> IOW why not this:
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>>>
>>> trace_cgroup_destroy_root(root);
>>>
>>> - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>>> + cgroup_lock_and_drain_offline(cgrp);
>>>
>>> BUG_ON(atomic_read(&root->nr_cgrps));
>>> BUG_ON(!list_empty(&cgrp->self.children));
>>>
>>> Does this correct the LTP scenario?
>>>
>>> Thanks,
>>> Michal
>>
>> I've tested this approach and discovered it can lead to another issue that required significant
>> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
>> draining of the default hierarchy.
>>
>> Consider this sequence:
>>
>> mount net_prio umount perf_event
>> cgroup1_get_tree
>> // &cgrp_dfl_root.cgrp
>> cgroup_lock_and_drain_offline
>> // wait for all perf_event csses dead
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> cgroup_destroy_root
>> // &root->cgrp, not cgrp_dfl_root
>> cgroup_lock_and_drain_offline
> perf_event's css (offline but dying)
>
>> rebind_subsystems
>> rcu_assign_pointer(dcgrp->subsys[ssid], css);
>> dst_root->subsys_mask |= 1 << ssid;
>> cgroup_propagate_control
>> // enable cgrp_dfl_root perf_event css
>> cgroup_apply_control_enable
>> css = cgroup_css(dsct, ss);
>> // since we drain root->cgrp not cgrp_dfl_root
>> // css(dying) is not null on the cgrp_dfl_root
>> // we won't create css, but the css is dying
>
> What would prevent seeing a dying css when
> cgrp_dfl_root is drained?
> (Or nothing drained as in the patch?)
>
> I assume you've seen this warning from
> cgroup_apply_control_enable
> WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
>
>
>>
>> // got the offline_waitq wake up
>> goto restart;
>> // some perf_event dying csses are online now
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> // never get the offline_waitq wake up
>>
>> I encountered two main issues:
>> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
>
> Is this really resolved by the patch? (The questions above.)
>
>> 2.Potential hangs during cgrp_dfl_root draining in the mounting process
>
> Fortunately, the typical use case (mounting at boot) wouldn't suffer
> from this.
>
>> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
>> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
>> begins.
>
> This is a valid point.
>
>> How can we guarantee this ordering? Therefore, I propose moving the draining operation
>> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
>> potential race condition. This patch implements that approach.
>
> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.
>
> Let me think more about this...
>
> Thanks,
> Michal
I've just submitted v3, which I believe provides an improved solution to the issue we've been
discussing.
The v3 version is available here:
https://lore.kernel.org/cgroups/20250815070518.1255842-1-chenridong@huaweicloud.com/T/#u
I'd be truly grateful if you could spare some time to review it.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 2:40 ` Hillf Danton
@ 2025-08-15 7:29 ` Chen Ridong
2025-08-15 10:02 ` Hillf Danton
0 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-08-15 7:29 UTC (permalink / raw)
To: Hillf Danton, Michal Koutný
Cc: tj, cgroups, linux-kernel, lujialin4, chenridong, gaoyingjie
On 2025/8/15 10:40, Hillf Danton wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0 CPU1
>>>> mount perf_event umount net_prio
>>>> cgroup1_get_tree cgroup_kill_sb
>>>> rebind_subsystems // root destruction enqueues
>>>> // cgroup_destroy_wq
>>>> // kill all perf_event css
>>>> // one perf_event css A is dying
>>>> // css A offline enqueues cgroup_destroy_wq
>>>> // root destruction will be executed first
>>>> css_free_rwork_fn
>>>> cgroup_destroy_root
>>>> cgroup_lock_and_drain_offline
>>>> // some perf descendants are dying
>>>> // cgroup_destroy_wq max_active = 1
>>>> // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
> /*
> * cgroup destruction makes heavy use of work items and there can be a lot
> * of concurrent destructions. Use a separate workqueue so that cgroup
> * destruction work items don't end up filling up max_active of system_wq
> * which may lead to deadlock.
> */
>
> If task hung could be reliably reproduced, it is right time to cut
> max_active off for cgroup_destroy_wq according to its comment.
Hi Danton,
Thank you for your feedback.
While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
addresses the issue more comprehensively.
I’d be very grateful if you could take a look and share your thoughts. Your review would be greatly
appreciated!
v3: https://lore.kernel.org/cgroups/20250815070518.1255842-1-chenridong@huaweicloud.com/T/#u
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 7:29 ` Chen Ridong
@ 2025-08-15 10:02 ` Hillf Danton
2025-08-15 10:28 ` Chen Ridong
0 siblings, 1 reply; 17+ messages in thread
From: Hillf Danton @ 2025-08-15 10:02 UTC (permalink / raw)
To: Chen Ridong
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote:
>On 2025/8/15 10:40, Hillf Danton wrote:
>> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>> CPU0 CPU1
>>>>> mount perf_event umount net_prio
>>>>> cgroup1_get_tree cgroup_kill_sb
>>>>> rebind_subsystems // root destruction enqueues
>>>>> // cgroup_destroy_wq
>>>>> // kill all perf_event css
>>>>> // one perf_event css A is dying
>>>>> // css A offline enqueues cgroup_destroy_wq
>>>>> // root destruction will be executed first
>>>>> css_free_rwork_fn
>>>>> cgroup_destroy_root
>>>>> cgroup_lock_and_drain_offline
>>>>> // some perf descendants are dying
>>>>> // cgroup_destroy_wq max_active = 1
>>>>> // waiting for css A to die
>>>>>
>>>>> Problem scenario:
>>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>>
>>>> What's concerning me is why umount of net_prio hierarhy waits for
>>>> draining of the default hierachy? (Where you then run into conflict with
>>>> perf_event that's implicit_on_dfl.)
>>>>
>> /*
>> * cgroup destruction makes heavy use of work items and there can be a lot
>> * of concurrent destructions. Use a separate workqueue so that cgroup
>> * destruction work items don't end up filling up max_active of system_wq
>> * which may lead to deadlock.
>> */
>>
>> If task hung could be reliably reproduced, it is right time to cut
>> max_active off for cgroup_destroy_wq according to its comment.
>
>Hi Danton,
>
>Thank you for your feedback.
>
>While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
>side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
>addresses the issue more comprehensively.
>
Given your reproducer [1], it is simple to test with max_active cut.
I do not think v3 is a correct fix frankly because it leaves the root cause
intact. Nor is it cgroup specific even given high concurrency in destruction.
[1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 10:02 ` Hillf Danton
@ 2025-08-15 10:28 ` Chen Ridong
2025-08-15 11:54 ` Hillf Danton
0 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-08-15 10:28 UTC (permalink / raw)
To: Hillf Danton
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/8/15 18:02, Hillf Danton wrote:
> On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote:
>> On 2025/8/15 10:40, Hillf Danton wrote:
>>> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>>> CPU0 CPU1
>>>>>> mount perf_event umount net_prio
>>>>>> cgroup1_get_tree cgroup_kill_sb
>>>>>> rebind_subsystems // root destruction enqueues
>>>>>> // cgroup_destroy_wq
>>>>>> // kill all perf_event css
>>>>>> // one perf_event css A is dying
>>>>>> // css A offline enqueues cgroup_destroy_wq
>>>>>> // root destruction will be executed first
>>>>>> css_free_rwork_fn
>>>>>> cgroup_destroy_root
>>>>>> cgroup_lock_and_drain_offline
>>>>>> // some perf descendants are dying
>>>>>> // cgroup_destroy_wq max_active = 1
>>>>>> // waiting for css A to die
>>>>>>
>>>>>> Problem scenario:
>>>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>>>
>>>>> What's concerning me is why umount of net_prio hierarhy waits for
>>>>> draining of the default hierachy? (Where you then run into conflict with
>>>>> perf_event that's implicit_on_dfl.)
>>>>>
>>> /*
>>> * cgroup destruction makes heavy use of work items and there can be a lot
>>> * of concurrent destructions. Use a separate workqueue so that cgroup
>>> * destruction work items don't end up filling up max_active of system_wq
>>> * which may lead to deadlock.
>>> */
>>>
>>> If task hung could be reliably reproduced, it is right time to cut
>>> max_active off for cgroup_destroy_wq according to its comment.
>>
>> Hi Danton,
>>
>> Thank you for your feedback.
>>
>> While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
>> side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
>> addresses the issue more comprehensively.
>>
> Given your reproducer [1], it is simple to test with max_active cut.
>
> I do not think v3 is a correct fix frankly because it leaves the root cause
> intact. Nor is it cgroup specific even given high concurrency in destruction.
>
> [1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/
Hi Danton,
Thank you for your reply.
To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
cgroup_destroy_wq to 1?
Note that cgroup_destroy_wq already has max_active=1 as inited:
```cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);```
The v3 changes prevent subsystem root destruction from being blocked by unrelated subsystem offline
events. Since root destruction should only proceed after all descendants are destroyed, it shouldn't
be blocked by children offline events. My testing with the reproducer confirms this fixes the issue
I encountered.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 10:28 ` Chen Ridong
@ 2025-08-15 11:54 ` Hillf Danton
2025-08-16 0:33 ` Chen Ridong
0 siblings, 1 reply; 17+ messages in thread
From: Hillf Danton @ 2025-08-15 11:54 UTC (permalink / raw)
To: Chen Ridong
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote:
>
> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
> cgroup_destroy_wq to 1?
>
cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 11:54 ` Hillf Danton
@ 2025-08-16 0:33 ` Chen Ridong
2025-08-16 0:57 ` Hillf Danton
0 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-08-16 0:33 UTC (permalink / raw)
To: Hillf Danton
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/8/15 19:54, Hillf Danton wrote:
> On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote:
>>
>> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
>> cgroup_destroy_wq to 1?
>>
> cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);
Thank you for the additional clarification.
While this modification is functional, I’m concerned it might spawn a significant number of cgroup
destruction tasks—most of which would contend for cgroup_mutex. Could this waste system resources?
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-16 0:33 ` Chen Ridong
@ 2025-08-16 0:57 ` Hillf Danton
0 siblings, 0 replies; 17+ messages in thread
From: Hillf Danton @ 2025-08-16 0:57 UTC (permalink / raw)
To: Chen Ridong
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On Sat, 16 Aug 2025 08:33:55 +0800 Chen Ridong wrote:
> On 2025/8/15 19:54, Hillf Danton wrote:
> > On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote:
> >>
> >> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
> >> cgroup_destroy_wq to 1?
> >>
> > cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);
>
> Thank you for the additional clarification.
>
> While this modification is functional, I’m concerned it might spawn a significant number of cgroup
> destruction tasks—most of which would contend for cgroup_mutex. Could this waste system resources?
>
No win comes from nothing, so you need to pay $.02 for example at least
for curing the task hung by erasing the root cause.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-08-16 1:00 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22 11:27 [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks Chen Ridong
2025-07-24 13:35 ` Michal Koutný
2025-07-25 1:42 ` Chen Ridong
2025-07-25 17:17 ` Michal Koutný
2025-07-26 0:52 ` Chen Ridong
2025-07-31 11:53 ` Chen Ridong
2025-08-14 15:17 ` Michal Koutný
2025-08-15 0:30 ` Chen Ridong
2025-08-15 2:40 ` Hillf Danton
2025-08-15 7:29 ` Chen Ridong
2025-08-15 10:02 ` Hillf Danton
2025-08-15 10:28 ` Chen Ridong
2025-08-15 11:54 ` Hillf Danton
2025-08-16 0:33 ` Chen Ridong
2025-08-16 0:57 ` Hillf Danton
2025-08-15 7:24 ` Chen Ridong
2025-07-25 1:48 ` Chen Ridong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).