* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-25 17:17 ` Michal Koutný
@ 2025-07-26 0:52 ` Chen Ridong
2025-07-31 11:53 ` Chen Ridong
` (2 subsequent siblings)
3 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-07-26 0:52 UTC (permalink / raw)
To: Michal Koutný
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/7/26 1:17, Michal Koutný wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0 CPU1
>>>> mount perf_event umount net_prio
>>>> cgroup1_get_tree cgroup_kill_sb
>>>> rebind_subsystems // root destruction enqueues
>>>> // cgroup_destroy_wq
>>>> // kill all perf_event css
>>>> // one perf_event css A is dying
>>>> // css A offline enqueues cgroup_destroy_wq
>>>> // root destruction will be executed first
>>>> css_free_rwork_fn
>>>> cgroup_destroy_root
>>>> cgroup_lock_and_drain_offline
>>>> // some perf descendants are dying
>>>> // cgroup_destroy_wq max_active = 1
>>>> // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
>>
>> This was also first respond.
>>
>>> IOW why not this:
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>>>
>>> trace_cgroup_destroy_root(root);
>>>
>>> - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>>> + cgroup_lock_and_drain_offline(cgrp);
>>>
>>> BUG_ON(atomic_read(&root->nr_cgrps));
>>> BUG_ON(!list_empty(&cgrp->self.children));
>>>
>>> Does this correct the LTP scenario?
>>>
>>> Thanks,
>>> Michal
>>
>> I've tested this approach and discovered it can lead to another issue that required significant
>> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
>> draining of the default hierarchy.
>>
>> Consider this sequence:
>>
>> mount net_prio umount perf_event
>> cgroup1_get_tree
>> // &cgrp_dfl_root.cgrp
>> cgroup_lock_and_drain_offline
>> // wait for all perf_event csses dead
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> cgroup_destroy_root
>> // &root->cgrp, not cgrp_dfl_root
>> cgroup_lock_and_drain_offline
> perf_event's css (offline but dying)
>
>> rebind_subsystems
>> rcu_assign_pointer(dcgrp->subsys[ssid], css);
>> dst_root->subsys_mask |= 1 << ssid;
>> cgroup_propagate_control
>> // enable cgrp_dfl_root perf_event css
>> cgroup_apply_control_enable
>> css = cgroup_css(dsct, ss);
>> // since we drain root->cgrp not cgrp_dfl_root
>> // css(dying) is not null on the cgrp_dfl_root
>> // we won't create css, but the css is dying
>
> What would prevent seeing a dying css when
> cgrp_dfl_root is drained?
> (Or nothing drained as in the patch?)
> I assume you've seen this warning from
> cgroup_apply_control_enable
> WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
>
>
WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
-- Yes
Draining the cgrp_dfl_root can prevent seeing the dying css.
Q:When the task can be woken up if it is waiting on offline_waitq?
A:The offline_css is invoked, and:
RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
If we drain the cgrp_dfl_root, it traverses all the csses
That means cgroup_lock_and_drain_offline can only return when all
the dying have disappeared, thus preventing seeing a dying css.
>>
>> // got the offline_waitq wake up
>> goto restart;
>> // some perf_event dying csses are online now
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> // never get the offline_waitq wake up
>>
>> I encountered two main issues:
>> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
>
> Is this really resolved by the patch? (The questions above.)
>
>> 2.Potential hangs during cgrp_dfl_root draining in the mounting process
>
> Fortunately, the typical use case (mounting at boot) wouldn't suffer
> from this.
>
>> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
>> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
>> begins.
>
> This is a valid point.
>
>> How can we guarantee this ordering? Therefore, I propose moving the draining operation
>> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
>> potential race condition. This patch implements that approach.
>
> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.
Thank you, I’d appreciate it if you could suggest a better solution.
Thanks,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-25 17:17 ` Michal Koutný
2025-07-26 0:52 ` Chen Ridong
@ 2025-07-31 11:53 ` Chen Ridong
2025-08-14 15:17 ` Michal Koutný
2025-08-15 2:40 ` Hillf Danton
2025-08-15 7:24 ` Chen Ridong
3 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-07-31 11:53 UTC (permalink / raw)
To: Michal Koutný
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.
>
> Let me think more about this...
>
> Thanks,
> Michal
Hi Michal,
Have you come up with a better solution for this?
Would appreciate your thoughts when you have time.
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-31 11:53 ` Chen Ridong
@ 2025-08-14 15:17 ` Michal Koutný
2025-08-15 0:30 ` Chen Ridong
0 siblings, 1 reply; 17+ messages in thread
From: Michal Koutný @ 2025-08-14 15:17 UTC (permalink / raw)
To: Chen Ridong
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
[-- Attachment #1: Type: text/plain, Size: 728 bytes --]
Hi Ridong.
On Thu, Jul 31, 2025 at 07:53:02PM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
> Have you come up with a better solution for this?
> Would appreciate your thoughts when you have time.
Sorry for taking so long. (Also expect my next response here may be
slow.)
I tried reproducing it with the described LTP tests [1] (to get a better
idea about what and why needs to be offlined) but I cannot bring it to a
hang nor lockdep report. How do you launch the particular LTP tests to
trigger this?
Thanks,
Michal
[1]
while true ; do
/opt/ltp/testcases/bin/cgroup_fj_function.sh net_cls $pp
/opt/ltp/testcases/bin/cgroup_fj_function.sh perf_event
done
(with pp both `;` or `&` for concurrent runs, two vCPUs)
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-14 15:17 ` Michal Koutný
@ 2025-08-15 0:30 ` Chen Ridong
0 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-08-15 0:30 UTC (permalink / raw)
To: Michal Koutný
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/8/14 23:17, Michal Koutný wrote:
> Hi Ridong.
>
> On Thu, Jul 31, 2025 at 07:53:02PM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>> Have you come up with a better solution for this?
>> Would appreciate your thoughts when you have time.
>
> Sorry for taking so long. (Also expect my next response here may be
> slow.)
> I tried reproducing it with the described LTP tests [1] (to get a better
> idea about what and why needs to be offlined) but I cannot bring it to a
> hang nor lockdep report. How do you launch the particular LTP tests to
> trigger this?
>
> Thanks,
> Michal
>
> [1]
> while true ; do
> /opt/ltp/testcases/bin/cgroup_fj_function.sh net_cls $pp
> /opt/ltp/testcases/bin/cgroup_fj_function.sh perf_event
> done
> (with pp both `;` or `&` for concurrent runs, two vCPUs)
Hi Michal,
I’ve provided details on how to reproduce the issue—it’s quite straightforward. For your reference,
here’s the link to the discussion:
https://lore.kernel.org/cgroups/btaaerpdl3bolxbysbqcig6kiccdgsoo32td64sk6yo4m5l5zy@nds6s35p6e6w/T/#m01f1229f2e84e1186abaf04378b4c0f47151f7e4
Let me know if you need further clarification.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-25 17:17 ` Michal Koutný
2025-07-26 0:52 ` Chen Ridong
2025-07-31 11:53 ` Chen Ridong
@ 2025-08-15 2:40 ` Hillf Danton
2025-08-15 7:29 ` Chen Ridong
2025-08-15 7:24 ` Chen Ridong
3 siblings, 1 reply; 17+ messages in thread
From: Hillf Danton @ 2025-08-15 2:40 UTC (permalink / raw)
To: Michal Koutný, Chen Ridong
Cc: tj, cgroups, linux-kernel, lujialin4, chenridong, gaoyingjie
On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
> > On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
> >> CPU0 CPU1
> >> mount perf_event umount net_prio
> >> cgroup1_get_tree cgroup_kill_sb
> >> rebind_subsystems // root destruction enqueues
> >> // cgroup_destroy_wq
> >> // kill all perf_event css
> >> // one perf_event css A is dying
> >> // css A offline enqueues cgroup_destroy_wq
> >> // root destruction will be executed first
> >> css_free_rwork_fn
> >> cgroup_destroy_root
> >> cgroup_lock_and_drain_offline
> >> // some perf descendants are dying
> >> // cgroup_destroy_wq max_active = 1
> >> // waiting for css A to die
> >>
> >> Problem scenario:
> >> 1. CPU0 mounts perf_event (rebind_subsystems)
> >> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
> >> 3. A dying perf_event CSS gets queued for offline after root destruction
> >> 4. Root destruction waits for offline completion, but offline work is
> >> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
> >
> > What's concerning me is why umount of net_prio hierarhy waits for
> > draining of the default hierachy? (Where you then run into conflict with
> > perf_event that's implicit_on_dfl.)
> >
/*
* cgroup destruction makes heavy use of work items and there can be a lot
* of concurrent destructions. Use a separate workqueue so that cgroup
* destruction work items don't end up filling up max_active of system_wq
* which may lead to deadlock.
*/
If task hung could be reliably reproduced, it is right time to cut
max_active off for cgroup_destroy_wq according to its comment.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 2:40 ` Hillf Danton
@ 2025-08-15 7:29 ` Chen Ridong
2025-08-15 10:02 ` Hillf Danton
0 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-08-15 7:29 UTC (permalink / raw)
To: Hillf Danton, Michal Koutný
Cc: tj, cgroups, linux-kernel, lujialin4, chenridong, gaoyingjie
On 2025/8/15 10:40, Hillf Danton wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0 CPU1
>>>> mount perf_event umount net_prio
>>>> cgroup1_get_tree cgroup_kill_sb
>>>> rebind_subsystems // root destruction enqueues
>>>> // cgroup_destroy_wq
>>>> // kill all perf_event css
>>>> // one perf_event css A is dying
>>>> // css A offline enqueues cgroup_destroy_wq
>>>> // root destruction will be executed first
>>>> css_free_rwork_fn
>>>> cgroup_destroy_root
>>>> cgroup_lock_and_drain_offline
>>>> // some perf descendants are dying
>>>> // cgroup_destroy_wq max_active = 1
>>>> // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
> /*
> * cgroup destruction makes heavy use of work items and there can be a lot
> * of concurrent destructions. Use a separate workqueue so that cgroup
> * destruction work items don't end up filling up max_active of system_wq
> * which may lead to deadlock.
> */
>
> If task hung could be reliably reproduced, it is right time to cut
> max_active off for cgroup_destroy_wq according to its comment.
Hi Danton,
Thank you for your feedback.
While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
addresses the issue more comprehensively.
I’d be very grateful if you could take a look and share your thoughts. Your review would be greatly
appreciated!
v3: https://lore.kernel.org/cgroups/20250815070518.1255842-1-chenridong@huaweicloud.com/T/#u
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 7:29 ` Chen Ridong
@ 2025-08-15 10:02 ` Hillf Danton
2025-08-15 10:28 ` Chen Ridong
0 siblings, 1 reply; 17+ messages in thread
From: Hillf Danton @ 2025-08-15 10:02 UTC (permalink / raw)
To: Chen Ridong
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote:
>On 2025/8/15 10:40, Hillf Danton wrote:
>> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>> CPU0 CPU1
>>>>> mount perf_event umount net_prio
>>>>> cgroup1_get_tree cgroup_kill_sb
>>>>> rebind_subsystems // root destruction enqueues
>>>>> // cgroup_destroy_wq
>>>>> // kill all perf_event css
>>>>> // one perf_event css A is dying
>>>>> // css A offline enqueues cgroup_destroy_wq
>>>>> // root destruction will be executed first
>>>>> css_free_rwork_fn
>>>>> cgroup_destroy_root
>>>>> cgroup_lock_and_drain_offline
>>>>> // some perf descendants are dying
>>>>> // cgroup_destroy_wq max_active = 1
>>>>> // waiting for css A to die
>>>>>
>>>>> Problem scenario:
>>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>>
>>>> What's concerning me is why umount of net_prio hierarhy waits for
>>>> draining of the default hierachy? (Where you then run into conflict with
>>>> perf_event that's implicit_on_dfl.)
>>>>
>> /*
>> * cgroup destruction makes heavy use of work items and there can be a lot
>> * of concurrent destructions. Use a separate workqueue so that cgroup
>> * destruction work items don't end up filling up max_active of system_wq
>> * which may lead to deadlock.
>> */
>>
>> If task hung could be reliably reproduced, it is right time to cut
>> max_active off for cgroup_destroy_wq according to its comment.
>
>Hi Danton,
>
>Thank you for your feedback.
>
>While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
>side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
>addresses the issue more comprehensively.
>
Given your reproducer [1], it is simple to test with max_active cut.
I do not think v3 is a correct fix frankly because it leaves the root cause
intact. Nor is it cgroup specific even given high concurrency in destruction.
[1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 10:02 ` Hillf Danton
@ 2025-08-15 10:28 ` Chen Ridong
2025-08-15 11:54 ` Hillf Danton
0 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-08-15 10:28 UTC (permalink / raw)
To: Hillf Danton
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/8/15 18:02, Hillf Danton wrote:
> On Fri, 15 Aug 2025 15:29:56 +0800 Chen Ridong wrote:
>> On 2025/8/15 10:40, Hillf Danton wrote:
>>> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>>>> CPU0 CPU1
>>>>>> mount perf_event umount net_prio
>>>>>> cgroup1_get_tree cgroup_kill_sb
>>>>>> rebind_subsystems // root destruction enqueues
>>>>>> // cgroup_destroy_wq
>>>>>> // kill all perf_event css
>>>>>> // one perf_event css A is dying
>>>>>> // css A offline enqueues cgroup_destroy_wq
>>>>>> // root destruction will be executed first
>>>>>> css_free_rwork_fn
>>>>>> cgroup_destroy_root
>>>>>> cgroup_lock_and_drain_offline
>>>>>> // some perf descendants are dying
>>>>>> // cgroup_destroy_wq max_active = 1
>>>>>> // waiting for css A to die
>>>>>>
>>>>>> Problem scenario:
>>>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>>>
>>>>> What's concerning me is why umount of net_prio hierarhy waits for
>>>>> draining of the default hierachy? (Where you then run into conflict with
>>>>> perf_event that's implicit_on_dfl.)
>>>>>
>>> /*
>>> * cgroup destruction makes heavy use of work items and there can be a lot
>>> * of concurrent destructions. Use a separate workqueue so that cgroup
>>> * destruction work items don't end up filling up max_active of system_wq
>>> * which may lead to deadlock.
>>> */
>>>
>>> If task hung could be reliably reproduced, it is right time to cut
>>> max_active off for cgroup_destroy_wq according to its comment.
>>
>> Hi Danton,
>>
>> Thank you for your feedback.
>>
>> While modifying max_active could be a viable solution, I’m unsure whether it might introduce other
>> side effects. Instead, I’ve proposed an alternative approach in v3 of the patch, which I believe
>> addresses the issue more comprehensively.
>>
> Given your reproducer [1], it is simple to test with max_active cut.
>
> I do not think v3 is a correct fix frankly because it leaves the root cause
> intact. Nor is it cgroup specific even given high concurrency in destruction.
>
> [1] https://lore.kernel.org/lkml/39e05402-40c7-4631-a87b-8e3747ceddc6@huaweicloud.com/
Hi Danton,
Thank you for your reply.
To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
cgroup_destroy_wq to 1?
Note that cgroup_destroy_wq already has max_active=1 as inited:
```cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);```
The v3 changes prevent subsystem root destruction from being blocked by unrelated subsystem offline
events. Since root destruction should only proceed after all descendants are destroyed, it shouldn't
be blocked by children offline events. My testing with the reproducer confirms this fixes the issue
I encountered.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 10:28 ` Chen Ridong
@ 2025-08-15 11:54 ` Hillf Danton
2025-08-16 0:33 ` Chen Ridong
0 siblings, 1 reply; 17+ messages in thread
From: Hillf Danton @ 2025-08-15 11:54 UTC (permalink / raw)
To: Chen Ridong
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote:
>
> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
> cgroup_destroy_wq to 1?
>
cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-15 11:54 ` Hillf Danton
@ 2025-08-16 0:33 ` Chen Ridong
2025-08-16 0:57 ` Hillf Danton
0 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-08-16 0:33 UTC (permalink / raw)
To: Hillf Danton
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/8/15 19:54, Hillf Danton wrote:
> On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote:
>>
>> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
>> cgroup_destroy_wq to 1?
>>
> cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);
Thank you for the additional clarification.
While this modification is functional, I’m concerned it might spawn a significant number of cgroup
destruction tasks—most of which would contend for cgroup_mutex. Could this waste system resources?
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-08-16 0:33 ` Chen Ridong
@ 2025-08-16 0:57 ` Hillf Danton
0 siblings, 0 replies; 17+ messages in thread
From: Hillf Danton @ 2025-08-16 0:57 UTC (permalink / raw)
To: Chen Ridong
Cc: Michal Koutny, tj, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On Sat, 16 Aug 2025 08:33:55 +0800 Chen Ridong wrote:
> On 2025/8/15 19:54, Hillf Danton wrote:
> > On Fri, 15 Aug 2025 18:28:53 +0800 Chen Ridong wrote:
> >>
> >> To clarify, when you mentioned "cut max_active off", did you mean setting max_active of
> >> cgroup_destroy_wq to 1?
> >>
> > cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 0);
>
> Thank you for the additional clarification.
>
> While this modification is functional, I’m concerned it might spawn a significant number of cgroup
> destruction tasks—most of which would contend for cgroup_mutex. Could this waste system resources?
>
No win comes from nothing, so you need to pay $.02 for example at least
for curing the task hung by erasing the root cause.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
2025-07-25 17:17 ` Michal Koutný
` (2 preceding siblings ...)
2025-08-15 2:40 ` Hillf Danton
@ 2025-08-15 7:24 ` Chen Ridong
3 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-08-15 7:24 UTC (permalink / raw)
To: Michal Koutný
Cc: tj, hannes, lizefan, cgroups, linux-kernel, lujialin4, chenridong,
gaoyingjie
On 2025/7/26 1:17, Michal Koutný wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0 CPU1
>>>> mount perf_event umount net_prio
>>>> cgroup1_get_tree cgroup_kill_sb
>>>> rebind_subsystems // root destruction enqueues
>>>> // cgroup_destroy_wq
>>>> // kill all perf_event css
>>>> // one perf_event css A is dying
>>>> // css A offline enqueues cgroup_destroy_wq
>>>> // root destruction will be executed first
>>>> css_free_rwork_fn
>>>> cgroup_destroy_root
>>>> cgroup_lock_and_drain_offline
>>>> // some perf descendants are dying
>>>> // cgroup_destroy_wq max_active = 1
>>>> // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>> blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
>>
>> This was also first respond.
>>
>>> IOW why not this:
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>>>
>>> trace_cgroup_destroy_root(root);
>>>
>>> - cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>>> + cgroup_lock_and_drain_offline(cgrp);
>>>
>>> BUG_ON(atomic_read(&root->nr_cgrps));
>>> BUG_ON(!list_empty(&cgrp->self.children));
>>>
>>> Does this correct the LTP scenario?
>>>
>>> Thanks,
>>> Michal
>>
>> I've tested this approach and discovered it can lead to another issue that required significant
>> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
>> draining of the default hierarchy.
>>
>> Consider this sequence:
>>
>> mount net_prio umount perf_event
>> cgroup1_get_tree
>> // &cgrp_dfl_root.cgrp
>> cgroup_lock_and_drain_offline
>> // wait for all perf_event csses dead
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> cgroup_destroy_root
>> // &root->cgrp, not cgrp_dfl_root
>> cgroup_lock_and_drain_offline
> perf_event's css (offline but dying)
>
>> rebind_subsystems
>> rcu_assign_pointer(dcgrp->subsys[ssid], css);
>> dst_root->subsys_mask |= 1 << ssid;
>> cgroup_propagate_control
>> // enable cgrp_dfl_root perf_event css
>> cgroup_apply_control_enable
>> css = cgroup_css(dsct, ss);
>> // since we drain root->cgrp not cgrp_dfl_root
>> // css(dying) is not null on the cgrp_dfl_root
>> // we won't create css, but the css is dying
>
> What would prevent seeing a dying css when
> cgrp_dfl_root is drained?
> (Or nothing drained as in the patch?)
>
> I assume you've seen this warning from
> cgroup_apply_control_enable
> WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
>
>
>>
>> // got the offline_waitq wake up
>> goto restart;
>> // some perf_event dying csses are online now
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> // never get the offline_waitq wake up
>>
>> I encountered two main issues:
>> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
>
> Is this really resolved by the patch? (The questions above.)
>
>> 2.Potential hangs during cgrp_dfl_root draining in the mounting process
>
> Fortunately, the typical use case (mounting at boot) wouldn't suffer
> from this.
>
>> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
>> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
>> begins.
>
> This is a valid point.
>
>> How can we guarantee this ordering? Therefore, I propose moving the draining operation
>> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
>> potential race condition. This patch implements that approach.
>
> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.
>
> Let me think more about this...
>
> Thanks,
> Michal
I've just submitted v3, which I believe provides an improved solution to the issue we've been
discussing.
The v3 version is available here:
https://lore.kernel.org/cgroups/20250815070518.1255842-1-chenridong@huaweicloud.com/T/#u
I'd be truly grateful if you could spare some time to review it.
--
Best regards,
Ridong
^ permalink raw reply [flat|nested] 17+ messages in thread