cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chen Ridong <chenridong@huaweicloud.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: tj@kernel.org, hannes@cmpxchg.org, lizefan@huawei.com,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	lujialin4@huawei.com, chenridong@huawei.com,
	gaoyingjie@uniontech.com
Subject: Re: [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks
Date: Sat, 26 Jul 2025 08:52:44 +0800	[thread overview]
Message-ID: <179f706c-b04d-4fd5-b896-0abfc546528f@huaweicloud.com> (raw)
In-Reply-To: <htzudoa4cgius7ncus67axelhv3qh6fgjgnvju27fuyw7gimla@uzrta5sfbh2w>



On 2025/7/26 1:17, Michal Koutný wrote:
> On Fri, Jul 25, 2025 at 09:42:05AM +0800, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On Tue, Jul 22, 2025 at 11:27:33AM +0000, Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> CPU0                            CPU1
>>>> mount perf_event                umount net_prio
>>>> cgroup1_get_tree                cgroup_kill_sb
>>>> rebind_subsystems               // root destruction enqueues
>>>> 				// cgroup_destroy_wq
>>>> // kill all perf_event css
>>>>                                 // one perf_event css A is dying
>>>>                                 // css A offline enqueues cgroup_destroy_wq
>>>>                                 // root destruction will be executed first
>>>>                                 css_free_rwork_fn
>>>>                                 cgroup_destroy_root
>>>>                                 cgroup_lock_and_drain_offline
>>>>                                 // some perf descendants are dying
>>>>                                 // cgroup_destroy_wq max_active = 1
>>>>                                 // waiting for css A to die
>>>>
>>>> Problem scenario:
>>>> 1. CPU0 mounts perf_event (rebind_subsystems)
>>>> 2. CPU1 unmounts net_prio (cgroup_kill_sb), queuing root destruction work
>>>> 3. A dying perf_event CSS gets queued for offline after root destruction
>>>> 4. Root destruction waits for offline completion, but offline work is
>>>>    blocked behind root destruction in cgroup_destroy_wq (max_active=1)
>>>
>>> What's concerning me is why umount of net_prio hierarhy waits for
>>> draining of the default hierachy? (Where you then run into conflict with
>>> perf_event that's implicit_on_dfl.)
>>>
>>
>> This was also first respond.
>>
>>> IOW why not this:
>>> --- a/kernel/cgroup/cgroup.c
>>> +++ b/kernel/cgroup/cgroup.c
>>> @@ -1346,7 +1346,7 @@ static void cgroup_destroy_root(struct cgroup_root *root)
>>>
>>>         trace_cgroup_destroy_root(root);
>>>
>>> -       cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>>> +       cgroup_lock_and_drain_offline(cgrp);
>>>
>>>         BUG_ON(atomic_read(&root->nr_cgrps));
>>>         BUG_ON(!list_empty(&cgrp->self.children));
>>>
>>> Does this correct the LTP scenario?
>>>
>>> Thanks,
>>> Michal
>>
>> I've tested this approach and discovered it can lead to another issue that required significant
>> investigation. This helped me understand why unmounting the net_prio hierarchy needs to wait for
>> draining of the default hierarchy.
>>
>> Consider this sequence:
>>
>> mount net_prio			umount perf_event
>> cgroup1_get_tree
>> // &cgrp_dfl_root.cgrp
>> cgroup_lock_and_drain_offline
>> // wait for all perf_event csses dead
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> 				cgroup_destroy_root
>> 				// &root->cgrp, not cgrp_dfl_root
>> 				cgroup_lock_and_drain_offline
> 								perf_event's css (offline but dying)
> 
>> 				rebind_subsystems
>> 				rcu_assign_pointer(dcgrp->subsys[ssid], css);
>> 				dst_root->subsys_mask |= 1 << ssid;
>> 				cgroup_propagate_control
>> 				// enable cgrp_dfl_root perf_event css
>> 				cgroup_apply_control_enable
>> 				css = cgroup_css(dsct, ss);
>> 				// since we drain root->cgrp not cgrp_dfl_root
>> 				// css(dying) is not null on the cgrp_dfl_root
>> 				// we won't create css, but the css is dying
> 
> 				What would prevent seeing a dying css when
> 				cgrp_dfl_root is drained?
> 				(Or nothing drained as in the patch?)

> 				I assume you've seen this warning from
> 				cgroup_apply_control_enable
> 				WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
>
> 
				WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt)); ?
				-- Yes
				Draining the cgrp_dfl_root can prevent seeing the dying css.
				Q:When the task can be woken up if it is waiting on offline_waitq?
				A:The offline_css is invoked, and:
				RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);

				If we drain the cgrp_dfl_root, it traverses all the csses
				That means cgroup_lock_and_drain_offline can only return when all
				the dying have disappeared, thus preventing seeing a dying css.
>> 								
>> // got the offline_waitq wake up
>> goto restart;
>> // some perf_event dying csses are online now
>> prepare_to_wait(&dsct->offline_waitq)
>> schedule();
>> // never get the offline_waitq wake up
>>
>> I encountered two main issues:
>> 1.Dying csses on cgrp_dfl_root may be brought back online when rebinding the subsystem to cgrp_dfl_root
> 
> Is this really resolved by the patch? (The questions above.)
> 
>> 2.Potential hangs during cgrp_dfl_root draining in the mounting process
> 
> Fortunately, the typical use case (mounting at boot) wouldn't suffer
> from this.
> 
>> I believe waiting for a wake-up in cgroup_destroy_wq is inherently risky, as it requires that
>> offline css work(the cgroup_destroy_root need to drain) cannot be enqueued after cgroup_destroy_root
>> begins.
> 
> This is a valid point.
> 
>> How can we guarantee this ordering? Therefore, I propose moving the draining operation
>> outside of cgroup_destroy_wq as a more robust solution that would completely eliminate this
>> potential race condition. This patch implements that approach.
> 
> I acknowledge the issue (although rare in real world). Some entity will
> always have to wait of the offlining. It may be OK in cgroup_kill_sb
> (ideally, if this was bound to process context of umount caller, not
> sure if that's how kill_sb works).
> I slightly dislike the form of an empty lock/unlock -- which makes me
> wonder if this is the best solution.

Thank you, I’d appreciate it if you could suggest a better solution.

Thanks,
Ridong


  reply	other threads:[~2025-07-26  0:52 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-22 11:27 [PATCH v2 -next] cgroup: remove offline draining in root destruction to avoid hung_tasks Chen Ridong
2025-07-24 13:35 ` Michal Koutný
2025-07-25  1:42   ` Chen Ridong
2025-07-25 17:17     ` Michal Koutný
2025-07-26  0:52       ` Chen Ridong [this message]
2025-07-31 11:53       ` Chen Ridong
2025-08-14 15:17         ` Michal Koutný
2025-08-15  0:30           ` Chen Ridong
2025-08-15  2:40       ` Hillf Danton
2025-08-15  7:29         ` Chen Ridong
2025-08-15 10:02           ` Hillf Danton
2025-08-15 10:28             ` Chen Ridong
2025-08-15 11:54               ` Hillf Danton
2025-08-16  0:33                 ` Chen Ridong
2025-08-16  0:57                   ` Hillf Danton
2025-08-15  7:24       ` Chen Ridong
2025-07-25  1:48 ` Chen Ridong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=179f706c-b04d-4fd5-b896-0abfc546528f@huaweicloud.com \
    --to=chenridong@huaweicloud.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huawei.com \
    --cc=gaoyingjie@uniontech.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=lujialin4@huawei.com \
    --cc=mkoutny@suse.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).