From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bui Quang Minh Subject: Re: [PATCH v2] cgroup: Kill the parent controller when its last child is killed Date: Tue, 5 Apr 2022 21:58:01 +0700 Message-ID: References: <20220404142535.145975-1-minhquangbui99@gmail.com> <20220405091158.GA13806@blackbody.suse.cz> Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=message-id:date:mime-version:user-agent:subject:content-language:to :cc:references:from:in-reply-to:content-transfer-encoding; bh=iE/Y9WscNLHu6w2KEIZCPXwIElVoVHM0BzES9hjgoSw=; b=phlbuxVtZrq+ny/cBitrnv1EdI2+ifY+dU7QMLrWR4vTUYhhs5DmvkW8pGnjEt00dY aeztRgO/rJiLIbsJOMkp2X5p+zBpqygo3MmguP1fKSF3+Lf8WmwLmYa414Z6tbkq7PtB ZlM8l2daOyMhZ2URU1jsuj+cAFosdKMB1hJVwUE82ZFfIxAwOpVLrf4SjgSs9s9MuxmX RLSMFtHho+fNSTZvCXpwoVQPSjCeD+aUdFQhTpIpZuUCgFohEgx7bCH/+TM6bTLv0uvp 1FIYjib+PWttI6jjFrcNZVQV5hky9BB97FMuLw/RLCV3Wb1dJv9c8ovyUbgUvR0hSBfX 5MtA== Content-Language: en-US In-Reply-To: <20220405091158.GA13806@blackbody.suse.cz> List-ID: Content-Type: text/plain; charset="iso-8859-1"; format="flowed" To: =?UTF-8?Q?Michal_Koutn=c3=bd?= , Tejun Heo Cc: cgroups@vger.kernel.org, kernel test robot , Zefan Li , Johannes Weiner , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org On 4/5/22 16:11, Michal Koutn=C3=BD wrote: > On Mon, Apr 04, 2022 at 07:37:24AM -1000, Tejun Heo wrote: >> And the suggested behavior doesn't make much sense to me. It doesn't >> actually solve the underlying problem but instead always make css >> destructions recursive which can lead to surprises for normal use cases. >=20 > I also don't like the nested special-case use percpu_ref_kill(). After thinking more carefully, I agree with your points. The recursive=20 css destruction only does not fixup the previous parents' metadata=20 correctly and it is not a desirable behavior too. > I looked at this and my supposed solution turned out to be a revert of > commit 3c606d35fe97 ("cgroup: prevent mount hang due to memory > controller lifetime"). So at the unmount time it's necessary to distingui= sh > children that are in the process of removal from children than are online= or > pinned indefinitely. >=20 > What about: >=20 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -2205,11 +2205,14 @@ static void cgroup_kill_sb(struct super_block *sb) > struct cgroup_root *root =3D cgroup_root_from_kf(kf_root); >=20 > /* > - * If @root doesn't have any children, start killing it. > + * If @root doesn't have any children held by residual state (e.g. > + * memory controller), start killing it, flush workqueue to filte= r out > + * transiently offlined children. > * This prevents new mounts by disabling percpu_ref_tryget_live(= ). > * > * And don't kill the default root. > */ > + flush_workqueue(cgroup_destroy_wq); > if (list_empty(&root->cgrp.self.children) && root !=3D &cgrp_dfl= _root && > !percpu_ref_is_dying(&root->cgrp.self.refcnt)) { > cgroup_bpf_offline(&root->cgrp); >=20 > (I suspect there's technically still possible a race between concurrent u= nmount > and the last rmdir but the flush on kill_sb path should be affordable and= it > prevents unnecessarily conserved cgroup roots.) Your proposed solution looks good to me. As with my example the flush=20 will guarantee the rmdir and its deferred work has been executed before=20 cleaning up in umount path. But what do you think about diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index f01ff231a484..5578ee76e789 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2215,6 +2215,7 @@ static void cgroup_kill_sb(struct super_block *sb) cgroup_bpf_offline(&root->cgrp); percpu_ref_kill(&root->cgrp.self.refcnt); } + root->cgrp.flags |=3D CGRP_UMOUNT; cgroup_put(&root->cgrp); kernfs_kill_sb(sb); } @@ -5152,12 +5153,28 @@ static void css_release_work_fn(struct=20 work_struct *work) container_of(work, struct cgroup_subsys_state,=20 destroy_work); struct cgroup_subsys *ss =3D css->ss; struct cgroup *cgrp =3D css->cgroup; + struct cgroup *parent =3D cgroup_parent(cgrp); mutex_lock(&cgroup_mutex); css->flags |=3D CSS_RELEASED; list_del_rcu(&css->sibling); + /* + * If parent doesn't have any children, start killing it. + * And don't kill the default root. + */ + if (parent && list_empty(&parent->self.children) && + parent->flags & CGRP_UMOUNT && + parent !=3D &cgrp_dfl_root.cgrp && + !percpu_ref_is_dying(&parent->self.refcnt)) { +#ifdef CONFIG_CGROUP_BPF + if (!percpu_ref_is_dying(&cgrp->bpf.refcnt)) + cgroup_bpf_offline(parent); +#endif + percpu_ref_kill(&parent->self.refcnt); + } + if (ss) { /* css release path */ if (!list_empty(&css->rstat_css_node)) { The idea is to set a flag in the umount path, in the rmdir it will=20 destroy the css in case its direct parent is umounted, no recursive=20 here. This is just an incomplete example, we may need to reset that flag=20 when remounting. Thanks, Quang Minh.