From: Roman Gushchin <guro@fb.com>
To: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Christian Brauner <brauner@kernel.org>, Tejun Heo <tj@kernel.org>,
Shakeel Butt <shakeelb@google.com>,
Zefan Li <lizefan.x@bytedance.com>,
Johannes Weiner <hannes@cmpxchg.org>, <cgroups@vger.kernel.org>,
<containers@lists.linux.dev>
Subject: Re: [PATCH v2 1/5] cgroup: introduce cgroup.kill
Date: Wed, 5 May 2021 12:13:48 -0700 [thread overview]
Message-ID: <YJLubKN2JmffZjLd@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <20210505184632.jvg54r75d5lkdhuf@wittgenstein>
On Wed, May 05, 2021 at 08:46:32PM +0200, Christian Brauner wrote:
> On Wed, May 05, 2021 at 10:57:00AM -0700, Roman Gushchin wrote:
> > On Mon, May 03, 2021 at 04:39:19PM +0200, Christian Brauner wrote:
> > > From: Christian Brauner <christian.brauner@ubuntu.com>
> > >
> > > Introduce the cgroup.kill file. It does what it says on the tin and
> > > allows a caller to kill a cgroup by writing "1" into cgroup.kill.
> > > The file is available in non-root cgroups.
> > >
> > > Killing cgroups is a process directed operation, i.e. the whole
> > > thread-group is affected. Consequently trying to write to cgroup.kill in
> > > threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
> > > aligns with cgroup.procs where reads in threaded-cgroups are rejected
> > > with EOPNOTSUPP.
> > >
> > > The cgroup.kill file is write-only since killing a cgroup is an event
> > > not which makes it different from e.g. freezer where a cgroup
> > > transitions between the two states.
> > >
> > > As with all new cgroup features cgroup.kill is recursive by default.
> > >
> > > Killing a cgroup is protected against concurrent migrations through the
> > > cgroup mutex. To protect against forkbombs and to mitigate the effect of
> > > racing forks a new CGRP_KILL css set lock protected flag is introduced
> > > that is set prior to killing a cgroup and unset after the cgroup has
> > > been killed. We can then check in cgroup_post_fork() where we hold the
> > > css set lock already whether the cgroup is currently being killed. If so
> > > we send the child a SIGKILL signal immediately taking it down as soon as
> > > it returns to userspace. To make the killing of the child semantically
> > > clean it is killed after all cgroup attachment operations have been
> > > finalized.
> > >
> > > There are various use-cases of this interface:
> > > - Containers usually have a conservative layout where each container
> > > usually has a delegated cgroup. For such layouts there is a 1:1
> > > mapping between container and cgroup. If the container in addition
> > > uses a separate pid namespace then killing a container usually becomes
> > > a simple kill -9 <container-init-pid> from an ancestor pid namespace.
> > > However, there are quite a few scenarios where that isn't true. For
> > > example, there are containers that share the cgroup with other
> > > processes on purpose that are supposed to be bound to the lifetime of
> > > the container but are not in the same pidns of the container.
> > > Containers that are in a delegated cgroup but share the pid namespace
> > > with the host or other containers.
> > > - Service managers such as systemd use cgroups to group and organize
> > > processes belonging to a service. They usually rely on a recursive
> > > algorithm now to kill a service. With cgroup.kill this becomes a
> > > simple write to cgroup.kill.
> > > - Userspace OOM implementations can make good use of this feature to
> > > efficiently take down whole cgroups quickly.
> > > - The kill program can gain a new
> > > kill --cgroup /sys/fs/cgroup/delegated
> > > flag to take down cgroups.
> > >
> > > A few observations about the semantics:
> > > - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
> > > not specified we are not taking cgroup mutex meaning the cgroup can be
> > > killed while a process in that cgroup is forking.
> > > If the kill request happens right before cgroup_can_fork() and before
> > > the parent grabs its siglock the parent is guaranteed to see the
> > > pending SIGKILL. In addition we perform another check in
> > > cgroup_post_fork() whether the cgroup is being killed and is so take
> > > down the child (see above). This is robust enough and protects gainst
> > > forkbombs. If userspace really really wants to have stricter
> > > protection the simple solution would be to grab the write side of the
> > > cgroup threadgroup rwsem which will force all ongoing forks to
> > > complete before killing starts. We concluded that this is not
> > > necessary as the semantics for concurrent forking should simply align
> > > with freezer where a similar check as cgroup_post_fork() is performed.
> > >
> > > For all other cases CLONE_INTO_CGROUP is required. In this case we
> > > will grab the cgroup mutex so the cgroup can't be killed while we
> > > fork. Once we're done with the fork and have dropped cgroup mutex we
> > > are visible and will be found by any subsequent kill request.
> > > - We obviously don't kill kthreads. This means a cgroup that has a
> > > kthread will not become empty after killing and consequently no
> > > unpopulated event will be generated. The assumption is that kthreads
> > > should be in the root cgroup only anyway so this is not an issue.
> > > - We skip killing tasks that already have pending fatal signals.
> > > - Freezer doesn't care about tasks in different pid namespaces, i.e. if
> > > you have two tasks in different pid namespaces the cgroup would still
> > > be frozen. The cgroup.kill mechanism consequently behaves the same
> > > way, i.e. we kill all processes and ignore in which pid namespace they
> > > exist.
> > > - If the caller is located in a cgroup that is killed the caller will
> > > obviously be killed as well.
> > >
> > > Cc: Shakeel Butt <shakeelb@google.com>
> > > Cc: Roman Gushchin <guro@fb.com>
> > > Cc: Tejun Heo <tj@kernel.org>
> > > Cc: cgroups@vger.kernel.org
> > > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> > > ---
> > >
> > > The series can be pulled from
> > >
> > > git@gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux tags/cgroup.kill.v5.14
> > >
> > > /* v2 */
> > > - Roman Gushchin <guro@fb.com>:
> > > - Retrieve cgrp->flags only once and check CGRP_* bits on it.
> > > ---
> > > include/linux/cgroup-defs.h | 3 +
> > > kernel/cgroup/cgroup.c | 127 ++++++++++++++++++++++++++++++++----
> > > 2 files changed, 116 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> > > index 559ee05f86b2..43fef771009a 100644
> > > --- a/include/linux/cgroup-defs.h
> > > +++ b/include/linux/cgroup-defs.h
> > > @@ -71,6 +71,9 @@ enum {
> > >
> > > /* Cgroup is frozen. */
> > > CGRP_FROZEN,
> > > +
> > > + /* Control group has to be killed. */
> > > + CGRP_KILL,
> > > };
> > >
> > > /* cgroup_root->flags */
> > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > > index 9153b20e5cc6..aee84b99534a 100644
> > > --- a/kernel/cgroup/cgroup.c
> > > +++ b/kernel/cgroup/cgroup.c
> > > @@ -3654,6 +3654,80 @@ static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
> > > return nbytes;
> > > }
> > >
> > > +static void __cgroup_kill(struct cgroup *cgrp)
> > > +{
> > > + struct css_task_iter it;
> > > + struct task_struct *task;
> > > +
> > > + lockdep_assert_held(&cgroup_mutex);
> > > +
> > > + spin_lock_irq(&css_set_lock);
> > > + set_bit(CGRP_KILL, &cgrp->flags);
> > > + spin_unlock_irq(&css_set_lock);
> > > +
> > > + css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
> > > + while ((task = css_task_iter_next(&it))) {
> > > + /* Ignore kernel threads here. */
> > > + if (task->flags & PF_KTHREAD)
> > > + continue;
> > > +
> > > + /* Skip tasks that are already dying. */
> > > + if (__fatal_signal_pending(task))
> > > + continue;
> > > +
> > > + send_sig(SIGKILL, task, 0);
> > > + }
> > > + css_task_iter_end(&it);
> > > +
> > > + spin_lock_irq(&css_set_lock);
> > > + clear_bit(CGRP_KILL, &cgrp->flags);
> > > + spin_unlock_irq(&css_set_lock);
> > > +}
> > > +
> > > +static void cgroup_kill(struct cgroup *cgrp)
> > > +{
> > > + struct cgroup_subsys_state *css;
> > > + struct cgroup *dsct;
> > > +
> > > + lockdep_assert_held(&cgroup_mutex);
> > > +
> > > + cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
> > > + __cgroup_kill(dsct);
> > > +}
> > > +
> > > +static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
> > > + size_t nbytes, loff_t off)
> > > +{
> > > + ssize_t ret = 0;
> > > + int kill;
> > > + struct cgroup *cgrp;
> > > +
> > > + ret = kstrtoint(strstrip(buf), 0, &kill);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if (kill != 1)
> > > + return -ERANGE;
> > > +
> > > + cgrp = cgroup_kn_lock_live(of->kn, false);
> > > + if (!cgrp)
> > > + return -ENOENT;
> > > +
> > > + /*
> > > + * Killing is a process directed operation, i.e. the whole thread-group
> > > + * is taken down so act like we do for cgroup.procs and only make this
> > > + * writable in non-threaded cgroups.
> > > + */
> > > + if (cgroup_is_threaded(cgrp))
> > > + ret = -EOPNOTSUPP;
> > > + else
> > > + cgroup_kill(cgrp);
> > > +
> > > + cgroup_kn_unlock(of->kn);
> > > +
> > > + return ret ?: nbytes;
> > > +}
> > > +
> > > static int cgroup_file_open(struct kernfs_open_file *of)
> > > {
> > > struct cftype *cft = of_cft(of);
> > > @@ -4846,6 +4920,11 @@ static struct cftype cgroup_base_files[] = {
> > > .seq_show = cgroup_freeze_show,
> > > .write = cgroup_freeze_write,
> > > },
> > > + {
> > > + .name = "cgroup.kill",
> > > + .flags = CFTYPE_NOT_ON_ROOT,
> > > + .write = cgroup_kill_write,
> > > + },
> > > {
> > > .name = "cpu.stat",
> > > .seq_show = cpu_stat_show,
> > > @@ -6077,6 +6156,8 @@ void cgroup_post_fork(struct task_struct *child,
> > > struct kernel_clone_args *kargs)
> > > __releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
> > > {
> > > + unsigned long cgrp_flags = 0;
> > > + bool kill = false;
> > > struct cgroup_subsys *ss;
> > > struct css_set *cset;
> > > int i;
> > > @@ -6088,6 +6169,11 @@ void cgroup_post_fork(struct task_struct *child,
> > >
> > > /* init tasks are special, only link regular threads */
> > > if (likely(child->pid)) {
> > > + if (kargs->cgrp)
> > > + cgrp_flags = kargs->cgrp->flags;
> > > + else
> > > + cgrp_flags = cset->dfl_cgrp->flags;
> > > +
> > > WARN_ON_ONCE(!list_empty(&child->cg_list));
> > > cset->nr_tasks++;
> > > css_set_move_task(child, NULL, cset, false);
> > > @@ -6096,23 +6182,32 @@ void cgroup_post_fork(struct task_struct *child,
> > > cset = NULL;
> > > }
> > >
> > > - /*
> > > - * If the cgroup has to be frozen, the new task has too. Let's set
> > > - * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
> > > - * frozen state.
> > > - */
> > > - if (unlikely(cgroup_task_freeze(child))) {
> > > - spin_lock(&child->sighand->siglock);
> > > - WARN_ON_ONCE(child->frozen);
> > > - child->jobctl |= JOBCTL_TRAP_FREEZE;
> > > - spin_unlock(&child->sighand->siglock);
> > > + if (!(child->flags & PF_KTHREAD)) {
> > > + if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> > > + /*
> > > + * If the cgroup has to be frozen, the new task has
> > > + * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> > > + * get the task into the frozen state.
> > > + */
> > > + spin_lock(&child->sighand->siglock);
> > > + WARN_ON_ONCE(child->frozen);
> > > + child->jobctl |= JOBCTL_TRAP_FREEZE;
> > > + spin_unlock(&child->sighand->siglock);
> > > +
> > > + /*
> > > + * Calling cgroup_update_frozen() isn't required here,
> > > + * because it will be called anyway a bit later from
> > > + * do_freezer_trap(). So we avoid cgroup's transient
> > > + * switch from the frozen state and back.
> > > + */
> > > + }
> >
> > I think this part can be optimized a bit further:
> > 1) we don't need atomic test_bit() here
> > 2) all PF_KTHREAD, CGRP_FREEZE and CGRP_KILL cases are very unlikely
> >
> > So something like this could work (completely untested):
> >
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 0965b44ff425..f567ca69119d 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -6190,13 +6190,15 @@ void cgroup_post_fork(struct task_struct *child,
> > cset = NULL;
> > }
> >
> > - if (!(child->flags & PF_KTHREAD)) {
> > - if (test_bit(CGRP_FREEZE, &cgrp_flags)) {
> > - /*
> > - * If the cgroup has to be frozen, the new task has
> > - * too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
> > - * get the task into the frozen state.
> > - */
> > +
> > + if (unlikely(!(child->flags & PF_KTHREAD) &&
> > + cgrp_flags & (CGRP_FREEZE | CGRP_KILL))) {
>
> The unlikely might make sense.
>
> But hm, I'm not a fan of the CGRP_FREEZE and CGRP_KILL check without
> test_bit(). That seems a bit ugly. Especially since nowhere in
> kernel/cgroup.c are these bits checked without test_bit().
>
> Also, this wouldn't work afaict at least not for all values since
> CGRP_FREEZE and CGRP_KILL aren't flags, they're bits defined in an enum.
> (In contrast to cgroup_root->flags which are defined as flags _in an
> enum_.) So it seems they should really be used with test_bit. Otherwise
> this would probably have to be sm like
>
> if (unlikely(!(child->flags & PF_KTHREAD) &&
> (cgrp_flags & (BIT_ULL(CGRP_FREEZE) | BIT_ULL(CGRP_KILL))) {
> .
> .
> .
>
> which seems unreadable and makes the rest of cgroup.c for these values
> inconsistent.
> Note that before the check was the same for CGRP_FREEZE it was just
> hidden in the helper.
> I really think we should just leave the test_bit() checks.
Ok, you're right that it was (hidden) test_bit() here previously, so I
can't blame your patch and should blame my original code instead :)
Idk how badly we need to optimize this place and maybe having two checks
is a good price for a simpler code. After all, we can optimize it later.
With this said:
Acked-by: Roman Gushchin <guro@fb.com>
Thank you!
next prev parent reply other threads:[~2021-05-05 19:15 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-03 14:39 [PATCH v2 1/5] cgroup: introduce cgroup.kill Christian Brauner
2021-05-03 14:39 ` [PATCH v2 2/5] docs/cgroup: add entry for cgroup.kill Christian Brauner
2021-05-05 16:29 ` Shakeel Butt
2021-05-03 14:39 ` [PATCH v2 3/5] tests/cgroup: use cgroup.kill in cg_killall() Christian Brauner
2021-05-05 16:42 ` Shakeel Butt
2021-05-03 14:39 ` [PATCH v2 4/5] tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() Christian Brauner
2021-05-05 16:43 ` Shakeel Butt
2021-05-03 14:39 ` [PATCH v2 5/5] tests/cgroup: test cgroup.kill Christian Brauner
2021-05-05 18:34 ` Shakeel Butt
2021-05-05 18:52 ` Christian Brauner
2021-05-03 17:18 ` [PATCH v2 1/5] cgroup: introduce cgroup.kill Shakeel Butt
2021-05-04 1:47 ` Serge E. Hallyn
2021-05-04 18:29 ` Shakeel Butt
2021-05-05 17:57 ` Roman Gushchin
2021-05-05 18:46 ` Christian Brauner
2021-05-05 19:13 ` Roman Gushchin [this message]
2021-05-05 18:31 ` Eric W. Biederman
2021-05-05 18:49 ` Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YJLubKN2JmffZjLd@carbon.dhcp.thefacebook.com \
--to=guro@fb.com \
--cc=brauner@kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=christian.brauner@ubuntu.com \
--cc=containers@lists.linux.dev \
--cc=hannes@cmpxchg.org \
--cc=lizefan.x@bytedance.com \
--cc=shakeelb@google.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox