[RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Frederic Weisbecker <fweisbec@gmail.com>
To: Kay Sievers <kay.sievers@vrfy.org>,
	Paul Menage <paul@paulmenage.org>, Li Zefan <lizf@cn.fujitsu.com>
Cc: Tim Hockin <thockin@hockin.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Aditya Kali <adityakali@google.com>,
	Oleg Nesterov <oleg@redhat.com>
Subject: [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem)
Date: Thu, 18 Aug 2011 16:33:23 +0200	[thread overview]
Message-ID: <20110818143319.GC10441@somewhere> (raw)
In-Reply-To: <CAPXgP11aUfdkq-SdaHc8=tfjA4W7DYXjSXG-asoZMi6hK4iOPQ@mail.gmail.com>

On Tue, Aug 16, 2011 at 06:01:48PM +0200, Kay Sievers wrote:
> On Fri, Aug 12, 2011 at 23:11, Tim Hockin <thockin@hockin.org> wrote:
> > On Mon, Aug 1, 2011 at 4:19 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> >> On Fri, 29 Jul 2011 18:13:22 +0200
> >> Frederic Weisbecker <fweisbec@gmail.com> wrote:
> >>
> >>> Reminder:
> >>>
> >>> This patchset is aimed at reducing the impact of a forkbomb to a
> >>> cgroup boundaries, thus minimizing the consequences of such an attack
> >>> against the rest of the system.
> >>>
> >>> This can be useful when cgroups are used to stage some processes or run
> >>> untrustees.
> >>
> >> Really?  How useful?  Why is it useful enough to justify adding code
> >> such as this to the kernel?
> >>
> >> Is forkbomb-prevention the only use?  Others have proposed different
> >> ways of preventing forkbombs which were independent of cgroups - is
> >> this way better and if so, why?
> >
> > I certainly want this for exactly the proposed use - putting a bounds
> > on threads/tasks per container.  It's rlimits but more useful.
> >
> > IMHO, most every limit that can be set at a system level should be
> > settable at a cgroup level.  This is just one more isolation leak.
> 
> Such functionality in general sounds useful. System management tools
> want to be able to race-free stop a service. A 'service' in the sense
> of 'a group of processes and all the future processes it creates'.

Some background here: we got an offlist discussion where we debated
about how to safely kill all tasks in a cgroup in a race-free way.
This is also needed for containers. So that's how we found a secondary
purpose of this task counter subsystem. Setting the value 0 to tasks.limit
file would reject any future fork on the cgroup, making the whole group
of task killable without worrying against concurrent fork, which otherwise
might induce an unbounded number of iterations.

So there are now two possible uses of that task counter subsystem:

- protection against fork bombs in a container
- allow race free killing of a cgroup

And this secondary purpose is also potentially useful for systemd:

> A common problem here are user sessions that a logins creates. For
> some systems it is required, that after logout of the user, all
> processes the user has started are properly cleaned up. Common example
> for such enforcements are servers at schools universities that do not
> want to allow users to leave things like file sharing programs running
> in the background after they log out.
> 
> We currently do that in systemd by tracking these session in a cgroup
> and kill all pids in that group. This currently requires some
> cooperation of the services to be successful. If they would fork
> faster than we kill them, we would never be able to finish the task.
> 
> Such user sessions are generally untrusted code and processes, and the
> system management that cleans up after the end of the session runs
> privileged. It would be nice, to be allow trusted code to race-free
> kill all remaining processes of such an untrusted session. This is not
> so much about fork-bombs, things might not even have bad things in
> mind, this would be more like a rlimit for a 'group of pids', that
> allows race-free resource management of the services.
> 
> For the actual implementation, I think it would be nicer to use to
> have such functionality at the core of cgroups, and not require a
> specific controller to be set up. We already track every single
> service in its own cgroup in a custom hierarchy. These groups just act
> as the container for all the pids belonging to the service, so we can
> track the service properly.
> 
> Naively looking at it as a user of it, we would like to be able to
> apply these limits for every cgroup right away, not needing to create
> another controller/subsystem/hierarchy.

So the problem with the task counter as a subsystem is that you could
mount it in your systemd cgroups hierarchy but then it's not anymore
available for those who want to use containers.

It would be indeed handy to have that task counter as a cgroup core
feature so that it's usable on any hierarchy. Also it allows to
safely kill all tasks in a cgroup, and that sounds like something
that should be a cgroup core feature.

Now as a counter argument, bringing this at the cgroup core level would
bring some more overhead and complication. It implies to iterate,
on fork and exit, though all cgroups the task belongs to in every
hierachies and then charge/uncharge through all ancestors of these
cgroups.
With the subsystem, we only iterate through one cgroup and its
ancestor.

Now there are alternate ways to solve your issue. One could be
to mount a /sys/kernel/cgroups/task_counter point where anybody
interested in task counter features can use that. And systemd
could move all its task gathering there (without maintaining
a secondary mountpoint).

The other way is to use the cgroup freezer to kill your tasks.
Now I'm not aware of the overhead it implies.

next prev parent reply	other threads:[~2011-08-18 14:33 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-29 16:13 [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 1/8] cgroups: Add res_counter_write_u64() API Frederic Weisbecker
2011-08-09 15:17   ` Oleg Nesterov
2011-08-09 17:31     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 2/8] cgroups: New resource counter inheritance API Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 3/8] cgroups: Add previous cgroup in can_attach_task/attach_task callbacks Frederic Weisbecker
2011-08-17  2:40   ` Li Zefan
2011-08-27 13:58     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 4/8] cgroups: New cancel_attach_task subsystem callback Frederic Weisbecker
2011-08-17  2:40   ` Li Zefan
2011-08-27 13:58     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 5/8] cgroups: Ability to stop res charge propagation on bounded ancestor Frederic Weisbecker
2011-08-17  2:41   ` Li Zefan
2011-08-27 13:59     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 6/8] cgroups: Add res counter common ancestor searching Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 7/8] cgroups: Add a task counter subsystem Frederic Weisbecker
2011-08-01 23:13   ` Andrew Morton
2011-08-04 14:05     ` Frederic Weisbecker
2011-08-09 15:11   ` Oleg Nesterov
2011-08-09 17:27     ` Frederic Weisbecker
2011-08-09 17:57       ` Oleg Nesterov
2011-08-09 18:09         ` Frederic Weisbecker
2011-08-09 18:19           ` Oleg Nesterov
2011-08-09 18:34             ` Frederic Weisbecker
2011-08-09 18:39               ` Oleg Nesterov
2011-08-17  3:18   ` Li Zefan
2011-08-27 14:16     ` Frederic Weisbecker
2011-07-29 16:13 ` [PATCH 8/8] res_counter: Allow charge failure pointer to be null Frederic Weisbecker
2011-08-17  2:44   ` Li Zefan
2011-08-27 14:05     ` Frederic Weisbecker
2011-08-01 23:19 ` [PATCH 0/8 v3] cgroups: Task counter subsystem (was: New max number of tasks subsystem) Andrew Morton
2011-08-03 14:29   ` Frederic Weisbecker
2011-08-12 21:11   ` Tim Hockin
2011-08-16 16:01     ` Kay Sievers
2011-08-18 14:33       ` Frederic Weisbecker [this message]
2011-08-23 16:07         ` [RFD] Task counter: cgroup core feature or cgroup subsystem? (was Re: [PATCH 0/8 v3] cgroups: Task counter subsystem) Paul Menage
2011-08-24 17:54           ` Frederic Weisbecker
2011-08-26  7:28             ` Li Zefan
2011-08-26 14:58               ` Paul Menage
2011-09-06  9:06                 ` Li Zefan
2011-08-26 15:16             ` Paul Menage
2011-08-27 13:40               ` Frederic Weisbecker
2011-08-31 22:36                 ` Paul Menage
2011-08-31 21:54               ` Frederic Weisbecker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110818143319.GC10441@somewhere \
    --to=fweisbec@gmail.com \
    --cc=adityakali@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=kay.sievers@vrfy.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizf@cn.fujitsu.com \
    --cc=oleg@redhat.com \
    --cc=paul@paulmenage.org \
    --cc=thockin@hockin.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox