[PATCH v6 0/3] cgroups: add pids subsystem

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/3] cgroups: add pids subsystem
@ 2015-03-14  4:37 Aleksa Sarai
  2015-03-14  4:37 ` [PATCH v6 1/3] cgroups: use bitmask to filter for_each_subsys Aleksa Sarai
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Aleksa Sarai @ 2015-03-14  4:37 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA, peterz-wEGCiKHe2LqWVfeAwA7xHQ
  Cc: richard-/L3Ra7n9ekc, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Aleksa Sarai

This is a revised version of the pids v5 patchset, modified to deal with
several stylistic and correctness issues put forward by Tejun Heo. The
main changes include:

* Add bitmask filtering of for_each_subsys() in the form of
  for_each_subsys_which().

* Revert all succeeded can_fork() callbacks if any one of them fails
  inside cgroup_can_fork().

* Move revert/reapply code to a reapply_fork() callback provided by each
  cgroup controller. It is only run if the association for the specific
  subsystem changed during the interlude between cgroup_can_fork() and
  cgroup_post_fork().

* Pass an opaque pointer between cgroup_*_fork(), which holds the
  "current" task css_set. The css_set is pinned by bumping the refcount
  in cgroup_can_fork() and later unpinned in
  cgroup_{cancel,post}_fork().

* A whole bunch of userland API and stylistic fixes.

Aleksa Sarai (3):
  cgroups: use bitmask to filter for_each_subsys
  cgroups: allow a cgroup subsystem to reject a fork
  cgroups: add a pids subsystem

 include/linux/cgroup.h        |  13 +-
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  11 ++
 kernel/Makefile               |   1 +
 kernel/cgroup.c               | 132 +++++++++++++++---
 kernel/cgroup_pids.c          | 301 ++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c                 |  18 ++-
 7 files changed, 459 insertions(+), 21 deletions(-)
 create mode 100644 kernel/cgroup_pids.c

-- 
2.3.2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 1/3] cgroups: use bitmask to filter for_each_subsys
  2015-03-14  4:37 [PATCH v6 0/3] cgroups: add pids subsystem Aleksa Sarai
@ 2015-03-14  4:37 ` Aleksa Sarai
  2015-03-16 16:42   ` Tejun Heo
  2015-03-14  4:37 ` [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
  2015-03-14  4:37 ` [PATCH v6 3/3] cgroups: add a pids subsystem Aleksa Sarai
  2 siblings, 1 reply; 12+ messages in thread
From: Aleksa Sarai @ 2015-03-14  4:37 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Add a new macro for_each_subsys_which that allows all enabled cgroup
subsystems to be filtered by a bitmask, such that mask & (1 << ssid)
determines if the subsystem is to be processed in the loop body (where
ssid is the unique id of the subsystem).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 kernel/cgroup.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 29a7b2c..d60107e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -175,7 +175,8 @@ static DEFINE_IDR(cgroup_hierarchy_idr);
  */
 static u64 css_serial_nr_next = 1;
 
-/* This flag indicates whether tasks in the fork and exit paths should
+/*
+ * This bitmask flag indicates whether tasks in the fork and exit paths should
  * check for fork/exit handlers to call. This avoids us having to do
  * extra work in the fork/exit path if none of the subsystems need to
  * be called.
@@ -409,6 +410,19 @@ static int notify_on_release(const struct cgroup *cgrp)
 	for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT &&		\
 	     (((ss) = cgroup_subsys[ssid]) || true); (ssid)++)
 
+/**
+ * for_each_subsys_which - filter for_each_subsys with a bitmask
+ * @ss_mask: the bitmask
+ * @ss: the iteration cursor
+ * @ssid: the index of @ss, CGROUP_SUBSYS_COUNT after reaching the end
+ *
+ * The block will only run for cases where the ssid-th bit (1 << ssid) of
+ * mask is set to 1.
+ */
+#define for_each_subsys_which(ss_mask, ss, ssid) \
+	for_each_subsys((ss), (ssid)) \
+		if ((ss_mask) & (1 << (ssid)))
+
 /* iterate across the hierarchies */
 #define for_each_root(root)						\
 	list_for_each_entry((root), &cgroup_roots, root_list)
@@ -4932,7 +4946,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	 * init_css_set is in the subsystem's root cgroup. */
 	init_css_set.subsys[ss->id] = css;
 
-	need_forkexit_callback |= ss->fork || ss->exit;
+	need_forkexit_callback |= !!(ss->fork || ss->exit) << ss->id;
 
 	/* At system boot, before all subsystems have been
 	 * registered, no tasks have been forked, so we don't
@@ -5239,11 +5253,9 @@ void cgroup_post_fork(struct task_struct *child)
 	 * css_set; otherwise, @child might change state between ->fork()
 	 * and addition to css_set.
 	 */
-	if (need_forkexit_callback) {
-		for_each_subsys(ss, i)
-			if (ss->fork)
-				ss->fork(child);
-	}
+	for_each_subsys_which(need_forkexit_callback, ss, i)
+		if (ss->fork)
+			ss->fork(child);
 }
 
 /**
@@ -5287,15 +5299,13 @@ void cgroup_exit(struct task_struct *tsk)
 	cset = task_css_set(tsk);
 	RCU_INIT_POINTER(tsk->cgroups, &init_css_set);
 
-	if (need_forkexit_callback) {
-		/* see cgroup_post_fork() for details */
-		for_each_subsys(ss, i) {
-			if (ss->exit) {
-				struct cgroup_subsys_state *old_css = cset->subsys[i];
-				struct cgroup_subsys_state *css = task_css(tsk, i);
+	/* see cgroup_post_fork() for details */
+	for_each_subsys_which(need_forkexit_callback, ss, i) {
+		if (ss->exit) {
+			struct cgroup_subsys_state *old_css = cset->subsys[i];
+			struct cgroup_subsys_state *css = task_css(tsk, i);
 
-				ss->exit(css, old_css, tsk);
-			}
+			ss->exit(css, old_css, tsk);
 		}
 	}
 
-- 
2.3.2

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/3] cgroups: use bitmask to filter for_each_subsys
  2015-03-14  4:37 ` [PATCH v6 1/3] cgroups: use bitmask to filter for_each_subsys Aleksa Sarai
@ 2015-03-16 16:42   ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2015-03-16 16:42 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan, mingo, peterz, richard, fweisbec, linux-kernel, cgroups

Hello,

On Sat, Mar 14, 2015 at 03:37:13PM +1100, Aleksa Sarai wrote:
> -/* This flag indicates whether tasks in the fork and exit paths should
> +/*
> + * This bitmask flag indicates whether tasks in the fork and exit paths should

Please reflow the comment.  It's going over 80col.  Also, wouldn't it
be clearer if the comment actually stated that the bitmask is
subsystem bitmask?

> @@ -4932,7 +4946,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
>  	 * init_css_set is in the subsystem's root cgroup. */
>  	init_css_set.subsys[ss->id] = css;
>  
> -	need_forkexit_callback |= ss->fork || ss->exit;
> +	need_forkexit_callback |= !!(ss->fork || ss->exit) << ss->id;

I generally prefer (bool) casts to !!'s these days but this doesn't
really matter.  More importantly, shouldn't we be splitting fork and
exit masks so that we don't have to test the callback existence later?

> @@ -5239,11 +5253,9 @@ void cgroup_post_fork(struct task_struct *child)
>  	 * css_set; otherwise, @child might change state between ->fork()
>  	 * and addition to css_set.
>  	 */
> -	if (need_forkexit_callback) {
> -		for_each_subsys(ss, i)
> -			if (ss->fork)
> -				ss->fork(child);
> -	}
> +	for_each_subsys_which(need_forkexit_callback, ss, i)
> +		if (ss->fork)
> +			ss->fork(child);
>  }

IOW, we should be able to do

	for_each_subsys_which(subsys_has_fork, ss, i)
		ss->fork(child);

And ditto for exit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork
  2015-03-14  4:37 [PATCH v6 0/3] cgroups: add pids subsystem Aleksa Sarai
  2015-03-14  4:37 ` [PATCH v6 1/3] cgroups: use bitmask to filter for_each_subsys Aleksa Sarai
@ 2015-03-14  4:37 ` Aleksa Sarai
       [not found]   ` <1426307835-5893-3-git-send-email-cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>
  2015-03-14  4:37 ` [PATCH v6 3/3] cgroups: add a pids subsystem Aleksa Sarai
  2 siblings, 1 reply; 12+ messages in thread
From: Aleksa Sarai @ 2015-03-14  4:37 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Add a new cgroup subsystem callback can_fork that conditionally
states whether or not the fork is accepted or rejected by a cgroup
policy.

In addition, add a cancel_fork callback so that if an error occurs later
in the forking process, any state modified by can_fork can be reverted.

In order to ensure that the fork charged the right hierarchy, save the
"current" css_set before doing ss->can_fork and compare it with the
"current" css_set that gets committed to the task *proper* in post_fork.
If they do not match, revert the can_fork's charging of the wrong
hierarchy and forcefully reapply it to the right hierarchy using the
reapply_fork callback. Since a changing "current" css_set in
copy_process indicates an organisation operation took place, we can
break the cgroup policy in this case.

This is in preparation for implementing the pids cgroup subsystem.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup.h | 13 ++++++-
 kernel/cgroup.c        | 92 ++++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/fork.c          | 18 ++++++++--
 3 files changed, 117 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b9cb94c..9592f6e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -32,7 +32,9 @@ struct cgroup;
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
 extern void cgroup_fork(struct task_struct *p);
-extern void cgroup_post_fork(struct task_struct *p);
+extern int cgroup_can_fork(struct task_struct *p, void **state);
+extern void cgroup_cancel_fork(struct task_struct *p, void **state);
+extern void cgroup_post_fork(struct task_struct *p, void **old_state);
 extern void cgroup_exit(struct task_struct *p);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
@@ -649,6 +651,13 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys_state *css,
 		       struct cgroup_taskset *tset);
+	int (*can_fork)(struct cgroup_subsys_state *css,
+			struct task_struct *task);
+	void (*cancel_fork)(struct cgroup_subsys_state *css,
+			    struct task_struct *task);
+	void (*reapply_fork)(struct cgroup_subsys_state *css,
+			     struct cgroup_subsys_state *old_css,
+			     struct task_struct *task);
 	void (*fork)(struct task_struct *task);
 	void (*exit)(struct cgroup_subsys_state *css,
 		     struct cgroup_subsys_state *old_css,
@@ -948,6 +957,8 @@ struct cgroup_subsys_state;
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
+static inline int cgroup_can_fork(struct task_struct *p) { return 0; }
+static inline void cgroup_cancel_fork(struct task_struct *p) {}
 static inline void cgroup_post_fork(struct task_struct *p) {}
 static inline void cgroup_exit(struct task_struct *p) {}
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d60107e..1a77790 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -183,6 +183,10 @@ static u64 css_serial_nr_next = 1;
  */
 static int need_forkexit_callback __read_mostly;
 
+/* Ditto for the can_fork/cancel_fork/reapply_fork callbacks. */
+static int need_canfork_callback __read_mostly = 0,
+	   need_reapplyfork_callback __read_mostly = 0;
+
 static struct cftype cgroup_dfl_base_files[];
 static struct cftype cgroup_legacy_base_files[];
 
@@ -4947,6 +4951,8 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
 	init_css_set.subsys[ss->id] = css;
 
 	need_forkexit_callback |= !!(ss->fork || ss->exit) << ss->id;
+	need_canfork_callback |= !!(ss->can_fork || ss->cancel_fork) << ss->id;
+	need_reapplyfork_callback |= !!(ss->reapply_fork) << ss->id;
 
 	/* At system boot, before all subsystems have been
 	 * registered, no tasks have been forked, so we don't
@@ -5200,6 +5206,66 @@ void cgroup_fork(struct task_struct *child)
 }
 
 /**
+ * cgroup_can_fork - called on a new task before the process is exposed.
+ * @child: the task in question.
+ *
+ * This calls the subsystem can_fork() callbacks. If the can_fork() callback
+ * returns an error, the fork aborts with that error code. This allows for
+ * a cgroup subsystem to conditionally allow or deny new forks.
+ */
+int cgroup_can_fork(struct task_struct *child, void **state)
+{
+	struct cgroup_subsys *ss;
+	struct css_set *cset;
+	int i, j, retval;
+
+	cset = task_css_set(current);
+	get_css_set(cset);
+	*state = cset;
+
+	for_each_subsys_which(need_canfork_callback, ss, i)
+		if (ss->can_fork) {
+			retval = ss->can_fork(cset->subsys[i], child);
+			if (retval)
+				goto out_revert;
+		}
+
+	return 0;
+
+out_revert:
+	for_each_subsys_which(need_canfork_callback, ss, j) {
+		if (j == i)
+			break;
+
+		if (ss->cancel_fork)
+			ss->cancel_fork(cset->subsys[i], child);
+	}
+
+	put_css_set(cset);
+	return retval;
+}
+
+/**
+ * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork()
+ * @child: the task in question
+ *
+ * This calls the cancel_fork() callbacks if a fork failed *after*
+ * cgroup_can_fork() succeded.
+ */
+void cgroup_cancel_fork(struct task_struct *child, void **state)
+{
+	struct cgroup_subsys *ss;
+	struct css_set *cset = *state;
+	int i;
+
+	for_each_subsys_which(need_canfork_callback, ss, i)
+		if (ss->cancel_fork)
+			ss->cancel_fork(cset->subsys[i], child);
+
+	put_css_set(cset);
+}
+
+/**
  * cgroup_post_fork - called on a new task after adding it to the task list
  * @child: the task in question
  *
@@ -5209,9 +5275,10 @@ void cgroup_fork(struct task_struct *child)
  * cgroup_task_iter_start() - to guarantee that the new task ends up on its
  * list.
  */
-void cgroup_post_fork(struct task_struct *child)
+void cgroup_post_fork(struct task_struct *child, void **old_state)
 {
 	struct cgroup_subsys *ss;
+	struct css_set *cset, *old_cset = *old_state;
 	int i;
 
 	/*
@@ -5235,9 +5302,8 @@ void cgroup_post_fork(struct task_struct *child)
 	 * in the init_css_set before cg_links is enabled and there's no
 	 * operation which transfers all tasks out of init_css_set.
 	 */
+	cset = old_cset;
 	if (use_task_css_set_links) {
-		struct css_set *cset;
-
 		down_write(&css_set_rwsem);
 		cset = task_css_set(current);
 		if (list_empty(&child->cg_list)) {
@@ -5249,6 +5315,24 @@ void cgroup_post_fork(struct task_struct *child)
 	}
 
 	/*
+	 * Deal with tasks that were migrated mid-fork. If the css_set
+	 * changed between can_fork() and post_fork() an organisation
+	 * operation has occurred, and we need to revert/reapply the
+	 * the can_fork().
+	 */
+	for_each_subsys_which(need_canfork_callback, ss, i) {
+		struct cgroup_subsys_state *css = cset->subsys[i],
+					   *old_css = old_cset->subsys[i];
+
+		/*
+		 * We only reapply for subsystems whose
+		 * association changed in the interim.
+		 */
+		if (old_css != css && ss->reapply_fork)
+			ss->reapply_fork(css, old_css, child);
+	}
+
+	/*
 	 * Call ss->fork().  This must happen after @child is linked on
 	 * css_set; otherwise, @child might change state between ->fork()
 	 * and addition to css_set.
@@ -5256,6 +5340,8 @@ void cgroup_post_fork(struct task_struct *child)
 	for_each_subsys_which(need_forkexit_callback, ss, i)
 		if (ss->fork)
 			ss->fork(child);
+
+	put_css_set(old_cset);
 }
 
 /**
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..c15ca74 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1196,6 +1196,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 {
 	int retval;
 	struct task_struct *p;
+	void *cfs;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1469,6 +1470,17 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->task_works = NULL;
 
 	/*
+	 * Ensure that the cgroup subsystem policies allow the new process to be
+	 * forked. If this fork is happening in an organization operation, then
+	 * this will not charge the correct css_set. This is fixed after
+	 * cgroup_post_fork() (when the css_set has been updated) by undoing
+	 * this operation and forcefully charging the correct css_set.
+	 */
+	retval = cgroup_can_fork(p, &cfs);
+	if (retval)
+		goto bad_fork_free_pid;
+
+	/*
 	 * Make it visible to the rest of the system, but dont wake it up yet.
 	 * Need tasklist lock for parent etc handling!
 	 */
@@ -1504,7 +1516,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		spin_unlock(&current->sighand->siglock);
 		write_unlock_irq(&tasklist_lock);
 		retval = -ERESTARTNOINTR;
-		goto bad_fork_free_pid;
+		goto bad_fork_cgroup_cancel;
 	}
 
 	if (likely(p->pid)) {
@@ -1546,7 +1558,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	write_unlock_irq(&tasklist_lock);
 
 	proc_fork_connector(p);
-	cgroup_post_fork(p);
+	cgroup_post_fork(p, &cfs);
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
 	perf_event_fork(p);
@@ -1556,6 +1568,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	return p;
 
+bad_fork_cgroup_cancel:
+	cgroup_cancel_fork(p, &cfs);
 bad_fork_free_pid:
 	if (pid != &init_struct_pid)
 		free_pid(pid);
-- 
2.3.2

^ permalink raw reply related	[flat|nested] 12+ messages in thread

[parent not found: <1426307835-5893-3-git-send-email-cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>]

* Re: [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork
       [not found]   ` <1426307835-5893-3-git-send-email-cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>
@ 2015-03-16 16:53     ` Tejun Heo
       [not found]       ` <20150316165335.GC8353-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2015-03-16 16:53 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Aleksa.

On Sat, Mar 14, 2015 at 03:37:14PM +1100, Aleksa Sarai wrote:
> +/* Ditto for the can_fork/cancel_fork/reapply_fork callbacks. */
> +static int need_canfork_callback __read_mostly = 0,
> +	   need_reapplyfork_callback __read_mostly = 0;

Please separate the two definitions.  Please consult
Documentation/CodingStyle and in general try not to be creative with
style.

> +int cgroup_can_fork(struct task_struct *child, void **state)
> +{
> +	struct cgroup_subsys *ss;
> +	struct css_set *cset;
> +	int i, j, retval;
> +
> +	cset = task_css_set(current);
> +	get_css_set(cset);

So, you stuck with css_set refcnting.  That's fine, but please do
finish discussing in the original thread instead of throwing out new
version without explaining why you reached a particular conclusion.
Did you explore the idea of passing an opaque pointer from can_fork to
post_fork?  If so, why did you choose this direction instead of that
one?

Also, what pins the cset between task_css_set() and get_css_set()?
task_css_set() is protected by RCU and the above should have triggered
RCU warning if the debug option is enabled.  Please always test with
the debug options enabled, especially the lockdep and RCU ones.

> +	*state = cset;

There's no reason for css_set pointer to be passed around as an opaque
pointer.  The type is fixed.  I suppose this is from me mentioning
void pointer in the previous thread but that was about the cpuset
can_fork() implementation returning its css pointer, not css_set.
Please conclude discussions before working on the following version.

> @@ -5249,6 +5315,24 @@ void cgroup_post_fork(struct task_struct *child)
>  	}
>  
>  	/*
> +	 * Deal with tasks that were migrated mid-fork. If the css_set
> +	 * changed between can_fork() and post_fork() an organisation
> +	 * operation has occurred, and we need to revert/reapply the
> +	 * the can_fork().
> +	 */
> +	for_each_subsys_which(need_canfork_callback, ss, i) {
> +		struct cgroup_subsys_state *css = cset->subsys[i],
> +					   *old_css = old_cset->subsys[i];
> +
> +		/*
> +		 * We only reapply for subsystems whose
> +		 * association changed in the interim.
> +		 */
> +		if (old_css != css && ss->reapply_fork)
> +			ss->reapply_fork(css, old_css, child);
> +	}

Do we really need a separate callback for this?  Can't we just add
@old_css param to ss->fork() which is NULL if it didn't change since
can_fork()?

> @@ -1196,6 +1196,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  {
>  	int retval;
>  	struct task_struct *p;
> +	void *cfs;

If we're gonna do css_set, there's no need to make this an opaque
pointer.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <20150316165335.GC8353-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]

* Re: [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork
       [not found]       ` <20150316165335.GC8353-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2015-03-21 11:25         ` Aleksa Sarai
       [not found]           ` <CAOviyagGvm4z8eAWitR267U3br-60vVJ9A5MMtFENy9xt9zuLQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-03-26 14:38         ` Aleksa Sarai
  1 sibling, 1 reply; 12+ messages in thread
From: Aleksa Sarai @ 2015-03-21 11:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Tejun,

>> +int cgroup_can_fork(struct task_struct *child, void **state)
>> +{
>> +     struct cgroup_subsys *ss;
>> +     struct css_set *cset;
>> +     int i, j, retval;
>> +
>> +     cset = task_css_set(current);
>> +     get_css_set(cset);
>
> So, you stuck with css_set refcnting.  That's fine, but please do
> finish discussing in the original thread instead of throwing out new
> version without explaining why you reached a particular conclusion.
> Did you explore the idea of passing an opaque pointer from can_fork to
> post_fork?  If so, why did you choose this direction instead of that
> one?

The reason for using css_set is for future-proofness. If we stick with just
passing around the css, then we'd be stuck with only working with one
particular cgroup subsystem (and it would also make the code unwieldy). I'll
post this to that thread too, but the main reason is that it would make the
generic callback pids-specific.

> Also, what pins the cset between task_css_set() and get_css_set()?
> task_css_set() is protected by RCU and the above should have triggered
> RCU warning if the debug option is enabled.  Please always test with
> the debug options enabled, especially the lockdep and RCU ones.

Debug was enabled AFAIK and I didn't get anything in `dmesg`. I've fixed it
anyway.

>> +     *state = cset;
>
> There's no reason for css_set pointer to be passed around as an opaque
> pointer.  The type is fixed.  I suppose this is from me mentioning
> void pointer in the previous thread but that was about the cpuset
> can_fork() implementation returning its css pointer, not css_set.
> Please conclude discussions before working on the following version.

Yes, I misunderstood what you wanted the opaque pointer to point to. But I had
thought of doing it like that before, and decided against it for the reasons
outlined above.

>> @@ -5249,6 +5315,24 @@ void cgroup_post_fork(struct task_struct *child)
>>       }
>>
>>       /*
>> +      * Deal with tasks that were migrated mid-fork. If the css_set
>> +      * changed between can_fork() and post_fork() an organisation
>> +      * operation has occurred, and we need to revert/reapply the
>> +      * the can_fork().
>> +      */
>> +     for_each_subsys_which(need_canfork_callback, ss, i) {
>> +             struct cgroup_subsys_state *css = cset->subsys[i],
>> +                                        *old_css = old_cset->subsys[i];
>> +
>> +             /*
>> +              * We only reapply for subsystems whose
>> +              * association changed in the interim.
>> +              */
>> +             if (old_css != css && ss->reapply_fork)
>> +                     ss->reapply_fork(css, old_css, child);
>> +     }
>
> Do we really need a separate callback for this?  Can't we just add
> @old_css param to ss->fork() which is NULL if it didn't change since
> can_fork()?

I feel that reapply_fork() being a special case of fork() (or vice versa) just
feels like a weird API. The two actions are fundamentally different -- one is
basically confirming that the fork went through and the other is alerting the
subsystem that the association changed during a fork.

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAOviyagGvm4z8eAWitR267U3br-60vVJ9A5MMtFENy9xt9zuLQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork
       [not found]           ` <CAOviyagGvm4z8eAWitR267U3br-60vVJ9A5MMtFENy9xt9zuLQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-26 15:08             ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2015-03-26 15:08 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Sat, Mar 21, 2015 at 10:25:56PM +1100, Aleksa Sarai wrote:
> > Also, what pins the cset between task_css_set() and get_css_set()?
> > task_css_set() is protected by RCU and the above should have triggered
> > RCU warning if the debug option is enabled.  Please always test with
> > the debug options enabled, especially the lockdep and RCU ones.
> 
> Debug was enabled AFAIK and I didn't get anything in `dmesg`. I've fixed it
> anyway.

So, this is worrying.  If you have RCU debugging enabled, it should
have triggered the warning.  If it hadn't, maybe there's something
else protecting it and we don't need the extra rcu locking around it.
Can you please investigate what's going on?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork
       [not found]       ` <20150316165335.GC8353-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  2015-03-21 11:25         ` Aleksa Sarai
@ 2015-03-26 14:38         ` Aleksa Sarai
       [not found]           ` <CAOviyahTq1K6zBRfmXfepabiQ7xU8ZoNdq0jPd3nP4Q-0H1DDQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Aleksa Sarai @ 2015-03-26 14:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hi Tejun,

Here's a bit more of a follow-up on two of your points:

Since

> Do we really need a separate callback for this?  Can't we just add
> @old_css param to ss->fork() which is NULL if it didn't change since
> can_fork()?

and

> Did you explore the idea of passing an opaque pointer from can_fork to
> post_fork?  If so, why did you choose this direction instead of that
> one?

Are quite similar (they essentially both result in each other, since if you use
void pointers for state you can't check if the association changed without
doing completely separate accounting which wouldn't be used by the callbacks),
I'll respond to both in the same go:

The issue I can see with passing around an opaque pointer to fork() is that you
have a random private (void **) argument that is completely useless if you
don't use can_fork(). This is why I think we should call the reapply_fork()
callback if the association changes [we could call it something else if you
like, since reapply_fork() is a pids-specific name -- what about switch_fork(),
reassoc_fork(), re_fork() or something to show that it's a callback if the
association changes?] (the subsystem can decide if they want to ignore it / if
they don't want to touch it) and we deal with pinning / dropping the ref of the
css_set for the current task inside the cgroup_* callbacks. That way, we don't
start messing around with post-fork() callbacks that aren't related to the new
conditional stuff.

I mean, if you want to have a random, completely unused and essentially
vestigial argument to ss->fork() [if you don't use the new can_fork() callbacks
(and actually care about storing private data)] then I can just write that. It
just looks like a weird callback API imho.

In fact, it seems even more single purpose than just pinning the ref and
dealing with it for the general case of an association switching (because now
you have a single-purpose parameter to the general fork() call, as opposed to
doing some ref pinning and accounting of the current css_set for (currently)
only the pids subsystem in the cgroup core. Since we know  that pids will need
to pin the ref, then we're not doing any extra work in  doing it for the
general cgroup fork callback -- and we get to both separate "association
changed" logic from "post fork" logic in the callbacks, as well as not blindly
trusting the cgroup subsystem to properly deal with a changing association.

--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAOviyahTq1K6zBRfmXfepabiQ7xU8ZoNdq0jPd3nP4Q-0H1DDQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork
       [not found]           ` <CAOviyahTq1K6zBRfmXfepabiQ7xU8ZoNdq0jPd3nP4Q-0H1DDQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-26 15:02             ` Tejun Heo
  2015-03-26 21:41               ` Aleksa Sarai
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2015-03-26 15:02 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hello,

On Fri, Mar 27, 2015 at 01:38:54AM +1100, Aleksa Sarai wrote:
> The issue I can see with passing around an opaque pointer to fork() is that you
> have a random private (void **) argument that is completely useless if you
> don't use can_fork(). This is why I think we should call the reapply_fork()

Just pass NULL?  I really don't like having another callback.  pre and
post do make sense because the operation is essentially a transaction.
The problem with adding additional callbacks is that they aren't
essential and as such arbitrary to a certain degree.  reapply_fork or
whatnot may fit this case but may not others, so let's please stick
with what the logic dictates to be essential.

> callback if the association changes [we could call it something else if you
> like, since reapply_fork() is a pids-specific name -- what about switch_fork(),
> reassoc_fork(), re_fork() or something to show that it's a callback if the
> association changes?] (the subsystem can decide if they want to ignore it / if
> they don't want to touch it) and we deal with pinning / dropping the ref of the
> css_set for the current task inside the cgroup_* callbacks. That way, we don't
> start messing around with post-fork() callbacks that aren't related to the new
> conditional stuff.

You can't pin css_set from inside cgroup callbacks.  It's a construct
which in general shouldn't be accessible outside cgroup core.

> I mean, if you want to have a random, completely unused and essentially
> vestigial argument to ss->fork() [if you don't use the new can_fork() callbacks
> (and actually care about storing private data)] then I can just write that. It
> just looks like a weird callback API imho.

It's an opaque token from pre.  If a subsys doesn't have pre, it's
NULL.  I don't see anything weird about that, so let's please go that
way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork
  2015-03-26 15:02             ` Tejun Heo
@ 2015-03-26 21:41               ` Aleksa Sarai
       [not found]                 ` <CAOviyah+5DFxYtUo1=LMdQLAte=YR1s4JYoH_DNZEfpoBUfSRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Aleksa Sarai @ 2015-03-26 21:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mingo, peterz, richard, Frédéric Weisbecker,
	linux-kernel, cgroups

Hi,

>> callback if the association changes [we could call it something else if you
>> like, since reapply_fork() is a pids-specific name -- what about switch_fork(),
>> reassoc_fork(), re_fork() or something to show that it's a callback if the
>> association changes?] (the subsystem can decide if they want to ignore it / if
>> they don't want to touch it) and we deal with pinning / dropping the ref of the
>> css_set for the current task inside the cgroup_* callbacks. That way, we don't
>> start messing around with post-fork() callbacks that aren't related to the new
>> conditional stuff.
>
> You can't pin css_set from inside cgroup callbacks.  It's a construct
> which in general shouldn't be accessible outside cgroup core.

Yeah, sorry I meant css (you aren't pinning, but you're still saving under RCU
and dealing with the association-related stuff inside post_fork).


>> I mean, if you want to have a random, completely unused and essentially
>> vestigial argument to ss->fork() [if you don't use the new can_fork() callbacks
>> (and actually care about storing private data)] then I can just write that. It
>> just looks like a weird callback API imho.
>
> It's an opaque token from pre.  If a subsys doesn't have pre, it's
> NULL.  I don't see anything weird about that, so let's please go that
> way.

Alright, if you think that's the best way. I still think it's weird, but I
guess that's probably just down to personal taste.


--
Aleksa Sarai (cyphar)
www.cyphar.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAOviyah+5DFxYtUo1=LMdQLAte=YR1s4JYoH_DNZEfpoBUfSRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork
       [not found]                 ` <CAOviyah+5DFxYtUo1=LMdQLAte=YR1s4JYoH_DNZEfpoBUfSRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-26 22:18                   ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2015-03-26 22:18 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, mingo-H+wXaHxf7aLQT0dZR+AlfA,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ, richard-/L3Ra7n9ekc,
	Frédéric Weisbecker,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Mar 27, 2015 at 08:41:05AM +1100, Aleksa Sarai wrote:
> > You can't pin css_set from inside cgroup callbacks.  It's a construct
> > which in general shouldn't be accessible outside cgroup core.
> 
> Yeah, sorry I meant css (you aren't pinning, but you're still saving under RCU
> and dealing with the association-related stuff inside post_fork).

How would that work without pinning?  What'd keep the css from going
away between pre and post?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 3/3] cgroups: add a pids subsystem
  2015-03-14  4:37 [PATCH v6 0/3] cgroups: add pids subsystem Aleksa Sarai
  2015-03-14  4:37 ` [PATCH v6 1/3] cgroups: use bitmask to filter for_each_subsys Aleksa Sarai
  2015-03-14  4:37 ` [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
@ 2015-03-14  4:37 ` Aleksa Sarai
  2 siblings, 0 replies; 12+ messages in thread
From: Aleksa Sarai @ 2015-03-14  4:37 UTC (permalink / raw)
  To: tj, lizefan, mingo, peterz
  Cc: richard, fweisbec, linux-kernel, cgroups, Aleksa Sarai

Adds a new single-purpose pids subsystem to limit the number of
tasks that can be forked inside a cgroup. Essentially this is an
implementation of RLIMIT_NPROC that will applies to a cgroup rather than
a process tree.

However, it should be noted that organisational operations (adding and
removing tasks from a pids hierarchy) will *not* be prevented. Rather,
the number of tasks in the hierarchy cannot exceed the limit through
forking.

PIDs are fundamentally a global resource, and it is possible to reach
PID exhaustion inside a cgroup without hitting any reasonable kmemcg
policy. Once you've hit PID exhaustion, you're only in a marginally
better state than OOM. This subsystem allows PID exhaustion inside a
cgroup to be prevented.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |  11 ++
 kernel/Makefile               |   1 +
 kernel/cgroup_pids.c          | 301 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 317 insertions(+)
 create mode 100644 kernel/cgroup_pids.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..a198822 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_PIDS)
+SUBSYS(pids)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..88364c9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1054,6 +1054,17 @@ config CGROUP_HUGETLB
 	  control group is tracked in the third page lru pointer. This means
 	  that we cannot use the controller with huge page less than 3 pages.
 
+config CGROUP_PIDS
+	bool "Process number limiting on cgroups"
+	help
+	  This options enables the setting of process number limits in the scope
+	  of a cgroup. Any attempt to fork more processes than is allowed in the
+	  cgroup will fail. PIDs are fundamentally a global resource because it
+	  is fairly trivial to reach PID exhaustion before you reach even a
+	  conservative kmemcg limit. As a result, it is possible to grind a
+	  system to halt without being limited by other cgroup policies. The pids
+	  cgroup subsystem is designed to stop this from happening.
+
 config CGROUP_PERF
 	bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
 	depends on PERF_EVENTS && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..e823592 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup.o
 obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
+obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
new file mode 100644
index 0000000..436e4c7
--- /dev/null
+++ b/kernel/cgroup_pids.c
@@ -0,0 +1,301 @@
+/*
+ * Process number limiting controller for cgroups.
+ *
+ * Used to allow a cgroup hierarchy to stop any new processes
+ * from fork()ing after a certain limit is reached.
+ *
+ * Since it is trivial to hit the task limit without hitting
+ * any kmemcg limits in place, PIDs are a fundamental resource.
+ * As such, PID exhaustion must be preventable in the scope of
+ * a cgroup hierarchy by allowing resource limiting of the
+ * number of tasks in a cgroup.
+ *
+ * In order to use the `pids` controller, set the maximum number
+ * of tasks in pids.max (this is not available in the root cgroup
+ * for obvious reasons). The number of processes currently
+ * in the cgroup is given by pids.current. Organisational operations
+ * are not blocked by cgroup policies, so it is possible to have
+ * pids.current > pids.max. However, fork()s will still not work.
+ *
+ * To set a cgroup to have no limit, set pids.max to "max". fork()
+ * will return -EBUSY if forking would cause a cgroup policy to be
+ * violated.
+ *
+ * pids.current tracks all child cgroup hierarchies, so
+ * parent/pids.current is a superset of parent/child/pids.current.
+ *
+ * Copyright (C) 2015 Aleksa Sarai <cyphar@cyphar.com>
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/threads.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+
+#define PIDS_MAX (PID_MAX_LIMIT + 1ULL)
+#define PIDS_MAX_STR "max"
+
+struct pids_cgroup {
+	struct cgroup_subsys_state	css;
+
+	/*
+	 * Use 64-bit types so that we can safely represent "max" as
+	 * (PID_MAX_LIMIT + 1).
+	 */
+	atomic64_t			counter;
+	int64_t				limit;
+};
+
+static struct pids_cgroup *css_pids(struct cgroup_subsys_state *css)
+{
+	return container_of(css, struct pids_cgroup, css);
+}
+
+static struct pids_cgroup *parent_pids(struct pids_cgroup *pids)
+{
+	return css_pids(pids->css.parent);
+}
+
+static struct cgroup_subsys_state *
+pids_css_alloc(struct cgroup_subsys_state *parent)
+{
+	struct pids_cgroup *pids;
+
+	pids = kzalloc(sizeof(struct pids_cgroup), GFP_KERNEL);
+	if (!pids)
+		return ERR_PTR(-ENOMEM);
+
+	pids->limit = PIDS_MAX;
+	atomic64_set(&pids->counter, 0);
+	return &pids->css;
+}
+
+static void pids_css_free(struct cgroup_subsys_state *css)
+{
+	kfree(css_pids(css));
+}
+
+/**
+ * pids_cancel - uncharge the local pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to cancel
+ *
+ * This function will WARN if the pid count goes under 0,
+ * because such a case is a bug in the pids controller proper.
+ */
+static void pids_cancel(struct pids_cgroup *pids, int num)
+{
+	/*
+	 * A negative count (or overflow for that matter) is invalid,
+	 * and indicates a bug in the pids controller proper.
+	 */
+	WARN_ON_ONCE(atomic64_add_negative(-num, &pids->counter));
+}
+
+/**
+ * pids_uncharge - hierarchically uncharge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to uncharge
+ */
+static void pids_uncharge(struct pids_cgroup *pids, int num)
+{
+	struct pids_cgroup *p;
+
+	for (p = pids; p; p = parent_pids(p))
+		pids_cancel(p, num);
+}
+
+/**
+ * pids_charge - hierarchically charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function does *not* follow the pid limit set. It cannot
+ * fail and the new pid count may exceed the limit, because
+ * organisational operations cannot fail in the unified hierarchy.
+ */
+static void pids_charge(struct pids_cgroup *pids, int num)
+{
+	struct pids_cgroup *p;
+
+	for (p = pids; p; p = parent_pids(p))
+		atomic64_add(num, &p->counter);
+}
+
+/**
+ * pids_try_charge - hierarchically try to charge the pid count
+ * @pids: the pid cgroup state
+ * @num: the number of pids to charge
+ *
+ * This function follows the set limit. It will fail if the charge
+ * would cause the new value to exceed the hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN.
+ */
+static int pids_try_charge(struct pids_cgroup *pids, int num)
+{
+	struct pids_cgroup *p, *q;
+
+	for (p = pids; p; p = parent_pids(p)) {
+		int64_t new = atomic64_add_return(num, &p->counter);
+
+		if (new > p->limit)
+			goto revert;
+	}
+
+	return 0;
+
+revert:
+	for (q = pids; q != p; q = parent_pids(q))
+		pids_cancel(q, num);
+	pids_cancel(p, num);
+
+	return -EAGAIN;
+}
+
+static int pids_can_attach(struct cgroup_subsys_state *css,
+			   struct cgroup_taskset *tset)
+{
+	struct pids_cgroup *pids = css_pids(css);
+	struct task_struct *task;
+	int64_t num = 0;
+
+	cgroup_taskset_for_each(task, tset)
+		num++;
+
+	/*
+	 * Attaching to a cgroup is allowed to overcome the
+	 * the PID limit, so that organisation operations aren't
+	 * blocked by the `pids` cgroup controller.
+	 */
+	pids_charge(pids, num);
+	return 0;
+}
+
+static void pids_cancel_attach(struct cgroup_subsys_state *css,
+			       struct cgroup_taskset *tset)
+{
+	struct pids_cgroup *pids = css_pids(css);
+	struct task_struct *task;
+	int64_t num = 0;
+
+	cgroup_taskset_for_each(task, tset)
+		num++;
+
+	pids_uncharge(pids, num);
+}
+
+static int pids_can_fork(struct cgroup_subsys_state *css,
+			 struct task_struct *task)
+{
+	struct pids_cgroup *pids = css_pids(css);
+
+	return pids_try_charge(pids, 1);
+}
+
+static void pids_cancel_fork(struct cgroup_subsys_state *css,
+			     struct task_struct *task)
+{
+	struct pids_cgroup *pids = css_pids(css);
+
+	pids_uncharge(pids, 1);
+}
+
+static void pids_reapply_fork(struct cgroup_subsys_state *css,
+			      struct cgroup_subsys_state *old_css,
+			      struct task_struct *task)
+{
+	struct pids_cgroup *pids = css_pids(css),
+			   *old_pids = css_pids(old_css);
+
+	pids_uncharge(old_pids, 1);
+	pids_charge(pids, 1);
+}
+
+static void pids_exit(struct cgroup_subsys_state *css,
+		      struct cgroup_subsys_state *old_css,
+		      struct task_struct *task)
+{
+	struct pids_cgroup *pids = css_pids(old_css);
+
+	pids_uncharge(pids, 1);
+}
+
+static ssize_t pids_max_write(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct pids_cgroup *pids = css_pids(css);
+	int64_t limit;
+	int err;
+
+	buf = strstrip(buf);
+	if (!strcmp(buf, PIDS_MAX_STR)) {
+		pids->limit = PIDS_MAX;
+		return nbytes;
+	}
+
+	err = kstrtoll(buf, 0, &limit);
+	if (err)
+		return err;
+
+	if (limit < 0 || limit > INT_MAX)
+		return -EINVAL;
+
+	/*
+	 * Limit updates don't need to be mutex'd, since it isn't
+	 * critical that any racing fork()s follow the new limit.
+	 */
+	pids->limit = limit;
+	return nbytes;
+}
+
+static int pids_max_show(struct seq_file *sf, void *v)
+{
+	struct cgroup_subsys_state *css = seq_css(sf);
+	struct pids_cgroup *pids = css_pids(css);
+	int64_t limit = pids->limit;
+
+	if (limit == PIDS_MAX)
+		seq_printf(sf, "%s\n", PIDS_MAX_STR);
+	else
+		seq_printf(sf, "%lld\n", limit);
+
+	return 0;
+}
+
+static s64 pids_current_read(struct cgroup_subsys_state *css,
+			     struct cftype *cft)
+{
+	struct pids_cgroup *pids = css_pids(css);
+
+	return atomic64_read(&pids->counter);
+}
+
+static struct cftype files[] = {
+	{
+		.name = "max",
+		.write = pids_max_write,
+		.seq_show = pids_max_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+	{
+		.name = "current",
+		.read_s64 = pids_current_read,
+	},
+	{ }	/* terminate */
+};
+
+struct cgroup_subsys pids_cgrp_subsys = {
+	.css_alloc	= pids_css_alloc,
+	.css_free	= pids_css_free,
+	.can_attach	= pids_can_attach,
+	.cancel_attach	= pids_cancel_attach,
+	.can_fork	= pids_can_fork,
+	.cancel_fork	= pids_cancel_fork,
+	.reapply_fork	= pids_reapply_fork,
+	.exit		= pids_exit,
+	.legacy_cftypes	= files,
+	.early_init	= 0,
+};
-- 
2.3.2

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-03-26 22:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-14  4:37 [PATCH v6 0/3] cgroups: add pids subsystem Aleksa Sarai
2015-03-14  4:37 ` [PATCH v6 1/3] cgroups: use bitmask to filter for_each_subsys Aleksa Sarai
2015-03-16 16:42   ` Tejun Heo
2015-03-14  4:37 ` [PATCH v6 2/3] cgroups: allow a cgroup subsystem to reject a fork Aleksa Sarai
     [not found]   ` <1426307835-5893-3-git-send-email-cyphar-gVpy/LI/lHzQT0dZR+AlfA@public.gmane.org>
2015-03-16 16:53     ` Tejun Heo
     [not found]       ` <20150316165335.GC8353-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2015-03-21 11:25         ` Aleksa Sarai
     [not found]           ` <CAOviyagGvm4z8eAWitR267U3br-60vVJ9A5MMtFENy9xt9zuLQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-26 15:08             ` Tejun Heo
2015-03-26 14:38         ` Aleksa Sarai
     [not found]           ` <CAOviyahTq1K6zBRfmXfepabiQ7xU8ZoNdq0jPd3nP4Q-0H1DDQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-26 15:02             ` Tejun Heo
2015-03-26 21:41               ` Aleksa Sarai
     [not found]                 ` <CAOviyah+5DFxYtUo1=LMdQLAte=YR1s4JYoH_DNZEfpoBUfSRw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-26 22:18                   ` Tejun Heo
2015-03-14  4:37 ` [PATCH v6 3/3] cgroups: add a pids subsystem Aleksa Sarai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox