Linux Container Development
 help / color / mirror / Atom feed
* [PATCH RFC 0/9] cgroups: add res_counter_write_u64() API
@ 2013-12-12 21:35 Dwight Engen
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
       [not found] ` <20131223125410.GA585@localhost.localdomain>
  0 siblings, 2 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

Hello,

I've seen that some sort of fork/task limiter has been proposed and
discussed several times before. Despite the existance of kmem in memcg, a
fork limiter is still often asked for by container users. Perhaps this is
due to current granularity in kmem (ie. stack/struct task not split out from
other slab allocations) but I believe it is just more natural for users to
express a limit in terms of tasks.

So what I've done is updated Frederic Weisbecker's task counter patchset and
tried to address the concerns that I saw people had raised. This involved
the following changes:

- merged into cpuacct controller, as it seems there is a desire not to add
  new controllers, this controller is already heirarchical, and I feel
  limiting number of tasks/forks fits best here
- included a fork_limit similar to the one Max Kellermann posted
  (https://lkml.org/lkml/2011/2/17/116) which is a use case not addressable
  with memcg
- ala memcg kmem, for performance reasons don't account unless limit is set
- ala memcg, allow usage to roll up to root (prevents warnings on
  uncharge), but still don't allow setting limits in root
- changed the interface at fork()/exit(), adding
  can_fork()/cancel_can_fork() modeled on can_attach(). cgroup_fork()
  can now return failure to fork().
- ran Frederics selftests, and added a couple more

I also wrote a small fork micro benchmark to see how this change affected
performance. I did 20 runs of 100000 fork/exit/waitpid, and took the
average. Times are in seconds, base is without the change, cpuacct1 is with
the change but no accounting be done (ie. no limit set), and cpuacct2 is
with the test being in a cgroup 1 level deep.

base  cpuacct1  cpuaac2
5.59  5.59      5.64

So I believe this change has minimal performance impact, especially when no
limit is set.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/9] cgroups: add res_counter_write_u64() API
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2013-12-12 21:35   ` Dwight Engen
  2013-12-12 21:35   ` [PATCH 2/9] cgroups: new resource counter inheritance API Dwight Engen
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

From: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Extend the resource counter API with a mirror of res_counter_read_u64() to
make it handy to update a resource counter value from a cgroup subsystem
u64 value file.

[kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org: Separate 32 bits and 64 bits versions]

Signed-off-by: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Acked-by: Paul Menage <paul-inf54ven1CmVyaH7bEyXVA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Tim Hockin <thockin-Rl2oBbRerpQdnm+yROfE0A@public.gmane.org>
Cc: Tejun Heo <htejun-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Containers <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Cc: Glauber Costa <glommer-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: Daniel J Walsh <dwalsh-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Max Kellermann <mk-xMchvyqCc6DQT0dZR+AlfA@public.gmane.org>
Cc: Mandeep Singh Baines <msb-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Signed-off-by: Kirill A. Shutemov <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
---
 include/linux/res_counter.h |  2 ++
 kernel/res_counter.c        | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 201a697..eb78440 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -78,6 +78,8 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
 int res_counter_memparse_write_strategy(const char *buf,
 					unsigned long long *res);
 
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 4aa8a30..673a7a6 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -170,11 +170,25 @@ u64 res_counter_read_u64(struct res_counter *counter, int member)
 
 	return ret;
 }
+
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&counter->lock, flags);
+	*res_counter_member(counter, member) = val;
+	spin_unlock_irqrestore(&counter->lock, flags);
+}
 #else
 u64 res_counter_read_u64(struct res_counter *counter, int member)
 {
 	return *res_counter_member(counter, member);
 }
+
+void res_counter_write_u64(struct res_counter *counter, int member, u64 val)
+{
+	*res_counter_member(counter, member) = val;
+}
 #endif
 
 int res_counter_memparse_write_strategy(const char *buf,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/9] cgroups: new resource counter inheritance API
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2013-12-12 21:35   ` [PATCH 1/9] " Dwight Engen
@ 2013-12-12 21:35   ` Dwight Engen
  2013-12-12 21:35   ` [PATCH 3/9] cgroups: ability to stop res charge propagation on bounded ancestor Dwight Engen
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

From: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Provide an API to inherit a counter value from a parent.  This can be
useful to implement cgroup.clone_children on a resource counter.

Still the resources of the children are limited by those of the parent, so
this is only to provide a default setting behaviour when clone_children is
set.

Signed-off-by: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Paul Menage <paul-inf54ven1CmVyaH7bEyXVA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Tim Hockin <thockin-Rl2oBbRerpQdnm+yROfE0A@public.gmane.org>
Cc: Tejun Heo <htejun-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Containers <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Cc: Glauber Costa <glommer-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: Daniel J Walsh <dwalsh-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Max Kellermann <mk-xMchvyqCc6DQT0dZR+AlfA@public.gmane.org>
Cc: Mandeep Singh Baines <msb-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Acked-by: Kirill A. Shutemov <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/res_counter.h |  2 ++
 kernel/res_counter.c        | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index eb78440..92b9e18 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -80,6 +80,8 @@ int res_counter_memparse_write_strategy(const char *buf,
 
 void res_counter_write_u64(struct res_counter *counter, int member, u64 val);
 
+void res_counter_inherit(struct res_counter *counter, int member);
+
 /*
  * the field descriptors. one for each member of res_counter
  */
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 673a7a6..6174478 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -219,3 +219,21 @@ int res_counter_memparse_write_strategy(const char *buf,
 
 	return 0;
 }
+
+/*
+ * Simple inheritance implementation to get the same value
+ * than a parent. However this doesn't enforce the child value
+ * to be always below the one of the parent. But the child is
+ * subject to its parent limitation anyway.
+ */
+void res_counter_inherit(struct res_counter *counter, int member)
+{
+	struct res_counter *parent;
+	unsigned long long val;
+
+	parent = counter->parent;
+	if (parent) {
+		val = res_counter_read_u64(parent, member);
+		res_counter_write_u64(counter, member, val);
+	}
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/9] cgroups: ability to stop res charge propagation on bounded ancestor
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2013-12-12 21:35   ` [PATCH 1/9] " Dwight Engen
  2013-12-12 21:35   ` [PATCH 2/9] cgroups: new resource counter inheritance API Dwight Engen
@ 2013-12-12 21:35   ` Dwight Engen
  2013-12-12 21:35   ` [PATCH 4/9] cgroups: add res counter common ancestor searching Dwight Engen
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

From: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Moving a task from a cgroup to another may require to substract its
resource charge from the old cgroup and add it to the new one.

For this to happen, the uncharge/charge propagation can just stop when we
reach the common ancestor for the two cgroups.  Further the performance
reasons, we also want to avoid to temporarily overload the common
ancestors with a non-accurate resource counter usage if we charge first
the new cgroup and uncharge the old one thereafter.  This is going to be a
requirement for the coming max number of task subsystem.

To solve this, provide a pair of new API that can charge/uncharge a
resource counter until we reach a given ancestor.

Signed-off-by: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Acked-by: Paul Menage <paul-inf54ven1CmVyaH7bEyXVA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Tim Hockin <thockin-Rl2oBbRerpQdnm+yROfE0A@public.gmane.org>
Cc: Tejun Heo <htejun-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Containers <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Cc: Glauber Costa <glommer-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: Daniel J Walsh <dwalsh-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Max Kellermann <mk-xMchvyqCc6DQT0dZR+AlfA@public.gmane.org>
Cc: Mandeep Singh Baines <msb-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Acked-by: Kirill A. Shutemov <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 Documentation/cgroups/resource_counter.txt | 13 ++++++++++---
 include/linux/res_counter.h                |  4 ++++
 kernel/res_counter.c                       | 21 ++++++++++++++++-----
 3 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index c4d99ed..57f41d5 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -83,7 +83,14 @@ to work with it.
 	res_counter->lock internally (it must be called with res_counter->lock
 	held). The force parameter indicates whether we can bypass the limit.
 
- e. u64 res_counter_uncharge[_locked]
+ e. int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *limit, unsigned long val,
+			     struct res_counter **limit_fail_at)
+
+	The same as res_counter_charge(), but the charge propagation to
+	the hierarchy stops at the limit given in the "limit" parameter.
+
+ f. u64 res_counter_uncharge[_locked]
 			(struct res_counter *rc, unsigned long val)
 
 	When a resource is released (freed) it should be de-accounted
@@ -93,12 +100,12 @@ to work with it.
 
 	The _locked routines imply that the res_counter->lock is taken.
 
- f. u64 res_counter_uncharge_until
+ g. u64 res_counter_uncharge_until
 		(struct res_counter *rc, struct res_counter *top,
 		 unsinged long val)
 
 	Almost same as res_cunter_uncharge() but propagation of uncharge
-	stops when rc == top. This is useful when kill a res_coutner in
+	stops when rc == top. This is useful when kill a res_counter in
 	child cgroup.
 
  2.1 Other accounting routines
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 92b9e18..9721fde 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -119,6 +119,10 @@ int __must_check res_counter_charge_locked(struct res_counter *counter,
 					   unsigned long val, bool force);
 int __must_check res_counter_charge(struct res_counter *counter,
 		unsigned long val, struct res_counter **limit_fail_at);
+int __must_check res_counter_charge_until(struct res_counter *counter,
+					  struct res_counter *top,
+					  unsigned long val,
+					  struct res_counter **limit_fail_at);
 int res_counter_charge_nofail(struct res_counter *counter,
 		unsigned long val, struct res_counter **limit_fail_at);
 
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 6174478..dadb16e 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -40,8 +40,9 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val,
 	return ret;
 }
 
-static int __res_counter_charge(struct res_counter *counter, unsigned long val,
-				struct res_counter **limit_fail_at, bool force)
+static int __res_counter_charge_until(struct res_counter *counter,
+			struct res_counter *top, unsigned long val,
+			struct res_counter **limit_fail_at, bool force)
 {
 	int ret, r;
 	unsigned long flags;
@@ -50,7 +51,7 @@ static int __res_counter_charge(struct res_counter *counter, unsigned long val,
 	r = ret = 0;
 	*limit_fail_at = NULL;
 	local_irq_save(flags);
-	for (c = counter; c != NULL; c = c->parent) {
+	for (c = counter; c != top; c = c->parent) {
 		spin_lock(&c->lock);
 		r = res_counter_charge_locked(c, val, force);
 		spin_unlock(&c->lock);
@@ -77,13 +78,23 @@ static int __res_counter_charge(struct res_counter *counter, unsigned long val,
 int res_counter_charge(struct res_counter *counter, unsigned long val,
 			struct res_counter **limit_fail_at)
 {
-	return __res_counter_charge(counter, val, limit_fail_at, false);
+	return __res_counter_charge_until(counter, NULL, val, limit_fail_at,
+					  false);
+}
+
+int res_counter_charge_until(struct res_counter *counter,
+			     struct res_counter *top, unsigned long val,
+			     struct res_counter **limit_fail_at)
+{
+	return __res_counter_charge_until(counter, top, val, limit_fail_at,
+					  false);
 }
 
 int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
 			      struct res_counter **limit_fail_at)
 {
-	return __res_counter_charge(counter, val, limit_fail_at, true);
+	return __res_counter_charge_until(counter, NULL, val, limit_fail_at,
+					  true);
 }
 
 u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/9] cgroups: add res counter common ancestor searching
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-12-12 21:35   ` [PATCH 3/9] cgroups: ability to stop res charge propagation on bounded ancestor Dwight Engen
@ 2013-12-12 21:35   ` Dwight Engen
  2013-12-12 21:35   ` [PATCH 5/9] res_counter: allow charge failure pointer to be null Dwight Engen
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

From: "Kirill A. Shutemov" <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>

Add a new API to find the common ancestor between two resource counters.
This includes the passed resource counter themselves.

Signed-off-by: Kirill A. Shutemov <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
Signed-off-by: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Paul Menage <paul-inf54ven1CmVyaH7bEyXVA@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Tim Hockin <thockin-Rl2oBbRerpQdnm+yROfE0A@public.gmane.org>
Cc: Tejun Heo <htejun-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Containers <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Cc: Glauber Costa <glommer-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: Daniel J Walsh <dwalsh-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Max Kellermann <mk-xMchvyqCc6DQT0dZR+AlfA@public.gmane.org>
Cc: Mandeep Singh Baines <msb-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/res_counter.h |  4 ++++
 kernel/res_counter.c        | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 9721fde..4addc7a 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -144,6 +144,10 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
 u64 res_counter_uncharge_until(struct res_counter *counter,
 			       struct res_counter *top,
 			       unsigned long val);
+
+struct res_counter *res_counter_common_ancestor(struct res_counter *l,
+						struct res_counter *r);
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index dadb16e..8c16224 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -132,6 +132,40 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val)
 	return res_counter_uncharge_until(counter, NULL, val);
 }
 
+/*
+ * Walk through r1 and r2 parents and try to find the closest common one
+ * between both. If none is found, it returns NULL.
+ */
+struct res_counter *
+res_counter_common_ancestor(struct res_counter *r1, struct res_counter *r2)
+{
+	struct res_counter *iter;
+	int r1_depth = 0, r2_depth = 0;
+
+	for (iter = r1; iter; iter = iter->parent)
+		r1_depth++;
+
+	for (iter = r2; iter; iter = iter->parent)
+		r2_depth++;
+
+	while (r1_depth > r2_depth) {
+		r1 = r1->parent;
+		r1_depth--;
+	}
+
+	while (r2_depth > r1_depth) {
+		r2 = r2->parent;
+		r2_depth--;
+	}
+
+	while (r1 != r2) {
+		r1 = r1->parent;
+		r2 = r2->parent;
+	}
+
+	return r1;
+}
+
 static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/9] res_counter: allow charge failure pointer to be null
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-12-12 21:35   ` [PATCH 4/9] cgroups: add res counter common ancestor searching Dwight Engen
@ 2013-12-12 21:35   ` Dwight Engen
  2013-12-12 21:35   ` [PATCH 6/9] cgroups: pull up res counter charge failure interpretation to caller Dwight Engen
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

From: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

So that callers of res_counter_charge() don't have to create and pass this
pointer even if they aren't interested in it.

Signed-off-by: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Paul Menage <paul-inf54ven1CmVyaH7bEyXVA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Tim Hockin <thockin-Rl2oBbRerpQdnm+yROfE0A@public.gmane.org>
Cc: Tejun Heo <htejun-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Containers <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Cc: Glauber Costa <glommer-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: Daniel J Walsh <dwalsh-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Max Kellermann <mk-xMchvyqCc6DQT0dZR+AlfA@public.gmane.org>
Cc: Mandeep Singh Baines <msb-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Acked-by: Kirill A. Shutemov <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 kernel/res_counter.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 8c16224..ece24f7 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -49,7 +49,8 @@ static int __res_counter_charge_until(struct res_counter *counter,
 	struct res_counter *c, *u;
 
 	r = ret = 0;
-	*limit_fail_at = NULL;
+	if (limit_fail_at)
+		*limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != top; c = c->parent) {
 		spin_lock(&c->lock);
@@ -57,7 +58,8 @@ static int __res_counter_charge_until(struct res_counter *counter,
 		spin_unlock(&c->lock);
 		if (r < 0 && !ret) {
 			ret = r;
-			*limit_fail_at = c;
+			if (limit_fail_at)
+				*limit_fail_at = c;
 			if (!force)
 				break;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/9] cgroups: pull up res counter charge failure interpretation to caller
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2013-12-12 21:35   ` [PATCH 5/9] res_counter: allow charge failure pointer to be null Dwight Engen
@ 2013-12-12 21:35   ` Dwight Engen
  2013-12-12 21:35   ` [PATCH 7/9] fix compile error in cgroup_taskset_for_each() when skip_css is NULL Dwight Engen
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

From: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

res_counter_charge() always returns -ENOMEM when the limit is reached and
the charge thus can't happen.

However it's up to the caller to interpret this failure and return the
appropriate error value.  The task counter subsystem will need to report
the user that a fork() has been cancelled because of some limit reached,
not because we are too short on memory.

Fix this by returning -1 when res_counter_charge() fails.

Signed-off-by: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Paul Menage <paul-inf54ven1CmVyaH7bEyXVA@public.gmane.org>
Cc: Li Zefan <lizf-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Tim Hockin <thockin-Rl2oBbRerpQdnm+yROfE0A@public.gmane.org>
Cc: Tejun Heo <htejun-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Containers <containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Cc: Glauber Costa <glommer-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Cc: Daniel J Walsh <dwalsh-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "Daniel P. Berrange" <berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Max Kellermann <mk-xMchvyqCc6DQT0dZR+AlfA@public.gmane.org>
Cc: Mandeep Singh Baines <msb-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
Acked-by: Kirill A. Shutemov <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>
Signed-off-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 Documentation/cgroups/resource_counter.txt | 2 ++
 kernel/res_counter.c                       | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 57f41d5..7ca0d11 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -76,6 +76,8 @@ to work with it.
 	limit_fail_at parameter is set to the particular res_counter element
 	where the charging failed.
 
+	It returns 0 on success and -1 on failure.
+
  d. int res_counter_charge_locked
 			(struct res_counter *rc, unsigned long val, bool force)
 
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index ece24f7..79155f9 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -29,7 +29,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val,
 
 	if (counter->usage + val > counter->limit) {
 		counter->failcnt++;
-		ret = -ENOMEM;
+		ret = -1;
 		if (!force)
 			return ret;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 7/9] fix compile error in cgroup_taskset_for_each() when skip_css is NULL
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2013-12-12 21:35   ` [PATCH 6/9] cgroups: pull up res counter charge failure interpretation to caller Dwight Engen
@ 2013-12-12 21:35   ` Dwight Engen
  2013-12-12 21:35   ` [PATCH 8/9] cgroups: Add task and fork limits to cpuacct subsystem Dwight Engen
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

The gcc error message is:
include/linux/cgroup.h:587:14: error: request for member ‘ss’ in something
not a structure or union

Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
---
 include/linux/cgroup.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3561d30..9b20ba9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -584,7 +584,7 @@ int cgroup_taskset_size(struct cgroup_taskset *tset);
 	     (task) = cgroup_taskset_next((tset)))			\
 		if (!(skip_css) ||					\
 		    cgroup_taskset_cur_css((tset),			\
-			(skip_css)->ss->subsys_id) != (skip_css))
+			((struct cgroup_subsys_state *)skip_css)->ss->subsys_id) != (skip_css))
 
 /*
  * Control Group subsystem type.
-- 
1.8.3.1

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 8/9] cgroups: Add task and fork limits to cpuacct subsystem
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2013-12-12 21:35   ` [PATCH 7/9] fix compile error in cgroup_taskset_for_each() when skip_css is NULL Dwight Engen
@ 2013-12-12 21:35   ` Dwight Engen
  2013-12-12 21:35   ` [PATCH 9/9] selftests: Add a new task counter selftest Dwight Engen
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

A task limit can be set that is checked every time a task forks or
is moved into the cgroup. For performance reasons the accounting is
not performed unless a limit is set.

The primary goal is to protect against forkbombs that explode
inside a container. The traditional NR_PROC rlimit is not
efficient in that case because if we run containers in parallel
under the same user, one of these could starve all the others
by spawning a high number of tasks close to the user wide limit.

A secondary goal is to limit the total number of forks a container
can do, for example for use in a temporary cgroup created to
process a CGI request. This is implemented with a separate fork
count limit.

Original Author: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dwight Engen <dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 Documentation/cgroups/cpuacct.txt |  35 ++++-
 include/linux/cgroup.h            |   8 +-
 kernel/cgroup.c                   |  37 ++++-
 kernel/exit.c                     |   2 +-
 kernel/fork.c                     |   7 +-
 kernel/sched/cpuacct.c            | 279 +++++++++++++++++++++++++++++++++++++-
 6 files changed, 347 insertions(+), 21 deletions(-)

diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index 9d73cc0..4d5a568 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -2,11 +2,13 @@ CPU Accounting Controller
 -------------------------
 
 The CPU accounting controller is used to group tasks using cgroups and
-account the CPU usage of these groups of tasks.
+account the CPU usage of these groups of tasks. It can also limit the
+number of tasks running inside the cgroup, and limit the total number of
+forks done by processes in the cgroup.
 
 The CPU accounting controller supports multi-hierarchy groups. An accounting
-group accumulates the CPU usage of all of its child groups and the tasks
-directly present in its group.
+group accumulates the CPU and task usage of all of its child groups and the
+tasks directly present in its group.
 
 Accounting groups can be created by first mounting the cgroup filesystem.
 
@@ -47,3 +49,30 @@ system times. This has two side effects:
   against concurrent writes.
 - It is possible to see slightly outdated values for user and system times
   due to the batch processing nature of percpu_counter.
+
+cpuacct.fork_usage maintains a counter which is incremented each time a new
+process/thread is created. For performance reasons, this accounting is not
+done unless cpuacct.fork_limit is set.
+
+cpuacct.fork_limit limits the number of times a new child process or thread
+can be created. If cpuacct.fork_limit is set, when cpuacct.fork_usage
+reaches the limit, no process in the cgroup is allowed to create new child
+processes/threads, even if existing ones quit. A limit other than 0 cannot
+be set if the cgroup has children or tasks already assigned. Setting the
+limit to 0 is useful for stopping an in progress fork bomb. The limit in the
+root of the cgroup heirarchy cannot be set.
+
+This has been proven useful in a shared hosting environment.  A new
+temporary cgroup is created for each CGI process, and the maximum fork
+count is configured to a sensible value.  Since CGIs are expected to
+run for only a short time with predictable resource usage, this may be
+an appropriate tool to limit the damage that a freaked CGI can do.
+
+cpuacct.task_usage maintains a counter of the number of tasks in the cgroup.
+For performance reasons, this accounting is not done unless
+cpuacct.task_limit is set.
+
+cpuacct.task_limit limits the number of tasks running inside a given cgroup.
+It behaves like the NR_PROC rlimit but in the scope of a cgroup instead of a
+user. This limit is checked when a task forks or when it is migrated to the
+cgroup. The limit in the root of the cgroup heirarchy cannot be set.
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9b20ba9..519c80e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -33,9 +33,9 @@ struct eventfd_ctx;
 
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
-extern void cgroup_fork(struct task_struct *p);
+extern int cgroup_fork(struct task_struct *p);
 extern void cgroup_post_fork(struct task_struct *p);
-extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_exit(struct task_struct *p);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -603,6 +603,8 @@ struct cgroup_subsys {
 			      struct cgroup_taskset *tset);
 	void (*attach)(struct cgroup_subsys_state *css,
 		       struct cgroup_taskset *tset);
+	int (*can_fork)(void);
+	void (*cancel_can_fork)(void);
 	void (*fork)(struct task_struct *task);
 	void (*exit)(struct cgroup_subsys_state *css,
 		     struct cgroup_subsys_state *old_css,
@@ -913,7 +915,7 @@ static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_fork(struct task_struct *p) {}
 static inline void cgroup_post_fork(struct task_struct *p) {}
-static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_exit(struct task_struct *p) {}
 
 static inline int cgroupstats_build(struct cgroupstats *stats,
 					struct dentry *dentry)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5c9127d..8abacad 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4855,7 +4855,7 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss)
 	 * init_css_set is in the subsystem's top cgroup. */
 	init_css_set.subsys[ss->subsys_id] = css;
 
-	need_forkexit_callback |= ss->fork || ss->exit;
+	need_forkexit_callback |= ss->fork || ss->can_fork || ss->exit;
 
 	/* At system boot, before all subsystems have been
 	 * registered, no tasks have been forked, so we don't
@@ -5282,13 +5282,40 @@ static const struct file_operations proc_cgroupstats_operations = {
  * At the point that cgroup_fork() is called, 'current' is the parent
  * task, and the passed argument 'child' points to the child task.
  */
-void cgroup_fork(struct task_struct *child)
+int cgroup_fork(struct task_struct *child)
 {
+	struct cgroup_subsys *ss;
+	struct cgroup_subsys *failed_ss;
+	int i;
+	int err = 0;
+
 	task_lock(current);
+	if (need_forkexit_callback) {
+		for_each_builtin_subsys(ss, i) {
+			if (ss->can_fork) {
+				err = ss->can_fork();
+				if (err) {
+					failed_ss = ss;
+					goto out_cancel_fork;
+				}
+			}
+		}
+	}
 	get_css_set(task_css_set(current));
 	child->cgroups = current->cgroups;
-	task_unlock(current);
 	INIT_LIST_HEAD(&child->cg_list);
+
+out_cancel_fork:
+	if (err) {
+		for_each_builtin_subsys(ss, i) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_can_fork)
+				ss->cancel_can_fork();
+		}
+	}
+	task_unlock(current);
+	return err;
 }
 
 /**
@@ -5381,7 +5408,7 @@ void cgroup_post_fork(struct task_struct *child)
  *    which wards off any cgroup_attach_task() attempts, or task is a failed
  *    fork, never visible to cgroup_attach_task.
  */
-void cgroup_exit(struct task_struct *tsk, int run_callbacks)
+void cgroup_exit(struct task_struct *tsk)
 {
 	struct cgroup_subsys *ss;
 	struct css_set *cset;
@@ -5404,7 +5431,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 	cset = task_css_set(tsk);
 	RCU_INIT_POINTER(tsk->cgroups, &init_css_set);
 
-	if (run_callbacks && need_forkexit_callback) {
+	if (need_forkexit_callback) {
 		/*
 		 * fork/exit callbacks are supported only for builtin
 		 * subsystems, see cgroup_post_fork() for details.
diff --git a/kernel/exit.c b/kernel/exit.c
index a949819..74c4964 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -796,7 +796,7 @@ void do_exit(long code)
 	 */
 	perf_event_exit_task(tsk);
 
-	cgroup_exit(tsk, 1);
+	cgroup_exit(tsk);
 
 	if (group_dead)
 		disassociate_ctty(1);
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73..cff2f73 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1264,7 +1264,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->audit_context = NULL;
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_begin(current);
-	cgroup_fork(p);
+	retval = cgroup_fork(p);
+	if (retval)
+		goto bad_fork_cleanup_threadgroup;
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
 	if (IS_ERR(p->mempolicy)) {
@@ -1523,9 +1525,10 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
+	cgroup_exit(p);
+bad_fork_cleanup_threadgroup:
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
-	cgroup_exit(p, 0);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index f64722f..e23e543 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -5,6 +5,7 @@
 #include <linux/cpumask.h>
 #include <linux/seq_file.h>
 #include <linux/rcupdate.h>
+#include <linux/res_counter.h>
 #include <linux/kernel_stat.h>
 #include <linux/err.h>
 
@@ -31,6 +32,11 @@ struct cpuacct {
 	/* cpuusage holds pointer to a u64-type object on every cpu */
 	u64 __percpu *cpuusage;
 	struct kernel_cpustat __percpu *cpustat;
+
+	/* counter for allowed tasks */
+	struct res_counter task_limit;
+	/* counter for allowed forks */
+	struct res_counter fork_limit;
 };
 
 static inline struct cpuacct *css_ca(struct cgroup_subsys_state *css)
@@ -49,6 +55,11 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
 	return css_ca(css_parent(&ca->css));
 }
 
+static inline bool res_limit_enabled(struct res_counter *res)
+{
+	return res_counter_read_u64(res, RES_LIMIT) != RES_COUNTER_MAX;
+}
+
 static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
 	.cpustat	= &kernel_cpustat,
@@ -61,8 +72,11 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct cpuacct *ca;
 
-	if (!parent_css)
+	if (!parent_css) {
+		res_counter_init(&root_cpuacct.task_limit, NULL);
+		res_counter_init(&root_cpuacct.fork_limit, NULL);
 		return &root_cpuacct.css;
+	}
 
 	ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 	if (!ca)
@@ -76,6 +90,12 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (!ca->cpustat)
 		goto out_free_cpuusage;
 
+	res_counter_init(&ca->task_limit, &css_ca(parent_css)->task_limit);
+	res_counter_inherit(&ca->task_limit, RES_LIMIT);
+
+	res_counter_init(&ca->fork_limit, &css_ca(parent_css)->fork_limit);
+	res_counter_inherit(&ca->fork_limit, RES_LIMIT);
+
 	return &ca->css;
 
 out_free_cpuusage:
@@ -212,6 +232,223 @@ static int cpuacct_stats_show(struct cgroup_subsys_state *css,
 	return 0;
 }
 
+static u64 cpuacct_task_limit_read_u64(struct cgroup_subsys_state *css,
+				       struct cftype *cft)
+{
+	struct cpuacct *ca = css_ca(css);
+	int type = cft->private;
+
+	return res_counter_read_u64(&ca->task_limit, type);
+}
+
+static int cpuacct_task_limit_write_u64(struct cgroup_subsys_state *css,
+					struct cftype *cft, u64 val)
+{
+	struct cpuacct *ca = css_ca(css);
+	struct cgroup *cgrp = ca->css.cgroup;
+	int type = cft->private;
+
+	if (ca == &root_cpuacct)
+		return -EINVAL;
+
+	if (val != RES_COUNTER_MAX) {
+		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
+			return -EBUSY;
+		res_counter_write_u64(&ca->task_limit, type, val);
+	}
+
+	return 0;
+}
+
+static u64 cpuacct_fork_limit_read_u64(struct cgroup_subsys_state *css,
+				       struct cftype *cft)
+{
+	struct cpuacct *ca = css_ca(css);
+	int type = cft->private;
+
+	return res_counter_read_u64(&ca->fork_limit, type);
+}
+
+static int cpuacct_fork_limit_write_u64(struct cgroup_subsys_state *css,
+					struct cftype *cft, u64 val)
+{
+	struct cpuacct *ca = css_ca(css);
+	struct cgroup *cgrp = ca->css.cgroup;
+	int type = cft->private;
+
+	if (ca == &root_cpuacct)
+		return -EINVAL;
+
+	if (val != RES_COUNTER_MAX) {
+		/* always allow 0 to stop an ongoing fork bomb */
+		if (val != 0 &&
+		    (cgroup_task_count(cgrp) || !list_empty(&cgrp->children)))
+			return -EBUSY;
+		res_counter_write_u64(&ca->fork_limit, type, val);
+	}
+
+	return 0;
+}
+
+static int cpuacct_can_fork(void)
+{
+	int err = 0;
+	bool fork_charged = 0;
+	struct cpuacct *ca = task_ca(current);
+
+	if (ca == &root_cpuacct)
+		return 0;
+
+	if (res_limit_enabled(&ca->fork_limit)) {
+		if (res_counter_charge(&ca->fork_limit, 1, NULL))
+			return -EPERM;
+		fork_charged = 1;
+	}
+
+	if (res_limit_enabled(&ca->task_limit)) {
+		if (res_counter_charge(&ca->task_limit, 1, NULL)) {
+			err = -EAGAIN;
+			goto err_task_limit;
+		}
+	}
+
+	return 0;
+
+err_task_limit:
+	if (fork_charged)
+		res_counter_uncharge(&ca->fork_limit, 1);
+	return err;
+}
+
+static void cpuacct_cancel_can_fork(void)
+{
+	struct cpuacct *ca = task_ca(current);
+
+	if (ca == &root_cpuacct)
+		return;
+
+	if (res_limit_enabled(&ca->fork_limit))
+		res_counter_uncharge(&ca->fork_limit, 1);
+
+	if (res_limit_enabled(&ca->task_limit))
+		res_counter_uncharge(&ca->task_limit, 1);
+}
+
+
+static void cpuacct_exit(struct cgroup_subsys_state *css,
+			 struct cgroup_subsys_state *old_css,
+			 struct task_struct *task)
+{
+	struct cpuacct *ca = css_ca(old_css);
+
+	if (ca == &root_cpuacct)
+		return;
+
+	if (res_limit_enabled(&ca->task_limit))
+		res_counter_uncharge(&ca->task_limit, 1);
+}
+
+/*
+ * Complete the attach by uncharging the old cgroups. We can do that now that
+ * we are sure the attachment can't be cancelled anymore, because this uncharge
+ * operation couldn't be reverted later: a task in the old cgroup could fork
+ * after we uncharge and reach the task counter limit, making our return there
+ * not possible.
+ */
+static void cpuacct_attach(struct cgroup_subsys_state *css,
+			   struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct cpuacct *new = css_ca(css);
+	struct cpuacct *old;
+	struct res_counter *until;
+
+	cgroup_taskset_for_each(task, NULL, tset) {
+		old = css_ca(cgroup_taskset_cur_css(tset, cpuacct_subsys_id));
+		until = res_counter_common_ancestor(&new->task_limit,
+						    &old->task_limit);
+		if (until == &root_cpuacct.task_limit)
+			until = NULL;
+		if (res_limit_enabled(&old->task_limit))
+			res_counter_uncharge_until(&old->task_limit, until, 1);
+	}
+}
+
+static void cpuacct_cancel_attach_until(struct cgroup_subsys_state *css,
+					struct cgroup_taskset *tset,
+					struct task_struct *until_task)
+{
+	struct task_struct *task;
+	struct cpuacct *new = css_ca(css);
+	struct cpuacct *old;
+	struct res_counter *until;
+
+	cgroup_taskset_for_each(task, NULL, tset) {
+		if (task == until_task)
+			break;
+		old = css_ca(cgroup_taskset_cur_css(tset, cpuacct_subsys_id));
+		until = res_counter_common_ancestor(&new->task_limit,
+						    &old->task_limit);
+		if (until == &root_cpuacct.task_limit)
+			until = NULL;
+		if (res_limit_enabled(&new->task_limit))
+			res_counter_uncharge_until(&new->task_limit, until, 1);
+	}
+}
+
+/*
+ * This does more than just probing the ability to attach to the dest cgroup.
+ * We can not just _check_ if we can attach to the destination and do the real
+ * attachment later in cpuacct_attach() because a task in the dest cgroup can
+ * fork before we get there and steal the last remaining count, thus we must
+ * charge the dest cgroup right now.
+ */
+static int cpuacct_can_attach(struct cgroup_subsys_state *css,
+			      struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct cpuacct *new = css_ca(css);
+	struct cpuacct *old;
+	struct res_counter *until;
+	int err;
+
+	cgroup_taskset_for_each(task, NULL, tset) {
+		old = css_ca(cgroup_taskset_cur_css(tset, cpuacct_subsys_id));
+
+		/*
+		 * When moving a task from a cgroup to another, we don't want
+		 * to charge the common ancestors, even though they would be
+		 * uncharged later in cpuacct_attach(), because during that
+		 * short window between charge and uncharge, a task could fork
+		 * in the ancestor and spuriously fail due to the temporary
+		 * charge. The exception is root_cpuacct since it is unlimited.
+		 */
+		until = res_counter_common_ancestor(&new->task_limit,
+						    &old->task_limit);
+		if (until == &root_cpuacct.task_limit)
+			until = NULL;
+
+		if (!res_limit_enabled(&new->task_limit))
+			continue;
+
+		err = res_counter_charge_until(&new->task_limit, until, 1, NULL);
+		if (err) {
+			cpuacct_cancel_attach_until(css, tset, task);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+/* Uncharge the cgroup that we charged in cpuacct_can_attach() */
+static void cpuacct_cancel_attach(struct cgroup_subsys_state *css,
+				  struct cgroup_taskset *tset)
+{
+	cpuacct_cancel_attach_until(css, tset, NULL);
+}
+
+
 static struct cftype files[] = {
 	{
 		.name = "usage",
@@ -226,6 +463,28 @@ static struct cftype files[] = {
 		.name = "stat",
 		.read_map = cpuacct_stats_show,
 	},
+	{
+		.name = "task_limit",
+		.read_u64 = cpuacct_task_limit_read_u64,
+		.write_u64 = cpuacct_task_limit_write_u64,
+		.private = RES_LIMIT,
+	},
+	{
+		.name = "task_usage",
+		.read_u64 = cpuacct_task_limit_read_u64,
+		.private = RES_USAGE,
+	},
+	{
+		.name = "fork_limit",
+		.read_u64 = cpuacct_fork_limit_read_u64,
+		.write_u64 = cpuacct_fork_limit_write_u64,
+		.private = RES_LIMIT,
+	},
+	{
+		.name = "fork_usage",
+		.read_u64 = cpuacct_fork_limit_read_u64,
+		.private = RES_USAGE,
+	},
 	{ }	/* terminate */
 };
 
@@ -278,10 +537,16 @@ void cpuacct_account_field(struct task_struct *p, int index, u64 val)
 }
 
 struct cgroup_subsys cpuacct_subsys = {
-	.name		= "cpuacct",
-	.css_alloc	= cpuacct_css_alloc,
-	.css_free	= cpuacct_css_free,
-	.subsys_id	= cpuacct_subsys_id,
-	.base_cftypes	= files,
-	.early_init	= 1,
+	.name			= "cpuacct",
+	.css_alloc		= cpuacct_css_alloc,
+	.css_free		= cpuacct_css_free,
+	.subsys_id		= cpuacct_subsys_id,
+	.base_cftypes		= files,
+	.early_init		= 1,
+	.can_fork		= cpuacct_can_fork,
+	.cancel_can_fork	= cpuacct_cancel_can_fork,
+	.exit			= cpuacct_exit,
+	.attach			= cpuacct_attach,
+	.can_attach		= cpuacct_can_attach,
+	.cancel_attach		= cpuacct_cancel_attach,
 };
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 9/9] selftests: Add a new task counter selftest
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2013-12-12 21:35   ` [PATCH 8/9] cgroups: Add task and fork limits to cpuacct subsystem Dwight Engen
@ 2013-12-12 21:35   ` Dwight Engen
  2013-12-20 21:13   ` [PATCH RFC 0/9] cgroups: add res_counter_write_u64() API Serge E. Hallyn
  2013-12-23 12:54   ` Frederic Weisbecker
  10 siblings, 0 replies; 13+ messages in thread
From: Dwight Engen @ 2013-12-12 21:35 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Frederic Weisbecker, Max Kellermann

From: Frederic Weisbecker <fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

This is a batch of tests running against the cgroup task counter
subsystem in order to validate the expected behaviour from this
cgroup feature.

This checks the reliability of the value found in the tasks.usage file
after events like forks, exits or cgroup migration of whole processes
or individual threads.

This also check that forks or cgroup migration are either accepted or
rejected according to the value set in the tasks.limit file.

A forkbomb is also launched to test if the subsystem stops well its
propagation and kills it properly as expected.

Signed-off-by: Dwight Engen <dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 tools/testing/selftests/Makefile                   |   1 +
 tools/testing/selftests/cpuacct/Makefile           |  11 ++
 tools/testing/selftests/cpuacct/fork.c             |  40 ++++
 tools/testing/selftests/cpuacct/forkbomb.c         |  40 ++++
 tools/testing/selftests/cpuacct/multithread.c      |  68 +++++++
 tools/testing/selftests/cpuacct/run_test           | 214 +++++++++++++++++++++
 .../selftests/cpuacct/spread_thread_group.c        |  82 ++++++++
 7 files changed, 456 insertions(+)
 create mode 100644 tools/testing/selftests/cpuacct/Makefile
 create mode 100644 tools/testing/selftests/cpuacct/fork.c
 create mode 100644 tools/testing/selftests/cpuacct/forkbomb.c
 create mode 100644 tools/testing/selftests/cpuacct/multithread.c
 create mode 100755 tools/testing/selftests/cpuacct/run_test
 create mode 100644 tools/testing/selftests/cpuacct/spread_thread_group.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9f3eae2..61405ab 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -9,6 +9,7 @@ TARGETS += ptrace
 TARGETS += timers
 TARGETS += vm
 TARGETS += powerpc
+TARGETS += cpuacct
 
 all:
 	for TARGET in $(TARGETS); do \
diff --git a/tools/testing/selftests/cpuacct/Makefile b/tools/testing/selftests/cpuacct/Makefile
new file mode 100644
index 0000000..feb39e5
--- /dev/null
+++ b/tools/testing/selftests/cpuacct/Makefile
@@ -0,0 +1,11 @@
+all:
+	gcc fork.c -o fork
+	gcc forkbomb.c -o forkbomb
+	gcc multithread.c -o multithread -lpthread
+	gcc spread_thread_group.c -o spread_thread_group -lpthread
+
+run_tests: all
+	./run_test
+
+clean:
+	rm -f fork forkbomb multithread spread_thread_group *~
diff --git a/tools/testing/selftests/cpuacct/fork.c b/tools/testing/selftests/cpuacct/fork.c
new file mode 100644
index 0000000..c85a610
--- /dev/null
+++ b/tools/testing/selftests/cpuacct/fork.c
@@ -0,0 +1,40 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Move a task to a cgroup and try to fork on it.
+ */
+
+#include <stdio.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <string.h>
+
+int main(int argc, char **argv)
+{
+	FILE *fp;
+	int pid;
+	char cpid[50];
+
+	if (argc < 2)
+		return -1;
+
+	pid = getpid();
+	fp = fopen(argv[1], "w");
+	if (!fp)
+		return -2;
+
+	sprintf(cpid, "%d\n", pid);
+
+	if (fwrite(cpid, strlen(cpid), 1, fp) != 1) {
+		perror("can't write pid");
+		return -3;
+	}
+	fclose(fp);
+
+	if (fork() == -1)
+		return -4;
+
+	return 0;
+}
diff --git a/tools/testing/selftests/cpuacct/forkbomb.c b/tools/testing/selftests/cpuacct/forkbomb.c
new file mode 100644
index 0000000..221fefb
--- /dev/null
+++ b/tools/testing/selftests/cpuacct/forkbomb.c
@@ -0,0 +1,40 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Move a task to a cgroup and forkbomb there.
+ */
+
+#include <stdio.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include <string.h>
+
+int main(int argc, char **argv)
+{
+	FILE *fp;
+	int pid;
+	char cpid[50];
+
+	if (argc < 2)
+		return -1;
+
+	pid = getpid();
+	fp = fopen(argv[1], "w");
+	if (!fp)
+		return -2;
+
+	sprintf(cpid, "%d\n", pid);
+
+	if (fwrite(cpid, strlen(cpid), 1, fp) != 1) {
+		perror("can't write pid\n");
+		return -3;
+	}
+	fclose(fp);
+
+	for (;;)
+		fork();
+
+	return 0;
+}
diff --git a/tools/testing/selftests/cpuacct/multithread.c b/tools/testing/selftests/cpuacct/multithread.c
new file mode 100644
index 0000000..6d485de
--- /dev/null
+++ b/tools/testing/selftests/cpuacct/multithread.c
@@ -0,0 +1,68 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Try to move a multithread proc to a cgroup.
+ */
+
+
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <pthread.h>
+
+static volatile int thread_end;
+pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
+pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
+
+
+static void *thread_start(void *unused)
+{
+	pthread_mutex_lock(&mutex);
+
+	while (!thread_end)
+		pthread_cond_wait(&cond, &mutex);
+
+	pthread_mutex_unlock(&mutex);
+
+	return NULL;
+}
+
+int main(int argc, char **argv)
+{
+	int fd;
+	int pid;
+	char cpid[50];
+	pthread_t thread;
+
+	if (argc < 2)
+		return -1;
+
+	if (pthread_create(&thread, NULL, thread_start, NULL) != 0)
+		return -2;
+
+	pid = getpid();
+	fd = open(argv[1], O_WRONLY);
+	if (fd < 0)
+		return -3;
+
+	sprintf(cpid, "%d\n", pid);
+
+	if (write(fd, cpid, strlen(cpid)) < 0)
+		return -4;
+
+	close(fd);
+
+	pthread_mutex_lock(&mutex);
+	thread_end = 1;
+	pthread_cond_signal(&cond);
+	pthread_mutex_unlock(&mutex);
+
+	pthread_join(thread, NULL);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/cpuacct/run_test b/tools/testing/selftests/cpuacct/run_test
new file mode 100755
index 0000000..7532404
--- /dev/null
+++ b/tools/testing/selftests/cpuacct/run_test
@@ -0,0 +1,214 @@
+#!/bin/bash
+# Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+#
+# Licensed under the terms of the GNU GPL License version 2
+#
+# Validation tests for the cgroup task counter subsystem
+
+BASE=`grep '^cgroup.*cpuacct' /proc/mounts |awk '{print $2}'`
+
+if [ $UID != 0 ]
+then
+	echo "Must be root to run task counter selftest"
+	exit 0
+fi
+
+if [ -z $BASE ]
+then
+	echo "Can't find cpuacct mount in /proc/mounts"
+	echo "Ensure cpuacct cgroup is mounted"
+	exit 0
+fi
+
+sleep 1d &
+PID1=$!
+
+sleep 1d &
+PID2=$!
+
+echo 1 > $BASE/cgroup.clone_children
+mkdir $BASE/cgroup0
+
+function test_result
+{
+    # Result of the test
+    res=$1
+    # Expected result
+    expected=$2
+    # Invert test against expected result
+    # 0 = equal
+    # 1 = non equal
+    inv=$3
+    # String message
+    test_str=$4
+    passed=0
+
+    echo -n $test_str
+
+    if [ $res = $expected ]
+    then
+	passed=1
+    fi
+
+    passed=$[$passed ^ $inv]
+
+    if [ $passed = 1 ]
+    then
+	echo " [OK]"
+    else
+	echo " [FAILED]"
+    fi
+
+}
+
+# simple test limit
+echo 1 > $BASE/cgroup0/cpuacct.task_limit
+echo $PID1 >  $BASE/cgroup0/cgroup.procs
+test_result $? 0 0 "Allow 1 task on limit 1"
+
+echo $PID2 >  $BASE/cgroup0/cgroup.procs 2>/dev/null
+test_result $? 0 1 "Don't allow 2 tasks on limit 1"
+
+# can't change limit once cgroup has tasks
+echo 2 > $BASE/cgroup0/cpuacct.task_limit 2>/dev/null
+test_result $? 0 1 "Can't change limit with tasks"
+
+# simple test usage
+USAGE=$(cat $BASE/cgroup0/cpuacct.task_usage)
+test_result $USAGE 1 0 "Correct usage count "
+
+# simple test exit
+kill $PID1
+wait $PID1 2>/dev/null
+USAGE=$(cat $BASE/cgroup0/cpuacct.task_usage)
+test_result $USAGE 0 0 "Correct usage count after exit "
+
+sleep 1d &
+PID1=$!
+
+echo 1 > $BASE/cgroup0/cpuacct.task_limit
+echo $PID1 >  $BASE/cgroup0/cgroup.procs
+test_result $? 0 0 "Correct reuse after exit "
+USAGE=$(cat $BASE/cgroup0/cpuacct.task_usage)
+test_result $USAGE 1 0 "Correct usage count "
+
+# simple move to root
+
+echo $PID1 > $BASE/cgroup.procs
+test_result $? 0 0 "Correct move to root "
+
+# propagation tests
+
+mkdir $BASE/cgroup0/cgroup1
+mkdir $BASE/cgroup0/cgroup2
+
+echo 1 > $BASE/cgroup0/cgroup1/cpuacct.task_limit
+echo $PID1 > $BASE/cgroup0/cgroup1/cgroup.procs
+USAGE=$(cat $BASE/cgroup0/cpuacct.task_usage)
+test_result $USAGE 1 0 "Correct propagated usage "
+
+echo $PID1 > $BASE/cgroup0/cgroup.procs
+test_result $? 0 0 "Correct move on parent "
+
+
+# move
+
+echo $PID1 > $BASE/cgroup0/cgroup1/cgroup.procs
+test_result $? 0 0 "Correct move on child "
+
+echo $PID1 > $BASE/cgroup0/cgroup2/cgroup.procs
+test_result $? 0 0 "Correct move on sibling "
+
+echo $PID2 > $BASE/cgroup0/cgroup1/cgroup.procs 2>/dev/null
+test_result $? 0 1 "Correct propagation limit "
+
+kill $PID1
+wait $PID1 2>/dev/null
+kill $PID2
+wait $PID2 2>/dev/null
+
+rmdir $BASE/cgroup0/cgroup1
+rmdir $BASE/cgroup0/cgroup2
+
+# test limit on thread group
+usleep 100000
+echo 1024 > $BASE/cgroup0/cpuacct.task_limit
+mkdir $BASE/cgroup0/cgroup1
+
+# can't change limit once cgroup has children
+echo 2 > $BASE/cgroup0/cpuacct.task_limit 2>/dev/null
+test_result $? 0 1 "Can't change limit with children"
+
+echo 0 > $BASE/cgroup0/cgroup1/cpuacct.task_limit
+./multithread $BASE/cgroup0/cgroup1/cgroup.procs
+test_result $? 0 1 "Correct limit on multithreaded"
+
+# test move of a thread group
+echo 2 > $BASE/cgroup0/cgroup1/cpuacct.task_limit
+
+./multithread $BASE/cgroup0/cgroup1/cgroup.procs
+test_result $? 0 0 "Correct move of multithreaded"
+
+rmdir $BASE/cgroup0/cgroup1
+
+# test bug on common ancestor logic
+# as described in https://lkml.org/lkml/2011/11/8/218
+
+./spread_thread_group $BASE/cgroup0/cgroup.procs $BASE/tasks
+test_result $? 0 0 "Test bug on common ancestor logic"
+
+# test task limit with fork
+usleep 100000
+echo 1 > $BASE/cgroup0/cpuacct.task_limit
+./fork $BASE/cgroup0/cgroup.procs
+test_result $? 0 1 "Correct task limit "
+
+# test fork limit
+echo 2 > $BASE/cgroup0/cpuacct.task_limit
+echo 1 > $BASE/cgroup0/cpuacct.fork_limit
+./fork $BASE/cgroup0/cgroup.procs
+test_result $? 0 0 "Fork allowed "
+./fork $BASE/cgroup0/cgroup.procs
+test_result $? 0 1 "Fork denied "
+
+rmdir $BASE/cgroup0
+mkdir $BASE/cgroup0
+
+# test forkbomb
+
+echo 128 > $BASE/cgroup0/cpuacct.task_limit
+echo -n "Trying to stop forkbomb propagation..."
+./forkbomb $BASE/cgroup0/cgroup.procs &
+sleep 1
+RES=$(cat $BASE/cgroup0/cpuacct.task_usage)
+test_result $RES 128 0 ""
+
+# kill forkbomb
+
+echo -n "Trying to kill forkbomb "
+echo 0 > $BASE/cgroup0/cpuacct.fork_limit
+END=false
+
+while [ $END = false ]
+do
+	NR_TASKS=$(cat $BASE/cgroup0/cpuacct.task_usage)
+	NR_KILLED=0
+
+	for TASK in $(cat $BASE/cgroup0/cgroup.procs)
+	do
+		NR_KILLED=$(($NR_KILLED+1))
+		kill -KILL $TASK
+		wait $TASK 2>/dev/null
+	done
+
+	if [ "$NR_TASKS" = "$NR_KILLED" ]
+	then
+		END=true
+	fi
+done
+
+echo "[OK]"
+
+# Wait a bit for killed tasks to exit the cgroup
+sleep 1
+rmdir $BASE/cgroup0
diff --git a/tools/testing/selftests/cpuacct/spread_thread_group.c b/tools/testing/selftests/cpuacct/spread_thread_group.c
new file mode 100644
index 0000000..24f42b9
--- /dev/null
+++ b/tools/testing/selftests/cpuacct/spread_thread_group.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc., Frederic Weisbecker <fweisbec-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Try to reproduce bug on common ancestor logic as described
+ * at https://lkml.org/lkml/2011/11/8/218
+ */
+
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <pthread.h>
+
+static volatile int thread_end;
+pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
+pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
+
+
+static void *thread_start(void *unused)
+{
+	pthread_mutex_lock(&mutex);
+
+	while (!thread_end)
+		pthread_cond_wait(&cond, &mutex);
+
+	pthread_mutex_unlock(&mutex);
+
+	return NULL;
+}
+
+int main(int argc, char **argv)
+{
+	int fd_root, fd;
+	int pid;
+	char cpid[50];
+	pthread_t thread;
+
+	if (argc < 3)
+		return -1;
+
+	if (pthread_create(&thread, NULL, thread_start, NULL) != 0)
+		return -2;
+
+	pid = getpid();
+	fd = open(argv[1], O_WRONLY);
+	if (fd < 0)
+		return -3;
+
+	sprintf(cpid, "%d\n", pid);
+
+	/* Move group to /dev/cgroup/cgroup0 */
+	if (write(fd, cpid, strlen(cpid)) < 0)
+		return -4;
+
+	fd_root = open(argv[2], O_WRONLY);
+	if (fd < 0)
+		return -5;
+
+	/* Move group leader to /dev/cgroup/ root */
+	if (write(fd_root, cpid, strlen(cpid)) < 0)
+		return -6;
+
+	/* Move back group to /dev/cgroup/cgroup0 */
+	if (write(fd, cpid, strlen(cpid)) < 0)
+		return -7;
+
+	close(fd_root);
+	close(fd);
+
+	pthread_mutex_lock(&mutex);
+	thread_end = 1;
+	pthread_cond_signal(&cond);
+	pthread_mutex_unlock(&mutex);
+
+	pthread_join(thread, NULL);
+
+	return 0;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/9] cgroups: add res_counter_write_u64() API
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2013-12-12 21:35   ` [PATCH 9/9] selftests: Add a new task counter selftest Dwight Engen
@ 2013-12-20 21:13   ` Serge E. Hallyn
  2013-12-23 12:54   ` Frederic Weisbecker
  10 siblings, 0 replies; 13+ messages in thread
From: Serge E. Hallyn @ 2013-12-20 21:13 UTC (permalink / raw)
  To: Dwight Engen
  Cc: Frederic Weisbecker,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Max Kellermann

Quoting Dwight Engen (dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org):
> Hello,
> 
> I've seen that some sort of fork/task limiter has been proposed and
> discussed several times before. Despite the existance of kmem in memcg, a
> fork limiter is still often asked for by container users. Perhaps this is
> due to current granularity in kmem (ie. stack/struct task not split out from
> other slab allocations) but I believe it is just more natural for users to
> express a limit in terms of tasks.

Sorry, I thought I had replied to this.  I'm in support of this patchset,
and think you'd be better served just sending it to lkml.

Unfortunately I will be out for most of the rest of the year, but
please cc: me here and I'll do what I can to support you.

Assuming you've run ltp with and without your patchset and saw no
change in behavior, you should mention that here as well.

> So what I've done is updated Frederic Weisbecker's task counter patchset and
> tried to address the concerns that I saw people had raised. This involved
> the following changes:
> 
> - merged into cpuacct controller, as it seems there is a desire not to add
>   new controllers, this controller is already heirarchical, and I feel
>   limiting number of tasks/forks fits best here
> - included a fork_limit similar to the one Max Kellermann posted
>   (https://lkml.org/lkml/2011/2/17/116) which is a use case not addressable
>   with memcg
> - ala memcg kmem, for performance reasons don't account unless limit is set
> - ala memcg, allow usage to roll up to root (prevents warnings on
>   uncharge), but still don't allow setting limits in root
> - changed the interface at fork()/exit(), adding
>   can_fork()/cancel_can_fork() modeled on can_attach(). cgroup_fork()
>   can now return failure to fork().
> - ran Frederics selftests, and added a couple more
> 
> I also wrote a small fork micro benchmark to see how this change affected
> performance. I did 20 runs of 100000 fork/exit/waitpid, and took the
> average. Times are in seconds, base is without the change, cpuacct1 is with
> the change but no accounting be done (ie. no limit set), and cpuacct2 is
> with the test being in a cgroup 1 level deep.
> 
> base  cpuacct1  cpuaac2
> 5.59  5.59      5.64
> 
> So I believe this change has minimal performance impact, especially when no
> limit is set.
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/9] cgroups: add res_counter_write_u64() API
       [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2013-12-20 21:13   ` [PATCH RFC 0/9] cgroups: add res_counter_write_u64() API Serge E. Hallyn
@ 2013-12-23 12:54   ` Frederic Weisbecker
  10 siblings, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2013-12-23 12:54 UTC (permalink / raw)
  To: Dwight Engen, Tejun Heo, Glauber Costa, Vladimir Davydov
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Max Kellermann

On Thu, Dec 12, 2013 at 04:35:09PM -0500, Dwight Engen wrote:
> Hello,
> 
> I've seen that some sort of fork/task limiter has been proposed and
> discussed several times before. Despite the existance of kmem in memcg, a
> fork limiter is still often asked for by container users. Perhaps this is
> due to current granularity in kmem (ie. stack/struct task not split out from
> other slab allocations) but I believe it is just more natural for users to
> express a limit in terms of tasks.
> 
> So what I've done is updated Frederic Weisbecker's task counter patchset and
> tried to address the concerns that I saw people had raised. This involved
> the following changes:
> 
> - merged into cpuacct controller, as it seems there is a desire not to add
>   new controllers, this controller is already heirarchical, and I feel
>   limiting number of tasks/forks fits best here
> - included a fork_limit similar to the one Max Kellermann posted
>   (https://lkml.org/lkml/2011/2/17/116) which is a use case not addressable
>   with memcg
> - ala memcg kmem, for performance reasons don't account unless limit is set
> - ala memcg, allow usage to roll up to root (prevents warnings on
>   uncharge), but still don't allow setting limits in root
> - changed the interface at fork()/exit(), adding
>   can_fork()/cancel_can_fork() modeled on can_attach(). cgroup_fork()
>   can now return failure to fork().
> - ran Frederics selftests, and added a couple more
> 
> I also wrote a small fork micro benchmark to see how this change affected
> performance. I did 20 runs of 100000 fork/exit/waitpid, and took the
> average. Times are in seconds, base is without the change, cpuacct1 is with
> the change but no accounting be done (ie. no limit set), and cpuacct2 is
> with the test being in a cgroup 1 level deep.
> 
> base  cpuacct1  cpuaac2
> 5.59  5.59      5.64
> 
> So I believe this change has minimal performance impact, especially when no
> limit is set.

Sorry for the late answer on this. I'm adding cgroup mailing list and a few
cgroups/kmem people in Cc.

This patchset was eventually abandoned because kmem became the prefered way to
limit the number of tasks that can be forked.

Have you tried it?

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/9] cgroups: add res_counter_write_u64() API
       [not found]   ` <20131223125410.GA585-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2014-01-26 15:56     ` Serge E. Hallyn
  0 siblings, 0 replies; 13+ messages in thread
From: Serge E. Hallyn @ 2014-01-26 15:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Vladimir Davydov, Max Kellermann, Glauber Costa,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Quoting Frederic Weisbecker (fweisbec-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org):
> On Thu, Dec 12, 2013 at 04:35:09PM -0500, Dwight Engen wrote:
> > Hello,
> > 
> > I've seen that some sort of fork/task limiter has been proposed and
> > discussed several times before. Despite the existance of kmem in memcg, a
> > fork limiter is still often asked for by container users. Perhaps this is
> > due to current granularity in kmem (ie. stack/struct task not split out from
> > other slab allocations) but I believe it is just more natural for users to
> > express a limit in terms of tasks.
> > 
> > So what I've done is updated Frederic Weisbecker's task counter patchset and
> > tried to address the concerns that I saw people had raised. This involved
> > the following changes:
> > 
> > - merged into cpuacct controller, as it seems there is a desire not to add
> >   new controllers, this controller is already heirarchical, and I feel
> >   limiting number of tasks/forks fits best here
> > - included a fork_limit similar to the one Max Kellermann posted
> >   (https://lkml.org/lkml/2011/2/17/116) which is a use case not addressable
> >   with memcg
> > - ala memcg kmem, for performance reasons don't account unless limit is set
> > - ala memcg, allow usage to roll up to root (prevents warnings on
> >   uncharge), but still don't allow setting limits in root
> > - changed the interface at fork()/exit(), adding
> >   can_fork()/cancel_can_fork() modeled on can_attach(). cgroup_fork()
> >   can now return failure to fork().
> > - ran Frederics selftests, and added a couple more
> > 
> > I also wrote a small fork micro benchmark to see how this change affected
> > performance. I did 20 runs of 100000 fork/exit/waitpid, and took the
> > average. Times are in seconds, base is without the change, cpuacct1 is with
> > the change but no accounting be done (ie. no limit set), and cpuacct2 is
> > with the test being in a cgroup 1 level deep.
> > 
> > base  cpuacct1  cpuaac2
> > 5.59  5.59      5.64
> > 
> > So I believe this change has minimal performance impact, especially when no
> > limit is set.
> 
> Sorry for the late answer on this. I'm adding cgroup mailing list and a few
> cgroups/kmem people in Cc.
> 
> This patchset was eventually abandoned because kmem became the prefered way to
> limit the number of tasks that can be forked.
> 
> Have you tried it?

It seems like this could cause false positives if the cgroup was in
some other way using quite a bit of kernel memory (reading a big
ecryptfs file?).  Certainly it would be a gross rather than fine-tuned
knob.

Have people actually used the kmem controller this way and found it
useful and effective?  Is there any guidance on reasonable limits to
use?

-serge

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-01-26 15:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-12 21:35 [PATCH RFC 0/9] cgroups: add res_counter_write_u64() API Dwight Engen
     [not found] ` <1386884118-14972-1-git-send-email-dwight.engen-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2013-12-12 21:35   ` [PATCH 1/9] " Dwight Engen
2013-12-12 21:35   ` [PATCH 2/9] cgroups: new resource counter inheritance API Dwight Engen
2013-12-12 21:35   ` [PATCH 3/9] cgroups: ability to stop res charge propagation on bounded ancestor Dwight Engen
2013-12-12 21:35   ` [PATCH 4/9] cgroups: add res counter common ancestor searching Dwight Engen
2013-12-12 21:35   ` [PATCH 5/9] res_counter: allow charge failure pointer to be null Dwight Engen
2013-12-12 21:35   ` [PATCH 6/9] cgroups: pull up res counter charge failure interpretation to caller Dwight Engen
2013-12-12 21:35   ` [PATCH 7/9] fix compile error in cgroup_taskset_for_each() when skip_css is NULL Dwight Engen
2013-12-12 21:35   ` [PATCH 8/9] cgroups: Add task and fork limits to cpuacct subsystem Dwight Engen
2013-12-12 21:35   ` [PATCH 9/9] selftests: Add a new task counter selftest Dwight Engen
2013-12-20 21:13   ` [PATCH RFC 0/9] cgroups: add res_counter_write_u64() API Serge E. Hallyn
2013-12-23 12:54   ` Frederic Weisbecker
     [not found] ` <20131223125410.GA585@localhost.localdomain>
     [not found]   ` <20131223125410.GA585-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2014-01-26 15:56     ` Serge E. Hallyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox