[PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes
@ 2009-08-18 23:58 Paul Menage
  2009-08-18 23:58 ` [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869 Paul Menage
                   ` (7 more replies)
  0 siblings, 8 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-18 23:58 UTC (permalink / raw)
  To: akpm, bblum, lizf; +Cc: linux-kernel, containers

The following series adds a "cgroup.procs" file to each cgroup that
reports unique tgids rather than pids, and allows all threads in a
threadgroup to be atomically moved to a new cgroup.

The subsystem "attach" interface is modified to support attaching
whole threadgroups at a time, which could introduce potential problems
if any subsystem were to need to access the old cgroup of every thread
being moved. The attach interface may need to be revised if this
becomes the case.

Also added is functionality for read/write locking all CLONE_THREAD
fork()ing within a threadgroup, by means of an rwsem that lives in the
sighand_struct, for per-threadgroup-ness and also for sharing a
cacheline with the sighand's atomic count. This scheme should
introduce no extra overhead in the fork path when there's no
contention.

The final patch reveals potential for a race when forking before a
subsystem's attach function is called - one potential solution in case
any subsystem has this problem is to hang on to the group's fork mutex
through the attach() calls, though no subsystem yet demonstrates need
for an extended critical section.

---

Ben Blum (7):
      Adds ability to move all threads in a process to a new cgroup atomically
      Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
      Lets ss->can_attach and ss->attach do whole threadgroups at a time
      Changes css_set freeing mechanism to be under RCU
      Use vmalloc for large cgroups pidlist allocations
      Ensures correct concurrent opening/reading of pidlists across pid namespaces
      Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

Paul Menage (1):
      Revert commit 8827c288feb7810185aa7c2e37537202fc709869

 Documentation/cgroups/cgroups.txt |   25 +
 include/linux/cgroup.h            |   67 ++-
 include/linux/init_task.h         |    9 
 include/linux/sched.h             |   15 +
 kernel/cgroup.c                   |  948 +++++++++++++++++++++++++++++--------
 kernel/cgroup_freezer.c           |   15 +
 kernel/cpuset.c                   |   66 ++-
 kernel/fork.c                     |    9 
 kernel/ns_cgroup.c                |   16 +
 kernel/sched.c                    |   35 +
 mm/memcontrol.c                   |    3 
 security/device_cgroup.c          |    3 
 12 files changed, 974 insertions(+), 237 deletions(-)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869
  2009-08-18 23:58 [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes Paul Menage
@ 2009-08-18 23:58 ` Paul Menage
  2009-08-20  2:34   ` Li Zefan
  2009-08-20 21:13   ` Andrew Morton
  2009-08-18 23:58 ` [PATCH 2/8] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Paul Menage
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-18 23:58 UTC (permalink / raw)
  To: akpm, bblum, lizf; +Cc: linux-kernel, containers


Revert commit 8827c288feb7810185aa7c2e37537202fc709869

This is in preparation for some clashing cgroups changes that subsume
the original commit's functionaliy.

Signed-off-by: Paul Menage <menage@google.com>

--

The original commit fixed a pid namespace bug which Ben Blum fixed
independently (in the same way, but with different code) as part of a
series of patches. I played around with trying to reconcile Ben's
patch series with Li's patch, but concluded that it was simpler to
just revert Li's, given that Ben's patch series contained essentially
the same fix.

---

 include/linux/cgroup.h |   11 ++++--
 kernel/cgroup.c        |   95 ++++++++++++------------------------------------
 2 files changed, 31 insertions(+), 75 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 90bba9e..c833d6f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -179,11 +179,14 @@ struct cgroup {
 	 */
 	struct list_head release_list;
 
-	/* pids_mutex protects pids_list and cached pid arrays. */
+	/* pids_mutex protects the fields below */
 	struct rw_semaphore pids_mutex;
-
-	/* Linked list of struct cgroup_pids */
-	struct list_head pids_list;
+	/* Array of process ids in the cgroup */
+	pid_t *tasks_pids;
+	/* How many files are using the current tasks_pids array */
+	int pids_use_count;
+	/* Length of the current tasks_pids array */
+	int pids_length;
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 89d63eb..cea36c3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1121,7 +1121,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->children);
 	INIT_LIST_HEAD(&cgrp->css_sets);
 	INIT_LIST_HEAD(&cgrp->release_list);
-	INIT_LIST_HEAD(&cgrp->pids_list);
 	init_rwsem(&cgrp->pids_mutex);
 }
 
@@ -2431,30 +2430,12 @@ err:
 	return ret;
 }
 
-/*
- * Cache pids for all threads in the same pid namespace that are
- * opening the same "tasks" file.
- */
-struct cgroup_pids {
-	/* The node in cgrp->pids_list */
-	struct list_head list;
-	/* The cgroup those pids belong to */
-	struct cgroup *cgrp;
-	/* The namepsace those pids belong to */
-	struct pid_namespace *ns;
-	/* Array of process ids in the cgroup */
-	pid_t *tasks_pids;
-	/* How many files are using the this tasks_pids array */
-	int use_count;
-	/* Length of the current tasks_pids array */
-	int length;
-};
-
 static int cmppid(const void *a, const void *b)
 {
 	return *(pid_t *)a - *(pid_t *)b;
 }
 
+
 /*
  * seq_file methods for the "tasks" file. The seq_file position is the
  * next pid to display; the seq_file iterator is a pointer to the pid
@@ -2469,47 +2450,45 @@ static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
 	 * after a seek to the start). Use a binary-search to find the
 	 * next pid to display, if any
 	 */
-	struct cgroup_pids *cp = s->private;
-	struct cgroup *cgrp = cp->cgrp;
+	struct cgroup *cgrp = s->private;
 	int index = 0, pid = *pos;
 	int *iter;
 
 	down_read(&cgrp->pids_mutex);
 	if (pid) {
-		int end = cp->length;
+		int end = cgrp->pids_length;
 
 		while (index < end) {
 			int mid = (index + end) / 2;
-			if (cp->tasks_pids[mid] == pid) {
+			if (cgrp->tasks_pids[mid] == pid) {
 				index = mid;
 				break;
-			} else if (cp->tasks_pids[mid] <= pid)
+			} else if (cgrp->tasks_pids[mid] <= pid)
 				index = mid + 1;
 			else
 				end = mid;
 		}
 	}
 	/* If we're off the end of the array, we're done */
-	if (index >= cp->length)
+	if (index >= cgrp->pids_length)
 		return NULL;
 	/* Update the abstract position to be the actual pid that we found */
-	iter = cp->tasks_pids + index;
+	iter = cgrp->tasks_pids + index;
 	*pos = *iter;
 	return iter;
 }
 
 static void cgroup_tasks_stop(struct seq_file *s, void *v)
 {
-	struct cgroup_pids *cp = s->private;
-	struct cgroup *cgrp = cp->cgrp;
+	struct cgroup *cgrp = s->private;
 	up_read(&cgrp->pids_mutex);
 }
 
 static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
 {
-	struct cgroup_pids *cp = s->private;
+	struct cgroup *cgrp = s->private;
 	int *p = v;
-	int *end = cp->tasks_pids + cp->length;
+	int *end = cgrp->tasks_pids + cgrp->pids_length;
 
 	/*
 	 * Advance to the next pid in the array. If this goes off the
@@ -2536,33 +2515,26 @@ static const struct seq_operations cgroup_tasks_seq_operations = {
 	.show = cgroup_tasks_show,
 };
 
-static void release_cgroup_pid_array(struct cgroup_pids *cp)
+static void release_cgroup_pid_array(struct cgroup *cgrp)
 {
-	struct cgroup *cgrp = cp->cgrp;
-
 	down_write(&cgrp->pids_mutex);
-	BUG_ON(!cp->use_count);
-	if (!--cp->use_count) {
-		list_del(&cp->list);
-		put_pid_ns(cp->ns);
-		kfree(cp->tasks_pids);
-		kfree(cp);
+	BUG_ON(!cgrp->pids_use_count);
+	if (!--cgrp->pids_use_count) {
+		kfree(cgrp->tasks_pids);
+		cgrp->tasks_pids = NULL;
+		cgrp->pids_length = 0;
 	}
 	up_write(&cgrp->pids_mutex);
 }
 
 static int cgroup_tasks_release(struct inode *inode, struct file *file)
 {
-	struct seq_file *seq;
-	struct cgroup_pids *cp;
+	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
 
 	if (!(file->f_mode & FMODE_READ))
 		return 0;
 
-	seq = file->private_data;
-	cp = seq->private;
-
-	release_cgroup_pid_array(cp);
+	release_cgroup_pid_array(cgrp);
 	return seq_release(inode, file);
 }
 
@@ -2581,8 +2553,6 @@ static struct file_operations cgroup_tasks_operations = {
 static int cgroup_tasks_open(struct inode *unused, struct file *file)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-	struct pid_namespace *ns = current->nsproxy->pid_ns;
-	struct cgroup_pids *cp;
 	pid_t *pidarray;
 	int npids;
 	int retval;
@@ -2609,37 +2579,20 @@ static int cgroup_tasks_open(struct inode *unused, struct file *file)
 	 * array if necessary
 	 */
 	down_write(&cgrp->pids_mutex);
-
-	list_for_each_entry(cp, &cgrp->pids_list, list) {
-		if (ns == cp->ns)
-			goto found;
-	}
-
-	cp = kzalloc(sizeof(*cp), GFP_KERNEL);
-	if (!cp) {
-		up_write(&cgrp->pids_mutex);
-		kfree(pidarray);
-		return -ENOMEM;
-	}
-	cp->cgrp = cgrp;
-	cp->ns = ns;
-	get_pid_ns(ns);
-	list_add(&cp->list, &cgrp->pids_list);
-found:
-	kfree(cp->tasks_pids);
-	cp->tasks_pids = pidarray;
-	cp->length = npids;
-	cp->use_count++;
+	kfree(cgrp->tasks_pids);
+	cgrp->tasks_pids = pidarray;
+	cgrp->pids_length = npids;
+	cgrp->pids_use_count++;
 	up_write(&cgrp->pids_mutex);
 
 	file->f_op = &cgroup_tasks_operations;
 
 	retval = seq_open(file, &cgroup_tasks_seq_operations);
 	if (retval) {
-		release_cgroup_pid_array(cp);
+		release_cgroup_pid_array(cgrp);
 		return retval;
 	}
-	((struct seq_file *)file->private_data)->private = cp;
+	((struct seq_file *)file->private_data)->private = cgrp;
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/8] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids
  2009-08-18 23:58 [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes Paul Menage
  2009-08-18 23:58 ` [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869 Paul Menage
@ 2009-08-18 23:58 ` Paul Menage
  2009-08-20  2:34   ` Li Zefan
  2009-08-18 23:58 ` [PATCH 3/8] Ensures correct concurrent opening/reading of pidlists across pid namespaces Paul Menage
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Paul Menage @ 2009-08-18 23:58 UTC (permalink / raw)
  To: akpm, bblum, lizf; +Cc: linux-kernel, containers

From: Ben Blum <bblum@google.com>


Adds a read-only "procs" file similar to "tasks" that shows only unique tgids

struct cgroup used to have a bunch of fields for keeping track of the pidlist
for the tasks file. Those are now separated into a new struct cgroup_pidlist,
of which two are had, one for procs and one for tasks. The way the seq_file
operations are set up is changed so that just the pidlist struct gets passed
around as the private data.

Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>

---

 include/linux/cgroup.h |   22 ++--
 kernel/cgroup.c        |  278 ++++++++++++++++++++++++++++++------------------
 2 files changed, 186 insertions(+), 114 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c833d6f..2357733 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -141,6 +141,17 @@ enum {
 	CGRP_WAIT_ON_RMDIR,
 };
 
+struct cgroup_pidlist {
+	/* protects the other fields */
+	struct rw_semaphore mutex;
+	/* array of xids */
+	pid_t *list;
+	/* how many elements the above list has */
+	int length;
+	/* how many files are using the current array */
+	int use_count;
+};
+
 struct cgroup {
 	unsigned long flags;		/* "unsigned long" so bitops work */
 
@@ -179,14 +190,9 @@ struct cgroup {
 	 */
 	struct list_head release_list;
 
-	/* pids_mutex protects the fields below */
-	struct rw_semaphore pids_mutex;
-	/* Array of process ids in the cgroup */
-	pid_t *tasks_pids;
-	/* How many files are using the current tasks_pids array */
-	int pids_use_count;
-	/* Length of the current tasks_pids array */
-	int pids_length;
+	/* we will have two separate pidlists, one for pids (the tasks file)
+	 * and one for tgids (the procs file). */
+	struct cgroup_pidlist tasks, procs;
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cea36c3..18d1403 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1121,7 +1121,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->children);
 	INIT_LIST_HEAD(&cgrp->css_sets);
 	INIT_LIST_HEAD(&cgrp->release_list);
-	init_rwsem(&cgrp->pids_mutex);
+	init_rwsem(&(cgrp->tasks.mutex));
+	init_rwsem(&(cgrp->procs.mutex));
 }
 
 static void init_cgroup_root(struct cgroupfs_root *root)
@@ -1637,15 +1638,6 @@ static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 	return ret;
 }
 
-/* The various types of files and directories in a cgroup file system */
-enum cgroup_filetype {
-	FILE_ROOT,
-	FILE_DIR,
-	FILE_TASKLIST,
-	FILE_NOTIFY_ON_RELEASE,
-	FILE_RELEASE_AGENT,
-};
-
 /**
  * cgroup_lock_live_group - take cgroup_mutex and check that cgrp is alive.
  * @cgrp: the cgroup to be checked for liveness
@@ -2343,7 +2335,7 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
 }
 
 /*
- * Stuff for reading the 'tasks' file.
+ * Stuff for reading the 'tasks'/'procs' files.
  *
  * Reading this file can return large amounts of data if a cgroup has
  * *lots* of attached tasks. So it may need several calls to read(),
@@ -2353,27 +2345,106 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
  */
 
 /*
- * Load into 'pidarray' up to 'npids' of the tasks using cgroup
- * 'cgrp'.  Return actual number of pids loaded.  No need to
- * task_lock(p) when reading out p->cgroup, since we're in an RCU
- * read section, so the css_set can't go away, and is
- * immutable after creation.
+ * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
+ * If the new stripped list is sufficiently smaller and there's enough memory
+ * to allocate a new buffer, will let go of the unneeded memory. Returns the
+ * number of unique elements.
  */
-static int pid_array_load(pid_t *pidarray, int npids, struct cgroup *cgrp)
+/* is the size difference enough that we should re-allocate the array? */
+#define PIDLIST_REALLOC_DIFFERENCE(old, new) ((old) - PAGE_SIZE >= (new))
+static int pidlist_uniq(pid_t **p, int length)
 {
-	int n = 0, pid;
+	int src, dest = 1;
+	pid_t *list = *p;
+	pid_t *newlist;
+
+	/*
+	 * we presume the 0th element is unique, so i starts at 1. trivial
+	 * edge cases first; no work needs to be done for either
+	 */
+	if (length == 0 || length == 1)
+		return length;
+	/* src and dest walk down the list; dest counts unique elements */
+	for (src = 1; src < length; src++) {
+		/* find next unique element */
+		while (list[src] == list[src-1]) {
+			src++;
+			if (src == length)
+				goto after;
+		}
+		/* dest always points to where the next unique element goes */
+		list[dest] = list[src];
+		dest++;
+	}
+after:
+	/*
+	 * if the length difference is large enough, we want to allocate a
+	 * smaller buffer to save memory. if this fails due to out of memory,
+	 * we'll just stay with what we've got.
+	 */
+	if (PIDLIST_REALLOC_DIFFERENCE(length, dest)) {
+		newlist = krealloc(list, dest * sizeof(pid_t), GFP_KERNEL);
+		if (newlist)
+			*p = newlist;
+	}
+	return dest;
+}
+
+static int cmppid(const void *a, const void *b)
+{
+	return *(pid_t *)a - *(pid_t *)b;
+}
+
+/*
+ * Load a cgroup's pidarray with either procs' tgids or tasks' pids
+ */
+static int pidlist_array_load(struct cgroup *cgrp, bool procs)
+{
+	pid_t *array;
+	int length;
+	int pid, n = 0; /* used for populating the array */
 	struct cgroup_iter it;
 	struct task_struct *tsk;
+	struct cgroup_pidlist *l;
+
+	/*
+	 * If cgroup gets more users after we read count, we won't have
+	 * enough space - tough.  This race is indistinguishable to the
+	 * caller from the case that the additional cgroup users didn't
+	 * show up until sometime later on.
+	 */
+	length = cgroup_task_count(cgrp);
+	array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);
+	if (!array)
+		return -ENOMEM;
+	/* now, populate the array */
 	cgroup_iter_start(cgrp, &it);
 	while ((tsk = cgroup_iter_next(cgrp, &it))) {
-		if (unlikely(n == npids))
+		if (unlikely(n == length))
 			break;
-		pid = task_pid_vnr(tsk);
-		if (pid > 0)
-			pidarray[n++] = pid;
+		/* get tgid or pid for procs or tasks file respectively */
+		pid = (procs ? task_tgid_vnr(tsk) : task_pid_vnr(tsk));
+		if (pid > 0) /* make sure to only use valid results */
+			array[n++] = pid;
 	}
 	cgroup_iter_end(cgrp, &it);
-	return n;
+	length = n;
+	/* now sort & (if procs) strip out duplicates */
+	sort(array, length, sizeof(pid_t), cmppid, NULL);
+	if (procs) {
+		length = pidlist_uniq(&array, length);
+		l = &(cgrp->procs);
+	} else {
+		l = &(cgrp->tasks);
+	}
+	/* store array in cgroup, freeing old if necessary */
+	down_write(&l->mutex);
+	kfree(l->list);
+	l->list = array;
+	l->length = length;
+	l->use_count++;
+	up_write(&l->mutex);
+	return 0;
 }
 
 /**
@@ -2430,19 +2501,14 @@ err:
 	return ret;
 }
 
-static int cmppid(const void *a, const void *b)
-{
-	return *(pid_t *)a - *(pid_t *)b;
-}
-
 
 /*
- * seq_file methods for the "tasks" file. The seq_file position is the
+ * seq_file methods for the tasks/procs files. The seq_file position is the
  * next pid to display; the seq_file iterator is a pointer to the pid
- * in the cgroup->tasks_pids array.
+ * in the cgroup->l->list array.
  */
 
-static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
+static void *cgroup_pidlist_start(struct seq_file *s, loff_t *pos)
 {
 	/*
 	 * Initially we receive a position value that corresponds to
@@ -2450,46 +2516,45 @@ static void *cgroup_tasks_start(struct seq_file *s, loff_t *pos)
 	 * after a seek to the start). Use a binary-search to find the
 	 * next pid to display, if any
 	 */
-	struct cgroup *cgrp = s->private;
+	struct cgroup_pidlist *l = s->private;
 	int index = 0, pid = *pos;
 	int *iter;
 
-	down_read(&cgrp->pids_mutex);
+	down_read(&l->mutex);
 	if (pid) {
-		int end = cgrp->pids_length;
+		int end = l->length;
 
 		while (index < end) {
 			int mid = (index + end) / 2;
-			if (cgrp->tasks_pids[mid] == pid) {
+			if (l->list[mid] == pid) {
 				index = mid;
 				break;
-			} else if (cgrp->tasks_pids[mid] <= pid)
+			} else if (l->list[mid] <= pid)
 				index = mid + 1;
 			else
 				end = mid;
 		}
 	}
 	/* If we're off the end of the array, we're done */
-	if (index >= cgrp->pids_length)
+	if (index >= l->length)
 		return NULL;
 	/* Update the abstract position to be the actual pid that we found */
-	iter = cgrp->tasks_pids + index;
+	iter = l->list + index;
 	*pos = *iter;
 	return iter;
 }
 
-static void cgroup_tasks_stop(struct seq_file *s, void *v)
+static void cgroup_pidlist_stop(struct seq_file *s, void *v)
 {
-	struct cgroup *cgrp = s->private;
-	up_read(&cgrp->pids_mutex);
+	struct cgroup_pidlist *l = s->private;
+	up_read(&l->mutex);
 }
 
-static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
+static void *cgroup_pidlist_next(struct seq_file *s, void *v, loff_t *pos)
 {
-	struct cgroup *cgrp = s->private;
-	int *p = v;
-	int *end = cgrp->tasks_pids + cgrp->pids_length;
-
+	struct cgroup_pidlist *l = s->private;
+	pid_t *p = v;
+	pid_t *end = l->list + l->length;
 	/*
 	 * Advance to the next pid in the array. If this goes off the
 	 * end, we're done
@@ -2503,98 +2568,94 @@ static void *cgroup_tasks_next(struct seq_file *s, void *v, loff_t *pos)
 	}
 }
 
-static int cgroup_tasks_show(struct seq_file *s, void *v)
+static int cgroup_pidlist_show(struct seq_file *s, void *v)
 {
 	return seq_printf(s, "%d\n", *(int *)v);
 }
 
-static const struct seq_operations cgroup_tasks_seq_operations = {
-	.start = cgroup_tasks_start,
-	.stop = cgroup_tasks_stop,
-	.next = cgroup_tasks_next,
-	.show = cgroup_tasks_show,
+/*
+ * seq_operations functions for iterating on pidlists through seq_file -
+ * independent of whether it's tasks or procs
+ */
+static const struct seq_operations cgroup_pidlist_seq_operations = {
+	.start = cgroup_pidlist_start,
+	.stop = cgroup_pidlist_stop,
+	.next = cgroup_pidlist_next,
+	.show = cgroup_pidlist_show,
 };
 
-static void release_cgroup_pid_array(struct cgroup *cgrp)
+static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 {
-	down_write(&cgrp->pids_mutex);
-	BUG_ON(!cgrp->pids_use_count);
-	if (!--cgrp->pids_use_count) {
-		kfree(cgrp->tasks_pids);
-		cgrp->tasks_pids = NULL;
-		cgrp->pids_length = 0;
+	down_write(&l->mutex);
+	BUG_ON(!l->use_count);
+	if (!--l->use_count) {
+		kfree(l->list);
+		l->list = NULL;
+		l->length = 0;
 	}
-	up_write(&cgrp->pids_mutex);
+	up_write(&l->mutex);
 }
 
-static int cgroup_tasks_release(struct inode *inode, struct file *file)
+static int cgroup_pidlist_release(struct inode *inode, struct file *file)
 {
-	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-
+	struct cgroup_pidlist *l;
 	if (!(file->f_mode & FMODE_READ))
 		return 0;
-
-	release_cgroup_pid_array(cgrp);
+	/*
+	 * the seq_file will only be initialized if the file was opened for
+	 * reading; hence we check if it's not null only in that case.
+	 */
+	l = ((struct seq_file *)file->private_data)->private;
+	cgroup_release_pid_array(l);
 	return seq_release(inode, file);
 }
 
-static struct file_operations cgroup_tasks_operations = {
+static const struct file_operations cgroup_pidlist_operations = {
 	.read = seq_read,
 	.llseek = seq_lseek,
 	.write = cgroup_file_write,
-	.release = cgroup_tasks_release,
+	.release = cgroup_pidlist_release,
 };
 
 /*
- * Handle an open on 'tasks' file.  Prepare an array containing the
- * process id's of tasks currently attached to the cgroup being opened.
+ * The following functions handle opens on a file that displays a pidlist
+ * (tasks or procs). Prepare an array of the process/thread IDs of whoever's
+ * in the cgroup.
  */
-
-static int cgroup_tasks_open(struct inode *unused, struct file *file)
+/* helper function for the two below it */
+static int cgroup_pidlist_open(struct file *file, bool procs)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-	pid_t *pidarray;
-	int npids;
+	struct cgroup_pidlist *l = (procs ? &cgrp->procs : &cgrp->tasks);
 	int retval;
 
 	/* Nothing to do for write-only files */
 	if (!(file->f_mode & FMODE_READ))
 		return 0;
 
-	/*
-	 * If cgroup gets more users after we read count, we won't have
-	 * enough space - tough.  This race is indistinguishable to the
-	 * caller from the case that the additional cgroup users didn't
-	 * show up until sometime later on.
-	 */
-	npids = cgroup_task_count(cgrp);
-	pidarray = kmalloc(npids * sizeof(pid_t), GFP_KERNEL);
-	if (!pidarray)
-		return -ENOMEM;
-	npids = pid_array_load(pidarray, npids, cgrp);
-	sort(pidarray, npids, sizeof(pid_t), cmppid, NULL);
-
-	/*
-	 * Store the array in the cgroup, freeing the old
-	 * array if necessary
-	 */
-	down_write(&cgrp->pids_mutex);
-	kfree(cgrp->tasks_pids);
-	cgrp->tasks_pids = pidarray;
-	cgrp->pids_length = npids;
-	cgrp->pids_use_count++;
-	up_write(&cgrp->pids_mutex);
-
-	file->f_op = &cgroup_tasks_operations;
+	/* have the array populated */
+	retval = pidlist_array_load(cgrp, procs);
+	if (retval)
+		return retval;
+	/* configure file information */
+	file->f_op = &cgroup_pidlist_operations;
 
-	retval = seq_open(file, &cgroup_tasks_seq_operations);
+	retval = seq_open(file, &cgroup_pidlist_seq_operations);
 	if (retval) {
-		release_cgroup_pid_array(cgrp);
+		cgroup_release_pid_array(l);
 		return retval;
 	}
-	((struct seq_file *)file->private_data)->private = cgrp;
+	((struct seq_file *)file->private_data)->private = l;
 	return 0;
 }
+static int cgroup_tasks_open(struct inode *unused, struct file *file)
+{
+	return cgroup_pidlist_open(file, false);
+}
+static int cgroup_procs_open(struct inode *unused, struct file *file)
+{
+	return cgroup_pidlist_open(file, true);
+}
 
 static u64 cgroup_read_notify_on_release(struct cgroup *cgrp,
 					    struct cftype *cft)
@@ -2617,21 +2678,27 @@ static int cgroup_write_notify_on_release(struct cgroup *cgrp,
 /*
  * for the common functions, 'private' gives the type of file
  */
+/* for hysterical raisins, we can't put this on the older files */
+#define CGROUP_FILE_GENERIC_PREFIX "cgroup."
 static struct cftype files[] = {
 	{
 		.name = "tasks",
 		.open = cgroup_tasks_open,
 		.write_u64 = cgroup_tasks_write,
-		.release = cgroup_tasks_release,
-		.private = FILE_TASKLIST,
+		.release = cgroup_pidlist_release,
 		.mode = S_IRUGO | S_IWUSR,
 	},
-
+	{
+		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
+		.open = cgroup_procs_open,
+		/* .write_u64 = cgroup_procs_write, TODO */
+		.release = cgroup_pidlist_release,
+		.mode = S_IRUGO,
+	},
 	{
 		.name = "notify_on_release",
 		.read_u64 = cgroup_read_notify_on_release,
 		.write_u64 = cgroup_write_notify_on_release,
-		.private = FILE_NOTIFY_ON_RELEASE,
 	},
 };
 
@@ -2640,7 +2707,6 @@ static struct cftype cft_release_agent = {
 	.read_seq_string = cgroup_release_agent_show,
 	.write_string = cgroup_release_agent_write,
 	.max_write_len = PATH_MAX,
-	.private = FILE_RELEASE_AGENT,
 };
 
 static int cgroup_populate_dir(struct cgroup *cgrp)


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/8] Ensures correct concurrent opening/reading of pidlists across pid namespaces
  2009-08-18 23:58 [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes Paul Menage
  2009-08-18 23:58 ` [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869 Paul Menage
  2009-08-18 23:58 ` [PATCH 2/8] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Paul Menage
@ 2009-08-18 23:58 ` Paul Menage
  2009-08-18 23:58 ` [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations Paul Menage
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-18 23:58 UTC (permalink / raw)
  To: akpm, bblum, lizf; +Cc: linux-kernel, containers

From: Ben Blum <bblum@google.com>


Ensures correct concurrent opening/reading of pidlists across pid namespaces

Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process seeing
results from the other's namespace. Rather than one pidlist for each file in a
cgroup, we now keep a list of pidlists keyed by namespace and file type (tasks
versus procs) in which entries are placed on demand. Each pidlist has its own
lock, and that the pidlists themselves are passed around in the seq_file's
private pointer means we don't have to touch the cgroup or its master list
except when creating and destroying entries.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>

---

 include/linux/cgroup.h |   34 +++++++++++++--
 kernel/cgroup.c        |  107 ++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 119 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2357733..88e8634 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -141,15 +141,36 @@ enum {
 	CGRP_WAIT_ON_RMDIR,
 };
 
+/* which pidlist file are we talking about? */
+enum cgroup_filetype {
+	CGROUP_FILE_PROCS,
+	CGROUP_FILE_TASKS,
+};
+
+/*
+ * A pidlist is a list of pids that virtually represents the contents of one
+ * of the cgroup files ("procs" or "tasks"). We keep a list of such pidlists,
+ * a pair (one each for procs, tasks) for each pid namespace that's relevant
+ * to the cgroup.
+ */
 struct cgroup_pidlist {
-	/* protects the other fields */
-	struct rw_semaphore mutex;
+	/*
+	 * used to find which pidlist is wanted. doesn't change as long as
+	 * this particular list stays in the list.
+	 */
+	struct { enum cgroup_filetype type; struct pid_namespace *ns; } key;
 	/* array of xids */
 	pid_t *list;
 	/* how many elements the above list has */
 	int length;
 	/* how many files are using the current array */
 	int use_count;
+	/* each of these stored in a list by its cgroup */
+	struct list_head links;
+	/* pointer to the cgroup we belong to, for list removal purposes */
+	struct cgroup *owner;
+	/* protects the other fields */
+	struct rw_semaphore mutex;
 };
 
 struct cgroup {
@@ -190,9 +211,12 @@ struct cgroup {
 	 */
 	struct list_head release_list;
 
-	/* we will have two separate pidlists, one for pids (the tasks file)
-	 * and one for tgids (the procs file). */
-	struct cgroup_pidlist tasks, procs;
+	/*
+	 * list of pidlists, up to two for each namespace (one for procs, one
+	 * for tasks); created on demand.
+	 */
+	struct list_head pidlists;
+	struct mutex pidlist_mutex;
 
 	/* For RCU-protected deletion */
 	struct rcu_head rcu_head;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 18d1403..d207871 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -776,6 +776,12 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 		 */
 		deactivate_super(cgrp->root->sb);
 
+		/*
+		 * if we're getting rid of the cgroup, refcount should ensure
+		 * that there are no pidlists left.
+		 */
+		BUG_ON(!list_empty(&cgrp->pidlists));
+
 		call_rcu(&cgrp->rcu_head, free_cgroup_rcu);
 	}
 	iput(inode);
@@ -1121,8 +1127,8 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
 	INIT_LIST_HEAD(&cgrp->children);
 	INIT_LIST_HEAD(&cgrp->css_sets);
 	INIT_LIST_HEAD(&cgrp->release_list);
-	init_rwsem(&(cgrp->tasks.mutex));
-	init_rwsem(&(cgrp->procs.mutex));
+	INIT_LIST_HEAD(&cgrp->pidlists);
+	mutex_init(&cgrp->pidlist_mutex);
 }
 
 static void init_cgroup_root(struct cgroupfs_root *root)
@@ -2396,9 +2402,59 @@ static int cmppid(const void *a, const void *b)
 }
 
 /*
+ * find the appropriate pidlist for our purpose (given procs vs tasks)
+ * returns with the lock on that pidlist already held, and takes care
+ * of the use count, or returns NULL with no locks held if we're out of
+ * memory.
+ */
+static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
+						  enum cgroup_filetype type)
+{
+	struct cgroup_pidlist *l;
+	/* don't need task_nsproxy() if we're looking at ourself */
+	struct pid_namespace *ns = get_pid_ns(current->nsproxy->pid_ns);
+	/*
+	 * We can't drop the pidlist_mutex before taking the l->mutex in case
+	 * the last ref-holder is trying to remove l from the list at the same
+	 * time. Holding the pidlist_mutex precludes somebody taking whichever
+	 * list we find out from under us - compare release_pid_array().
+	 */
+	mutex_lock(&cgrp->pidlist_mutex);
+	list_for_each_entry(l, &cgrp->pidlists, links) {
+		if (l->key.type == type && l->key.ns == ns) {
+			/* found a matching list - drop the extra refcount */
+			put_pid_ns(ns);
+			/* make sure l doesn't vanish out from under us */
+			down_write(&l->mutex);
+			mutex_unlock(&cgrp->pidlist_mutex);
+			l->use_count++;
+			return l;
+		}
+	}
+	/* entry not found; create a new one */
+	l = kmalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
+	if (!l) {
+		mutex_unlock(&cgrp->pidlist_mutex);
+		put_pid_ns(ns);
+		return l;
+	}
+	init_rwsem(&l->mutex);
+	down_write(&l->mutex);
+	l->key.type = type;
+	l->key.ns = ns;
+	l->use_count = 0; /* don't increment here */
+	l->list = NULL;
+	l->owner = cgrp;
+	list_add(&l->links, &cgrp->pidlists);
+	mutex_unlock(&cgrp->pidlist_mutex);
+	return l;
+}
+
+/*
  * Load a cgroup's pidarray with either procs' tgids or tasks' pids
  */
-static int pidlist_array_load(struct cgroup *cgrp, bool procs)
+static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
+			      struct cgroup_pidlist **lp)
 {
 	pid_t *array;
 	int length;
@@ -2423,7 +2479,10 @@ static int pidlist_array_load(struct cgroup *cgrp, bool procs)
 		if (unlikely(n == length))
 			break;
 		/* get tgid or pid for procs or tasks file respectively */
-		pid = (procs ? task_tgid_vnr(tsk) : task_pid_vnr(tsk));
+		if (type == CGROUP_FILE_PROCS)
+			pid = task_tgid_vnr(tsk);
+		else
+			pid = task_pid_vnr(tsk);
 		if (pid > 0) /* make sure to only use valid results */
 			array[n++] = pid;
 	}
@@ -2431,19 +2490,20 @@ static int pidlist_array_load(struct cgroup *cgrp, bool procs)
 	length = n;
 	/* now sort & (if procs) strip out duplicates */
 	sort(array, length, sizeof(pid_t), cmppid, NULL);
-	if (procs) {
+	if (type == CGROUP_FILE_PROCS)
 		length = pidlist_uniq(&array, length);
-		l = &(cgrp->procs);
-	} else {
-		l = &(cgrp->tasks);
+	l = cgroup_pidlist_find(cgrp, type);
+	if (!l) {
+		kfree(array);
+		return -ENOMEM;
 	}
-	/* store array in cgroup, freeing old if necessary */
-	down_write(&l->mutex);
+	/* store array, freeing old if necessary - lock already held */
 	kfree(l->list);
 	l->list = array;
 	l->length = length;
 	l->use_count++;
 	up_write(&l->mutex);
+	*lp = l;
 	return 0;
 }
 
@@ -2586,13 +2646,26 @@ static const struct seq_operations cgroup_pidlist_seq_operations = {
 
 static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 {
+	/*
+	 * the case where we're the last user of this particular pidlist will
+	 * have us remove it from the cgroup's list, which entails taking the
+	 * mutex. since in pidlist_find the pidlist->lock depends on cgroup->
+	 * pidlist_mutex, we have to take pidlist_mutex first.
+	 */
+	mutex_lock(&l->owner->pidlist_mutex);
 	down_write(&l->mutex);
 	BUG_ON(!l->use_count);
 	if (!--l->use_count) {
+		/* we're the last user if refcount is 0; remove and free */
+		list_del(&l->links);
+		mutex_unlock(&l->owner->pidlist_mutex);
 		kfree(l->list);
-		l->list = NULL;
-		l->length = 0;
+		put_pid_ns(l->key.ns);
+		up_write(&l->mutex);
+		kfree(l);
+		return;
 	}
+	mutex_unlock(&l->owner->pidlist_mutex);
 	up_write(&l->mutex);
 }
 
@@ -2623,10 +2696,10 @@ static const struct file_operations cgroup_pidlist_operations = {
  * in the cgroup.
  */
 /* helper function for the two below it */
-static int cgroup_pidlist_open(struct file *file, bool procs)
+static int cgroup_pidlist_open(struct file *file, enum cgroup_filetype type)
 {
 	struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
-	struct cgroup_pidlist *l = (procs ? &cgrp->procs : &cgrp->tasks);
+	struct cgroup_pidlist *l;
 	int retval;
 
 	/* Nothing to do for write-only files */
@@ -2634,7 +2707,7 @@ static int cgroup_pidlist_open(struct file *file, bool procs)
 		return 0;
 
 	/* have the array populated */
-	retval = pidlist_array_load(cgrp, procs);
+	retval = pidlist_array_load(cgrp, type, &l);
 	if (retval)
 		return retval;
 	/* configure file information */
@@ -2650,11 +2723,11 @@ static int cgroup_pidlist_open(struct file *file, bool procs)
 }
 static int cgroup_tasks_open(struct inode *unused, struct file *file)
 {
-	return cgroup_pidlist_open(file, false);
+	return cgroup_pidlist_open(file, CGROUP_FILE_TASKS);
 }
 static int cgroup_procs_open(struct inode *unused, struct file *file)
 {
-	return cgroup_pidlist_open(file, true);
+	return cgroup_pidlist_open(file, CGROUP_FILE_PROCS);
 }
 
 static u64 cgroup_read_notify_on_release(struct cgroup *cgrp,


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations
  2009-08-18 23:58 [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes Paul Menage
                   ` (2 preceding siblings ...)
  2009-08-18 23:58 ` [PATCH 3/8] Ensures correct concurrent opening/reading of pidlists across pid namespaces Paul Menage
@ 2009-08-18 23:58 ` Paul Menage
  2009-08-20  2:37   ` Li Zefan
  2009-08-20 21:14   ` Andrew Morton
  2009-08-18 23:58 ` [PATCH 5/8] Changes css_set freeing mechanism to be under RCU Paul Menage
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-18 23:58 UTC (permalink / raw)
  To: akpm, bblum, lizf; +Cc: linux-kernel, containers

From: Ben Blum <bblum@google.com>


Use vmalloc for large cgroups pidlist allocations

Separates all pidlist allocation requests to a separate function that judges
based on the requested size whether or not the array needs to be vmalloced or
can be gotten via kmalloc, and similar for kfree/vfree.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>

---

 kernel/cgroup.c |   47 ++++++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d207871..7e2b285 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -50,6 +50,7 @@
 #include <linux/smp_lock.h>
 #include <linux/pid_namespace.h>
 #include <linux/idr.h>
+#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
 
 #include <asm/atomic.h>
 
@@ -2351,6 +2352,42 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
  */
 
 /*
+ * The following two functions "fix" the issue where there are more pids
+ * than kmalloc will give memory for; in such cases, we use vmalloc/vfree.
+ * TODO: replace with a kernel-wide solution to this problem
+ */
+#define PIDLIST_TOO_LARGE(c) ((c) * sizeof(pid_t) > (PAGE_SIZE * 2))
+static void *pidlist_allocate(int count)
+{
+	if (PIDLIST_TOO_LARGE(count))
+		return vmalloc(count * sizeof(pid_t));
+	else
+		return kmalloc(count * sizeof(pid_t), GFP_KERNEL);
+}
+static void pidlist_free(void *p)
+{
+	if (is_vmalloc_addr(p))
+		vfree(p);
+	else
+		kfree(p);
+}
+static void *pidlist_resize(void *p, int newcount)
+{
+	void *newlist;
+	/* note: if new alloc fails, old p will still be valid either way */
+	if (is_vmalloc_addr(p)) {
+		newlist = vmalloc(newcount * sizeof(pid_t));
+		if (!newlist)
+			return NULL;
+		memcpy(newlist, p, newcount * sizeof(pid_t));
+		vfree(p);
+	} else {
+		newlist = krealloc(p, newcount * sizeof(pid_t), GFP_KERNEL);
+	}
+	return newlist;
+}
+
+/*
  * pidlist_uniq - given a kmalloc()ed list, strip out all duplicate entries
  * If the new stripped list is sufficiently smaller and there's enough memory
  * to allocate a new buffer, will let go of the unneeded memory. Returns the
@@ -2389,7 +2426,7 @@ after:
 	 * we'll just stay with what we've got.
 	 */
 	if (PIDLIST_REALLOC_DIFFERENCE(length, dest)) {
-		newlist = krealloc(list, dest * sizeof(pid_t), GFP_KERNEL);
+		newlist = pidlist_resize(list, dest);
 		if (newlist)
 			*p = newlist;
 	}
@@ -2470,7 +2507,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 	 * show up until sometime later on.
 	 */
 	length = cgroup_task_count(cgrp);
-	array = kmalloc(length * sizeof(pid_t), GFP_KERNEL);
+	array = pidlist_allocate(length);
 	if (!array)
 		return -ENOMEM;
 	/* now, populate the array */
@@ -2494,11 +2531,11 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
 		length = pidlist_uniq(&array, length);
 	l = cgroup_pidlist_find(cgrp, type);
 	if (!l) {
-		kfree(array);
+		pidlist_free(array);
 		return -ENOMEM;
 	}
 	/* store array, freeing old if necessary - lock already held */
-	kfree(l->list);
+	pidlist_free(l->list);
 	l->list = array;
 	l->length = length;
 	l->use_count++;
@@ -2659,7 +2696,7 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l)
 		/* we're the last user if refcount is 0; remove and free */
 		list_del(&l->links);
 		mutex_unlock(&l->owner->pidlist_mutex);
-		kfree(l->list);
+		pidlist_free(l->list);
 		put_pid_ns(l->key.ns);
 		up_write(&l->mutex);
 		kfree(l);


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/8] Changes css_set freeing mechanism to be under RCU
  2009-08-18 23:58 [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes Paul Menage
                   ` (3 preceding siblings ...)
  2009-08-18 23:58 ` [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations Paul Menage
@ 2009-08-18 23:58 ` Paul Menage
  2009-08-20  2:38   ` Li Zefan
  2009-08-18 23:58 ` [PATCH 6/8] Lets ss->can_attach and ss->attach do whole threadgroups at a time Paul Menage
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Paul Menage @ 2009-08-18 23:58 UTC (permalink / raw)
  To: akpm, bblum, lizf; +Cc: linux-kernel, containers

From: Ben Blum <bblum@google.com>


Changes css_set freeing mechanism to be under RCU

This is a prepatch for making the procs file writable. In order to free the
old css_sets for each task to be moved as they're being moved, the freeing
mechanism must be RCU-protected, or else we would have to have a call to
synchronize_rcu() for each task before freeing its old css_set.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>

---

 include/linux/cgroup.h |    3 +++
 kernel/cgroup.c        |    8 +++++++-
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 88e8634..3ac78a2 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -260,6 +260,9 @@ struct css_set {
 	 * during subsystem registration (at boot time).
 	 */
 	struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
+
+	/* For RCU-protected deletion */
+	struct rcu_head rcu_head;
 };
 
 /*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e2b285..9f9e154 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -267,6 +267,12 @@ static struct hlist_head *css_set_hash(struct cgroup_subsys_state *css[])
 	return &css_set_table[index];
 }
 
+static void free_css_set_rcu(struct rcu_head *obj)
+{
+	struct css_set *cg = container_of(obj, struct css_set, rcu_head);
+	kfree(cg);
+}
+
 /* We don't maintain the lists running through each css_set to its
  * task until after the first call to cgroup_iter_start(). This
  * reduces the fork()/exit() overhead for people who have cgroups
@@ -310,7 +316,7 @@ static void __put_css_set(struct css_set *cg, int taskexit)
 	}
 
 	write_unlock(&css_set_lock);
-	kfree(cg);
+	call_rcu(&cg->rcu_head, free_css_set_rcu);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/8] Lets ss->can_attach and ss->attach do whole threadgroups at a time
  2009-08-18 23:58 [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes Paul Menage
                   ` (4 preceding siblings ...)
  2009-08-18 23:58 ` [PATCH 5/8] Changes css_set freeing mechanism to be under RCU Paul Menage
@ 2009-08-18 23:58 ` Paul Menage
  2009-08-20  3:06   ` Li Zefan
  2009-08-20  4:38   ` Matt Helsley
  2009-08-18 23:58 ` [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup Paul Menage
  2009-08-18 23:58 ` [PATCH 8/8] Adds ability to move all threads in a process to a new cgroup atomically Paul Menage
  7 siblings, 2 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-18 23:58 UTC (permalink / raw)
  To: akpm, bblum, lizf; +Cc: linux-kernel, containers

From: Ben Blum <bblum@google.com>


Lets ss->can_attach and ss->attach do whole threadgroups at a time

This patch alters the ss->can_attach and ss->attach functions to be able to
deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This
is a pre-patch to cgroup-procs-writable.patch.)

Currently, new mode of the attach function can only tell the subsystem about
the old cgroup of the threadgroup leader. No subsystem currently needs that
information for each thread that's being moved, but if one were to be added
(for example, one that counts tasks within a group) this bit would need to be
reworked a bit to tell the subsystem the right information.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>

---

 Documentation/cgroups/cgroups.txt |   12 +++++--
 include/linux/cgroup.h            |    7 ++--
 kernel/cgroup.c                   |    4 +-
 kernel/cgroup_freezer.c           |   15 ++++++++
 kernel/cpuset.c                   |   66 +++++++++++++++++++++++++++++--------
 kernel/ns_cgroup.c                |   16 ++++++++-
 kernel/sched.c                    |   35 ++++++++++++++++++--
 mm/memcontrol.c                   |    3 +-
 security/device_cgroup.c          |    3 +-
 9 files changed, 131 insertions(+), 30 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 4bccfc1..455d4e6 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -521,7 +521,7 @@ rmdir() will fail with it. From this behavior, pre_destroy() can be
 called multiple times against a cgroup.
 
 int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	       struct task_struct *task)
+	       struct task_struct *task, bool threadgroup)
 (cgroup_mutex held by caller)
 
 Called prior to moving a task into a cgroup; if the subsystem
@@ -529,14 +529,20 @@ returns an error, this will abort the attach operation.  If a NULL
 task is passed, then a successful result indicates that *any*
 unspecified task can be moved into the cgroup. Note that this isn't
 called on a fork. If this method returns 0 (success) then this should
-remain valid while the caller holds cgroup_mutex.
+remain valid while the caller holds cgroup_mutex. If threadgroup is
+true, then a successful result indicates that all threads in the given
+thread's threadgroup can be moved together.
 
 void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-	    struct cgroup *old_cgrp, struct task_struct *task)
+	    struct cgroup *old_cgrp, struct task_struct *task,
+	    bool threadgroup)
 (cgroup_mutex held by caller)
 
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
+If threadgroup is true, the subsystem should take care of all threads
+in the specified thread's threadgroup. Currently does not support any
+subsystem that might need the old_cgrp for every thread in the group.
 
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3ac78a2..b62bb92 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -425,10 +425,11 @@ struct cgroup_subsys {
 						  struct cgroup *cgrp);
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
-	int (*can_attach)(struct cgroup_subsys *ss,
-			  struct cgroup *cgrp, struct task_struct *tsk);
+	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+			  struct task_struct *tsk, bool threadgroup);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cgrp, struct task_struct *tsk);
+			struct cgroup *old_cgrp, struct task_struct *tsk,
+			bool threadgroup);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
 	void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
 	int (*populate)(struct cgroup_subsys *ss,
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9f9e154..fb3ba27 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1552,7 +1552,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {
-			retval = ss->can_attach(ss, cgrp, tsk);
+			retval = ss->can_attach(ss, cgrp, tsk, false);
 			if (retval)
 				return retval;
 		}
@@ -1590,7 +1590,7 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk);
+			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index fb249e2..59e9ef6 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -159,7 +159,7 @@ static bool is_task_frozen_enough(struct task_struct *task)
  */
 static int freezer_can_attach(struct cgroup_subsys *ss,
 			      struct cgroup *new_cgroup,
-			      struct task_struct *task)
+			      struct task_struct *task, bool threadgroup)
 {
 	struct freezer *freezer;
 
@@ -177,6 +177,19 @@ static int freezer_can_attach(struct cgroup_subsys *ss,
 	if (freezer->state == CGROUP_FROZEN)
 		return -EBUSY;
 
+	if (threadgroup) {
+		struct task_struct *c;
+
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			if (is_task_frozen_enough(c)) {
+				rcu_read_unlock();
+				return -EBUSY;
+			}
+		}
+		rcu_read_unlock();
+	}
+
 	return 0;
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 7e75a41..b5cb469 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1324,9 +1324,10 @@ static int fmeter_getrate(struct fmeter *fmp)
 static cpumask_var_t cpus_attach;
 
 /* Called by cgroups to determine if a cpuset is usable; cgroup_mutex held */
-static int cpuset_can_attach(struct cgroup_subsys *ss,
-			     struct cgroup *cont, struct task_struct *tsk)
+static int cpuset_can_attach(struct cgroup_subsys *ss, struct cgroup *cont,
+			     struct task_struct *tsk, bool threadgroup)
 {
+	int ret;
 	struct cpuset *cs = cgroup_cs(cont);
 
 	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
@@ -1343,18 +1344,51 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
 	if (tsk->flags & PF_THREAD_BOUND)
 		return -EINVAL;
 
-	return security_task_setscheduler(tsk, 0, NULL);
+	ret = security_task_setscheduler(tsk, 0, NULL);
+	if (ret)
+		return ret;
+	if (threadgroup) {
+		struct task_struct *c;
+
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			ret = security_task_setscheduler(c, 0, NULL);
+			if (ret) {
+				rcu_read_unlock();
+				return ret;
+			}
+		}
+		rcu_read_unlock();
+	}
+	return 0;
+}
+
+static void cpuset_attach_task(struct task_struct *tsk, nodemask_t *to,
+			       struct cpuset *cs)
+{
+	int err;
+	/*
+	 * can_attach beforehand should guarantee that this doesn't fail.
+	 * TODO: have a better way to handle failure here
+	 */
+	err = set_cpus_allowed_ptr(tsk, cpus_attach);
+	WARN_ON_ONCE(err);
+
+	task_lock(tsk);
+	cpuset_change_task_nodemask(tsk, to);
+	task_unlock(tsk);
+	cpuset_update_task_spread_flag(cs, tsk);
+
 }
 
-static void cpuset_attach(struct cgroup_subsys *ss,
-			  struct cgroup *cont, struct cgroup *oldcont,
-			  struct task_struct *tsk)
+static void cpuset_attach(struct cgroup_subsys *ss, struct cgroup *cont,
+			  struct cgroup *oldcont, struct task_struct *tsk,
+			  bool threadgroup)
 {
 	nodemask_t from, to;
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
-	int err;
 
 	if (cs == &top_cpuset) {
 		cpumask_copy(cpus_attach, cpu_possible_mask);
@@ -1363,15 +1397,19 @@ static void cpuset_attach(struct cgroup_subsys *ss,
 		guarantee_online_cpus(cs, cpus_attach);
 		guarantee_online_mems(cs, &to);
 	}
-	err = set_cpus_allowed_ptr(tsk, cpus_attach);
-	if (err)
-		return;
 
-	task_lock(tsk);
-	cpuset_change_task_nodemask(tsk, &to);
-	task_unlock(tsk);
-	cpuset_update_task_spread_flag(cs, tsk);
+	/* do per-task migration stuff possibly for each in the threadgroup */
+	cpuset_attach_task(tsk, &to, cs);
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			cpuset_attach_task(c, &to, cs);
+		}
+		rcu_read_unlock();
+	}
 
+	/* change mm; only needs to be done once even if threadgroup */
 	from = oldcs->mems_allowed;
 	to = cs->mems_allowed;
 	mm = get_task_mm(tsk);
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 5aa854f..2a5dfec 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -42,8 +42,8 @@ int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
  *       (hence either you are in the same cgroup as task, or in an
  *        ancestor cgroup thereof)
  */
-static int ns_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task)
+static int ns_can_attach(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
+			 struct task_struct *task, bool threadgroup)
 {
 	if (current != task) {
 		if (!capable(CAP_SYS_ADMIN))
@@ -56,6 +56,18 @@ static int ns_can_attach(struct cgroup_subsys *ss,
 	if (!cgroup_is_descendant(new_cgroup, task))
 		return -EPERM;
 
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &task->thread_group, thread_group) {
+			if (!cgroup_is_descendant(new_cgroup, c)) {
+				rcu_read_unlock();
+				return -EPERM;
+			}
+		}
+		rcu_read_unlock();
+	}
+
 	return 0;
 }
 
diff --git a/kernel/sched.c b/kernel/sched.c
index a2f8788..b7e3541 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -10336,8 +10336,7 @@ cpu_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 }
 
 static int
-cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-		      struct task_struct *tsk)
+cpu_cgroup_can_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
 #ifdef CONFIG_RT_GROUP_SCHED
 	if (!sched_rt_can_attach(cgroup_tg(cgrp), tsk))
@@ -10347,15 +10346,45 @@ cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 	if (tsk->sched_class != &fair_sched_class)
 		return -EINVAL;
 #endif
+	return 0;
+}
 
+static int
+cpu_cgroup_can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
+		      struct task_struct *tsk, bool threadgroup)
+{
+	int retval = cpu_cgroup_can_attach_task(cgrp, tsk);
+	if (retval)
+		return retval;
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			retval = cpu_cgroup_can_attach_task(cgrp, c);
+			if (retval) {
+				rcu_read_unlock();
+				return retval;
+			}
+		}
+		rcu_read_unlock();
+	}
 	return 0;
 }
 
 static void
 cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
-			struct cgroup *old_cont, struct task_struct *tsk)
+		  struct cgroup *old_cont, struct task_struct *tsk
+		  bool threadgroup)
 {
 	sched_move_task(tsk);
+	if (threadgroup) {
+		struct task_struct *c;
+		rcu_read_lock();
+		list_for_each_entry_rcu(c, &tsk->thread_group, thread_group) {
+			sched_move_task(c);
+		}
+		rcu_read_unlock();
+	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4f51885..53498d1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3128,7 +3128,8 @@ static int mem_cgroup_populate(struct cgroup_subsys *ss,
 static void mem_cgroup_move_task(struct cgroup_subsys *ss,
 				struct cgroup *cont,
 				struct cgroup *old_cont,
-				struct task_struct *p)
+				struct task_struct *p,
+				bool threadgroup)
 {
 	mutex_lock(&memcg_tasklist);
 	/*
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index b8186ba..6cf8fd2 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -61,7 +61,8 @@ static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
 struct cgroup_subsys devices_subsys;
 
 static int devcgroup_can_attach(struct cgroup_subsys *ss,
-		struct cgroup *new_cgroup, struct task_struct *task)
+		struct cgroup *new_cgroup, struct task_struct *task,
+		bool threadgroup)
 {
 	if (current != task && !capable(CAP_SYS_ADMIN))
 			return -EPERM;


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
  2009-08-18 23:58 [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes Paul Menage
                   ` (5 preceding siblings ...)
  2009-08-18 23:58 ` [PATCH 6/8] Lets ss->can_attach and ss->attach do whole threadgroups at a time Paul Menage
@ 2009-08-18 23:58 ` Paul Menage
  2009-08-20  2:39   ` Li Zefan
  2009-08-18 23:58 ` [PATCH 8/8] Adds ability to move all threads in a process to a new cgroup atomically Paul Menage
  7 siblings, 1 reply; 37+ messages in thread
From: Paul Menage @ 2009-08-18 23:58 UTC (permalink / raw)
  To: akpm, bblum, lizf; +Cc: linux-kernel, containers

From: Ben Blum <bblum@google.com>


Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

This patch adds an rwsem that lives in a threadgroup's sighand_struct (next to
the sighand's atomic count, to piggyback on its cacheline), and two functions
in kernel/cgroup.c (for now) for easily+safely obtaining and releasing it. If
another part of the kernel later wants to use such a locking mechanism, the
CONFIG_CGROUPS ifdefs should be changed to a higher-up flag that CGROUPS and
the other system would both depend on, and the lock/unlock functions could be
moved to sched.c or so.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>

---

 include/linux/cgroup.h    |   14 +++++--
 include/linux/init_task.h |    9 +++++
 include/linux/sched.h     |   15 ++++++++
 kernel/cgroup.c           |   87 ++++++++++++++++++++++++++++++++++++++++++++-
 kernel/fork.c             |    9 +++--
 5 files changed, 125 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b62bb92..642a47f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -30,10 +30,12 @@ extern int cgroup_init(void);
 extern void cgroup_lock(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
-extern void cgroup_fork(struct task_struct *p);
+extern void cgroup_fork(struct task_struct *p, unsigned long clone_flags);
 extern void cgroup_fork_callbacks(struct task_struct *p);
-extern void cgroup_post_fork(struct task_struct *p);
+extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags);
 extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
+			       unsigned long clone_flags);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 
@@ -568,10 +570,14 @@ unsigned short css_depth(struct cgroup_subsys_state *css);
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
-static inline void cgroup_fork(struct task_struct *p) {}
+static inline void cgroup_fork(struct task_struct *p,
+			       unsigned long clone_flags) {}
 static inline void cgroup_fork_callbacks(struct task_struct *p) {}
-static inline void cgroup_post_fork(struct task_struct *p) {}
+static inline void cgroup_post_fork(struct task_struct *p,
+				    unsigned long clone_flags) {}
 static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
+static inline void cgroup_fork_failed(struct task_struct *p, int callbacks,
+				      unsigned long clone_flags) {}
 
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 20cbdd2..42da017 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -41,7 +41,16 @@ extern struct nsproxy init_nsproxy;
 	INIT_IPC_NS(ipc_ns)						\
 }
 
+#ifdef CONFIG_CGROUPS
+# define INIT_THREADGROUP_FORK_LOCK(sighand)				\
+	.threadgroup_fork_lock = 					\
+		__RWSEM_INITIALIZER(sighand.threadgroup_fork_lock),
+#else
+# define INIT_THREADGROUP_FORK_LOCK(sighand)
+#endif
+
 #define INIT_SIGHAND(sighand) {						\
+	INIT_THREADGROUP_FORK_LOCK(sighand)				\
 	.count		= ATOMIC_INIT(1), 				\
 	.action		= { { { .sa_handler = NULL, } }, },		\
 	.siglock	= __SPIN_LOCK_UNLOCKED(sighand.siglock),	\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e6aa408..f3d5275 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -479,6 +479,21 @@ extern int get_dumpable(struct mm_struct *mm);
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
 struct sighand_struct {
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock is used to prevent any threads in a
+	 * threadgroup from forking with CLONE_THREAD while held for writing,
+	 * used for threadgroup-wide operations that are fork-sensitive. It
+	 * lives here next to sighand.count as a cacheline optimization.
+	 *
+	 * TODO: if anybody besides cgroups uses this lock, change the
+	 * CONFIG_CGROUPS to a higher-up CONFIG_* that the other user and
+	 * cgroups would both depend upon. Also, they'll want to move where
+	 * the readlock happens - it currently lives in kernel/cgroup.c in
+	 * cgroup_{fork,post_fork,fork_failed}().
+	 */
+	struct rw_semaphore	threadgroup_fork_lock;
+#endif
 	atomic_t		count;
 	struct k_sigaction	action[_NSIG];
 	spinlock_t		siglock;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index fb3ba27..d49488f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1529,6 +1529,65 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 
 /**
+ * threadgroup_fork_lock - block all CLONE_THREAD forks in the threadgroup
+ * @tsk: the task whose threadgroup should be locked
+ *
+ * Takes the threadgroup_lock_mutex in the threadgroup's sighand_struct, by
+ * means of searching the threadgroup list for a live thread in the group.
+ * Returns the sighand_struct that should be given to threadgroup_fork_unlock,
+ * or NULL if all threads in the group are exiting and have cleared their
+ * sighand pointers.
+ */
+struct sighand_struct *threadgroup_fork_lock(struct task_struct *tsk)
+{
+	struct sighand_struct *sighand;
+	struct task_struct *p;
+
+	/* tasklist lock protects sighand_struct's disappearance in exit(). */
+	read_lock(&tasklist_lock);
+	if (likely(tsk->sighand)) {
+		/* simple case - check the thread we were given first */
+		sighand = tsk->sighand;
+	} else {
+		sighand = NULL;
+		/*
+		 * tsk is exiting; try to find another thread in the group
+		 * whose sighand pointer is still alive.
+		 */
+		rcu_read_lock();
+		list_for_each_entry_rcu(p, &tsk->thread_group, thread_group) {
+			if (p->sighand) {
+				sighand = tsk->sighand;
+				break;
+			}
+		}
+		rcu_read_unlock();
+	}
+	/* prevent sighand from vanishing before we let go of tasklist_lock */
+	if (likely(sighand))
+		atomic_inc(&sighand->count);
+
+	/* done searching. */
+	read_unlock(&tasklist_lock);
+
+	if (likely(sighand))
+		down_write(&sighand->threadgroup_fork_lock);
+	return sighand;
+}
+
+/**
+ * threadgroup_fork_lock - let threadgroup resume CLONE_THREAD forks.
+ * @sighand: the threadgroup's sighand that threadgroup_fork_lock gave back
+ *
+ * Lets go of the threadgroup_fork_lock, and drops the sighand reference.
+ */
+void threadgroup_fork_unlock(struct sighand_struct *sighand)
+{
+	up_write(&sighand->threadgroup_fork_lock);
+	__cleanup_sighand(sighand);
+}
+
+/**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
  * @tsk: the task to be attached
@@ -3421,8 +3480,10 @@ static struct file_operations proc_cgroupstats_operations = {
  * At the point that cgroup_fork() is called, 'current' is the parent
  * task, and the passed argument 'child' points to the child task.
  */
-void cgroup_fork(struct task_struct *child)
+void cgroup_fork(struct task_struct *child, unsigned long clone_flags)
 {
+	if (clone_flags & CLONE_THREAD)
+		down_read(&current->sighand->threadgroup_fork_lock);
 	task_lock(current);
 	child->cgroups = current->cgroups;
 	get_css_set(child->cgroups);
@@ -3459,7 +3520,7 @@ void cgroup_fork_callbacks(struct task_struct *child)
  * with the first call to cgroup_iter_start() - to guarantee that the
  * new task ends up on its list.
  */
-void cgroup_post_fork(struct task_struct *child)
+void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags)
 {
 	if (use_task_css_set_links) {
 		write_lock(&css_set_lock);
@@ -3469,6 +3530,8 @@ void cgroup_post_fork(struct task_struct *child)
 		task_unlock(child);
 		write_unlock(&css_set_lock);
 	}
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->sighand->threadgroup_fork_lock);
 }
 /**
  * cgroup_exit - detach cgroup from exiting task
@@ -3540,6 +3603,26 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 }
 
 /**
+ * cgroup_fork_failed - undo operations for fork failure
+ * @tsk: pointer to  task_struct of exiting process
+ * @run_callback: run exit callbacks?
+ *
+ * Description: Undo cgroup operations after cgroup_fork in fork failure.
+ *
+ * We release the read lock that was taken in cgroup_fork(), since it is
+ * supposed to be dropped in cgroup_post_fork in the success case. The other
+ * thing that wants to be done is detaching the failed child task from the
+ * cgroup, so we wrap cgroup_exit.
+ */
+void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks,
+			unsigned long clone_flags)
+{
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->sighand->threadgroup_fork_lock);
+	cgroup_exit(tsk, run_callbacks);
+}
+
+/**
  * cgroup_clone - clone the cgroup the given subsystem is attached to
  * @tsk: the task to be moved
  * @subsys: the given subsystem
diff --git a/kernel/fork.c b/kernel/fork.c
index a010918..0db5289 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -806,6 +806,9 @@ static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
 		return -ENOMEM;
 	atomic_set(&sig->count, 1);
 	memcpy(sig->action, current->sighand->action, sizeof(sig->action));
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
 	return 0;
 }
 
@@ -1097,7 +1100,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
-	cgroup_fork(p);
+	cgroup_fork(p, clone_flags);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
@@ -1315,7 +1318,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
-	cgroup_post_fork(p);
+	cgroup_post_fork(p, clone_flags);
 	perf_counter_fork(p);
 	return p;
 
@@ -1347,7 +1350,7 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_fork_failed(p, cgroup_callbacks_done, clone_flags);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 8/8] Adds ability to move all threads in a process to a new cgroup atomically
  2009-08-18 23:58 [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes Paul Menage
                   ` (6 preceding siblings ...)
  2009-08-18 23:58 ` [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup Paul Menage
@ 2009-08-18 23:58 ` Paul Menage
  2009-08-20  3:00   ` Li Zefan
  7 siblings, 1 reply; 37+ messages in thread
From: Paul Menage @ 2009-08-18 23:58 UTC (permalink / raw)
  To: akpm, bblum, lizf; +Cc: linux-kernel, containers

From: Ben Blum <bblum@google.com>


Adds ability to move all threads in a process to a new cgroup atomically

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Cgroups subsystems that need to perform per-thread actions in their
"attach" callback are (currently) responsible for doing their own
synchronization, since this occurs outside of the critical section
that locks against cloning within a thread group.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>

---

 Documentation/cgroups/cgroups.txt |   13 +
 kernel/cgroup.c                   |  422 +++++++++++++++++++++++++++++++++----
 2 files changed, 392 insertions(+), 43 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 455d4e6..3df4b9a 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -9,6 +9,7 @@ Portions Copyright (C) 2004 BULL SA.
 Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
 Modified by Paul Jackson <pj@sgi.com>
 Modified by Christoph Lameter <clameter@sgi.com>
+Modified by Ben Blum <bblum@google.com>
 
 CONTENTS:
 =========
@@ -228,6 +229,7 @@ Each cgroup is represented by a directory in the cgroup file system
 containing the following files describing that cgroup:
 
  - tasks: list of tasks (by pid) attached to that cgroup
+ - cgroup.procs: list of unique tgids in the cgroup
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -374,7 +376,7 @@ Now you want to do something with this cgroup.
 
 In this directory you can find several files:
 # ls
-notify_on_release tasks
+cgroup.procs notify_on_release tasks
 (plus whatever files added by the attached subsystems)
 
 Now attach your shell to this cgroup:
@@ -408,6 +410,15 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+The cgroup.procs file is useful for managing all tasks in a threadgroup at
+once. It works the same way as the tasks file, but moves all tasks in the
+threadgroup with the specified tgid.
+
+Writing the pid of a task that's not the threadgroup leader (i.e., a pid
+that isn't a tgid) is treated as invalid. Writing a '0' to cgroup.procs will
+attach the writing task and all tasks in its threadgroup, but is invalid if
+the writing task is not the leader of the threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d49488f..86ecdfe 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1528,6 +1528,108 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 	return 0;
 }
 
+/*
+ * Return the first subsystem attached to a cgroup's hierarchy, and
+ * its subsystem id.
+ */
+
+static void get_first_subsys(const struct cgroup *cgrp,
+			struct cgroup_subsys_state **css, int *subsys_id)
+{
+	const struct cgroupfs_root *root = cgrp->root;
+	const struct cgroup_subsys *test_ss;
+	BUG_ON(list_empty(&root->subsys_list));
+	test_ss = list_entry(root->subsys_list.next,
+			     struct cgroup_subsys, sibling);
+	if (css) {
+		*css = cgrp->subsys[test_ss->subsys_id];
+		BUG_ON(!*css);
+	}
+	if (subsys_id)
+		*subsys_id = test_ss->subsys_id;
+}
+
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exist. If not set, this function might sleep, and can fail
+ * with -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, int guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/*
+	 * locate or allocate a new css_set for this task. 'guarantee' tells
+	 * us whether or not we are sure that a new css_set already exists;
+	 * in that case, we are not allowed to fail or sleep, as we won't need
+	 * malloc.
+	 */
+	if (guarantee) {
+		/*
+		 * our caller promises us that the css_set we want already
+		 * exists, so we use find_existing_css_set directly.
+		 */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/*
+	 * we cannot move a task that's declared itself as exiting, as once
+	 * PF_EXITING is set, the tsk->cgroups pointer is no longer safe.
+	 */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to
+	 * drop it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * threadgroup_fork_lock - block all CLONE_THREAD forks in the threadgroup
  * @tsk: the task whose threadgroup should be locked
@@ -1597,11 +1699,9 @@ void threadgroup_fork_unlock(struct sighand_struct *sighand)
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1617,75 +1717,307 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 0);
+	if (retval)
+		return retval;
+
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+	}
+
+	synchronize_rcu();
+
 	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
+	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
+	 * is no longer empty.
 	 */
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	return 0;
+}
+
+/*
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list, of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving
+ * the given task to the given cgroup. Returns 0 on success, -ENOMEM if we
+ * run out of memory.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	/* ensure a new css_set will exist for this thread */
 	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
 	if (!newcg)
 		return -ENOMEM;
+	/* add new element to list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
 
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	int subsys_id;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry;
+	/* needed for locking the threadgroup */
+	struct sighand_struct *threadgroup_sighand;
+
+	/* first, make sure this came from a valid tgid */
+	if (!thread_group_leader(leader))
+		return -EINVAL;
+	/*
+	 * check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval)
+				return retval;
+		}
+	}
+
+	get_first_subsys(cgrp, NULL, &subsys_id);
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup(leader, subsys_id);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case only has one css_set for all of them, usually O(n).
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup(tsk, subsys_id);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
 		task_unlock(tsk);
-		put_css_set(newcg);
-		return -ESRCH;
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
 	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
+	rcu_read_unlock();
 
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. Even if the threadgroup leader
+	 * is PF_EXITING, we still proceed to move all of its sub-threads to
+	 * the new cgroup; if everybody is PF_EXITING, we'll just end up doing
+	 * nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup(leader, subsys_id);
+	/* if leader is already there, skip moving him */
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, 1);
+		BUG_ON(retval != 0 && retval != -ESRCH);
 	}
-	write_unlock(&css_set_lock);
+	/*
+	 * now move all the rest of the threads - need to lock against
+	 * possible races with fork().
+	 */
+	threadgroup_sighand = threadgroup_fork_lock(leader);
+	if (unlikely(!threadgroup_sighand)) {
+		/*
+		 * it is okay to have a failure case after the move-the-leader
+		 * code, because we will only have failed here if the leader
+		 * was PF_EXITING, and then task_migrate on him will have done
+		 * nothing.
+		 */
+		retval = -ESRCH;
+		goto list_teardown;
+	}
+	rcu_read_lock();
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup(tsk, subsys_id);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, 1);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	rcu_read_unlock();
+	threadgroup_fork_unlock(threadgroup_sighand);
 
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that
+	 * information.
+	 */
 	for_each_subsys(root, ss) {
 		if (ss->attach)
-			ss->attach(ss, cgrp, oldcgrp, tsk, false);
+			ss->attach(ss, cgrp, oldcgrp, tsk, true);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
-	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
-	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
-	 * is no longer empty.
+	 * step 4: success! ...and cleanup
 	 */
+	synchronize_rcu();
 	cgroup_wakeup_rmdir_waiter(cgrp);
-	return 0;
+	retval = 0;
+list_teardown:
+	/* no longer need the list of css_sets, so get rid of it */
+	while (!list_empty(&newcg_list)) {
+		/* pop from the list */
+		cg_entry = list_first_entry(&newcg_list, struct cg_list_entry,
+					    links);
+		list_del(&cg_entry->links);
+		/* drop the refcount */
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+	/* done! */
+	return retval;
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid,
+			      int attach(struct cgroup *,
+					 struct task_struct *))
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
 		if (!tsk || tsk->flags & PF_EXITING) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
-
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on the group leader, because
+		 * even if another task has different permissions, the group
+		 * leader will have sufficient access to change it.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
@@ -1695,19 +2027,25 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	/*
+	 * Note that the check for whether the task is its threadgroup leader
+	 * is done in cgroup_attach_proc. This means that writing 0 to the
+	 * procs file will only work if the writing task is the leader.
+	 */
+	ret = attach(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
-	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
-	return ret;
+	return attach_task_by_pid(cgrp, pid, cgroup_attach_task);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
+	return attach_task_by_pid(cgrp, tgid, cgroup_attach_proc);
 }
 
 /**
@@ -2866,9 +3204,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869
  2009-08-18 23:58 ` [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869 Paul Menage
@ 2009-08-20  2:34   ` Li Zefan
  2009-08-20  2:41     ` Paul Menage
  2009-08-20 21:13   ` Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Li Zefan @ 2009-08-20  2:34 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

Paul Menage wrote:
> Revert commit 8827c288feb7810185aa7c2e37537202fc709869
> 
> This is in preparation for some clashing cgroups changes that subsume
> the original commit's functionaliy.
> 
> Signed-off-by: Paul Menage <menage@google.com>
> 
> --
> 
> The original commit fixed a pid namespace bug which Ben Blum fixed
> independently (in the same way, but with different code) as part of a
> series of patches. I played around with trying to reconcile Ben's
> patch series with Li's patch, but concluded that it was simpler to
> just revert Li's, given that Ben's patch series contained essentially
> the same fix.
> 

It's sure that reverting this commit makes things easier, but
rebasing this patchset shouldn't be hard. And this doesn't
sound a good reason to revert an innocent commit.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/8] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids
  2009-08-18 23:58 ` [PATCH 2/8] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Paul Menage
@ 2009-08-20  2:34   ` Li Zefan
  0 siblings, 0 replies; 37+ messages in thread
From: Li Zefan @ 2009-08-20  2:34 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

07:58, Paul Menage wrote:
> From: Ben Blum <bblum@google.com>
> 
> 
> Adds a read-only "procs" file similar to "tasks" that shows only unique tgids
> 
> struct cgroup used to have a bunch of fields for keeping track of the pidlist
> for the tasks file. Those are now separated into a new struct cgroup_pidlist,
> of which two are had, one for procs and one for tasks. The way the seq_file
> operations are set up is changed so that just the pidlist struct gets passed
> around as the private data.
> 
> Interface example: Suppose a multithreaded process has pid 1000 and other
> threads with ids 1001, 1002, 1003:
> $ cat tasks
> 1000
> 1001
> 1002
> 1003
> $ cat cgroup.procs
> 1000
> $
> 
> Signed-off-by: Ben Blum <bblum@google.com>
> Signed-off-by: Paul Menage <menage@google.com>
> 

Acked-by: Li Zefan <lizf@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations
  2009-08-18 23:58 ` [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations Paul Menage
@ 2009-08-20  2:37   ` Li Zefan
  2009-08-20  2:41     ` Paul Menage
  2009-08-20 21:14   ` Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Li Zefan @ 2009-08-20  2:37 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

Paul Menage wrote:
> From: Ben Blum <bblum@google.com>
> 
> 
> Use vmalloc for large cgroups pidlist allocations
> 
> Separates all pidlist allocation requests to a separate function that judges
> based on the requested size whether or not the array needs to be vmalloced or
> can be gotten via kmalloc, and similar for kfree/vfree.
> 
> Signed-off-by: Ben Blum <bblum@google.com>
> Signed-off-by: Paul Menage <menage@google.com>
> 

Acked-by: Li Zefan <lizf@cn.fujitsu.com>

> ---
> 
>  kernel/cgroup.c |   47 ++++++++++++++++++++++++++++++++++++++++++-----
>  1 files changed, 42 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index d207871..7e2b285 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -50,6 +50,7 @@
>  #include <linux/smp_lock.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/idr.h>
> +#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>  
>  #include <asm/atomic.h>
>  
> @@ -2351,6 +2352,42 @@ int cgroup_scan_tasks(struct cgroup_scanner *scan)
>   */
>  
>  /*
> + * The following two functions "fix" the issue where there are more pids
> + * than kmalloc will give memory for; in such cases, we use vmalloc/vfree.
> + * TODO: replace with a kernel-wide solution to this problem
> + */

I come to realize the "procs" file will make it much harder to convert
vmalloc/kmalloc with a different solution..

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/8] Changes css_set freeing mechanism to be under RCU
  2009-08-18 23:58 ` [PATCH 5/8] Changes css_set freeing mechanism to be under RCU Paul Menage
@ 2009-08-20  2:38   ` Li Zefan
  0 siblings, 0 replies; 37+ messages in thread
From: Li Zefan @ 2009-08-20  2:38 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

07:58, Paul Menage wrote:
> From: Ben Blum <bblum@google.com>
> 
> 
> Changes css_set freeing mechanism to be under RCU
> 
> This is a prepatch for making the procs file writable. In order to free the
> old css_sets for each task to be moved as they're being moved, the freeing
> mechanism must be RCU-protected, or else we would have to have a call to
> synchronize_rcu() for each task before freeing its old css_set.
> 
> Signed-off-by: Ben Blum <bblum@google.com>
> Signed-off-by: Paul Menage <menage@google.com>
> 

Acked-by: Li Zefan <lizf@cn.fujitsu.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
  2009-08-18 23:58 ` [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup Paul Menage
@ 2009-08-20  2:39   ` Li Zefan
  2009-08-20  2:45     ` Paul Menage
  0 siblings, 1 reply; 37+ messages in thread
From: Li Zefan @ 2009-08-20  2:39 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

Paul Menage wrote:
> From: Ben Blum <bblum@google.com>
> 
> 
> Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
> 
> This patch adds an rwsem that lives in a threadgroup's sighand_struct (next to
> the sighand's atomic count, to piggyback on its cacheline), and two functions
> in kernel/cgroup.c (for now) for easily+safely obtaining and releasing it. If
> another part of the kernel later wants to use such a locking mechanism, the
> CONFIG_CGROUPS ifdefs should be changed to a higher-up flag that CGROUPS and
> the other system would both depend on, and the lock/unlock functions could be
> moved to sched.c or so.
> 
> Signed-off-by: Ben Blum <bblum@google.com>
> Signed-off-by: Paul Menage <menage@google.com>
>

Looks fine to me.

Acked-by: Li Zefan <lizf@cn.fujitsu.com>

... 

> +struct sighand_struct *threadgroup_fork_lock(struct task_struct *tsk)
> +{
> +	struct sighand_struct *sighand;
> +	struct task_struct *p;
> +
> +	/* tasklist lock protects sighand_struct's disappearance in exit(). */
> +	read_lock(&tasklist_lock);
> +	if (likely(tsk->sighand)) {
> +		/* simple case - check the thread we were given first */
> +		sighand = tsk->sighand;
> +	} else {
> +		sighand = NULL;
> +		/*
> +		 * tsk is exiting; try to find another thread in the group
> +		 * whose sighand pointer is still alive.
> +		 */
> +		rcu_read_lock();

since we are holding tasklist_lock, I think we don't need to
take rcu lock?

> +		list_for_each_entry_rcu(p, &tsk->thread_group, thread_group) {
> +			if (p->sighand) {
> +				sighand = tsk->sighand;

s/tsk->sighand/p->sighand

> +				break;
> +			}
> +		}
> +		rcu_read_unlock();
> +	}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations
  2009-08-20  2:37   ` Li Zefan
@ 2009-08-20  2:41     ` Paul Menage
  2009-08-20  2:54       ` Li Zefan
  0 siblings, 1 reply; 37+ messages in thread
From: Paul Menage @ 2009-08-20  2:41 UTC (permalink / raw)
  To: Li Zefan; +Cc: akpm, bblum, linux-kernel, containers

On Wed, Aug 19, 2009 at 7:37 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>
> I come to realize the "procs" file will make it much harder to convert
> vmalloc/kmalloc with a different solution..
>

Why do you think that?

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869
  2009-08-20  2:34   ` Li Zefan
@ 2009-08-20  2:41     ` Paul Menage
  2009-08-20  3:11       ` Li Zefan
  0 siblings, 1 reply; 37+ messages in thread
From: Paul Menage @ 2009-08-20  2:41 UTC (permalink / raw)
  To: Li Zefan; +Cc: akpm, bblum, linux-kernel, containers

On Wed, Aug 19, 2009 at 7:34 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>
> It's sure that reverting this commit makes things easier, but
> rebasing this patchset shouldn't be hard. And this doesn't
> sound a good reason to revert an innocent commit.
>

The problem is that Ben's patch set starts with a patch that adds the
"procs" file, and then a patch that fixes the namespace bug, and he
and you used different names for variables/functions even though you
were doing essentially the same thing. So rebasing would involve
pretty much entirely rewriting the first patch and ditching the second
- not just a case of fixing up some merge conflicts.

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD  fork()ing per-threadgroup
  2009-08-20  2:39   ` Li Zefan
@ 2009-08-20  2:45     ` Paul Menage
  2009-08-20 21:13       ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Paul Menage @ 2009-08-20  2:45 UTC (permalink / raw)
  To: Li Zefan; +Cc: akpm, bblum, linux-kernel, containers

On Wed, Aug 19, 2009 at 7:39 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>
>> +             list_for_each_entry_rcu(p, &tsk->thread_group, thread_group) {
>> +                     if (p->sighand) {
>> +                             sighand = tsk->sighand;
>
> s/tsk->sighand/p->sighand

Good catch, thanks. Fixed.

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations
  2009-08-20  2:41     ` Paul Menage
@ 2009-08-20  2:54       ` Li Zefan
  2009-08-20  3:04         ` Paul Menage
  0 siblings, 1 reply; 37+ messages in thread
From: Li Zefan @ 2009-08-20  2:54 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

Paul Menage wrote:
> On Wed, Aug 19, 2009 at 7:37 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>> I come to realize the "procs" file will make it much harder to convert
>> vmalloc/kmalloc with a different solution..
>>
> 
> Why do you think that?
> 

To avoid showing duplicate threadgroup pids, when dealing with a
thread, you have to check backwards to see if it's group pid
has been printed. Maybe it doesn't add much extra cost with your
proposal, we'll see.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 8/8] Adds ability to move all threads in a process to a new cgroup atomically
  2009-08-18 23:58 ` [PATCH 8/8] Adds ability to move all threads in a process to a new cgroup atomically Paul Menage
@ 2009-08-20  3:00   ` Li Zefan
  0 siblings, 0 replies; 37+ messages in thread
From: Li Zefan @ 2009-08-20  3:00 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

Paul Menage wrote:
> From: Ben Blum <bblum@google.com>
> 
> 
> Adds ability to move all threads in a process to a new cgroup atomically
> 
> This patch adds functionality that enables users to move all threads in a
> threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
> file. This current implementation makes use of a per-threadgroup rwsem that's
> taken for reading in the fork() path to prevent newly forking threads within
> the threadgroup from "escaping" while the move is in progress.
> 
> Cgroups subsystems that need to perform per-thread actions in their
> "attach" callback are (currently) responsible for doing their own
> synchronization, since this occurs outside of the critical section
> that locks against cloning within a thread group.
> 
> Signed-off-by: Ben Blum <bblum@google.com>
> Signed-off-by: Paul Menage <menage@google.com>
> 

Acked-by: Li Zefan <lizf@cn.fujitsu.com>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations
  2009-08-20  2:54       ` Li Zefan
@ 2009-08-20  3:04         ` Paul Menage
  0 siblings, 0 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-20  3:04 UTC (permalink / raw)
  To: Li Zefan; +Cc: akpm, bblum, linux-kernel, containers

On Wed, Aug 19, 2009 at 7:54 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>
> To avoid showing duplicate threadgroup pids, when dealing with a
> thread, you have to check backwards to see if it's group pid
> has been printed. Maybe it doesn't add much extra cost with your
> proposal, we'll see.

My thought was to explicitly specify the output of cgroup.procs as
"mostly unique" and leave it up to userspace to make it really unique
if necessary. Then we can just skip the tgid if it's the same as the
previous one, and that will give you a unique list in the common case.

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/8] Lets ss->can_attach and ss->attach do whole threadgroups at a time
  2009-08-18 23:58 ` [PATCH 6/8] Lets ss->can_attach and ss->attach do whole threadgroups at a time Paul Menage
@ 2009-08-20  3:06   ` Li Zefan
  2009-08-20  4:38   ` Matt Helsley
  1 sibling, 0 replies; 37+ messages in thread
From: Li Zefan @ 2009-08-20  3:06 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

07:58, Paul Menage wrote:
> From: Ben Blum <bblum@google.com>
> 
> 
> Lets ss->can_attach and ss->attach do whole threadgroups at a time
> 
> This patch alters the ss->can_attach and ss->attach functions to be able to
> deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This
> is a pre-patch to cgroup-procs-writable.patch.)
> 
> Currently, new mode of the attach function can only tell the subsystem about
> the old cgroup of the threadgroup leader. No subsystem currently needs that
> information for each thread that's being moved, but if one were to be added
> (for example, one that counts tasks within a group) this bit would need to be
> reworked a bit to tell the subsystem the right information.
> 
> Signed-off-by: Ben Blum <bblum@google.com>
> Signed-off-by: Paul Menage <menage@google.com>
> 

Acked-by: Li Zefan <lizf@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869
  2009-08-20  2:41     ` Paul Menage
@ 2009-08-20  3:11       ` Li Zefan
  2009-08-20  3:13         ` Paul Menage
  0 siblings, 1 reply; 37+ messages in thread
From: Li Zefan @ 2009-08-20  3:11 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

Paul Menage wrote:
> On Wed, Aug 19, 2009 at 7:34 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>> It's sure that reverting this commit makes things easier, but
>> rebasing this patchset shouldn't be hard. And this doesn't
>> sound a good reason to revert an innocent commit.
>>
> 
> The problem is that Ben's patch set starts with a patch that adds the
> "procs" file, and then a patch that fixes the namespace bug, and he
> and you used different names for variables/functions even though you
> were doing essentially the same thing. So rebasing would involve
> pretty much entirely rewriting the first patch and ditching the second
> - not just a case of fixing up some merge conflicts.
> 

Here we reintroduce the bug, and re-fix it with the essentially
same solution, this lazy way doesn't sound right to me.

As you said, the 2 patches differ just in using different var/func
names, so rebasing shouldn't be hard but maybe a bit boring.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869
  2009-08-20  3:11       ` Li Zefan
@ 2009-08-20  3:13         ` Paul Menage
  2009-08-20  3:17           ` Li Zefan
  2009-08-20  4:40           ` Matt Helsley
  0 siblings, 2 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-20  3:13 UTC (permalink / raw)
  To: Li Zefan; +Cc: akpm, bblum, linux-kernel, containers

On Wed, Aug 19, 2009 at 8:11 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>
> As you said, the 2 patches differ just in using different var/func
> names, so rebasing shouldn't be hard but maybe a bit boring.
>

Yes - it's a fair amount of work for no real gain.

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869
  2009-08-20  3:13         ` Paul Menage
@ 2009-08-20  3:17           ` Li Zefan
  2009-08-20  4:40           ` Matt Helsley
  1 sibling, 0 replies; 37+ messages in thread
From: Li Zefan @ 2009-08-20  3:17 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, linux-kernel, containers

Paul Menage wrote:
> On Wed, Aug 19, 2009 at 8:11 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>> As you said, the 2 patches differ just in using different var/func
>> names, so rebasing shouldn't be hard but maybe a bit boring.
>>
> 
> Yes - it's a fair amount of work for no real gain.
> 

Ok, I won't insist on how it should be done.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/8] Lets ss->can_attach and ss->attach do whole threadgroups at a time
  2009-08-18 23:58 ` [PATCH 6/8] Lets ss->can_attach and ss->attach do whole threadgroups at a time Paul Menage
  2009-08-20  3:06   ` Li Zefan
@ 2009-08-20  4:38   ` Matt Helsley
  1 sibling, 0 replies; 37+ messages in thread
From: Matt Helsley @ 2009-08-20  4:38 UTC (permalink / raw)
  To: Paul Menage; +Cc: akpm, bblum, lizf, containers, linux-kernel

On Tue, Aug 18, 2009 at 04:58:38PM -0700, Paul Menage wrote:
> From: Ben Blum <bblum@google.com>
> 
> 
> Lets ss->can_attach and ss->attach do whole threadgroups at a time
> 
> This patch alters the ss->can_attach and ss->attach functions to be able to
> deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This
> is a pre-patch to cgroup-procs-writable.patch.)
> 
> Currently, new mode of the attach function can only tell the subsystem about
> the old cgroup of the threadgroup leader. No subsystem currently needs that
> information for each thread that's being moved, but if one were to be added
> (for example, one that counts tasks within a group) this bit would need to be
> reworked a bit to tell the subsystem the right information.
> 
> Signed-off-by: Ben Blum <bblum@google.com>
> Signed-off-by: Paul Menage <menage@google.com>

freezer bits:

Reviewed-by: Matt Helsley <matthltc@us.ibm.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869
  2009-08-20  3:13         ` Paul Menage
  2009-08-20  3:17           ` Li Zefan
@ 2009-08-20  4:40           ` Matt Helsley
  2009-08-20  4:49             ` Paul Menage
  1 sibling, 1 reply; 37+ messages in thread
From: Matt Helsley @ 2009-08-20  4:40 UTC (permalink / raw)
  To: Paul Menage; +Cc: Li Zefan, bblum, containers, akpm, linux-kernel

On Wed, Aug 19, 2009 at 08:13:59PM -0700, Paul Menage wrote:
> On Wed, Aug 19, 2009 at 8:11 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
> >
> > As you said, the 2 patches differ just in using different var/func
> > names, so rebasing shouldn't be hard but maybe a bit boring.
> >
> 
> Yes - it's a fair amount of work for no real gain.

So this revert won't introduce a regression for any potential tests?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869
  2009-08-20  4:40           ` Matt Helsley
@ 2009-08-20  4:49             ` Paul Menage
  0 siblings, 0 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-20  4:49 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Li Zefan, bblum, containers, akpm, linux-kernel

On Wed, Aug 19, 2009 at 9:40 PM, Matt Helsley<matthltc@us.ibm.com> wrote:
> On Wed, Aug 19, 2009 at 08:13:59PM -0700, Paul Menage wrote:
>> On Wed, Aug 19, 2009 at 8:11 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
>> >
>> > As you said, the 2 patches differ just in using different var/func
>> > names, so rebasing shouldn't be hard but maybe a bit boring.
>> >
>>
>> Yes - it's a fair amount of work for no real gain.
>
> So this revert won't introduce a regression for any potential tests?
>

Yes, there would be a two-commit window where a hypothetical test that
tests for concurrent opens of "tasks" files from different pid
namespaces would fail.

If that was really a big deal, then it wouldn't be too hard to fold
the first three patches in this series down into one, but that would
reduce the clarity of the commit history.

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869
  2009-08-18 23:58 ` [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869 Paul Menage
  2009-08-20  2:34   ` Li Zefan
@ 2009-08-20 21:13   ` Andrew Morton
  1 sibling, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2009-08-20 21:13 UTC (permalink / raw)
  To: Paul Menage; +Cc: bblum, lizf, linux-kernel, containers

On Tue, 18 Aug 2009 16:58:12 -0700
Paul Menage <menage@google.com> wrote:

> Revert commit 8827c288feb7810185aa7c2e37537202fc709869
> 
> This is in preparation for some clashing cgroups changes that subsume
> the original commit's functionaliy.


Please don't refer to commits via their bare hash number.

This commit actually has a different hash in my tree:

commit 096b7fe012d66ed55e98bc8022405ede0cc80e96
Author:     Li Zefan <lizf@cn.fujitsu.com>
AuthorDate: Wed Jul 29 15:04:04 2009 -0700
Commit:     Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Jul 29 19:10:35 2009 -0700

    cgroups: fix pid namespace bug


And it will definitely have a different hash in linux-next, -stable,
distro trees, etc.

So the conventional way of referring to a commit within a changelog is


096b7fe012d66ed55e98bc8022405ede0cc80e96 ("cgroups: fix pid namespace bug")



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
  2009-08-20  2:45     ` Paul Menage
@ 2009-08-20 21:13       ` Andrew Morton
  2009-08-20 21:16         ` Paul Menage
  2009-08-21  4:24         ` Ben Blum
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2009-08-20 21:13 UTC (permalink / raw)
  To: Paul Menage; +Cc: lizf, bblum, linux-kernel, containers

On Wed, 19 Aug 2009 19:45:11 -0700
Paul Menage <menage@google.com> wrote:

> On Wed, Aug 19, 2009 at 7:39 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
> >
> >> + __ __ __ __ __ __ list_for_each_entry_rcu(p, &tsk->thread_group, thread_group) {
> >> + __ __ __ __ __ __ __ __ __ __ if (p->sighand) {
> >> + __ __ __ __ __ __ __ __ __ __ __ __ __ __ sighand = tsk->sighand;

(^^ who did that?)

> >
> > s/tsk->sighand/p->sighand
> 
> Good catch, thanks. Fixed.
> 

I see no fix anywhere so I did this:

--- a/kernel/cgroup.c~cgroups-add-functionality-to-read-write-lock-clone_thread-forking-per-threadgroup-fix
+++ a/kernel/cgroup.c
@@ -1557,7 +1557,7 @@ struct sighand_struct *threadgroup_fork_
 		rcu_read_lock();
 		list_for_each_entry_rcu(p, &tsk->thread_group, thread_group) {
 			if (p->sighand) {
-				sighand = tsk->sighand;
+				sighand = p->sighand;
 				break;
 			}
 		}


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations
  2009-08-18 23:58 ` [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations Paul Menage
  2009-08-20  2:37   ` Li Zefan
@ 2009-08-20 21:14   ` Andrew Morton
  2009-08-20 22:35     ` Jonathan Corbet
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2009-08-20 21:14 UTC (permalink / raw)
  To: Paul Menage
  Cc: bblum, lizf, linux-kernel, containers, Dave Hansen,
	David Rientjes

On Tue, 18 Aug 2009 16:58:27 -0700
Paul Menage <menage@google.com> wrote:

> From: Ben Blum <bblum@google.com>
> 
> 
> Use vmalloc for large cgroups pidlist allocations
> 
> Separates all pidlist allocation requests to a separate function that judges
> based on the requested size whether or not the array needs to be vmalloced or
> can be gotten via kmalloc, and similar for kfree/vfree.
> 

Hang on.  Isn't this why Dave just wrote and I just rush-merged
lib/flex_array.c?

Was that code evaluated for this application and judged unsuitable?  If so,
for what reason?

If not, can someone please do this?

Ta.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD  fork()ing per-threadgroup
  2009-08-20 21:13       ` Andrew Morton
@ 2009-08-20 21:16         ` Paul Menage
  2009-08-21  4:24         ` Ben Blum
  1 sibling, 0 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-20 21:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lizf, bblum, linux-kernel, containers

On Thu, Aug 20, 2009 at 2:13 PM, Andrew Morton<akpm@linux-foundation.org> wrote:
>> >
>> > s/tsk->sighand/p->sighand
>>
>> Good catch, thanks. Fixed.
>>
>
> I see no fix anywhere so I did this:

I meant fixed in my tree to go out in the next set of patches.


>
> --- a/kernel/cgroup.c~cgroups-add-functionality-to-read-write-lock-clone_thread-forking-per-threadgroup-fix
> +++ a/kernel/cgroup.c
> @@ -1557,7 +1557,7 @@ struct sighand_struct *threadgroup_fork_
>                rcu_read_lock();
>                list_for_each_entry_rcu(p, &tsk->thread_group, thread_group) {
>                        if (p->sighand) {
> -                               sighand = tsk->sighand;
> +                               sighand = p->sighand;
>                                break;
>                        }
>                }

Yes, looks good.

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations
  2009-08-20 21:14   ` Andrew Morton
@ 2009-08-20 22:35     ` Jonathan Corbet
  2009-08-20 22:56       ` David Rientjes
  0 siblings, 1 reply; 37+ messages in thread
From: Jonathan Corbet @ 2009-08-20 22:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Paul Menage, bblum, lizf, linux-kernel, containers, Dave Hansen,
	David Rientjes

On Thu, 20 Aug 2009 14:14:00 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> Hang on.  Isn't this why Dave just wrote and I just rush-merged
> lib/flex_array.c?
> 
> Was that code evaluated for this application and judged unsuitable?  If so,
> for what reason?

Should it be helpful: I wrote an overview of the flex_array API here:

	http://lwn.net/Articles/345273/

I could format it up for addition to Documentation/ if people want.

jon

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations
  2009-08-20 22:35     ` Jonathan Corbet
@ 2009-08-20 22:56       ` David Rientjes
  2009-08-21  0:53         ` Li Zefan
  0 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2009-08-20 22:56 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Andrew Morton, Paul Menage, bblum, lizf, linux-kernel, containers,
	Dave Hansen

On Thu, 20 Aug 2009, Jonathan Corbet wrote:

> On Thu, 20 Aug 2009 14:14:00 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Hang on.  Isn't this why Dave just wrote and I just rush-merged
> > lib/flex_array.c?
> > 
> > Was that code evaluated for this application and judged unsuitable?  If so,
> > for what reason?
> 
> Should it be helpful: I wrote an overview of the flex_array API here:
> 
> 	http://lwn.net/Articles/345273/
> 
> I could format it up for addition to Documentation/ if people want.
> 

It's definitely helpful for this use case, flex array can store 261,632 
pid_t's on 64 bit.

Instead of reallocating new memory when the pidlist becomes smaller, it 
would then be sufficient to simply shrink the flex array by freeing parts 
members that are newly unused (which also avoids the oom possibility in 
pidlist_resize()).

flex_array_shrink(struct flex_array *) would require element poisoning 
when an element is cleared, but that's probably not far off anyways 
because it has additional advantages such as preventing flex_array_get() 
on an element that has not been put.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations
  2009-08-20 22:56       ` David Rientjes
@ 2009-08-21  0:53         ` Li Zefan
  0 siblings, 0 replies; 37+ messages in thread
From: Li Zefan @ 2009-08-21  0:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Jonathan Corbet, Andrew Morton, Paul Menage, bblum, linux-kernel,
	containers, Dave Hansen

David Rientjes wrote:
> On Thu, 20 Aug 2009, Jonathan Corbet wrote:
> 
>> On Thu, 20 Aug 2009 14:14:00 -0700
>> Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>>> Hang on.  Isn't this why Dave just wrote and I just rush-merged
>>> lib/flex_array.c?
>>>
>>> Was that code evaluated for this application and judged unsuitable?  If so,
>>> for what reason?
>> Should it be helpful: I wrote an overview of the flex_array API here:
>>
>> 	http://lwn.net/Articles/345273/
>>
>> I could format it up for addition to Documentation/ if people want.
>>
> 
> It's definitely helpful for this use case, flex array can store 261,632 
> pid_t's on 64 bit.
> 

flex_array surely can be used here, or implement Paul's proposal:
	http://lkml.org/lkml/2009/7/15/226

Note both solutions will probably make the output pid list unsorted.
Or maybe we can implement flex_array_sort().


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
  2009-08-20 21:13       ` Andrew Morton
  2009-08-20 21:16         ` Paul Menage
@ 2009-08-21  4:24         ` Ben Blum
  2009-08-21  4:50           ` Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Ben Blum @ 2009-08-21  4:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Menage, lizf, linux-kernel, containers

On Thu, Aug 20, 2009 at 02:13:56PM -0700, Andrew Morton wrote:
> On Wed, 19 Aug 2009 19:45:11 -0700
> Paul Menage <menage@google.com> wrote:
> 
> > On Wed, Aug 19, 2009 at 7:39 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
> > >
> > >> + __ __ __ __ __ __ list_for_each_entry_rcu(p, &tsk->thread_group, thread_group) {
> > >> + __ __ __ __ __ __ __ __ __ __ if (p->sighand) {
> > >> + __ __ __ __ __ __ __ __ __ __ __ __ __ __ sighand = tsk->sighand;
> 
> (^^ who did that?)
> 

The tsk->sighand? Likely me. I'm surprised it took that many pairs of eyes
to catch it.

> > >
> > > s/tsk->sighand/p->sighand
> > 
> > Good catch, thanks. Fixed.
> > 
> 
> I see no fix anywhere so I did this:
> 
> --- a/kernel/cgroup.c~cgroups-add-functionality-to-read-write-lock-clone_thread-forking-per-threadgroup-fix
> +++ a/kernel/cgroup.c
> @@ -1557,7 +1557,7 @@ struct sighand_struct *threadgroup_fork_
>  		rcu_read_lock();
>  		list_for_each_entry_rcu(p, &tsk->thread_group, thread_group) {
>  			if (p->sighand) {
> -				sighand = tsk->sighand;
> +				sighand = p->sighand;
>  				break;
>  			}
>  		}
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup
  2009-08-21  4:24         ` Ben Blum
@ 2009-08-21  4:50           ` Andrew Morton
  2009-08-21  4:58             ` Paul Menage
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2009-08-21  4:50 UTC (permalink / raw)
  To: Ben Blum; +Cc: Paul Menage, lizf, linux-kernel, containers

On Fri, 21 Aug 2009 00:24:03 -0400 Ben Blum <bblum@andrew.cmu.edu> wrote:

> On Thu, Aug 20, 2009 at 02:13:56PM -0700, Andrew Morton wrote:
> > On Wed, 19 Aug 2009 19:45:11 -0700
> > Paul Menage <menage@google.com> wrote:
> > 
> > > On Wed, Aug 19, 2009 at 7:39 PM, Li Zefan<lizf@cn.fujitsu.com> wrote:
> > > >
> > > >> + __ __ __ __ __ __ list_for_each_entry_rcu(p, &tsk->thread_group, thread_group) {
> > > >> + __ __ __ __ __ __ __ __ __ __ if (p->sighand) {
> > > >> + __ __ __ __ __ __ __ __ __ __ __ __ __ __ sighand = tsk->sighand;
> > 
> > (^^ who did that?)
> > 
> 
> The tsk->sighand? Likely me. I'm surprised it took that many pairs of eyes
> to catch it.

I was actually referring to the non-ascii crap in the email.

'twas Paul.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD  fork()ing per-threadgroup
  2009-08-21  4:50           ` Andrew Morton
@ 2009-08-21  4:58             ` Paul Menage
  0 siblings, 0 replies; 37+ messages in thread
From: Paul Menage @ 2009-08-21  4:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ben Blum, lizf, linux-kernel, containers

On Thu, Aug 20, 2009 at 9:50 PM, Andrew Morton<akpm@linux-foundation.org> wrote:
>
> I was actually referring to the non-ascii crap in the email.
>
> 'twas Paul.
>

Hmm, so it was. It's sad how much gmail seems like like mangling ascii
patches ...

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2009-08-21  4:58 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-18 23:58 [PATCH 0/8] CGroups: cgroup memberlist enhancements/fixes Paul Menage
2009-08-18 23:58 ` [PATCH 1/8] Revert commit 8827c288feb7810185aa7c2e37537202fc709869 Paul Menage
2009-08-20  2:34   ` Li Zefan
2009-08-20  2:41     ` Paul Menage
2009-08-20  3:11       ` Li Zefan
2009-08-20  3:13         ` Paul Menage
2009-08-20  3:17           ` Li Zefan
2009-08-20  4:40           ` Matt Helsley
2009-08-20  4:49             ` Paul Menage
2009-08-20 21:13   ` Andrew Morton
2009-08-18 23:58 ` [PATCH 2/8] Adds a read-only "procs" file similar to "tasks" that shows only unique tgids Paul Menage
2009-08-20  2:34   ` Li Zefan
2009-08-18 23:58 ` [PATCH 3/8] Ensures correct concurrent opening/reading of pidlists across pid namespaces Paul Menage
2009-08-18 23:58 ` [PATCH 4/8] Use vmalloc for large cgroups pidlist allocations Paul Menage
2009-08-20  2:37   ` Li Zefan
2009-08-20  2:41     ` Paul Menage
2009-08-20  2:54       ` Li Zefan
2009-08-20  3:04         ` Paul Menage
2009-08-20 21:14   ` Andrew Morton
2009-08-20 22:35     ` Jonathan Corbet
2009-08-20 22:56       ` David Rientjes
2009-08-21  0:53         ` Li Zefan
2009-08-18 23:58 ` [PATCH 5/8] Changes css_set freeing mechanism to be under RCU Paul Menage
2009-08-20  2:38   ` Li Zefan
2009-08-18 23:58 ` [PATCH 6/8] Lets ss->can_attach and ss->attach do whole threadgroups at a time Paul Menage
2009-08-20  3:06   ` Li Zefan
2009-08-20  4:38   ` Matt Helsley
2009-08-18 23:58 ` [PATCH 7/8] Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup Paul Menage
2009-08-20  2:39   ` Li Zefan
2009-08-20  2:45     ` Paul Menage
2009-08-20 21:13       ` Andrew Morton
2009-08-20 21:16         ` Paul Menage
2009-08-21  4:24         ` Ben Blum
2009-08-21  4:50           ` Andrew Morton
2009-08-21  4:58             ` Paul Menage
2009-08-18 23:58 ` [PATCH 8/8] Adds ability to move all threads in a process to a new cgroup atomically Paul Menage
2009-08-20  3:00   ` Li Zefan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).