[RFC] [PATCH v2 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] [PATCH v2 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs
@ 2010-05-30  1:30 Ben Blum
  2010-05-30  1:31 ` [RFC] [PATCH 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
  2010-05-30  1:33 ` [RFC] [PATCH 2/2] cgroups: make procs file writable Ben Blum
  0 siblings, 2 replies; 19+ messages in thread
From: Ben Blum @ 2010-05-30  1:30 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: akpm, bblum, ebiederm, lizf, matthltc, menage, oleg

This patch series is a revision of http://lkml.org/lkml/2010/1/3/51 and
http://lkml.org/lkml/2010/1/3/52 .

The rwsem in the fork path has been moved to signal_struct to simplify
the locking code in the cgroup_attach_proc side. This depends on Oleg's
recentish changes to signal_struct's lifetime rules (which don't seem to
appear when I check out mmotm with git clone, so I wasn't able to do any
more than basic testing).

There is still a race with exec in the case where the threadgroup leader
changes. To solve this, this implementation checks if the race occurred
after all previous set-up has been done and all necessary locks are
held, and if so, returns -EAGAIN which is handled by the calling
function by looping until a different value is returned.

-- bblum

---
 Documentation/cgroups/cgroups.txt |    9
 include/linux/cgroup.h            |   15 -
 include/linux/init_task.h         |    9
 include/linux/sched.h             |   10
 kernel/cgroup.c                   |  435 +++++++++++++++++++++++++++++++++-----
 kernel/fork.c                     |   10
 6 files changed, 431 insertions(+), 57 deletions(-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC] [PATCH 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup
  2010-05-30  1:30 [RFC] [PATCH v2 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
@ 2010-05-30  1:31 ` Ben Blum
  2010-05-30  1:33 ` [RFC] [PATCH 2/2] cgroups: make procs file writable Ben Blum
  1 sibling, 0 replies; 19+ messages in thread
From: Ben Blum @ 2010-05-30  1:31 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: akpm, bblum, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-threadgroup-fork-lock.patch --]
[-- Type: text/plain, Size: 7559 bytes --]

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds an rwsem that lives in a threadgroup's signal_struct that's
taken for reading in the fork path, under CONFIG_CGROUPS. If another part of
the kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other system
would both depend on.

This is a pre-patch for cgroups-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 include/linux/cgroup.h    |   15 ++++++++++-----
 include/linux/init_task.h |    9 +++++++++
 include/linux/sched.h     |   10 ++++++++++
 kernel/cgroup.c           |   23 +++++++++++++++++++++--
 kernel/fork.c             |   10 +++++++---
 5 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 8f78073..196a703 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -31,10 +31,12 @@ extern void cgroup_lock(void);
 extern int cgroup_lock_is_held(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
-extern void cgroup_fork(struct task_struct *p);
+extern void cgroup_fork(struct task_struct *p, unsigned long clone_flags);
 extern void cgroup_fork_callbacks(struct task_struct *p);
-extern void cgroup_post_fork(struct task_struct *p);
+extern void cgroup_post_fork(struct task_struct *p, unsigned long clone_flags);
 extern void cgroup_exit(struct task_struct *p, int run_callbacks);
+extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
+			       unsigned long clone_flags);
 extern int cgroupstats_build(struct cgroupstats *stats,
 				struct dentry *dentry);
 extern int cgroup_load_subsys(struct cgroup_subsys *ss);
@@ -613,11 +615,14 @@ unsigned short css_depth(struct cgroup_subsys_state *css);
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
-static inline void cgroup_fork(struct task_struct *p) {}
+static inline void cgroup_fork(struct task_struct *p,
+			       unsigned long clone_flags) {}
 static inline void cgroup_fork_callbacks(struct task_struct *p) {}
-static inline void cgroup_post_fork(struct task_struct *p) {}
+static inline void cgroup_post_fork(struct task_struct *p,
+				    unsigned long clone_flags) {}
 static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
-
+static inline void cgroup_fork_failed(struct task_struct *p, int callbacks,
+				      unsigned long clone_flags) {}
 static inline void cgroup_lock(void) {}
 static inline void cgroup_unlock(void) {}
 static inline int cgroupstats_build(struct cgroupstats *stats,
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index b1ed1cd..cfb2bc8 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -15,6 +15,14 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_THREADGROUP_FORK_LOCK(sig)					\
+	.threadgroup_fork_lock =					\
+		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#else
+#define INIT_THREADGROUP_FORK_LOCK(sig)
+#endif
+
 #define INIT_SIGNALS(sig) {						\
 	.count		= ATOMIC_INIT(1), 				\
 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ -29,6 +37,7 @@ extern struct fs_struct init_fs;
 		.running = 0,						\
 		.lock = __SPIN_LOCK_UNLOCKED(sig.cputimer.lock),	\
 	},								\
+	INIT_THREADGROUP_FORK_LOCK(sig)					\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b7b81d..2bbcbd2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -627,6 +627,16 @@ struct signal_struct {
 	unsigned audit_tty;
 	struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+	/*
+	 * The threadgroup_fork_lock prevents threads from forking with
+	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
+	 * threadgroup-wide operations.
+	 * Currently only needed by cgroups, and the fork-side readlock happens
+	 * in cgroup_{fork,post_fork,fork_failed}.
+	 */
+	struct rw_semaphore threadgroup_fork_lock;
+#endif
 
 	int oom_adj;	/* OOM kill score adjustment (bit shift) */
 };
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6d870f2..6c8e46f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4015,8 +4015,10 @@ static const struct file_operations proc_cgroupstats_operations = {
  * At the point that cgroup_fork() is called, 'current' is the parent
  * task, and the passed argument 'child' points to the child task.
  */
-void cgroup_fork(struct task_struct *child)
+void cgroup_fork(struct task_struct *child, unsigned long clone_flags)
 {
+	if (clone_flags & CLONE_THREAD)
+		down_read(&current->signal->threadgroup_fork_lock);
 	task_lock(current);
 	child->cgroups = current->cgroups;
 	get_css_set(child->cgroups);
@@ -4058,7 +4060,7 @@ void cgroup_fork_callbacks(struct task_struct *child)
  * with the first call to cgroup_iter_start() - to guarantee that the
  * new task ends up on its list.
  */
-void cgroup_post_fork(struct task_struct *child)
+void cgroup_post_fork(struct task_struct *child, unsigned long clone_flags)
 {
 	if (use_task_css_set_links) {
 		write_lock(&css_set_lock);
@@ -4068,6 +4070,8 @@ void cgroup_post_fork(struct task_struct *child)
 		task_unlock(child);
 		write_unlock(&css_set_lock);
 	}
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
 }
 /**
  * cgroup_exit - detach cgroup from exiting task
@@ -4143,6 +4147,21 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
 }
 
 /**
+ * cgroup_fork_failed - undo operations for fork failure
+ * @tsk: pointer to  task_struct of exiting process
+ * @run_callback: run exit callbacks?
+ *
+ * Wrapper for cgroup_exit that also drops the fork lock.
+ */
+void cgroup_fork_failed(struct task_struct *tsk, int run_callbacks,
+			unsigned long clone_flags)
+{
+	if (clone_flags & CLONE_THREAD)
+		up_read(&current->signal->threadgroup_fork_lock);
+	cgroup_exit(tsk, run_callbacks);
+}
+
+/**
  * cgroup_clone - clone the cgroup the given subsystem is attached to
  * @tsk: the task to be moved
  * @subsys: the given subsystem
diff --git a/kernel/fork.c b/kernel/fork.c
index 4c14942..e2b04ac 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -884,6 +884,10 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 	tty_audit_fork(sig);
 
+#ifdef CONFIG_CGROUPS
+	init_rwsem(&sig->threadgroup_fork_lock);
+#endif
+
 	sig->oom_adj = current->signal->oom_adj;
 
 	return 0;
@@ -1069,7 +1073,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	monotonic_to_bootbased(&p->real_start_time);
 	p->io_context = NULL;
 	p->audit_context = NULL;
-	cgroup_fork(p);
+	cgroup_fork(p, clone_flags);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
@@ -1277,7 +1281,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
-	cgroup_post_fork(p);
+	cgroup_post_fork(p, clone_flags);
 	perf_event_fork(p);
 	return p;
 
@@ -1311,7 +1315,7 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_cgroup:
 #endif
-	cgroup_exit(p, cgroup_callbacks_done);
+	cgroup_fork_failed(p, cgroup_callbacks_done, clone_flags);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
 bad_fork_cleanup_count:

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-05-30  1:30 [RFC] [PATCH v2 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
  2010-05-30  1:31 ` [RFC] [PATCH 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
@ 2010-05-30  1:33 ` Ben Blum
  2010-05-31 17:52   ` Oleg Nesterov
  1 sibling, 1 reply; 19+ messages in thread
From: Ben Blum @ 2010-05-30  1:33 UTC (permalink / raw)
  To: linux-kernel, containers
  Cc: akpm, bblum, ebiederm, lizf, matthltc, menage, oleg

[-- Attachment #1: cgroup-procs-writable.patch --]
[-- Type: text/plain, Size: 16683 bytes --]

Makes procs file writable to move all threads by tgid at once

From: Ben Blum <bblum@andrew.cmu.edu>

This patch adds functionality that enables users to move all threads in a
threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs'
file. This current implementation makes use of a per-threadgroup rwsem that's
taken for reading in the fork() path to prevent newly forking threads within
the threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
---
 Documentation/cgroups/cgroups.txt |    9 +
 kernel/cgroup.c                   |  412 +++++++++++++++++++++++++++++++++----
 2 files changed, 374 insertions(+), 47 deletions(-)

diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index a1ca592..3f035f5 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -235,7 +235,8 @@ containing the following files describing that cgroup:
  - cgroup.procs: list of tgids in the cgroup.  This list is not
    guaranteed to be sorted or free of duplicate tgids, and userspace
    should sort/uniquify the list if this property is required.
-   This is a read-only file, for now.
+   Writing a thread group id into this file moves all threads in that
+   group into this cgroup.
  - notify_on_release flag: run the release agent on exit?
  - release_agent: the path to use for release notifications (this file
    exists in the top cgroup only)
@@ -416,6 +417,12 @@ You can attach the current shell task by echoing 0:
 
 # echo 0 > tasks
 
+You can use the cgroup.procs file instead of the tasks file to move all
+threads in a threadgroup at once. Echoing the pid of any task in a
+threadgroup to cgroup.procs causes all tasks in that threadgroup to be
+be attached to the cgroup. Writing 0 to cgroup.procs moves all tasks
+in the writing task's threadgroup.
+
 2.3 Mounting hierarchies by name
 --------------------------------
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6c8e46f..4d5ea94 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1686,6 +1686,76 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int buflen)
 }
 EXPORT_SYMBOL_GPL(cgroup_path);
 
+/*
+ * cgroup_task_migrate - move a task from one cgroup to another.
+ *
+ * 'guarantee' is set if the caller promises that a new css_set for the task
+ * will already exit. If not set, this function might sleep, and can fail with
+ * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ */
+static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
+			       struct task_struct *tsk, bool guarantee)
+{
+	struct css_set *oldcg;
+	struct css_set *newcg;
+
+	/*
+	 * get old css_set. we need to take task_lock and refcount it, because
+	 * an exiting task can change its css_set to init_css_set and drop its
+	 * old one without taking cgroup_mutex.
+	 */
+	task_lock(tsk);
+	oldcg = tsk->cgroups;
+	get_css_set(oldcg);
+	task_unlock(tsk);
+
+	/* locate or allocate a new css_set for this task. */
+	if (guarantee) {
+		/* we know the css_set we want already exists. */
+		struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+		read_lock(&css_set_lock);
+		newcg = find_existing_css_set(oldcg, cgrp, template);
+		BUG_ON(!newcg);
+		get_css_set(newcg);
+		read_unlock(&css_set_lock);
+	} else {
+		might_sleep();
+		/* find_css_set will give us newcg already referenced. */
+		newcg = find_css_set(oldcg, cgrp);
+		if (!newcg) {
+			put_css_set(oldcg);
+			return -ENOMEM;
+		}
+	}
+	put_css_set(oldcg);
+
+	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	task_lock(tsk);
+	if (tsk->flags & PF_EXITING) {
+		task_unlock(tsk);
+		put_css_set(newcg);
+		return -ESRCH;
+	}
+	rcu_assign_pointer(tsk->cgroups, newcg);
+	task_unlock(tsk);
+
+	/* Update the css_set linked lists if we're using them */
+	write_lock(&css_set_lock);
+	if (!list_empty(&tsk->cg_list))
+		list_move(&tsk->cg_list, &newcg->tasks);
+	write_unlock(&css_set_lock);
+
+	/*
+	 * We just gained a reference on oldcg by taking it from the task. As
+	 * trading it for newcg is protected by cgroup_mutex, we're safe to drop
+	 * it here; it will be freed under RCU.
+	 */
+	put_css_set(oldcg);
+
+	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+	return 0;
+}
+
 /**
  * cgroup_attach_task - attach task 'tsk' to cgroup 'cgrp'
  * @cgrp: the cgroup the task is attaching to
@@ -1696,11 +1766,9 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
-	int retval = 0;
+	int retval;
 	struct cgroup_subsys *ss, *failed_ss = NULL;
 	struct cgroup *oldcgrp;
-	struct css_set *cg;
-	struct css_set *newcg;
 	struct cgroupfs_root *root = cgrp->root;
 
 	/* Nothing to do if the task is already in that cgroup */
@@ -1724,46 +1792,16 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 		}
 	}
 
-	task_lock(tsk);
-	cg = tsk->cgroups;
-	get_css_set(cg);
-	task_unlock(tsk);
-	/*
-	 * Locate or allocate a new css_set for this task,
-	 * based on its final set of cgroups
-	 */
-	newcg = find_css_set(cg, cgrp);
-	put_css_set(cg);
-	if (!newcg) {
-		retval = -ENOMEM;
-		goto out;
-	}
-
-	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		retval = -ESRCH;
+	retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, false);
+	if (retval)
 		goto out;
-	}
-	rcu_assign_pointer(tsk->cgroups, newcg);
-	task_unlock(tsk);
-
-	/* Update the css_set linked lists if we're using them */
-	write_lock(&css_set_lock);
-	if (!list_empty(&tsk->cg_list)) {
-		list_del(&tsk->cg_list);
-		list_add(&tsk->cg_list, &newcg->tasks);
-	}
-	write_unlock(&css_set_lock);
 
 	for_each_subsys(root, ss) {
 		if (ss->attach)
 			ss->attach(ss, cgrp, oldcgrp, tsk, false);
 	}
-	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
+
 	synchronize_rcu();
-	put_css_set(cg);
 
 	/*
 	 * wake up rmdir() waiter. the rmdir should fail since the cgroup
@@ -1789,28 +1827,300 @@ out:
 }
 
 /*
- * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
- * held. May take task_lock of task
+ * cgroup_attach_proc works in two stages, the first of which prefetches all
+ * new css_sets needed (to make sure we have enough memory before committing
+ * to the move) and stores them in a list of entries of the following type.
+ * TODO: possible optimization: use css_set->rcu_head for chaining instead
+ */
+struct cg_list_entry {
+	struct css_set *cg;
+	struct list_head links;
+};
+
+static bool css_set_check_fetched(struct cgroup *cgrp,
+				  struct task_struct *tsk, struct css_set *cg,
+				  struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+	struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
+
+	read_lock(&css_set_lock);
+	newcg = find_existing_css_set(cg, cgrp, template);
+	if (newcg)
+		get_css_set(newcg);
+	read_unlock(&css_set_lock);
+
+	/* doesn't exist at all? */
+	if (!newcg)
+		return false;
+	/* see if it's already in the list */
+	list_for_each_entry(cg_entry, newcg_list, links) {
+		if (cg_entry->cg == newcg) {
+			put_css_set(newcg);
+			return true;
+		}
+	}
+
+	/* not found */
+	put_css_set(newcg);
+	return false;
+}
+
+/*
+ * Find the new css_set and store it in the list in preparation for moving the
+ * given task to the given cgroup. Returns 0 or -ENOMEM.
+ */
+static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
+			    struct list_head *newcg_list)
+{
+	struct css_set *newcg;
+	struct cg_list_entry *cg_entry;
+
+	/* ensure a new css_set will exist for this thread */
+	newcg = find_css_set(cg, cgrp);
+	if (!newcg)
+		return -ENOMEM;
+	/* add it to the list */
+	cg_entry = kmalloc(sizeof(struct cg_list_entry), GFP_KERNEL);
+	if (!cg_entry) {
+		put_css_set(newcg);
+		return -ENOMEM;
+	}
+	cg_entry->cg = newcg;
+	list_add(&cg_entry->links, newcg_list);
+	return 0;
+}
+
+/**
+ * cgroup_attach_proc - attach all threads in a threadgroup to a cgroup
+ * @cgrp: the cgroup to attach to
+ * @leader: the threadgroup leader task_struct of the group to be attached
+ *
+ * Call holding cgroup_mutex. Will take task_lock of each thread in leader's
+ * threadgroup individually in turn.
+ */
+int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
+{
+	int retval;
+	struct cgroup_subsys *ss, *failed_ss = NULL;
+	struct cgroup *oldcgrp;
+	struct css_set *oldcg;
+	struct cgroupfs_root *root = cgrp->root;
+	/* threadgroup list cursor */
+	struct task_struct *tsk;
+	/*
+	 * we need to make sure we have css_sets for all the tasks we're
+	 * going to move -before- we actually start moving them, so that in
+	 * case we get an ENOMEM we can bail out before making any changes.
+	 */
+	struct list_head newcg_list;
+	struct cg_list_entry *cg_entry, *temp_nobe;
+
+	/*
+	 * Note: Because of possible races with de_thread(), we can't
+	 * distinguish between the case where the user gives a non-leader tid
+	 * and the case where it changes out from under us. So both are allowed.
+	 */
+	leader = leader->group_leader;
+
+	/*
+	 * check that we can legitimately attach to the cgroup.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->can_attach) {
+			retval = ss->can_attach(ss, cgrp, leader, true);
+			if (retval) {
+				failed_ss = ss;
+				goto out;
+			}
+		}
+	}
+
+	/*
+	 * step 1: make sure css_sets exist for all threads to be migrated.
+	 * we use find_css_set, which allocates a new one if necessary.
+	 */
+	INIT_LIST_HEAD(&newcg_list);
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		/* get old css_set */
+		task_lock(leader);
+		if (leader->flags & PF_EXITING) {
+			task_unlock(leader);
+			goto prefetch_loop;
+		}
+		oldcg = leader->cgroups;
+		get_css_set(oldcg);
+		task_unlock(leader);
+		/* acquire new one */
+		retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+		put_css_set(oldcg);
+		if (retval)
+			goto list_teardown;
+	}
+prefetch_loop:
+	rcu_read_lock();
+	/*
+	 * if we need to fetch a new css_set for this task, we must exit the
+	 * rcu_read section because allocating it can sleep. afterwards, we'll
+	 * need to restart iteration on the threadgroup list - the whole thing
+	 * will be O(nm) in the number of threads and css_sets; as the typical
+	 * case has only one css_set for all of them, usually O(n).
+	 */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		/* nothing to do if this task is already in the cgroup */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* get old css_set pointer */
+		task_lock(tsk);
+		if (tsk->flags & PF_EXITING) {
+			/* ignore this task if it's going away */
+			task_unlock(tsk);
+			continue;
+		}
+		oldcg = tsk->cgroups;
+		get_css_set(oldcg);
+		task_unlock(tsk);
+		/* see if the new one for us is already in the list? */
+		if (css_set_check_fetched(cgrp, tsk, oldcg, &newcg_list)) {
+			/* was already there, nothing to do. */
+			put_css_set(oldcg);
+		} else {
+			/* we don't already have it. get new one. */
+			rcu_read_unlock();
+			retval = css_set_prefetch(cgrp, oldcg, &newcg_list);
+			put_css_set(oldcg);
+			if (retval)
+				goto list_teardown;
+			/* begin iteration again. */
+			goto prefetch_loop;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * step 2: now that we're guaranteed success wrt the css_sets, proceed
+	 * to move all tasks to the new cgroup. we need to lock against possible
+	 * races with fork(). note: we can safely access leader->signal because
+	 * attach_task_by_pid takes a reference on leader, which guarantees that
+	 * the signal_struct will stick around. threadgroup_fork_lock must be
+	 * taken outside of tasklist_lock to match the order in the fork path.
+	 */
+	BUG_ON(!leader->signal);
+	down_write(&leader->signal->threadgroup_fork_lock);
+	read_lock(&tasklist_lock);
+	/*
+	 * One final check. It's possible for a race with de_thread() to change
+	 * the group leadership, which makes this operation potentially not safe
+	 * wrt both iterating and the permissions check in attach_task_by_pid.
+	 */
+	if (!thread_group_leader(leader)) {
+		retval = -EAGAIN;
+		read_unlock(&tasklist_lock);
+		up_write(&leader->signal->threadgroup_fork_lock);
+		goto list_teardown;
+	}
+	/*
+	 * No failure cases left, so this is the commit point.
+	 *
+	 * If the leader is already there, skip moving him. Note: even if the
+	 * leader is PF_EXITING, we still move all other threads; if everybody
+	 * is PF_EXITING, we end up doing nothing, which is ok.
+	 */
+	oldcgrp = task_cgroup_from_root(leader, root);
+	if (cgrp != oldcgrp) {
+		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+	/* Now iterate over each thread in the group. */
+	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
+		BUG_ON(tsk->signal != leader->signal);
+		/* leave current thread as it is if it's already there */
+		oldcgrp = task_cgroup_from_root(tsk, root);
+		if (cgrp == oldcgrp)
+			continue;
+		/* we don't care whether these threads are exiting */
+		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
+		BUG_ON(retval != 0 && retval != -ESRCH);
+	}
+
+	/*
+	 * step 3: attach whole threadgroup to each subsystem
+	 * TODO: if ever a subsystem needs to know the oldcgrp for each task
+	 * being moved, this call will need to be reworked to communicate that.
+	 */
+	for_each_subsys(root, ss) {
+		if (ss->attach)
+			ss->attach(ss, cgrp, oldcgrp, leader, true);
+	}
+	/* holding these until here keeps us safe from exec() and fork(). */
+	read_unlock(&tasklist_lock);
+	up_write(&leader->signal->threadgroup_fork_lock);
+
+	/*
+	 * step 4: success! and cleanup
+	 */
+	synchronize_rcu();
+	cgroup_wakeup_rmdir_waiter(cgrp);
+	retval = 0;
+list_teardown:
+	/* clean up the list of prefetched css_sets. */
+	list_for_each_entry_safe(cg_entry, temp_nobe, &newcg_list, links) {
+		list_del(&cg_entry->links);
+		put_css_set(cg_entry->cg);
+		kfree(cg_entry);
+	}
+out:
+	if (retval) {
+		/* same deal as in cgroup_attach_task, with threadgroup=true */
+		for_each_subsys(root, ss) {
+			if (ss == failed_ss)
+				break;
+			if (ss->cancel_attach)
+				ss->cancel_attach(ss, cgrp, leader, true);
+		}
+	}
+	return retval;
+}
+
+/*
+ * Find the task_struct of the task to attach by vpid and pass it along to the
+ * function to attach either it or all tasks in its threadgroup. Will take
+ * cgroup_mutex; may take task_lock of task.
  */
-static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
+static int attach_task_by_pid(struct cgroup *cgrp, u64 pid,
+			      int attach(struct cgroup *, struct task_struct *))
 {
 	struct task_struct *tsk;
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
+	if (!cgroup_lock_live_group(cgrp))
+		return -ENODEV;
+
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
 		if (!tsk || tsk->flags & PF_EXITING) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -ESRCH;
 		}
 
+		/*
+		 * even if we're attaching all tasks in the thread group, we
+		 * only need to check permissions on the group leader, because
+		 * even if another task has different permissions, the group
+		 * leader will have sufficient access to change it.
+		 */
 		tcred = __task_cred(tsk);
 		if (cred->euid &&
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
+			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
@@ -1820,18 +2130,28 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid)
 		get_task_struct(tsk);
 	}
 
-	ret = cgroup_attach_task(cgrp, tsk);
+	ret = attach(cgrp, tsk);
 	put_task_struct(tsk);
+	cgroup_unlock();
 	return ret;
 }
 
 static int cgroup_tasks_write(struct cgroup *cgrp, struct cftype *cft, u64 pid)
 {
+	return attach_task_by_pid(cgrp, pid, cgroup_attach_task);
+}
+
+static int cgroup_procs_write(struct cgroup *cgrp, struct cftype *cft, u64 tgid)
+{
 	int ret;
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-	ret = attach_task_by_pid(cgrp, pid);
-	cgroup_unlock();
+	do {
+		/*
+		 * attach_proc fails with -EAGAIN if threadgroup leadership
+		 * changes in the middle of the operation, in which case we need
+		 * to find the task_struct for the new leader and start over.
+		 */
+		ret = attach_task_by_pid(cgrp, tgid, cgroup_attach_proc);
+	} while (ret == -EAGAIN);
 	return ret;
 }
 
@@ -3167,9 +3487,9 @@ static struct cftype files[] = {
 	{
 		.name = CGROUP_FILE_GENERIC_PREFIX "procs",
 		.open = cgroup_procs_open,
-		/* .write_u64 = cgroup_procs_write, TODO */
+		.write_u64 = cgroup_procs_write,
 		.release = cgroup_pidlist_release,
-		.mode = S_IRUGO,
+		.mode = S_IRUGO | S_IWUSR,
 	},
 	{
 		.name = "notify_on_release",

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-05-30  1:33 ` [RFC] [PATCH 2/2] cgroups: make procs file writable Ben Blum
@ 2010-05-31 17:52   ` Oleg Nesterov
  2010-05-31 18:04     ` Oleg Nesterov
  2010-06-03  4:56     ` Ben Blum
  0 siblings, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2010-05-31 17:52 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage

I only glanced into one function, cgroup_attach_proc(), and some things
look "obviously wrong". Sorry, I can't really read these patches now,
most probably I misunderstood the code...

> +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> +{
> +	int retval;
> +	struct cgroup_subsys *ss, *failed_ss = NULL;
> +	struct cgroup *oldcgrp;
> +	struct css_set *oldcg;
> +	struct cgroupfs_root *root = cgrp->root;
> +	/* threadgroup list cursor */
> +	struct task_struct *tsk;
> +	/*
> +	 * we need to make sure we have css_sets for all the tasks we're
> +	 * going to move -before- we actually start moving them, so that in
> +	 * case we get an ENOMEM we can bail out before making any changes.
> +	 */
> +	struct list_head newcg_list;
> +	struct cg_list_entry *cg_entry, *temp_nobe;
> +
> +	/*
> +	 * Note: Because of possible races with de_thread(), we can't
> +	 * distinguish between the case where the user gives a non-leader tid
> +	 * and the case where it changes out from under us. So both are allowed.
> +	 */

OK, the caller has a reference to the argument, leader,

> +	leader = leader->group_leader;

But why it is safe to use leader->group_leader if we race with exec?

> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {

Even if we didn't change "leader" above, this is not safe in theory.
We already discussed this, list_for_each_rcu(head) is only safe when
we know that "head" itself is valid.

Suppose that this leader exits, then leader->thread_group.next exits
too before we take rcu_read_lock().

> +	oldcgrp = task_cgroup_from_root(leader, root);
> +	if (cgrp != oldcgrp) {
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}
> +	/* Now iterate over each thread in the group. */
> +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> +		BUG_ON(tsk->signal != leader->signal);
> +		/* leave current thread as it is if it's already there */
> +		oldcgrp = task_cgroup_from_root(tsk, root);
> +		if (cgrp == oldcgrp)
> +			continue;
> +		/* we don't care whether these threads are exiting */
> +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> +		BUG_ON(retval != 0 && retval != -ESRCH);
> +	}

This looks strange. Why do we move leader outside of the loop ?
Of course, list_for_each_entry() can't work to move all sub-threads,
but "do while_each_thread()" can.

>From 0/2:
>
> recentish changes to signal_struct's lifetime rules (which don't seem to
> appear when I check out mmotm with git clone,

already in Linus's tree.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-05-31 17:52   ` Oleg Nesterov
@ 2010-05-31 18:04     ` Oleg Nesterov
  2010-06-01 18:57       ` Paul Menage
  2010-06-03  4:56     ` Ben Blum
  1 sibling, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2010-05-31 18:04 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage

On 05/31, Oleg Nesterov wrote:
>
> I only glanced into one function, cgroup_attach_proc(), and some things
> look "obviously wrong". Sorry, I can't really read these patches now,
> most probably I misunderstood the code...

And, forgot to mention, I do not understand the PF_EXITING check in
attach_task_by_pid() (and some others).

At first glance, it buys nothing. PF_EXITING can be set right after
the check. And, what if we have the process with the exited group-leader?
In that case we can't use tgid to move this process?

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-05-31 18:04     ` Oleg Nesterov
@ 2010-06-01 18:57       ` Paul Menage
  2010-06-02 14:06         ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Paul Menage @ 2010-06-01 18:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On Mon, May 31, 2010 at 11:04 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> And, forgot to mention, I do not understand the PF_EXITING check in
> attach_task_by_pid() (and some others).
>
> At first glance, it buys nothing. PF_EXITING can be set right after
> the check.

It can, but it's a benign race.

Moving a non-current thread into a cgroup takes task->alloc_lock and
checks for PF_EXITING before manipulating that thread's cgroup links.
The exit procedure sets PF_EXITING and then (somewhat later, but
guaranteed) moves current to the root cgroups while holding
alloc_lock. If a task hasn't set PF_EXITING by the time we check for
it, while holding alloc_lock, it can't enter the cgroup cleanup code
until we've finished moving it to its new cgroup. If it has set
PF_EXITING by that point, it's guaranteed to be moving itself to the
root cgroups and disconnecting from various cgroups structures in the
very near future, so it's fine to refuse to move it to a new cgroup.

Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-01 18:57       ` Paul Menage
@ 2010-06-02 14:06         ` Oleg Nesterov
  2010-06-02 19:53           ` Paul Menage
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2010-06-02 14:06 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On 06/01, Paul Menage wrote:
>
> On Mon, May 31, 2010 at 11:04 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > And, forgot to mention, I do not understand the PF_EXITING check in
> > attach_task_by_pid() (and some others).
> >
> > At first glance, it buys nothing. PF_EXITING can be set right after
> > the check.
>
> It can, but it's a benign race.

Yes,

> Moving a non-current thread into a cgroup takes task->alloc_lock and
> checks for PF_EXITING before manipulating that thread's cgroup links.
> The exit procedure sets PF_EXITING and then (somewhat later, but
> guaranteed) moves current to the root cgroups while holding
> alloc_lock.

Yes sure, I understand this part. cgroup_attach_task() correctly checks
PF_EXITING under task_lock(), this protects from the case when this
task has already passed cgroup_exit() which takes this lock too.

But. This exactly means that the PF_EXITING check in attach_task_by_pid()
is not necessary from the correctness pov (while probably can be considered
as optimization), right?

And,

> so it's fine to refuse to move it to a new cgroup.

I am not sure. It doesn't hurt when we try to move a thread. But if
we want to move the whole thread group, we should proceed even if the
group leader has already exited and thus has PF_EXITING bit set.

That was my question.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-02 14:06         ` Oleg Nesterov
@ 2010-06-02 19:53           ` Paul Menage
  2010-06-02 20:20             ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Paul Menage @ 2010-06-02 19:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On Wed, Jun 2, 2010 at 7:06 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Yes sure, I understand this part. cgroup_attach_task() correctly checks
> PF_EXITING under task_lock(), this protects from the case when this
> task has already passed cgroup_exit() which takes this lock too.

Right.

>
> But. This exactly means that the PF_EXITING check in attach_task_by_pid()
> is not necessary from the correctness pov (while probably can be considered
> as optimization), right?

For the value of correctness that was relevant when writing that code
originally, I think it's fine. But ...

> I am not sure. It doesn't hurt when we try to move a thread. But if
> we want to move the whole thread group, we should proceed even if the
> group leader has already exited and thus has PF_EXITING bit set.

Hmm, maybe. I could see this being argued both ways. Can the process
hang around indefinitely with the leader as a zombie and the other
threads still running?

It wouldn't be that hard to make it possible to avoid relying on
PF_EXITING as the check - instead we'd have an exiting_css_set object
that have the same pointer set as init_css_set, but would be
distinguishable from it as a task->cgroups pointer by virtue of being
a different object. Then cgroup_exit() can reassign tasks to point to
exiting_css_set rather than init_css_set, and the various checks that
are currently made for (task->flags & PF_EXITING) could check for
(task->cgroups == exiting_css_set) instead. This would allow task
movement further into the exit process.

Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-02 19:53           ` Paul Menage
@ 2010-06-02 20:20             ` Oleg Nesterov
  2010-06-02 20:31               ` Paul Menage
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2010-06-02 20:20 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On 06/02, Paul Menage wrote:
>
> On Wed, Jun 2, 2010 at 7:06 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I am not sure. It doesn't hurt when we try to move a thread. But if
> > we want to move the whole thread group, we should proceed even if the
> > group leader has already exited and thus has PF_EXITING bit set.
>
> Hmm, maybe. I could see this being argued both ways. Can the process
> hang around indefinitely with the leader as a zombie and the other
> threads still running?

Yes sure. The main thread can exit via sys_exit(), this doesn't affect
the thread group. Of course, I am not saying this is common practice,
perhaps no "important" app does this.

> It wouldn't be that hard to make it possible to avoid relying on
> PF_EXITING as the check - instead we'd have an exiting_css_set object
> that have the same pointer set as init_css_set, but would be
> distinguishable from it as a task->cgroups pointer by virtue of being
> a different object. Then cgroup_exit() can reassign tasks to point to
> exiting_css_set rather than init_css_set, and the various checks that
> are currently made for (task->flags & PF_EXITING) could check for
> (task->cgroups == exiting_css_set) instead. This would allow task
> movement further into the exit process.

It is too late for me to even try to understand the above ;)

But I still can't understand why we can't just remove it. Both
cgroup_attach_task() and cgroup_attach_proc() should handle the
possible races correctly anyway.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-02 20:20             ` Oleg Nesterov
@ 2010-06-02 20:31               ` Paul Menage
  2010-06-02 20:58                 ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Paul Menage @ 2010-06-02 20:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On Wed, Jun 2, 2010 at 1:20 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> Yes sure. The main thread can exit via sys_exit(), this doesn't affect
> the thread group. Of course, I am not saying this is common practice,
> perhaps no "important" app does this.

This check has been in cgroups for quite a while and no-one's
complained so far...

> But I still can't understand why we can't just remove it. Both
> cgroup_attach_task() and cgroup_attach_proc() should handle the
> possible races correctly anyway.

The "it" that you're proposing to remove is in fact the code that
handles those races.

Anyway, I think this issue is orthogonal to the movability of entire
processes - with Ben's patch, an exited-but-not-reaped group leader is
immovable either via "tasks" or "cgroup.procs". Changing the race
condition handling to allow such movement would be a separate issue.

Paul

Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-02 20:31               ` Paul Menage
@ 2010-06-02 20:58                 ` Oleg Nesterov
  2010-06-02 21:12                   ` Paul Menage
  2010-06-03  4:40                   ` Ben Blum
  0 siblings, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2010-06-02 20:58 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On 06/02, Paul Menage wrote:
>
> On Wed, Jun 2, 2010 at 1:20 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > Yes sure. The main thread can exit via sys_exit(), this doesn't affect
> > the thread group. Of course, I am not saying this is common practice,
> > perhaps no "important" app does this.
>
> This check has been in cgroups for quite a while and no-one's
> complained so far...

Sure. because currently attach_task_by_pid() works with the single
thread. But Ben's patch reuses this code to call cgroup_attach_proc().

> > But I still can't understand why we can't just remove it. Both
> > cgroup_attach_task() and cgroup_attach_proc() should handle the
> > possible races correctly anyway.
>
> The "it" that you're proposing to remove is in fact the code that
> handles those races.

In that case I confused, and I thought we already agreed that
the PF_EXITING check in attach_task_by_pid() is not strictly needed
for correctness.

Once again, the task can call do_exit() and set PF_EXITING right
after the check.

> Anyway, I think this issue is orthogonal to the movability of entire
> processes - with Ben's patch, an exited-but-not-reaped group leader is
> immovable either via "tasks" or "cgroup.procs".

Hmm. As I said, I didn't really read this patch. But I thought that
cgroup_attach_proc() tries to handle this.

Anyway I agree, this is minor and I never pretended I understand
this code.

Hmm. The usage of ->thread_group in ->can_attach() methods doesn't
look safe to me... but currently bool threadgroup is always false.

Oleg.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-02 20:58                 ` Oleg Nesterov
@ 2010-06-02 21:12                   ` Paul Menage
  2010-06-02 21:38                     ` Oleg Nesterov
  2010-06-03  4:40                   ` Ben Blum
  1 sibling, 1 reply; 19+ messages in thread
From: Paul Menage @ 2010-06-02 21:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On Wed, Jun 2, 2010 at 1:58 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>> The "it" that you're proposing to remove is in fact the code that
>> handles those races.
>
> In that case I confused, and I thought we already agreed that
> the PF_EXITING check in attach_task_by_pid() is not strictly needed
> for correctness.

Not quite - something is required for correctness, and the PF_EXITING
check provides that correctness, with a very small window (between
setting PF_EXITING and calling cgroup_exit) where we might arguably
have been able to move the thread but decline to do so because it's
simpler not to do so and no-one cares. That's the optimization that I
meant - the data structures are slightly simpler since there's no way
to tell when a task has passed cgroup_exit(), and instead we just see
if they've passed PF_EXITING.

>
> Once again, the task can call do_exit() and set PF_EXITING right
> after the check.

Yes, the important part is that they haven't set it *before* the check
in attach_task_by_pid(). If they have set it before that, then they
could be anywhere in the exit path after PF_EXITING, and we decline to
move them since it's possible that they've already passed
cgroup_exit(). If the exiting task has not yet set PF_EXITING, then it
can't possibly get into the critical section in cgroup_exit() since
attach_task_by_pid() holds task->alloc_lock.

Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-02 21:12                   ` Paul Menage
@ 2010-06-02 21:38                     ` Oleg Nesterov
  2010-06-02 22:03                       ` Paul Menage
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2010-06-02 21:38 UTC (permalink / raw)
  To: Paul Menage
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On 06/02, Paul Menage wrote:
>
> On Wed, Jun 2, 2010 at 1:58 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >> The "it" that you're proposing to remove is in fact the code that
> >> handles those races.
> >
> > In that case I confused, and I thought we already agreed that
> > the PF_EXITING check in attach_task_by_pid() is not strictly needed
> > for correctness.
>
> Not quite - something is required for correctness, and the PF_EXITING
> check provides that correctness, with a very small window (between
> setting PF_EXITING and calling cgroup_exit) where we might arguably
> have been able to move the thread but decline to do so because it's
> simpler not to do so and no-one cares. That's the optimization that I
> meant - the data structures are slightly simpler since there's no way
> to tell when a task has passed cgroup_exit(), and instead we just see
> if they've passed PF_EXITING.
>
> >
> > Once again, the task can call do_exit() and set PF_EXITING right
> > after the check.
>
> Yes, the important part is that they haven't set it *before* the check
> in attach_task_by_pid(). If they have set it before that, then they
> could be anywhere in the exit path after PF_EXITING, and we decline to
> move them since it's possible that they've already passed
> cgroup_exit(). If the exiting task has not yet set PF_EXITING, then it
> can't possibly get into the critical section in cgroup_exit() since
> attach_task_by_pid() holds task->alloc_lock.

It doesn't ? At least in Linus's tree.

cgroup_attach_task() does, and this time PF_EXITING is understandable.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-02 21:38                     ` Oleg Nesterov
@ 2010-06-02 22:03                       ` Paul Menage
  2010-06-03  4:44                         ` Ben Blum
  0 siblings, 1 reply; 19+ messages in thread
From: Paul Menage @ 2010-06-02 22:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On Wed, Jun 2, 2010 at 2:38 PM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> cgroup_attach_task() does, and this time PF_EXITING is understandable.
>

Oh, OK, I was talking about the one in cgroup_attach_task(), which is
called by attach_task_by_pid(), and assumed that you were referring to
the same one. I'd forgotten about the pre-check in
attach_task_by_pid(). Yes, that one could probably be removed without
affecting any final outcomes.

Paul

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-02 20:58                 ` Oleg Nesterov
  2010-06-02 21:12                   ` Paul Menage
@ 2010-06-03  4:40                   ` Ben Blum
  2010-06-03 14:48                     ` Oleg Nesterov
  1 sibling, 1 reply; 19+ messages in thread
From: Ben Blum @ 2010-06-03  4:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul Menage, Ben Blum, linux-kernel, containers, akpm, ebiederm,
	lizf, matthltc

On Wed, Jun 02, 2010 at 10:58:55PM +0200, Oleg Nesterov wrote:
> Hmm. The usage of ->thread_group in ->can_attach() methods doesn't
> look safe to me... but currently bool threadgroup is always false.
> 
> Oleg.
> 
> 

I recall putting a rcu_read_lock() around that part and being assured
that made it safe. But I don't remember the details. Maybe taking
tasklist_lock is necessary?

-- Ben

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-02 22:03                       ` Paul Menage
@ 2010-06-03  4:44                         ` Ben Blum
  0 siblings, 0 replies; 19+ messages in thread
From: Ben Blum @ 2010-06-03  4:44 UTC (permalink / raw)
  To: Paul Menage
  Cc: Oleg Nesterov, Ben Blum, linux-kernel, containers, akpm, ebiederm,
	lizf, matthltc

On Wed, Jun 02, 2010 at 03:03:49PM -0700, Paul Menage wrote:
> On Wed, Jun 2, 2010 at 2:38 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > cgroup_attach_task() does, and this time PF_EXITING is understandable.
> >
> 
> Oh, OK, I was talking about the one in cgroup_attach_task(), which is
> called by attach_task_by_pid(), and assumed that you were referring to
> the same one. I'd forgotten about the pre-check in
> attach_task_by_pid(). Yes, that one could probably be removed without
> affecting any final outcomes.
> 
> Paul

Yes, I agree. The check should be moved into cgroup_attach_task before
the ss->can_attach calls, so the "optimization" stays in the case where
it's wanted.

And the use of __task_cred after the exiting check seems safe too if the
check is gone - from cred.h: "The caller must make sure task doesn't go
away, either by holding a ref on task or by holding tasklist_lock to
prevent it from being unlinked." So that means it'd be safe to take a
reference, task sets exiting and does everything before unhashing, then
access the cred pointer; since we hold tasklist lock it's the same case.

-- Ben

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-05-31 17:52   ` Oleg Nesterov
  2010-05-31 18:04     ` Oleg Nesterov
@ 2010-06-03  4:56     ` Ben Blum
  2010-06-03 14:43       ` Oleg Nesterov
  1 sibling, 1 reply; 19+ messages in thread
From: Ben Blum @ 2010-06-03  4:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ben Blum, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc, menage

On Mon, May 31, 2010 at 07:52:42PM +0200, Oleg Nesterov wrote:
> I only glanced into one function, cgroup_attach_proc(), and some things
> look "obviously wrong". Sorry, I can't really read these patches now,
> most probably I misunderstood the code...
> 
> > +int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> > +{
> > +	int retval;
> > +	struct cgroup_subsys *ss, *failed_ss = NULL;
> > +	struct cgroup *oldcgrp;
> > +	struct css_set *oldcg;
> > +	struct cgroupfs_root *root = cgrp->root;
> > +	/* threadgroup list cursor */
> > +	struct task_struct *tsk;
> > +	/*
> > +	 * we need to make sure we have css_sets for all the tasks we're
> > +	 * going to move -before- we actually start moving them, so that in
> > +	 * case we get an ENOMEM we can bail out before making any changes.
> > +	 */
> > +	struct list_head newcg_list;
> > +	struct cg_list_entry *cg_entry, *temp_nobe;
> > +
> > +	/*
> > +	 * Note: Because of possible races with de_thread(), we can't
> > +	 * distinguish between the case where the user gives a non-leader tid
> > +	 * and the case where it changes out from under us. So both are allowed.
> > +	 */
> 
> OK, the caller has a reference to the argument, leader,
> 
> > +	leader = leader->group_leader;
> 
> But why it is safe to use leader->group_leader if we race with exec?

This line means "let's try to find who the leader is", since
attach_task_by_pid doesn't grab it for us. It's not "safe", and we still
check if it's really the leader later (just before the 'commit point').
Note that before this line 'leader' doesn't really mean the leader -
perhaps i should rename the variables :P

But maybe I also want to grab a reference on the new task? I can't
remember whether I need to or not. I'm not sure whether or not I need to
grab an rcu lock, but it doesn't seem necessary because of the commit
point check later on. Plus can_attach takes the rcu lock itself for
iterating if it needs it.

> 
> > +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> 
> Even if we didn't change "leader" above, this is not safe in theory.
> We already discussed this, list_for_each_rcu(head) is only safe when
> we know that "head" itself is valid.
> 
> Suppose that this leader exits, then leader->thread_group.next exits
> too before we take rcu_read_lock().

Why is that a problem? I thought leader->thread_group is supposed to
stay sane as long as leader is the leader.

This looks like it needs a check to see if 'leader' is still really the
leader, but nothing more.

> 
> > +	oldcgrp = task_cgroup_from_root(leader, root);
> > +	if (cgrp != oldcgrp) {
> > +		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
> > +		BUG_ON(retval != 0 && retval != -ESRCH);
> > +	}
> > +	/* Now iterate over each thread in the group. */
> > +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> > +		BUG_ON(tsk->signal != leader->signal);
> > +		/* leave current thread as it is if it's already there */
> > +		oldcgrp = task_cgroup_from_root(tsk, root);
> > +		if (cgrp == oldcgrp)
> > +			continue;
> > +		/* we don't care whether these threads are exiting */
> > +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> > +		BUG_ON(retval != 0 && retval != -ESRCH);
> > +	}
> 
> This looks strange. Why do we move leader outside of the loop ?
> Of course, list_for_each_entry() can't work to move all sub-threads,
> but "do while_each_thread()" can.

do/while_each_thread oves over all threads in the system, rather than
just the threadgroup... this isn't supposed to be a fast operation, but
that seems like overkill.

> 
> From 0/2:
> >
> > recentish changes to signal_struct's lifetime rules (which don't seem to
> > appear when I check out mmotm with git clone,
> 
> already in Linus's tree.
> 
> Oleg.
> 

-- Ben

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-03  4:56     ` Ben Blum
@ 2010-06-03 14:43       ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2010-06-03 14:43 UTC (permalink / raw)
  To: Ben Blum; +Cc: linux-kernel, containers, akpm, ebiederm, lizf, matthltc, menage

On 06/03, Ben Blum wrote:
>
> On Mon, May 31, 2010 at 07:52:42PM +0200, Oleg Nesterov wrote:
> >
> > OK, the caller has a reference to the argument, leader,
> >
> > > +	leader = leader->group_leader;
> >
> > But why it is safe to use leader->group_leader if we race with exec?
>
> This line means "let's try to find who the leader is", since
> attach_task_by_pid doesn't grab it for us. It's not "safe", and we still
> check if it's really the leader later (just before the 'commit point').

It is not safe to even dereference this memory, it can point to nowhere.
I do not remember how this patch does "check if it's really the leader later",
but in any case this is too late: iirc at least can_attach(leader) was called.

> Note that before this line 'leader' doesn't really mean the leader -

Yes,

> perhaps i should rename the variables :P
>
> But maybe I also want to grab a reference on the new task?

Of course.

Not that I really understand why this task must be ->group_leader at
this point.

> > > +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> >
> > Even if we didn't change "leader" above, this is not safe in theory.
> > We already discussed this, list_for_each_rcu(head) is only safe when
> > we know that "head" itself is valid.
> >
> > Suppose that this leader exits, then leader->thread_group.next exits
> > too before we take rcu_read_lock().
>
> Why is that a problem? I thought leader->thread_group is supposed to
> stay sane as long as leader is the leader.

Unless we race with exec/exit. Yes, this race is very unlikely.

> This looks like it needs a check to see if 'leader' is still really the
> leader, but nothing more.

Or you can check pid_alive(). Again, you don't really need ->group_leader
to iterate over thread-group.

> > > +	oldcgrp = task_cgroup_from_root(leader, root);
> > > +	if (cgrp != oldcgrp) {
> > > +		retval = cgroup_task_migrate(cgrp, oldcgrp, leader, true);
> > > +		BUG_ON(retval != 0 && retval != -ESRCH);
> > > +	}
> > > +	/* Now iterate over each thread in the group. */
> > > +	list_for_each_entry_rcu(tsk, &leader->thread_group, thread_group) {
> > > +		BUG_ON(tsk->signal != leader->signal);
> > > +		/* leave current thread as it is if it's already there */
> > > +		oldcgrp = task_cgroup_from_root(tsk, root);
> > > +		if (cgrp == oldcgrp)
> > > +			continue;
> > > +		/* we don't care whether these threads are exiting */
> > > +		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
> > > +		BUG_ON(retval != 0 && retval != -ESRCH);
> > > +	}
> >
> > This looks strange. Why do we move leader outside of the loop ?
> > Of course, list_for_each_entry() can't work to move all sub-threads,
> > but "do while_each_thread()" can.
>
> do/while_each_thread oves over all threads in the system, rather than
> just the threadgroup... this isn't supposed to be a fast operation, but
> that seems like overkill.

What are you talking about? ;)

do/while_each_thread iterates over threadgroup.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC] [PATCH 2/2] cgroups: make procs file writable
  2010-06-03  4:40                   ` Ben Blum
@ 2010-06-03 14:48                     ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2010-06-03 14:48 UTC (permalink / raw)
  To: Ben Blum
  Cc: Paul Menage, linux-kernel, containers, akpm, ebiederm, lizf,
	matthltc

On 06/03, Ben Blum wrote:
>
> On Wed, Jun 02, 2010 at 10:58:55PM +0200, Oleg Nesterov wrote:
> > Hmm. The usage of ->thread_group in ->can_attach() methods doesn't
> > look safe to me... but currently bool threadgroup is always false.
>
> I recall putting a rcu_read_lock() around that part and being assured
> that made it safe. But I don't remember the details. Maybe taking
> tasklist_lock is necessary?

rcu_read_lock() is not enough, see another email I sent.

Once again.

	rcu_read_lock()
	list_for_each_rcu(tsk->thread_group)

assumes that at least tsk->thread_group->next can't point to nowhere,
this is not true. This memory can go away _before_ we take rcu lock.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-06-03 14:50 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-30  1:30 [RFC] [PATCH v2 0/2] cgroups: implement moving a threadgroup's threads atomically with cgroup.procs Ben Blum
2010-05-30  1:31 ` [RFC] [PATCH 1/2] cgroups: read-write lock CLONE_THREAD forking per threadgroup Ben Blum
2010-05-30  1:33 ` [RFC] [PATCH 2/2] cgroups: make procs file writable Ben Blum
2010-05-31 17:52   ` Oleg Nesterov
2010-05-31 18:04     ` Oleg Nesterov
2010-06-01 18:57       ` Paul Menage
2010-06-02 14:06         ` Oleg Nesterov
2010-06-02 19:53           ` Paul Menage
2010-06-02 20:20             ` Oleg Nesterov
2010-06-02 20:31               ` Paul Menage
2010-06-02 20:58                 ` Oleg Nesterov
2010-06-02 21:12                   ` Paul Menage
2010-06-02 21:38                     ` Oleg Nesterov
2010-06-02 22:03                       ` Paul Menage
2010-06-03  4:44                         ` Ben Blum
2010-06-03  4:40                   ` Ben Blum
2010-06-03 14:48                     ` Oleg Nesterov
2010-06-03  4:56     ` Ben Blum
2010-06-03 14:43       ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).