[RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes
@ 2008-12-10 23:36 menage
  2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: menage @ 2008-12-10 23:36 UTC (permalink / raw)
  To: kamezawa.hiroyu, balbir, containers; +Cc: linux-kernel, akpm

These patches present an alternative to some of the other cgroups
locking/refcount patches that have been proposed on LKML recently.

Several of these patches have been to address the race opened by
moving the calls to pre_destroy() callbacks outside of cgroup_mutex;
rather than continuing to patch up the holes caused by that change,
these patches introduce new locking/refcount rules to ultimately allow
the previous atomicity of cgroup_rmdir() to be restored.

These three patches give:

1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
      use to prevent changes to its own cgroup tree

2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
      memory controller

3/3 - introduce a css_tryget() function similar to the one proposed by
      Kamezawa, but avoiding spurious refcount failures in the event
      of a race between a css_tryget() and an unsuccessful cgroup_rmdir()

Future patches will likely involve:

- using hierarchy mutex in place of cgroup_lock() in more subsystems
  where appropriate

- restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()

Signed-off-by: Paul Menage <menage@google.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-10 23:36 [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes menage
@ 2008-12-10 23:36 ` menage
  2008-12-11  0:37   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2008-12-10 23:36 ` [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage
  2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage
  2 siblings, 3 replies; 24+ messages in thread
From: menage @ 2008-12-10 23:36 UTC (permalink / raw)
  To: kamezawa.hiroyu, balbir, containers; +Cc: linux-kernel, akpm

[-- Attachment #1: cgroup_hierarchy_lock.patch --]
[-- Type: text/plain, Size: 5137 bytes --]

This patch adds a hierarchy_mutex to the cgroup_subsys object that
protects changes to the hierarchy observed by that subsystem. It is
taken by the cgroup subsystem (in addition to cgroup_mutex) for the
following operations:

- linking a cgroup into that subsystem's cgroup tree
- unlinking a cgroup from that subsystem's cgroup tree
- moving the subsystem to/from a hierarchy (including across the
  bind() callback)

Thus if the subsystem holds its own hierarchy_mutex, it can safely
traverse its own hierarchy.

Signed-off-by: Paul Menage <menage@google.com>

---

 Documentation/cgroups/cgroups.txt |    2 +-
 include/linux/cgroup.h            |    9 ++++++++-
 kernel/cgroup.c                   |   37 +++++++++++++++++++++++++++++++++++--
 3 files changed, 44 insertions(+), 4 deletions(-)

Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
+++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
@@ -337,8 +337,15 @@ struct cgroup_subsys {
 #define MAX_CGROUP_TYPE_NAMELEN 32
 	const char *name;
 
-	struct cgroupfs_root *root;
+	/*
+	 * Protects sibling/children links of cgroups in this
+	 * hierarchy, plus protects which hierarchy (or none) the
+	 * subsystem is a part of (i.e. root/sibling)
+	 */
+	struct mutex hierarchy_mutex;
 
+	/* Protected by this->hierarchy_mutex and cgroup_lock() */
+	struct cgroupfs_root *root;
 	struct list_head sibling;
 };
 
Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
+++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
@@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro
 			BUG_ON(cgrp->subsys[i]);
 			BUG_ON(!dummytop->subsys[i]);
 			BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
+			mutex_lock(&ss->hierarchy_mutex);
 			cgrp->subsys[i] = dummytop->subsys[i];
 			cgrp->subsys[i]->cgroup = cgrp;
 			list_move(&ss->sibling, &root->subsys_list);
 			ss->root = root;
 			if (ss->bind)
 				ss->bind(ss, cgrp);
-
+			mutex_unlock(&ss->hierarchy_mutex);
 		} else if (bit & removed_bits) {
 			/* We're removing this subsystem */
 			BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
 			BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
+			mutex_lock(&ss->hierarchy_mutex);
 			if (ss->bind)
 				ss->bind(ss, dummytop);
 			dummytop->subsys[i]->cgroup = dummytop;
 			cgrp->subsys[i] = NULL;
 			subsys[i]->root = &rootnode;
 			list_move(&ss->sibling, &rootnode.subsys_list);
+			mutex_unlock(&ss->hierarchy_mutex);
 		} else if (bit & final_bits) {
 			/* Subsystem state should already exist */
 			BUG_ON(!cgrp->subsys[i]);
@@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou
 	cgrp->subsys[ss->subsys_id] = css;
 }
 
+static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
+{
+	/* We need to take each hierarchy_mutex in a consistent order */
+	int i;
+
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		struct cgroup_subsys *ss = subsys[i];
+		if (ss->root == root)
+			mutex_lock_nested(&ss->hierarchy_mutex, i);
+	}
+}
+
+static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
+{
+	int i;
+
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		struct cgroup_subsys *ss = subsys[i];
+		if (ss->root == root)
+			mutex_unlock(&ss->hierarchy_mutex);
+	}
+}
+
 /*
  * cgroup_create - create a cgroup
  * @parent: cgroup that will be parent of the new cgroup
@@ -2374,7 +2400,9 @@ static long cgroup_create(struct cgroup 
 		init_cgroup_css(css, ss, cgrp);
 	}
 
+	cgroup_lock_hierarchy(root);
 	list_add(&cgrp->sibling, &cgrp->parent->children);
+	cgroup_unlock_hierarchy(root);
 	root->number_of_cgroups++;
 
 	err = cgroup_create_dir(cgrp, dentry, mode);
@@ -2492,8 +2520,12 @@ static int cgroup_rmdir(struct inode *un
 	if (!list_empty(&cgrp->release_list))
 		list_del(&cgrp->release_list);
 	spin_unlock(&release_list_lock);
-	/* delete my sibling from parent->children */
+
+	cgroup_lock_hierarchy(cgrp->root);
+	/* delete this cgroup from parent->children */
 	list_del(&cgrp->sibling);
+	cgroup_unlock_hierarchy(cgrp->root);
+
 	spin_lock(&cgrp->dentry->d_lock);
 	d = dget(cgrp->dentry);
 	spin_unlock(&d->d_lock);
@@ -2535,6 +2567,7 @@ static void __init cgroup_init_subsys(st
 	 * need to invoke fork callbacks here. */
 	BUG_ON(!list_empty(&init_task.tasks));
 
+	mutex_init(&ss->hierarchy_mutex);
 	ss->active = 1;
 }
 
Index: hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/Documentation/cgroups/cgroups.txt
+++ hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt
@@ -528,7 +528,7 @@ example in cpusets, no task may attach b
 up.
 
 void bind(struct cgroup_subsys *ss, struct cgroup *root)
-(cgroup_mutex held by caller)
+(cgroup_mutex and ss->hierarchy_mutex held by caller)
 
 Called when a cgroup subsystem is rebound to a different hierarchy
 and root cgroup. Currently this will only involve movement between

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller
  2008-12-10 23:36 [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes menage
  2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
@ 2008-12-10 23:36 ` menage
  2008-12-11  0:49   ` KAMEZAWA Hiroyuki
  2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage
  2 siblings, 1 reply; 24+ messages in thread
From: menage @ 2008-12-10 23:36 UTC (permalink / raw)
  To: kamezawa.hiroyu, balbir, containers; +Cc: linux-kernel, akpm

[-- Attachment #1: mm-destroy-fix.patch --]
[-- Type: text/plain, Size: 2270 bytes --]

This patch updates the memory controller to use its hierarchy_mutex
rather than calling cgroup_lock() to protected against
cgroup_mkdir()/cgroup_rmdir() from occurring in its hierarchy.

Signed-off-by: Paul Menage <menage@google.com>

---
 mm/memcontrol.c |   14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

Index: hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/mm/memcontrol.c
+++ hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c
@@ -154,7 +154,7 @@ struct mem_cgroup {
 
 	/*
 	 * While reclaiming in a hiearchy, we cache the last child we
-	 * reclaimed from. Protected by cgroup_lock()
+	 * reclaimed from. Protected by hierarchy_mutex
 	 */
 	struct mem_cgroup *last_scanned_child;
 	/*
@@ -529,7 +529,7 @@ unsigned long mem_cgroup_isolate_pages(u
 
 /*
  * This routine finds the DFS walk successor. This routine should be
- * called with cgroup_mutex held
+ * called with hierarchy_mutex held
  */
 static struct mem_cgroup *
 mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
@@ -598,7 +598,7 @@ mem_cgroup_get_first_node(struct mem_cgr
 	/*
 	 * Scan all children under the mem_cgroup mem
 	 */
-	cgroup_lock();
+	mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
 	if (list_empty(&root_mem->css.cgroup->children)) {
 		ret = root_mem;
 		goto done;
@@ -619,7 +619,7 @@ mem_cgroup_get_first_node(struct mem_cgr
 
 done:
 	root_mem->last_scanned_child = ret;
-	cgroup_unlock();
+	mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
 	return ret;
 }
 
@@ -683,18 +683,16 @@ static int mem_cgroup_hierarchical_recla
 	while (next_mem != root_mem) {
 		if (next_mem->obsolete) {
 			mem_cgroup_put(next_mem);
-			cgroup_lock();
 			next_mem = mem_cgroup_get_first_node(root_mem);
-			cgroup_unlock();
 			continue;
 		}
 		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
 						   get_swappiness(next_mem));
 		if (mem_cgroup_check_under_limit(root_mem))
 			return 0;
-		cgroup_lock();
+		mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
 		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
-		cgroup_unlock();
+		mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
 	}
 	return ret;
 }

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 3/3] CGroups: Add css_tryget()
  2008-12-10 23:36 [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes menage
  2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
  2008-12-10 23:36 ` [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage
@ 2008-12-10 23:36 ` menage
  2008-12-11  0:52   ` KAMEZAWA Hiroyuki
  2008-12-11  5:15   ` KAMEZAWA Hiroyuki
  2 siblings, 2 replies; 24+ messages in thread
From: menage @ 2008-12-10 23:36 UTC (permalink / raw)
  To: kamezawa.hiroyu, balbir, containers; +Cc: linux-kernel, akpm

[-- Attachment #1: cgroup_refcnt.patch --]
[-- Type: text/plain, Size: 6424 bytes --]

This patch adds css_tryget(), that obtains a counted reference on a
CSS.  It is used in situations where the caller has a "weak" reference
to the CSS, i.e. one that does not protect the cgroup from removal via
a reference count, but would instead be cleaned up by a destroy()
callback.

css_tryget() will return true on success, or false if the cgroup is
being removed.

This is similar to Kamezawa Hiroyuki's patch from a week or two ago,
but with the difference that in the event of css_tryget() racing with
a cgroup_rmdir(), css_tryget() will only return false if the cgroup
really does get removed.

This implementation is done by biasing css->refcnt, so that a refcnt
of 1 means "releasable" and 0 means "released or releasing". In the
event of a race, css_tryget() distinguishes between "released" and
"releasing" by checking for the CSS_REMOVED flag in css->flags.

Signed-off-by: Paul Menage <menage@google.com>

---
 include/linux/cgroup.h |   33 ++++++++++++++++++++++-----
 kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 80 insertions(+), 11 deletions(-)

Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
+++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
@@ -52,9 +52,9 @@ struct cgroup_subsys_state {
 	 * hierarchy structure */
 	struct cgroup *cgroup;
 
-	/* State maintained by the cgroup system to allow
-	 * subsystems to be "busy". Should be accessed via css_get()
-	 * and css_put() */
+	/* State maintained by the cgroup system to allow subsystems
+	 * to be "busy". Should be accessed via css_get(),
+	 * css_tryget() and and css_put(). */
 
 	atomic_t refcnt;
 
@@ -64,11 +64,14 @@ struct cgroup_subsys_state {
 /* bits in struct cgroup_subsys_state flags field */
 enum {
 	CSS_ROOT, /* This CSS is the root of the subsystem */
+	CSS_REMOVED, /* This CSS is dead */
 };
 
 /*
- * Call css_get() to hold a reference on the cgroup;
- *
+ * Call css_get() to hold a reference on the css; it can be used
+ * for a reference obtained via:
+ * - an existing ref-counted reference to the css
+ * - task->cgroups for a locked task
  */
 
 static inline void css_get(struct cgroup_subsys_state *css)
@@ -77,9 +80,27 @@ static inline void css_get(struct cgroup
 	if (!test_bit(CSS_ROOT, &css->flags))
 		atomic_inc(&css->refcnt);
 }
+
+/*
+ * Call css_tryget() to take a reference on a css if your existing
+ * (known-valid) reference isn't already ref-counted. Returns false if
+ * the css has been destroyed.
+ */
+
+static inline bool css_tryget(struct cgroup_subsys_state *css)
+{
+	if (test_bit(CSS_ROOT, &css->flags))
+		return true;
+	while (!atomic_inc_not_zero(&css->refcnt)) {
+		if (test_bit(CSS_REMOVED, &css->flags))
+			return false;
+	}
+	return true;
+}
+
 /*
  * css_put() should be called to release a reference taken by
- * css_get()
+ * css_get() or css_tryget()
  */
 
 extern void __css_put(struct cgroup_subsys_state *css);
Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
===================================================================
--- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
+++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
@@ -2321,7 +2321,7 @@ static void init_cgroup_css(struct cgrou
 			       struct cgroup *cgrp)
 {
 	css->cgroup = cgrp;
-	atomic_set(&css->refcnt, 0);
+	atomic_set(&css->refcnt, 1);
 	css->flags = 0;
 	if (cgrp == dummytop)
 		set_bit(CSS_ROOT, &css->flags);
@@ -2453,7 +2453,7 @@ static int cgroup_has_css_refs(struct cg
 {
 	/* Check the reference count on each subsystem. Since we
 	 * already established that there are no tasks in the
-	 * cgroup, if the css refcount is also 0, then there should
+	 * cgroup, if the css refcount is also 1, then there should
 	 * be no outstanding references, so the subsystem is safe to
 	 * destroy. We scan across all subsystems rather than using
 	 * the per-hierarchy linked list of mounted subsystems since
@@ -2474,12 +2474,59 @@ static int cgroup_has_css_refs(struct cg
 		 * matter, since it can only happen if the cgroup
 		 * has been deleted and hence no longer needs the
 		 * release agent to be called anyway. */
-		if (css && atomic_read(&css->refcnt))
+		if (css && (atomic_read(&css->refcnt) > 1))
 			return 1;
 	}
 	return 0;
 }
 
+/*
+ * Atomically mark all of the cgroup's CSS objects as
+ * CSS_REMOVED. Return true on success, or false if the cgroup has
+ * busy subsystems. Call with cgroup_mutex held
+ */
+
+static int cgroup_clear_css_refs(struct cgroup *cgrp)
+{
+	struct cgroup_subsys *ss;
+	unsigned long flags;
+	bool failed = false;
+	local_irq_save(flags);
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+		int refcnt;
+		do {
+			/* We can only remove a CSS with a refcnt==1 */
+			refcnt = atomic_read(&css->refcnt);
+			if (refcnt > 1) {
+				failed = true;
+				goto done;
+			}
+			BUG_ON(!refcnt);
+			/*
+			 * Drop the refcnt to 0 while we check other
+			 * subsystems. This will cause any racing
+			 * css_tryget() to spin until we set the
+			 * CSS_REMOVED bits or abort
+			 */
+		} while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt);
+	}
+ done:
+	for_each_subsys(cgrp->root, ss) {
+		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+		if (failed) {
+			/* Restore old refcnt if necessary */
+			if (!atomic_read(&css->refcnt))
+				atomic_set(&css->refcnt, 1);
+		} else {
+			/* Commit the fact that the CSS is removed */
+			set_bit(CSS_REMOVED, &css->flags);
+		}
+	}
+	local_irq_restore(flags);
+	return !failed;
+}
+
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cgroup *cgrp = dentry->d_fsdata;
@@ -2510,7 +2557,7 @@ static int cgroup_rmdir(struct inode *un
 
 	if (atomic_read(&cgrp->count)
 	    || !list_empty(&cgrp->children)
-	    || cgroup_has_css_refs(cgrp)) {
+	    || !cgroup_clear_css_refs(cgrp)) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
@@ -3065,7 +3112,8 @@ void __css_put(struct cgroup_subsys_stat
 {
 	struct cgroup *cgrp = css->cgroup;
 	rcu_read_lock();
-	if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
+	if ((atomic_dec_return(&css->refcnt) == 1) &&
+	    notify_on_release(cgrp)) {
 		set_bit(CGRP_RELEASABLE, &cgrp->flags);
 		check_for_release(cgrp);
 	}

--

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
@ 2008-12-11  0:37   ` KAMEZAWA Hiroyuki
  2008-12-11  0:44     ` Paul Menage
  2008-12-11  6:30     ` Balbir Singh
  2008-12-11  3:05   ` Li Zefan
  2008-12-11  6:29   ` Balbir Singh
  2 siblings, 2 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-11  0:37 UTC (permalink / raw)
  To: menage; +Cc: balbir, containers, linux-kernel, akpm

thank you for this work.

On Wed, 10 Dec 2008 15:36:55 -0800
menage@google.com wrote:

> This patch adds a hierarchy_mutex to the cgroup_subsys object that
> protects changes to the hierarchy observed by that subsystem. It is
> taken by the cgroup subsystem (in addition to cgroup_mutex) for the
> following operations:
> 
> - linking a cgroup into that subsystem's cgroup tree
> - unlinking a cgroup from that subsystem's cgroup tree
> - moving the subsystem to/from a hierarchy (including across the
>   bind() callback)
> 
o.k. usage is very clear.

> Thus if the subsystem holds its own hierarchy_mutex, it can safely
> traverse its own hierarchy.
> 
> Signed-off-by: Paul Menage <menage@google.com>
> 
> ---
> 
>  Documentation/cgroups/cgroups.txt |    2 +-
>  include/linux/cgroup.h            |    9 ++++++++-
>  kernel/cgroup.c                   |   37 +++++++++++++++++++++++++++++++++++--
>  3 files changed, 44 insertions(+), 4 deletions(-)
> 
> Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
> +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> @@ -337,8 +337,15 @@ struct cgroup_subsys {
>  #define MAX_CGROUP_TYPE_NAMELEN 32
>  	const char *name;
>  
> -	struct cgroupfs_root *root;
> +	/*
> +	 * Protects sibling/children links of cgroups in this
> +	 * hierarchy, plus protects which hierarchy (or none) the
> +	 * subsystem is a part of (i.e. root/sibling)
> +	 */
> +	struct mutex hierarchy_mutex;
>  
> +	/* Protected by this->hierarchy_mutex and cgroup_lock() */
> +	struct cgroupfs_root *root;
>  	struct list_head sibling;
>  };
>  
> Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
> +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> @@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro
>  			BUG_ON(cgrp->subsys[i]);
>  			BUG_ON(!dummytop->subsys[i]);
>  			BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
> +			mutex_lock(&ss->hierarchy_mutex);
>  			cgrp->subsys[i] = dummytop->subsys[i];
>  			cgrp->subsys[i]->cgroup = cgrp;
>  			list_move(&ss->sibling, &root->subsys_list);
>  			ss->root = root;
>  			if (ss->bind)
>  				ss->bind(ss, cgrp);
> -
> +			mutex_unlock(&ss->hierarchy_mutex);
>  		} else if (bit & removed_bits) {
>  			/* We're removing this subsystem */
>  			BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
>  			BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
> +			mutex_lock(&ss->hierarchy_mutex);
>  			if (ss->bind)
>  				ss->bind(ss, dummytop);
>  			dummytop->subsys[i]->cgroup = dummytop;
>  			cgrp->subsys[i] = NULL;
>  			subsys[i]->root = &rootnode;
>  			list_move(&ss->sibling, &rootnode.subsys_list);
> +			mutex_unlock(&ss->hierarchy_mutex);
>  		} else if (bit & final_bits) {
>  			/* Subsystem state should already exist */
>  			BUG_ON(!cgrp->subsys[i]);
> @@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou
>  	cgrp->subsys[ss->subsys_id] = css;
>  }
>  
> +static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
> +{
> +	/* We need to take each hierarchy_mutex in a consistent order */
> +	int i;
> +
> +	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> +		struct cgroup_subsys *ss = subsys[i];
> +		if (ss->root == root)
> +			mutex_lock_nested(&ss->hierarchy_mutex, i);
> +	}
> +}
> +
> +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
> +{
> +	int i;
> +
> +	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> +		struct cgroup_subsys *ss = subsys[i];
> +		if (ss->root == root)
> +			mutex_unlock(&ss->hierarchy_mutex);
> +	}
> +}
> +
Maybe no problem..but I don't like releasing lock in the order of acquiring lock.

	for (i = CGROUP_SUBSYS_COUNT - 1; i >=0; i--)  ?


-Kame



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-11  0:37   ` KAMEZAWA Hiroyuki
@ 2008-12-11  0:44     ` Paul Menage
  2008-12-11  6:30     ` Balbir Singh
  1 sibling, 0 replies; 24+ messages in thread
From: Paul Menage @ 2008-12-11  0:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, containers, linux-kernel, akpm

On Wed, Dec 10, 2008 at 4:37 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
>> +{
>> +     int i;
>> +
>> +     for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
>> +             struct cgroup_subsys *ss = subsys[i];
>> +             if (ss->root == root)
>> +                     mutex_unlock(&ss->hierarchy_mutex);
>> +     }
>> +}
>> +
> Maybe no problem..but I don't like releasing lock in the order of acquiring lock.
>
>        for (i = CGROUP_SUBSYS_COUNT - 1; i >=0; i--)  ?

The order that you release the locks is irrelevant for correctness. In
this case, since the only callers of cgroup_lock_hierarchy() also hold
cgroup_mutex and hence can't race with one another, the order of
locking is irrelevant for correctness too - right now the locking
order is just designed to keep lockdep happy.

I think that the reverse-ordered loop is less readable for no gain.

Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller
  2008-12-10 23:36 ` [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage
@ 2008-12-11  0:49   ` KAMEZAWA Hiroyuki
  2008-12-11  0:52     ` Paul Menage
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-11  0:49 UTC (permalink / raw)
  To: menage; +Cc: balbir, containers, linux-kernel, akpm

On Wed, 10 Dec 2008 15:36:56 -0800
menage@google.com wrote:

> This patch updates the memory controller to use its hierarchy_mutex
> rather than calling cgroup_lock() to protected against
> cgroup_mkdir()/cgroup_rmdir() from occurring in its hierarchy.
> 
> Signed-off-by: Paul Menage <menage@google.com>
> 
Hmm, can we avoid following kind of dead-lock by this ?

Assume subsys A and B is on the same hierarchy, A's subsys ID is smaller than B's.
(I assume memory and cpuset for A and B, but just an example.)

	an operation like rmdir() in somewhere.
		hierarchy_lock for A (acquired)
		hierarchy_lock for B (waiting)

	in subsys A.
		mmap_sem (acquired)
		hierarchy_lock for A (waiting)
	in subsys B.
		hierarchy_lock for B (aquired)
		mmap_sem 	     (waiting)



Thanks,
-Kame

> ---
>  mm/memcontrol.c |   14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> Index: hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/mm/memcontrol.c
> +++ hierarchy_lock-mmotm-2008-12-09/mm/memcontrol.c
> @@ -154,7 +154,7 @@ struct mem_cgroup {
>  
>  	/*
>  	 * While reclaiming in a hiearchy, we cache the last child we
> -	 * reclaimed from. Protected by cgroup_lock()
> +	 * reclaimed from. Protected by hierarchy_mutex
>  	 */
>  	struct mem_cgroup *last_scanned_child;
>  	/*
> @@ -529,7 +529,7 @@ unsigned long mem_cgroup_isolate_pages(u
>  
>  /*
>   * This routine finds the DFS walk successor. This routine should be
> - * called with cgroup_mutex held
> + * called with hierarchy_mutex held
>   */
>  static struct mem_cgroup *
>  mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> @@ -598,7 +598,7 @@ mem_cgroup_get_first_node(struct mem_cgr
>  	/*
>  	 * Scan all children under the mem_cgroup mem
>  	 */
> -	cgroup_lock();
> +	mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
>  	if (list_empty(&root_mem->css.cgroup->children)) {
>  		ret = root_mem;
>  		goto done;
> @@ -619,7 +619,7 @@ mem_cgroup_get_first_node(struct mem_cgr
>  
>  done:
>  	root_mem->last_scanned_child = ret;
> -	cgroup_unlock();
> +	mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
>  	return ret;
>  }
>  
> @@ -683,18 +683,16 @@ static int mem_cgroup_hierarchical_recla
>  	while (next_mem != root_mem) {
>  		if (next_mem->obsolete) {
>  			mem_cgroup_put(next_mem);
> -			cgroup_lock();
>  			next_mem = mem_cgroup_get_first_node(root_mem);
> -			cgroup_unlock();
>  			continue;
>  		}
>  		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
>  						   get_swappiness(next_mem));
>  		if (mem_cgroup_check_under_limit(root_mem))
>  			return 0;
> -		cgroup_lock();
> +		mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
>  		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
> -		cgroup_unlock();
> +		mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
>  	}
>  	return ret;
>  }
> 
> --
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 3/3] CGroups: Add css_tryget()
  2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage
@ 2008-12-11  0:52   ` KAMEZAWA Hiroyuki
  2008-12-11  7:02     ` [RFC][PATCH]example: use css_tryget() in memcg " KAMEZAWA Hiroyuki
  2008-12-11  5:15   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-11  0:52 UTC (permalink / raw)
  To: menage; +Cc: balbir, containers, linux-kernel, akpm

On Wed, 10 Dec 2008 15:36:57 -0800
menage@google.com wrote:

> This patch adds css_tryget(), that obtains a counted reference on a
> CSS.  It is used in situations where the caller has a "weak" reference
> to the CSS, i.e. one that does not protect the cgroup from removal via
> a reference count, but would instead be cleaned up by a destroy()
> callback.
> 
> css_tryget() will return true on success, or false if the cgroup is
> being removed.
> 
> This is similar to Kamezawa Hiroyuki's patch from a week or two ago,
> but with the difference that in the event of css_tryget() racing with
> a cgroup_rmdir(), css_tryget() will only return false if the cgroup
> really does get removed.
> 
> This implementation is done by biasing css->refcnt, so that a refcnt
> of 1 means "releasable" and 0 means "released or releasing". In the
> event of a race, css_tryget() distinguishes between "released" and
> "releasing" by checking for the CSS_REMOVED flag in css->flags.
> 
> Signed-off-by: Paul Menage <menage@google.com>
> 
thx, I'll try this.

-Kame



> ---
>  include/linux/cgroup.h |   33 ++++++++++++++++++++++-----
>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 80 insertions(+), 11 deletions(-)
> 
> Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
> +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> @@ -52,9 +52,9 @@ struct cgroup_subsys_state {
>  	 * hierarchy structure */
>  	struct cgroup *cgroup;
>  
> -	/* State maintained by the cgroup system to allow
> -	 * subsystems to be "busy". Should be accessed via css_get()
> -	 * and css_put() */
> +	/* State maintained by the cgroup system to allow subsystems
> +	 * to be "busy". Should be accessed via css_get(),
> +	 * css_tryget() and and css_put(). */
>  
>  	atomic_t refcnt;
>  
> @@ -64,11 +64,14 @@ struct cgroup_subsys_state {
>  /* bits in struct cgroup_subsys_state flags field */
>  enum {
>  	CSS_ROOT, /* This CSS is the root of the subsystem */
> +	CSS_REMOVED, /* This CSS is dead */
>  };
>  
>  /*
> - * Call css_get() to hold a reference on the cgroup;
> - *
> + * Call css_get() to hold a reference on the css; it can be used
> + * for a reference obtained via:
> + * - an existing ref-counted reference to the css
> + * - task->cgroups for a locked task
>   */
>  
>  static inline void css_get(struct cgroup_subsys_state *css)
> @@ -77,9 +80,27 @@ static inline void css_get(struct cgroup
>  	if (!test_bit(CSS_ROOT, &css->flags))
>  		atomic_inc(&css->refcnt);
>  }
> +
> +/*
> + * Call css_tryget() to take a reference on a css if your existing
> + * (known-valid) reference isn't already ref-counted. Returns false if
> + * the css has been destroyed.
> + */
> +
> +static inline bool css_tryget(struct cgroup_subsys_state *css)
> +{
> +	if (test_bit(CSS_ROOT, &css->flags))
> +		return true;
> +	while (!atomic_inc_not_zero(&css->refcnt)) {
> +		if (test_bit(CSS_REMOVED, &css->flags))
> +			return false;
> +	}
> +	return true;
> +}
> +
>  /*
>   * css_put() should be called to release a reference taken by
> - * css_get()
> + * css_get() or css_tryget()
>   */
>  
>  extern void __css_put(struct cgroup_subsys_state *css);
> Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
> +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> @@ -2321,7 +2321,7 @@ static void init_cgroup_css(struct cgrou
>  			       struct cgroup *cgrp)
>  {
>  	css->cgroup = cgrp;
> -	atomic_set(&css->refcnt, 0);
> +	atomic_set(&css->refcnt, 1);
>  	css->flags = 0;
>  	if (cgrp == dummytop)
>  		set_bit(CSS_ROOT, &css->flags);
> @@ -2453,7 +2453,7 @@ static int cgroup_has_css_refs(struct cg
>  {
>  	/* Check the reference count on each subsystem. Since we
>  	 * already established that there are no tasks in the
> -	 * cgroup, if the css refcount is also 0, then there should
> +	 * cgroup, if the css refcount is also 1, then there should
>  	 * be no outstanding references, so the subsystem is safe to
>  	 * destroy. We scan across all subsystems rather than using
>  	 * the per-hierarchy linked list of mounted subsystems since
> @@ -2474,12 +2474,59 @@ static int cgroup_has_css_refs(struct cg
>  		 * matter, since it can only happen if the cgroup
>  		 * has been deleted and hence no longer needs the
>  		 * release agent to be called anyway. */
> -		if (css && atomic_read(&css->refcnt))
> +		if (css && (atomic_read(&css->refcnt) > 1))
>  			return 1;
>  	}
>  	return 0;
>  }
>  
> +/*
> + * Atomically mark all of the cgroup's CSS objects as
> + * CSS_REMOVED. Return true on success, or false if the cgroup has
> + * busy subsystems. Call with cgroup_mutex held
> + */
> +
> +static int cgroup_clear_css_refs(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	unsigned long flags;
> +	bool failed = false;
> +	local_irq_save(flags);
> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +		int refcnt;
> +		do {
> +			/* We can only remove a CSS with a refcnt==1 */
> +			refcnt = atomic_read(&css->refcnt);
> +			if (refcnt > 1) {
> +				failed = true;
> +				goto done;
> +			}
> +			BUG_ON(!refcnt);
> +			/*
> +			 * Drop the refcnt to 0 while we check other
> +			 * subsystems. This will cause any racing
> +			 * css_tryget() to spin until we set the
> +			 * CSS_REMOVED bits or abort
> +			 */
> +		} while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt);
> +	}
> + done:
> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +		if (failed) {
> +			/* Restore old refcnt if necessary */
> +			if (!atomic_read(&css->refcnt))
> +				atomic_set(&css->refcnt, 1);
> +		} else {
> +			/* Commit the fact that the CSS is removed */
> +			set_bit(CSS_REMOVED, &css->flags);
> +		}
> +	}
> +	local_irq_restore(flags);
> +	return !failed;
> +}
> +
>  static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  {
>  	struct cgroup *cgrp = dentry->d_fsdata;
> @@ -2510,7 +2557,7 @@ static int cgroup_rmdir(struct inode *un
>  
>  	if (atomic_read(&cgrp->count)
>  	    || !list_empty(&cgrp->children)
> -	    || cgroup_has_css_refs(cgrp)) {
> +	    || !cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> @@ -3065,7 +3112,8 @@ void __css_put(struct cgroup_subsys_stat
>  {
>  	struct cgroup *cgrp = css->cgroup;
>  	rcu_read_lock();
> -	if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
> +	if ((atomic_dec_return(&css->refcnt) == 1) &&
> +	    notify_on_release(cgrp)) {
>  		set_bit(CGRP_RELEASABLE, &cgrp->flags);
>  		check_for_release(cgrp);
>  	}
> 
> --
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory  controller
  2008-12-11  0:49   ` KAMEZAWA Hiroyuki
@ 2008-12-11  0:52     ` Paul Menage
  2008-12-11  1:05       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Menage @ 2008-12-11  0:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, containers, linux-kernel, akpm

On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>        an operation like rmdir() in somewhere.
>                hierarchy_lock for A (acquired)
>                hierarchy_lock for B (waiting)
>
>        in subsys A.
>                mmap_sem (acquired)
>                hierarchy_lock for A (waiting)
>        in subsys B.
>                hierarchy_lock for B (aquired)
>                mmap_sem             (waiting)
>

That's a valid deadlock - you'd need to require the mmap_sem nests
either inside all hierarchy_mutexes or else outside all of them.

Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory  controller
  2008-12-11  0:52     ` Paul Menage
@ 2008-12-11  1:05       ` KAMEZAWA Hiroyuki
  2008-12-11  6:33         ` Balbir Singh
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-11  1:05 UTC (permalink / raw)
  To: Paul Menage; +Cc: balbir, containers, linux-kernel, akpm

On Wed, 10 Dec 2008 16:52:57 -0800
Paul Menage <menage@google.com> wrote:

> On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> >        an operation like rmdir() in somewhere.
> >                hierarchy_lock for A (acquired)
> >                hierarchy_lock for B (waiting)
> >
> >        in subsys A.
> >                mmap_sem (acquired)
> >                hierarchy_lock for A (waiting)
> >        in subsys B.
> >                hierarchy_lock for B (aquired)
> >                mmap_sem             (waiting)
> >
> 
> That's a valid deadlock - you'd need to require the mmap_sem nests
> either inside all hierarchy_mutexes or else outside all of them.
> 
This was a found dead lock between memcg and cpuset.

another one was 

	an operation like rmdir() in somewhere.
		hierarchy_lock for memcg (acquired)
		hierarchy_lock for B (waiting)

	in subsys B.
		hierarchy_lock for B (aquired)
		have to do some memory reclaim -> hierarchy_lock for memcg (waiting)

I have no objections to hierarchy_lock itself but calling context to memcg is very
complicated and simple replace of these locks will be just a small help.

-Kame


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
  2008-12-11  0:37   ` KAMEZAWA Hiroyuki
@ 2008-12-11  3:05   ` Li Zefan
  2008-12-11 17:07     ` Paul Menage
  2008-12-11  6:29   ` Balbir Singh
  2 siblings, 1 reply; 24+ messages in thread
From: Li Zefan @ 2008-12-11  3:05 UTC (permalink / raw)
  To: menage; +Cc: kamezawa.hiroyu, balbir, containers, linux-kernel, akpm

> +static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
> +{
> +	/* We need to take each hierarchy_mutex in a consistent order */

This comment is not so clear. Do you mean for_each_subsys() can't be
used here?

But this function is used in cgroup.c internally, and always called
with cgroup_lock held, so it's OK to use for_each_subsys().

> +	int i;
> +
> +	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> +		struct cgroup_subsys *ss = subsys[i];
> +		if (ss->root == root)
> +			mutex_lock_nested(&ss->hierarchy_mutex, i);
> +	}
> +}


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 3/3] CGroups: Add css_tryget()
  2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage
  2008-12-11  0:52   ` KAMEZAWA Hiroyuki
@ 2008-12-11  5:15   ` KAMEZAWA Hiroyuki
  2008-12-11  7:28     ` Paul Menage
  1 sibling, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-11  5:15 UTC (permalink / raw)
  To: menage; +Cc: balbir, containers, linux-kernel, akpm

On Wed, 10 Dec 2008 15:36:57 -0800
menage@google.com wrote:

> This patch adds css_tryget(), that obtains a counted reference on a
> CSS.  It is used in situations where the caller has a "weak" reference
> to the CSS, i.e. one that does not protect the cgroup from removal via
> a reference count, but would instead be cleaned up by a destroy()
> callback.
> 
> css_tryget() will return true on success, or false if the cgroup is
> being removed.
> 
> This is similar to Kamezawa Hiroyuki's patch from a week or two ago,
> but with the difference that in the event of css_tryget() racing with
> a cgroup_rmdir(), css_tryget() will only return false if the cgroup
> really does get removed.
> 
> This implementation is done by biasing css->refcnt, so that a refcnt
> of 1 means "releasable" and 0 means "released or releasing". In the
> event of a race, css_tryget() distinguishes between "released" and
> "releasing" by checking for the CSS_REMOVED flag in css->flags.
> 
> Signed-off-by: Paul Menage <menage@google.com>
> 
> ---
>  include/linux/cgroup.h |   33 ++++++++++++++++++++++-----
>  kernel/cgroup.c        |   58 ++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 80 insertions(+), 11 deletions(-)
> 
> Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
> +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> @@ -52,9 +52,9 @@ struct cgroup_subsys_state {
>  	 * hierarchy structure */
>  	struct cgroup *cgroup;
>  
> -	/* State maintained by the cgroup system to allow
> -	 * subsystems to be "busy". Should be accessed via css_get()
> -	 * and css_put() */
> +	/* State maintained by the cgroup system to allow subsystems
> +	 * to be "busy". Should be accessed via css_get(),
> +	 * css_tryget() and and css_put(). */
>  
>  	atomic_t refcnt;
>  
> @@ -64,11 +64,14 @@ struct cgroup_subsys_state {
>  /* bits in struct cgroup_subsys_state flags field */
>  enum {
>  	CSS_ROOT, /* This CSS is the root of the subsystem */
> +	CSS_REMOVED, /* This CSS is dead */
>  };
>  

Could you add this ?
==
bool	css_is_removed(struct cgroup_subsys_state *css)
{
	return test_bit(CSS_REMOVED, &css->flags);
}
==
Then, I'll use this instead of memcg->obsolete flag.

Thanks,
-Kame

>  /*
> - * Call css_get() to hold a reference on the cgroup;
> - *
> + * Call css_get() to hold a reference on the css; it can be used
> + * for a reference obtained via:
> + * - an existing ref-counted reference to the css
> + * - task->cgroups for a locked task
>   */
>  
>  static inline void css_get(struct cgroup_subsys_state *css)
> @@ -77,9 +80,27 @@ static inline void css_get(struct cgroup
>  	if (!test_bit(CSS_ROOT, &css->flags))
>  		atomic_inc(&css->refcnt);
>  }
> +
> +/*
> + * Call css_tryget() to take a reference on a css if your existing
> + * (known-valid) reference isn't already ref-counted. Returns false if
> + * the css has been destroyed.
> + */
> +
> +static inline bool css_tryget(struct cgroup_subsys_state *css)
> +{
> +	if (test_bit(CSS_ROOT, &css->flags))
> +		return true;
> +	while (!atomic_inc_not_zero(&css->refcnt)) {
> +		if (test_bit(CSS_REMOVED, &css->flags))
> +			return false;
> +	}
> +	return true;
> +}
> +
>  /*
>   * css_put() should be called to release a reference taken by
> - * css_get()
> + * css_get() or css_tryget()
>   */
>  
>  extern void __css_put(struct cgroup_subsys_state *css);
> Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
> +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> @@ -2321,7 +2321,7 @@ static void init_cgroup_css(struct cgrou
>  			       struct cgroup *cgrp)
>  {
>  	css->cgroup = cgrp;
> -	atomic_set(&css->refcnt, 0);
> +	atomic_set(&css->refcnt, 1);
>  	css->flags = 0;
>  	if (cgrp == dummytop)
>  		set_bit(CSS_ROOT, &css->flags);
> @@ -2453,7 +2453,7 @@ static int cgroup_has_css_refs(struct cg
>  {
>  	/* Check the reference count on each subsystem. Since we
>  	 * already established that there are no tasks in the
> -	 * cgroup, if the css refcount is also 0, then there should
> +	 * cgroup, if the css refcount is also 1, then there should
>  	 * be no outstanding references, so the subsystem is safe to
>  	 * destroy. We scan across all subsystems rather than using
>  	 * the per-hierarchy linked list of mounted subsystems since
> @@ -2474,12 +2474,59 @@ static int cgroup_has_css_refs(struct cg
>  		 * matter, since it can only happen if the cgroup
>  		 * has been deleted and hence no longer needs the
>  		 * release agent to be called anyway. */
> -		if (css && atomic_read(&css->refcnt))
> +		if (css && (atomic_read(&css->refcnt) > 1))
>  			return 1;
>  	}
>  	return 0;
>  }
>  
> +/*
> + * Atomically mark all of the cgroup's CSS objects as
> + * CSS_REMOVED. Return true on success, or false if the cgroup has
> + * busy subsystems. Call with cgroup_mutex held
> + */
> +
> +static int cgroup_clear_css_refs(struct cgroup *cgrp)
> +{
> +	struct cgroup_subsys *ss;
> +	unsigned long flags;
> +	bool failed = false;
> +	local_irq_save(flags);
> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +		int refcnt;
> +		do {
> +			/* We can only remove a CSS with a refcnt==1 */
> +			refcnt = atomic_read(&css->refcnt);
> +			if (refcnt > 1) {
> +				failed = true;
> +				goto done;
> +			}
> +			BUG_ON(!refcnt);
> +			/*
> +			 * Drop the refcnt to 0 while we check other
> +			 * subsystems. This will cause any racing
> +			 * css_tryget() to spin until we set the
> +			 * CSS_REMOVED bits or abort
> +			 */
> +		} while (atomic_cmpxchg(&css->refcnt, refcnt, 0) != refcnt);
> +	}
> + done:
> +	for_each_subsys(cgrp->root, ss) {
> +		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +		if (failed) {
> +			/* Restore old refcnt if necessary */
> +			if (!atomic_read(&css->refcnt))
> +				atomic_set(&css->refcnt, 1);
> +		} else {
> +			/* Commit the fact that the CSS is removed */
> +			set_bit(CSS_REMOVED, &css->flags);
> +		}
> +	}
> +	local_irq_restore(flags);
> +	return !failed;
> +}
> +
>  static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
>  {
>  	struct cgroup *cgrp = dentry->d_fsdata;
> @@ -2510,7 +2557,7 @@ static int cgroup_rmdir(struct inode *un
>  
>  	if (atomic_read(&cgrp->count)
>  	    || !list_empty(&cgrp->children)
> -	    || cgroup_has_css_refs(cgrp)) {
> +	    || !cgroup_clear_css_refs(cgrp)) {
>  		mutex_unlock(&cgroup_mutex);
>  		return -EBUSY;
>  	}
> @@ -3065,7 +3112,8 @@ void __css_put(struct cgroup_subsys_stat
>  {
>  	struct cgroup *cgrp = css->cgroup;
>  	rcu_read_lock();
> -	if (atomic_dec_and_test(&css->refcnt) && notify_on_release(cgrp)) {
> +	if ((atomic_dec_return(&css->refcnt) == 1) &&
> +	    notify_on_release(cgrp)) {
>  		set_bit(CGRP_RELEASABLE, &cgrp->flags);
>  		check_for_release(cgrp);
>  	}
> 
> --
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
  2008-12-11  0:37   ` KAMEZAWA Hiroyuki
  2008-12-11  3:05   ` Li Zefan
@ 2008-12-11  6:29   ` Balbir Singh
  2008-12-11 17:09     ` Paul Menage
  2 siblings, 1 reply; 24+ messages in thread
From: Balbir Singh @ 2008-12-11  6:29 UTC (permalink / raw)
  To: menage; +Cc: kamezawa.hiroyu, containers, linux-kernel, akpm

* menage@google.com <menage@google.com> [2008-12-10 15:36:55]:

> This patch adds a hierarchy_mutex to the cgroup_subsys object that
> protects changes to the hierarchy observed by that subsystem. It is
> taken by the cgroup subsystem (in addition to cgroup_mutex) for the
> following operations:
> 
> - linking a cgroup into that subsystem's cgroup tree
> - unlinking a cgroup from that subsystem's cgroup tree
> - moving the subsystem to/from a hierarchy (including across the
>   bind() callback)
> 
> Thus if the subsystem holds its own hierarchy_mutex, it can safely
> traverse its own hierarchy.
>

Ths sounds reasonable, a further abstraction in the future could be to
provide the visitor pattern. Allow cgroups to do the walking and have
callbacks called during the visit.
 
> Signed-off-by: Paul Menage <menage@google.com>
> 
> ---
> 
>  Documentation/cgroups/cgroups.txt |    2 +-
>  include/linux/cgroup.h            |    9 ++++++++-
>  kernel/cgroup.c                   |   37 +++++++++++++++++++++++++++++++++++--
>  3 files changed, 44 insertions(+), 4 deletions(-)
> 
> Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
> +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> @@ -337,8 +337,15 @@ struct cgroup_subsys {
>  #define MAX_CGROUP_TYPE_NAMELEN 32
>  	const char *name;
> 
> -	struct cgroupfs_root *root;
> +	/*
> +	 * Protects sibling/children links of cgroups in this
> +	 * hierarchy, plus protects which hierarchy (or none) the
> +	 * subsystem is a part of (i.e. root/sibling)
> +	 */
> +	struct mutex hierarchy_mutex;
> 
> +	/* Protected by this->hierarchy_mutex and cgroup_lock() */
> +	struct cgroupfs_root *root;
>  	struct list_head sibling;
>  };
> 
> Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
> +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> @@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro
>  			BUG_ON(cgrp->subsys[i]);
>  			BUG_ON(!dummytop->subsys[i]);
>  			BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
> +			mutex_lock(&ss->hierarchy_mutex);
>  			cgrp->subsys[i] = dummytop->subsys[i];
>  			cgrp->subsys[i]->cgroup = cgrp;
>  			list_move(&ss->sibling, &root->subsys_list);
>  			ss->root = root;
>  			if (ss->bind)
>  				ss->bind(ss, cgrp);
> -
> +			mutex_unlock(&ss->hierarchy_mutex);
>  		} else if (bit & removed_bits) {
>  			/* We're removing this subsystem */
>  			BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
>  			BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
> +			mutex_lock(&ss->hierarchy_mutex);
>  			if (ss->bind)
>  				ss->bind(ss, dummytop);
>  			dummytop->subsys[i]->cgroup = dummytop;
>  			cgrp->subsys[i] = NULL;
>  			subsys[i]->root = &rootnode;
>  			list_move(&ss->sibling, &rootnode.subsys_list);
> +			mutex_unlock(&ss->hierarchy_mutex);
>  		} else if (bit & final_bits) {
>  			/* Subsystem state should already exist */
>  			BUG_ON(!cgrp->subsys[i]);
> @@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou
>  	cgrp->subsys[ss->subsys_id] = css;
>  }
> 
> +static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
> +{
> +	/* We need to take each hierarchy_mutex in a consistent order */
> +	int i;
> +
> +	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> +		struct cgroup_subsys *ss = subsys[i];
> +		if (ss->root == root)
> +			mutex_lock_nested(&ss->hierarchy_mutex, i);
> +	}
> +}
> +
> +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
> +{
> +	int i;
> +
> +	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> +		struct cgroup_subsys *ss = subsys[i];
> +		if (ss->root == root)
> +			mutex_unlock(&ss->hierarchy_mutex);
> +	}
> +}
> +
>  /*
>   * cgroup_create - create a cgroup
>   * @parent: cgroup that will be parent of the new cgroup
> @@ -2374,7 +2400,9 @@ static long cgroup_create(struct cgroup 
>  		init_cgroup_css(css, ss, cgrp);
>  	}
> 
> +	cgroup_lock_hierarchy(root);
>  	list_add(&cgrp->sibling, &cgrp->parent->children);
> +	cgroup_unlock_hierarchy(root);
>  	root->number_of_cgroups++;
> 
>  	err = cgroup_create_dir(cgrp, dentry, mode);
> @@ -2492,8 +2520,12 @@ static int cgroup_rmdir(struct inode *un
>  	if (!list_empty(&cgrp->release_list))
>  		list_del(&cgrp->release_list);
>  	spin_unlock(&release_list_lock);
> -	/* delete my sibling from parent->children */
> +
> +	cgroup_lock_hierarchy(cgrp->root);
> +	/* delete this cgroup from parent->children */
>  	list_del(&cgrp->sibling);
> +	cgroup_unlock_hierarchy(cgrp->root);
> +
>  	spin_lock(&cgrp->dentry->d_lock);
>  	d = dget(cgrp->dentry);
>  	spin_unlock(&d->d_lock);
> @@ -2535,6 +2567,7 @@ static void __init cgroup_init_subsys(st
>  	 * need to invoke fork callbacks here. */
>  	BUG_ON(!list_empty(&init_task.tasks));
> 
> +	mutex_init(&ss->hierarchy_mutex);
>  	ss->active = 1;
>  }
> 
> Index: hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt
> ===================================================================
> --- hierarchy_lock-mmotm-2008-12-09.orig/Documentation/cgroups/cgroups.txt
> +++ hierarchy_lock-mmotm-2008-12-09/Documentation/cgroups/cgroups.txt
> @@ -528,7 +528,7 @@ example in cpusets, no task may attach b
>  up.
> 
>  void bind(struct cgroup_subsys *ss, struct cgroup *root)
> -(cgroup_mutex held by caller)
> +(cgroup_mutex and ss->hierarchy_mutex held by caller)
>

Seems reasonable, I was wondering if instead of acquiring a mutex per
subsystem that shares the root, if we can collapse it to a single
mutex and prevent cgroup from changing binding. Those are
optimizations that we can think of later

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-11  0:37   ` KAMEZAWA Hiroyuki
  2008-12-11  0:44     ` Paul Menage
@ 2008-12-11  6:30     ` Balbir Singh
  1 sibling, 0 replies; 24+ messages in thread
From: Balbir Singh @ 2008-12-11  6:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: menage, containers, linux-kernel, akpm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-11 09:37:33]:

> thank you for this work.
> 
> On Wed, 10 Dec 2008 15:36:55 -0800
> menage@google.com wrote:
> 
> > This patch adds a hierarchy_mutex to the cgroup_subsys object that
> > protects changes to the hierarchy observed by that subsystem. It is
> > taken by the cgroup subsystem (in addition to cgroup_mutex) for the
> > following operations:
> > 
> > - linking a cgroup into that subsystem's cgroup tree
> > - unlinking a cgroup from that subsystem's cgroup tree
> > - moving the subsystem to/from a hierarchy (including across the
> >   bind() callback)
> > 
> o.k. usage is very clear.
> 
> > Thus if the subsystem holds its own hierarchy_mutex, it can safely
> > traverse its own hierarchy.
> > 
> > Signed-off-by: Paul Menage <menage@google.com>
> > 
> > ---
> > 
> >  Documentation/cgroups/cgroups.txt |    2 +-
> >  include/linux/cgroup.h            |    9 ++++++++-
> >  kernel/cgroup.c                   |   37 +++++++++++++++++++++++++++++++++++--
> >  3 files changed, 44 insertions(+), 4 deletions(-)
> > 
> > Index: hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> > ===================================================================
> > --- hierarchy_lock-mmotm-2008-12-09.orig/include/linux/cgroup.h
> > +++ hierarchy_lock-mmotm-2008-12-09/include/linux/cgroup.h
> > @@ -337,8 +337,15 @@ struct cgroup_subsys {
> >  #define MAX_CGROUP_TYPE_NAMELEN 32
> >  	const char *name;
> >  
> > -	struct cgroupfs_root *root;
> > +	/*
> > +	 * Protects sibling/children links of cgroups in this
> > +	 * hierarchy, plus protects which hierarchy (or none) the
> > +	 * subsystem is a part of (i.e. root/sibling)
> > +	 */
> > +	struct mutex hierarchy_mutex;
> >  
> > +	/* Protected by this->hierarchy_mutex and cgroup_lock() */
> > +	struct cgroupfs_root *root;
> >  	struct list_head sibling;
> >  };
> >  
> > Index: hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> > ===================================================================
> > --- hierarchy_lock-mmotm-2008-12-09.orig/kernel/cgroup.c
> > +++ hierarchy_lock-mmotm-2008-12-09/kernel/cgroup.c
> > @@ -714,23 +714,26 @@ static int rebind_subsystems(struct cgro
> >  			BUG_ON(cgrp->subsys[i]);
> >  			BUG_ON(!dummytop->subsys[i]);
> >  			BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
> > +			mutex_lock(&ss->hierarchy_mutex);
> >  			cgrp->subsys[i] = dummytop->subsys[i];
> >  			cgrp->subsys[i]->cgroup = cgrp;
> >  			list_move(&ss->sibling, &root->subsys_list);
> >  			ss->root = root;
> >  			if (ss->bind)
> >  				ss->bind(ss, cgrp);
> > -
> > +			mutex_unlock(&ss->hierarchy_mutex);
> >  		} else if (bit & removed_bits) {
> >  			/* We're removing this subsystem */
> >  			BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
> >  			BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
> > +			mutex_lock(&ss->hierarchy_mutex);
> >  			if (ss->bind)
> >  				ss->bind(ss, dummytop);
> >  			dummytop->subsys[i]->cgroup = dummytop;
> >  			cgrp->subsys[i] = NULL;
> >  			subsys[i]->root = &rootnode;
> >  			list_move(&ss->sibling, &rootnode.subsys_list);
> > +			mutex_unlock(&ss->hierarchy_mutex);
> >  		} else if (bit & final_bits) {
> >  			/* Subsystem state should already exist */
> >  			BUG_ON(!cgrp->subsys[i]);
> > @@ -2326,6 +2329,29 @@ static void init_cgroup_css(struct cgrou
> >  	cgrp->subsys[ss->subsys_id] = css;
> >  }
> >  
> > +static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
> > +{
> > +	/* We need to take each hierarchy_mutex in a consistent order */
> > +	int i;
> > +
> > +	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > +		struct cgroup_subsys *ss = subsys[i];
> > +		if (ss->root == root)
> > +			mutex_lock_nested(&ss->hierarchy_mutex, i);
> > +	}
> > +}
> > +
> > +static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
> > +		struct cgroup_subsys *ss = subsys[i];
> > +		if (ss->root == root)
> > +			mutex_unlock(&ss->hierarchy_mutex);
> > +	}
> > +}
> > +
> Maybe no problem..but I don't like releasing lock in the order of acquiring lock.
> 
> 	for (i = CGROUP_SUBSYS_COUNT - 1; i >=0; i--)  ?
>

Good point!

-- 
	Balbir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller
  2008-12-11  1:05       ` KAMEZAWA Hiroyuki
@ 2008-12-11  6:33         ` Balbir Singh
  2008-12-11  6:47           ` Li Zefan
  2008-12-11  6:53           ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 24+ messages in thread
From: Balbir Singh @ 2008-12-11  6:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Paul Menage, containers, linux-kernel, akpm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-11 10:05:01]:

> On Wed, 10 Dec 2008 16:52:57 -0800
> Paul Menage <menage@google.com> wrote:
> 
> > On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > >        an operation like rmdir() in somewhere.
> > >                hierarchy_lock for A (acquired)
> > >                hierarchy_lock for B (waiting)
> > >
> > >        in subsys A.
> > >                mmap_sem (acquired)
> > >                hierarchy_lock for A (waiting)
> > >        in subsys B.
> > >                hierarchy_lock for B (aquired)
> > >                mmap_sem             (waiting)
> > >
> > 
> > That's a valid deadlock - you'd need to require the mmap_sem nests
> > either inside all hierarchy_mutexes or else outside all of them.
> > 
> This was a found dead lock between memcg and cpuset.
> 
> another one was 
> 
> 	an operation like rmdir() in somewhere.
> 		hierarchy_lock for memcg (acquired)
> 		hierarchy_lock for B (waiting)
> 
> 	in subsys B.
> 		hierarchy_lock for B (aquired)

But then the hierarchy_locks acquired will be different right?

> 		have to do some memory reclaim -> hierarchy_lock for memcg (waiting)
> 
> I have no objections to hierarchy_lock itself but calling context to memcg is very
> complicated and simple replace of these locks will be just a small help.

Could you please explain the race better?

-- 
	Balbir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller
  2008-12-11  6:33         ` Balbir Singh
@ 2008-12-11  6:47           ` Li Zefan
  2008-12-11  6:53           ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 24+ messages in thread
From: Li Zefan @ 2008-12-11  6:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Paul Menage, containers, linux-kernel, akpm

Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-11 10:05:01]:
> 
>> On Wed, 10 Dec 2008 16:52:57 -0800
>> Paul Menage <menage@google.com> wrote:
>>
>>> On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki
>>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>>        an operation like rmdir() in somewhere.
>>>>                hierarchy_lock for A (acquired)
>>>>                hierarchy_lock for B (waiting)
>>>>
>>>>        in subsys A.
>>>>                mmap_sem (acquired)
>>>>                hierarchy_lock for A (waiting)
>>>>        in subsys B.
>>>>                hierarchy_lock for B (aquired)
>>>>                mmap_sem             (waiting)
>>>>
>>> That's a valid deadlock - you'd need to require the mmap_sem nests
>>> either inside all hierarchy_mutexes or else outside all of them.
>>>
>> This was a found dead lock between memcg and cpuset.
>>
>> another one was 
>>
>> 	an operation like rmdir() in somewhere.
>> 		hierarchy_lock for memcg (acquired)
>> 		hierarchy_lock for B (waiting)
>>
>> 	in subsys B.
>> 		hierarchy_lock for B (aquired)
> 
> But then the hierarchy_locks acquired will be different right?
> 

Yes, I'm worrying this too. The lock order by cgroup_lock_hierarchy() is:
	lock A -> lock B -> lock C
But a call chain may end up with:
	... -> lock B -> ... lock A -> ...

So though this hierarchy lock proprosal can solve specific deadlock between
cpuset and memcg by making cpuset holding cgroup_lock and memcg holding hierarchy_lock,
but we'll probably encounter other deadlocks describled above.

>> 		have to do some memory reclaim -> hierarchy_lock for memcg (waiting)
>>
>> I have no objections to hierarchy_lock itself but calling context to memcg is very
>> complicated and simple replace of these locks will be just a small help.
> 
> Could you please explain the race better?
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller
  2008-12-11  6:33         ` Balbir Singh
  2008-12-11  6:47           ` Li Zefan
@ 2008-12-11  6:53           ` KAMEZAWA Hiroyuki
  2008-12-11 17:05             ` Paul Menage
  1 sibling, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-11  6:53 UTC (permalink / raw)
  To: balbir; +Cc: Paul Menage, containers, linux-kernel, akpm

On Thu, 11 Dec 2008 12:03:07 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-12-11 10:05:01]:
> 
> > On Wed, 10 Dec 2008 16:52:57 -0800
> > Paul Menage <menage@google.com> wrote:
> > 
> > > On Wed, Dec 10, 2008 at 4:49 PM, KAMEZAWA Hiroyuki
> > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > >
> > > >        an operation like rmdir() in somewhere.
> > > >                hierarchy_lock for A (acquired)
> > > >                hierarchy_lock for B (waiting)
> > > >
> > > >        in subsys A.
> > > >                mmap_sem (acquired)
> > > >                hierarchy_lock for A (waiting)
> > > >        in subsys B.
> > > >                hierarchy_lock for B (aquired)
> > > >                mmap_sem             (waiting)
> > > >
> > > 
> > > That's a valid deadlock - you'd need to require the mmap_sem nests
> > > either inside all hierarchy_mutexes or else outside all of them.
> > > 
> > This was a found dead lock between memcg and cpuset.
> > 
> > another one was 
> > 
> > 	an operation like rmdir() in somewhere.
> > 		hierarchy_lock for memcg (acquired)
> > 		hierarchy_lock for B (waiting)
> > 
> > 	in subsys B.
> > 		hierarchy_lock for B (aquired)
> 
> But then the hierarchy_locks acquired will be different right?
> 
yes.

> > 		have to do some memory reclaim -> hierarchy_lock for memcg (waiting)
> > 
> > I have no objections to hierarchy_lock itself but calling context to memcg is very
> > complicated and simple replace of these locks will be just a small help.
> 
> Could you please explain the race better?
> 
I explained...
One example was mmap_sem. above one is recursive lock in cpuset you tried to fix.
(Fortunately, cpuset's subsys ID is smaller than memcg's one. you'll not see
 in real environment.)
And there are other unknown subsyses are planned, bio, checkpoint, etc..

Could you write a document to explain what kind of nest of locks are allowed
before merging this ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH]example: use css_tryget() in memcg Re: [RFC][PATCH 3/3] CGroups: Add css_tryget()
  2008-12-11  0:52   ` KAMEZAWA Hiroyuki
@ 2008-12-11  7:02     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-11  7:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: menage, containers, linux-kernel, akpm, balbir

An exmaple. and css_is_removed() is added (by other patch)

bool	css_is_removed(struct cgroup_subsys_state *css)
{
	return test_bit(CSS_REMOVED, &css->flags);
}
==
From:KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Use css_tryget() in memcg.

css_tryget() newly is added and we can know css is alive or not and
get refcnt of css in very safe way.
("alive" here means "rmdir/destroy" is not called.)

This patch replaces css_get() to css_tryget(), where I cannot explain
why css_get() is safe. And removes memcg->obsolete flag.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   98 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 61 insertions(+), 37 deletions(-)

Index: mmotm-2.6.28-Dec10/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Dec10.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Dec10/mm/memcontrol.c
@@ -162,7 +162,6 @@ struct mem_cgroup {
 	 */
 	bool use_hierarchy;
 	unsigned long	last_oom_jiffies;
-	int		obsolete;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -286,6 +285,31 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
+static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem = NULL;
+	/*
+	 * Because we have no locks, mm->owner's may be being moved to other
+	 * cgroup. We use css_tryget() here even if this looks
+	 * pessimistic (rather than adding locks here).
+	 */
+	rcu_read_lock();
+	do {
+		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+		if (unlikely(!mem))
+			break;
+	} while (!css_tryget(&mem->css));
+	rcu_read_unlock();
+	return mem;
+}
+
+static bool mem_cgroup_is_obsolete(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return true;
+	return css_is_removed(&mem->css);
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -592,8 +616,9 @@ mem_cgroup_get_first_node(struct mem_cgr
 {
 	struct cgroup *cgroup;
 	struct mem_cgroup *ret;
-	bool obsolete = (root_mem->last_scanned_child &&
-				root_mem->last_scanned_child->obsolete);
+	bool obsolete;
+
+	obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
 
 	/*
 	 * Scan all children under the mem_cgroup mem
@@ -681,7 +706,7 @@ static int mem_cgroup_hierarchical_recla
 	next_mem = mem_cgroup_get_first_node(root_mem);
 
 	while (next_mem != root_mem) {
-		if (next_mem->obsolete) {
+		if (mem_cgroup_is_obsolete(next_mem)) {
 			mem_cgroup_put(next_mem);
 			next_mem = mem_cgroup_get_first_node(root_mem);
 			continue;
@@ -737,23 +762,17 @@ static int __mem_cgroup_try_charge(struc
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the init_mm (happens for pagecache usage).
 	 */
-	if (likely(!*memcg)) {
-		rcu_read_lock();
-		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-		if (unlikely(!mem)) {
-			rcu_read_unlock();
-			return 0;
-		}
-		/*
-		 * For every charge from the cgroup, increment reference count
-		 */
-		css_get(&mem->css);
+	mem = *memcg;
+	if (likely(!mem)) {
+		mem = try_get_mem_cgroup_from_mm(mm);
 		*memcg = mem;
-		rcu_read_unlock();
 	} else {
-		mem = *memcg;
 		css_get(&mem->css);
 	}
+	if (unlikely(!mem))
+		return 0;
+
+	VM_BUG_ON(mem_cgroup_is_obsolete(mem));
 
 	while (1) {
 		int ret;
@@ -1040,12 +1059,19 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
+/*
+ * While swap-in, try_charge -> commit or cancel, the page is locked.
+ * And when try_charge() successfully returns, one refcnt to memcg without
+ * struct page_cgroup is aquired. This refcnt will be cumsumed by
+ * "commit()" or removed by "cancel()"
+ */
 int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 				 struct page *page,
 				 gfp_t mask, struct mem_cgroup **ptr)
 {
 	struct mem_cgroup *mem;
 	swp_entry_t     ent;
+	int ret;
 
 	if (mem_cgroup_disabled())
 		return 0;
@@ -1064,10 +1090,15 @@ int mem_cgroup_try_charge_swapin(struct 
 	ent.val = page_private(page);
 
 	mem = lookup_swap_cgroup(ent);
-	if (!mem || mem->obsolete)
+	if (!mem)
+		goto charge_cur_mm;
+	if (css_tryget(&mem->css))
 		goto charge_cur_mm;
 	*ptr = mem;
-	return __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	/* drop extra refcnt from tryget */
+	css_put(&mem->css);
+	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
@@ -1098,13 +1129,18 @@ int mem_cgroup_cache_charge_swapin(struc
 		ent.val = page_private(page);
 		if (do_swap_account) {
 			mem = lookup_swap_cgroup(ent);
-			if (mem && mem->obsolete)
-				mem = NULL;
-			if (mem)
-				mm = NULL;
+			if (mem) {
+				if (css_tryget(&mem->css))
+					mm = NULL; /* charge to recorded */
+				else
+					mem = NULL; /* charge to current */
+			}
 		}
 		ret = mem_cgroup_charge_common(page, mm, mask,
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, mem);
+		/* drop extra refcnt from tryget */
+		if (mem)
+			css_put(&mem->css);
 
 		if (!ret && do_swap_account) {
 			/* avoid double counting */
@@ -1145,7 +1181,6 @@ void mem_cgroup_commit_charge_swapin(str
 		struct mem_cgroup *memcg;
 		memcg = swap_cgroup_record(ent, NULL);
 		if (memcg) {
-			/* If memcg is obsolete, memcg can be != ptr */
 			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 			mem_cgroup_put(memcg);
 		}
@@ -1388,14 +1423,9 @@ int mem_cgroup_shrink_usage(struct mm_st
 	if (!mm)
 		return 0;
 
-	rcu_read_lock();
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (unlikely(!mem)) {
-		rcu_read_unlock();
+	mem = try_get_mem_cgroup_from_mm(mm);
+	if (unlikely(!mem))
 		return 0;
-	}
-	css_get(&mem->css);
-	rcu_read_unlock();
 
 	do {
 		progress = mem_cgroup_hierarchical_reclaim(mem, gfp_mask, true);
@@ -2081,9 +2111,6 @@ static struct mem_cgroup *mem_cgroup_all
  * the number of reference from swap_cgroup and free mem_cgroup when
  * it goes down to 0.
  *
- * When mem_cgroup is destroyed, mem->obsolete will be set to 0 and
- * entry which points to this memcg will be ignore at swapin.
- *
  * Removal of cgroup itself succeeds regardless of refs from swap.
  */
 
@@ -2112,8 +2139,6 @@ static void mem_cgroup_get(struct mem_cg
 static void mem_cgroup_put(struct mem_cgroup *mem)
 {
 	if (atomic_dec_and_test(&mem->refcnt)) {
-		if (!mem->obsolete)
-			return;
 		mem_cgroup_free(mem);
 	}
 }
@@ -2179,7 +2204,6 @@ static void mem_cgroup_pre_destroy(struc
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
-	mem->obsolete = 1;
 	mem_cgroup_force_empty(mem, false);
 }
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 3/3] CGroups: Add css_tryget()
  2008-12-11  5:15   ` KAMEZAWA Hiroyuki
@ 2008-12-11  7:28     ` Paul Menage
  2008-12-11  7:29       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Menage @ 2008-12-11  7:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, containers, linux-kernel, akpm

On Wed, Dec 10, 2008 at 9:15 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> Could you add this ?
> ==
> bool    css_is_removed(struct cgroup_subsys_state *css)
> {
>        return test_bit(CSS_REMOVED, &css->flags);
> }
> ==
> Then, I'll use this instead of memcg->obsolete flag.

I actually had a patch almost ready to send out the nuked memcg->obsolete.

A difference in mine is that I did

#define css_is_removed(__css) test_bit(CSS_REMOVED, &(__css)->css.flags)

so that you can use it on any subsystem state object that contains a
"struct cgroup_subsys_state css;" field, which is currently all of
them I think, without having to add the "->css" every time.

Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 3/3] CGroups: Add css_tryget()
  2008-12-11  7:28     ` Paul Menage
@ 2008-12-11  7:29       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-11  7:29 UTC (permalink / raw)
  To: Paul Menage; +Cc: balbir, containers, linux-kernel, akpm

On Wed, 10 Dec 2008 23:28:12 -0800
Paul Menage <menage@google.com> wrote:

> On Wed, Dec 10, 2008 at 9:15 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > Could you add this ?
> > ==
> > bool    css_is_removed(struct cgroup_subsys_state *css)
> > {
> >        return test_bit(CSS_REMOVED, &css->flags);
> > }
> > ==
> > Then, I'll use this instead of memcg->obsolete flag.
> 
> I actually had a patch almost ready to send out the nuked memcg->obsolete.
> 
> A difference in mine is that I did
> 
> #define css_is_removed(__css) test_bit(CSS_REMOVED, &(__css)->css.flags)
> 
> so that you can use it on any subsystem state object that contains a
> "struct cgroup_subsys_state css;" field, which is currently all of
> them I think, without having to add the "->css" every time.
> 
ok, thx. very useful :)

-Kame


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory  controller
  2008-12-11  6:53           ` KAMEZAWA Hiroyuki
@ 2008-12-11 17:05             ` Paul Menage
  2008-12-12  1:12               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Menage @ 2008-12-11 17:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: balbir, containers, linux-kernel, akpm

On Wed, Dec 10, 2008 at 10:53 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> Could you write a document to explain what kind of nest of locks are allowed
> before merging this ?

We should probably treat it similarly to the cpuset "callback_mutex".
I.e. memory allocations are most likely forbidden when you're holding
it.

Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-11  3:05   ` Li Zefan
@ 2008-12-11 17:07     ` Paul Menage
  0 siblings, 0 replies; 24+ messages in thread
From: Paul Menage @ 2008-12-11 17:07 UTC (permalink / raw)
  To: Li Zefan; +Cc: kamezawa.hiroyu, balbir, containers, linux-kernel, akpm

On Wed, Dec 10, 2008 at 7:05 PM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>
> But this function is used in cgroup.c internally, and always called
> with cgroup_lock held, so it's OK to use for_each_subsys().

This is true, but it doesn't hurt to be future-proof and not require
cgroup_mutex to be held.

Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex
  2008-12-11  6:29   ` Balbir Singh
@ 2008-12-11 17:09     ` Paul Menage
  0 siblings, 0 replies; 24+ messages in thread
From: Paul Menage @ 2008-12-11 17:09 UTC (permalink / raw)
  To: balbir, menage, kamezawa.hiroyu, containers, linux-kernel, akpm

On Wed, Dec 10, 2008 at 10:29 PM, Balbir Singh
<balbir@linux.vnet.ibm.com> wrote:
>
> Seems reasonable, I was wondering if instead of acquiring a mutex per
> subsystem that shares the root, if we can collapse it to a single
> mutex and prevent cgroup from changing binding. Those are
> optimizations that we can think of later

You mean a single hierarchy_mutex per root? But given that one of the
things being protected is which root you're attached to, if you're
trying to prevent hierarchy changes then which root's hierarchy_mutex
would you lock?

Paul

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory  controller
  2008-12-11 17:05             ` Paul Menage
@ 2008-12-12  1:12               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-12  1:12 UTC (permalink / raw)
  To: Paul Menage; +Cc: balbir, containers, linux-kernel, akpm

On Thu, 11 Dec 2008 09:05:47 -0800
Paul Menage <menage@google.com> wrote:

> On Wed, Dec 10, 2008 at 10:53 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > Could you write a document to explain what kind of nest of locks are allowed
> > before merging this ?
> 
> We should probably treat it similarly to the cpuset "callback_mutex".
> I.e. memory allocations are most likely forbidden when you're holding
> it.
> 
Hmm. Anyway, hierarchy mutex should not be aquired in deep stack of the kernel
calls. (and my patch will make memcg's hierarchy walk lockless.)

-Kame


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-12-12  1:14 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-10 23:36 [RFC][PATCH 0/3] CGroups: CGroups: Hierarchy locking/refcount changes menage
2008-12-10 23:36 ` [RFC][PATCH 1/3] CGroups: Add a per-subsystem hierarchy_mutex menage
2008-12-11  0:37   ` KAMEZAWA Hiroyuki
2008-12-11  0:44     ` Paul Menage
2008-12-11  6:30     ` Balbir Singh
2008-12-11  3:05   ` Li Zefan
2008-12-11 17:07     ` Paul Menage
2008-12-11  6:29   ` Balbir Singh
2008-12-11 17:09     ` Paul Menage
2008-12-10 23:36 ` [RFC][PATCH 2/3] CGroups: Use hierarchy_mutex in memory controller menage
2008-12-11  0:49   ` KAMEZAWA Hiroyuki
2008-12-11  0:52     ` Paul Menage
2008-12-11  1:05       ` KAMEZAWA Hiroyuki
2008-12-11  6:33         ` Balbir Singh
2008-12-11  6:47           ` Li Zefan
2008-12-11  6:53           ` KAMEZAWA Hiroyuki
2008-12-11 17:05             ` Paul Menage
2008-12-12  1:12               ` KAMEZAWA Hiroyuki
2008-12-10 23:36 ` [RFC][PATCH 3/3] CGroups: Add css_tryget() menage
2008-12-11  0:52   ` KAMEZAWA Hiroyuki
2008-12-11  7:02     ` [RFC][PATCH]example: use css_tryget() in memcg " KAMEZAWA Hiroyuki
2008-12-11  5:15   ` KAMEZAWA Hiroyuki
2008-12-11  7:28     ` Paul Menage
2008-12-11  7:29       ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox