[RFC][PATCH] cgroup and memcg updates 20090108

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH] cgroup and memcg updates 20090108
@ 2009-01-08  9:25 KAMEZAWA Hiroyuki
  2009-01-08  9:28 ` [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID) KAMEZAWA Hiroyuki
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-08  9:25 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: linux-kernel@vger.kernel.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com,
	menage@google.com

Request for comments for 4 cgroup/memcg patches.
 
 [1/4] CSS ID support for cgroup
    minor update from the last year.

 [2/4] use CSS ID in memcg
    minor update

 [3/4] fix oom-kill problem of memcg/hierarchy=1
    minor update, but I'm still looking for better fix.

 [4/4] fix frequest -EBUSY at rmdir().
     New one. I need advices.

I'd like to remove RFC of some of them in the next week.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID)
  2009-01-08  9:25 [RFC][PATCH] cgroup and memcg updates 20090108 KAMEZAWA Hiroyuki
@ 2009-01-08  9:28 ` KAMEZAWA Hiroyuki
  2009-01-09  3:59   ` Li Zefan
                     ` (3 more replies)
  2009-01-08  9:30 ` [RFC][PATCH 2/4] memcg: use CSS ID in memcg KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-08  9:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com, menage@google.com


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.

This patch attaches unique ID to each css and provides following.

 - css_lookup(subsys, id)
   returns pointer to struct cgroup_subysys_state of id.
 - css_get_next(subsys, id, rootid, depth, foundid)
   returns the next css under "root" by scanning

When cgrou_subsys->use_id is set, an id for css is maintained.
The cgroup framework only parepares
	- css_id of root css for subsys
	- id is automatically attached at creation of css.
	- id is *not* freed automatically. Because the cgroup framework
	  don't know lifetime of cgroup_subsys_state.
	  free_css_id() function is provided. This must be called by subsys.

There are several reasons to develop this.
	- Saving space .... For example, memcg's swap_cgroup is array of
	  pointers to cgroup. But it is not necessary to be very fast.
	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
	  reduce much amount of memory usage.

	- Scanning without lock.
	  CSS_ID provides "scan id under this ROOT" function. By this, scanning
	  css under root can be written without locks.
	  ex)
	  do {
		rcu_read_lock();
		next = cgroup_get_next(subsys, id, root, &found);
		/* check sanity of next here */
		css_tryget();
		rcu_read_unlock();
		id = found + 1
	 } while(...)

Characteristics: 
	- Each css has unique ID under subsys.
	- Lifetime of ID is controlled by subsys.
	- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
	- Allowed ID is 1-65535, ID 0 is UNUSED ID.

Design Choices:
	- scan-by-ID v.s. scan-by-tree-walk.
	  As /proc's pid scan does, scan-by-ID is robust when scanning is done
	  by following kind of routine.
	  scan -> rest a while(release a lock) -> conitunue from interrupted
	  memcg's hierarchical reclaim does this.

	- When subsys->use_id is set, # of css in the system is limited to
	  65535. 

Changelog: (v6) -> (v7)
	- refcnt for CSS ID is removed. Subsys can do it by own logic.
	- New id allocation is done automatically.
	- fixed typos.
	- fixed limit check of ID.

Changelog: (v5) -> (v6)
 	- max depth is removed.
	- changed arguments to "scan"
Changelog: (v4) -> (v5)
	- Totally re-designed as per-css ID.
Changelog:(v3) -> (v4)
	- updated comments.
	- renamed hierarchy_code[] to stack[]
	- merged prepare_id routines.

Changelog (v2) -> (v3)
	- removed cgroup_id_getref().
	- added cgroup_id_tryget().

Changelog (v1) -> (v2):
	- Design change: show only ID(integer) to outside of cgroup.c
	- moved cgroup ID definition from include/ to kernel/cgroup.c
	- struct cgroup_id is freed by RCU.
	- changed interface from pointer to "int"
	- kill_sb() is handled. 
	- ID 0 as unused ID.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/cgroup.h |   44 +++++++
 include/linux/idr.h    |    1 
 kernel/cgroup.c        |  270 ++++++++++++++++++++++++++++++++++++++++++++++++-
 lib/idr.c              |   46 ++++++++
 4 files changed, 360 insertions(+), 1 deletion(-)

Index: mmotm-2.6.28-Jan7/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Jan7.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Jan7/include/linux/cgroup.h
@@ -15,6 +15,7 @@
 #include <linux/cgroupstats.h>
 #include <linux/prio_heap.h>
 #include <linux/rwsem.h>
+#include <linux/idr.h>
 
 #ifdef CONFIG_CGROUPS
 
@@ -22,6 +23,7 @@ struct cgroupfs_root;
 struct cgroup_subsys;
 struct inode;
 struct cgroup;
+struct css_id;
 
 extern int cgroup_init_early(void);
 extern int cgroup_init(void);
@@ -59,6 +61,8 @@ struct cgroup_subsys_state {
 	atomic_t refcnt;
 
 	unsigned long flags;
+	/* ID for this css, if possible */
+	struct css_id *id;
 };
 
 /* bits in struct cgroup_subsys_state flags field */
@@ -363,6 +367,11 @@ struct cgroup_subsys {
 	int active;
 	int disabled;
 	int early_init;
+	/*
+	 * True if this subsys uses ID. ID is not available before cgroup_init()
+	 * (not available in early_init time.)
+	 */
+	bool use_id;
 #define MAX_CGROUP_TYPE_NAMELEN 32
 	const char *name;
 
@@ -384,6 +393,9 @@ struct cgroup_subsys {
 	 */
 	struct cgroupfs_root *root;
 	struct list_head sibling;
+	/* used when use_id == true */
+	struct idr idr;
+	spinlock_t id_lock;
 };
 
 #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
@@ -437,6 +449,38 @@ void cgroup_iter_end(struct cgroup *cgrp
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
 
+/*
+ * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
+ * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
+ * CSS ID is assigned at cgroup allocation (create) automatically
+ * and removed when subsys calls free_css_id() function. This is because
+ * the lifetime of cgroup_subsys_state is  subsys's matter.
+ *
+ * Looking up and scanning function should be called under rcu_read_lock().
+ * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for all calls.
+ */
+
+/* Typically Called at ->destroy(), or somewhere the subsys frees
+  cgroup_subsys_state. */
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
+
+/* Find a cgroup_subsys_state which has given ID */
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
+/*
+ * Get a cgroup whose id is greater than or equal to id under tree of root.
+ * Returning a cgroup_subsys_state or NULL.
+ */
+struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
+		struct cgroup_subsys_state *root, int *foundid);
+
+/* Returns true if root is ancestor of cg */
+bool css_is_ancestor(struct cgroup_subsys_state *cg,
+		     struct cgroup_subsys_state *root);
+
+/* Get id and depth of css */
+unsigned short css_id(struct cgroup_subsys_state *css);
+unsigned short css_depth(struct cgroup_subsys_state *css);
+
 #else /* !CONFIG_CGROUPS */
 
 static inline int cgroup_init_early(void) { return 0; }
Index: mmotm-2.6.28-Jan7/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Jan7.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Jan7/kernel/cgroup.c
@@ -46,7 +46,6 @@
 #include <linux/cgroupstats.h>
 #include <linux/hash.h>
 #include <linux/namei.h>
-
 #include <asm/atomic.h>
 
 static DEFINE_MUTEX(cgroup_mutex);
@@ -185,6 +184,8 @@ struct cg_cgroup_link {
 static struct css_set init_css_set;
 static struct cg_cgroup_link init_css_set_link;
 
+static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
+
 /* css_set_lock protects the list of css_set objects, and the
  * chain of tasks off each css_set.  Nests outside task->alloc_lock
  * due to cgroup_iter_start() */
@@ -567,6 +568,9 @@ static struct backing_dev_info cgroup_ba
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
 
+static int alloc_css_id(struct cgroup_subsys *ss,
+			struct cgroup *parent, struct cgroup *child);
+
 static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
 {
 	struct inode *inode = new_inode(sb);
@@ -2335,6 +2339,7 @@ static void init_cgroup_css(struct cgrou
 	css->cgroup = cgrp;
 	atomic_set(&css->refcnt, 1);
 	css->flags = 0;
+	css->id = NULL;
 	if (cgrp == dummytop)
 		set_bit(CSS_ROOT, &css->flags);
 	BUG_ON(cgrp->subsys[ss->subsys_id]);
@@ -2410,6 +2415,10 @@ static long cgroup_create(struct cgroup 
 			goto err_destroy;
 		}
 		init_cgroup_css(css, ss, cgrp);
+		if (ss->use_id)
+			if (alloc_css_id(ss, parent, cgrp))
+				goto err_destroy;
+		/* At error, ->destroy() callback has to free assigned ID. */
 	}
 
 	cgroup_lock_hierarchy(root);
@@ -2699,6 +2708,8 @@ int __init cgroup_init(void)
 		struct cgroup_subsys *ss = subsys[i];
 		if (!ss->early_init)
 			cgroup_init_subsys(ss);
+		if (ss->use_id)
+			cgroup_subsys_init_idr(ss);
 	}
 
 	/* Add init_css_set to the hash table */
@@ -3231,3 +3242,260 @@ static int __init cgroup_disable(char *s
 	return 1;
 }
 __setup("cgroup_disable=", cgroup_disable);
+
+/*
+ * CSS ID -- ID per Subsys's Cgroup Subsys State.
+ */
+struct css_id {
+	/*
+	 * The cgroup to which this ID points. If cgroup is removed, this will
+	 * be NULL. This pointer is expected to be RCU-safe because destroy()
+	 * is called after synchronize_rcu(). But for safe use, css_is_removed()
+	 * css_tryget() should be used for avoiding race.
+	 */
+	struct cgroup_subsys_state *css;
+	/*
+	 * ID of this css.
+	 */
+	unsigned short  id;
+	/*
+	 * Depth in hierarchy which this ID belongs to.
+	 */
+	unsigned short depth;
+	/*
+	 * ID is freed by RCU. (and lookup routine is RCU safe.)
+	 */
+	struct rcu_head rcu_head;
+	/*
+	 * Hierarchy of CSS ID belongs to.
+	 */
+	unsigned short  stack[0]; /* Array of Length (depth+1) */
+};
+#define CSS_ID_MAX	(65535)
+
+/*
+ * To get ID other than 0, this should be called when !cgroup_is_removed().
+ */
+unsigned short css_id(struct cgroup_subsys_state *css)
+{
+	struct css_id *cssid = rcu_dereference(css->id);
+
+	if (cssid)
+		return cssid->id;
+	return 0;
+}
+
+unsigned short css_depth(struct cgroup_subsys_state *css)
+{
+	struct css_id *cssid = rcu_dereference(css->id);
+
+	if (cssid)
+		return cssid->depth;
+	return 0;
+}
+
+bool css_is_ancestor(struct cgroup_subsys_state *child,
+		    struct cgroup_subsys_state *root)
+{
+	struct css_id *child_id = rcu_dereference(child->id);
+	struct css_id *root_id = rcu_dereference(root->id);
+
+	if (!child_id || !root_id || (child_id->depth < root_id->depth))
+		return false;
+	return child_id->stack[root_id->depth] == root_id->id;
+}
+
+static void __free_css_id_cb(struct rcu_head *head)
+{
+	struct css_id *id;
+
+	id = container_of(head, struct css_id, rcu_head);
+	kfree(id);
+}
+
+void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
+{
+	struct css_id *id = css->id;
+
+	BUG_ON(!ss->use_id);
+
+	rcu_assign_pointer(id->css, NULL);
+	rcu_assign_pointer(css->id, NULL);
+	spin_lock(&ss->id_lock);
+	idr_remove(&ss->idr, id->id);
+	spin_unlock(&ss->id_lock);
+	call_rcu(&id->rcu_head, __free_css_id_cb);
+}
+
+/*
+ * This is called by init or create(). Then, calls to this function are
+ * always serialized (By cgroup_mutex() at create()).
+ */
+
+static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
+{
+	struct css_id *newid;
+	int myid, error, size;
+
+	BUG_ON(!ss->use_id);
+
+	size = sizeof(*newid) + sizeof(unsigned short) * (depth + 1);
+	newid = kzalloc(size, GFP_KERNEL);
+	if (!newid)
+		return ERR_PTR(-ENOMEM);
+	/* get id */
+	if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
+		error = -ENOMEM;
+		goto err_out;
+	}
+	spin_lock(&ss->id_lock);
+	/* Don't use 0. allocates an ID of 1-65535 */
+	error = idr_get_new_above(&ss->idr, newid, 1, &myid);
+	spin_unlock(&ss->id_lock);
+
+	/* Returns error when there are no free spaces for new ID.*/
+	if (error) {
+		error = -ENOSPC;
+		goto err_out;
+	}
+	if (myid > CSS_ID_MAX) {
+		error = -ENOSPC;
+		spin_lock(&ss->id_lock);
+		idr_remove(&ss->idr, myid);
+		spin_unlock(&ss->id_lock);
+		goto err_out;
+	}
+
+	newid->id = myid;
+	newid->depth = depth;
+	return newid;
+err_out:
+	kfree(newid);
+	return ERR_PTR(error);
+
+}
+
+
+static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
+{
+	struct css_id *newid;
+	struct cgroup_subsys_state *rootcss;
+
+	spin_lock_init(&ss->id_lock);
+	idr_init(&ss->idr);
+
+	rootcss = init_css_set.subsys[ss->subsys_id];
+	newid = get_new_cssid(ss, 0);
+	if (IS_ERR(newid))
+		return PTR_ERR(newid);
+
+	newid->stack[0] = newid->id;
+	newid->css = rootcss;
+	rootcss->id = newid;
+	return 0;
+}
+
+static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
+			struct cgroup *child)
+{
+	int subsys_id, i, depth = 0;
+	struct cgroup_subsys_state *parent_css, *child_css;
+	struct css_id *child_id, *parent_id = NULL;
+
+	subsys_id = ss->subsys_id;
+	parent_css = parent->subsys[subsys_id];
+	child_css = child->subsys[subsys_id];
+	depth = css_depth(parent_css) + 1;
+	parent_id = parent_css->id;
+
+	child_id = get_new_cssid(ss, depth);
+	if (IS_ERR(child_id))
+		return PTR_ERR(child_id);
+
+	for (i = 0; i < depth; i++)
+		child_id->stack[i] = parent_id->stack[i];
+	child_id->stack[depth] = child_id->id;
+
+	rcu_assign_pointer(child_id->css, child_css);
+	rcu_assign_pointer(child_css->id, child_id);
+
+	return 0;
+}
+
+/**
+ * css_lookup - lookup css by id
+ * @ss: cgroup subsys to be looked into.
+ * @id: the id
+ *
+ * Returns pointer to cgroup_subsys_state if there is valid one with id.
+ * NULL if not.Should be called under rcu_read_lock()
+ */
+
+struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
+{
+	struct css_id *cssid = NULL;
+
+	BUG_ON(!ss->use_id);
+	cssid = idr_find(&ss->idr, id);
+
+	if (unlikely(!cssid))
+		return NULL;
+
+	return rcu_dereference(cssid->css);
+}
+
+/**
+ * css_get_next - lookup next cgroup under specified hierarchy.
+ * @ss: pointer to subsystem
+ * @id: current position of iteration.
+ * @root: pointer to css. search tree under this.
+ * @foundid: position of found object.
+ *
+ * Search next css under the specified hierarchy of rootid. Calling under
+ * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
+ */
+struct cgroup_subsys_state *
+css_get_next(struct cgroup_subsys *ss, int id,
+	     struct cgroup_subsys_state *root, int *foundid)
+{
+	struct cgroup_subsys_state *ret = NULL;
+	struct css_id *tmp;
+	int tmpid;
+	int rootid = css_id(root);
+	int depth = css_depth(root);
+
+	if (!rootid)
+		return NULL;
+
+	BUG_ON(!ss->use_id);
+	rcu_read_lock();
+	/* fill start point for scan */
+	tmpid = id;
+	while (1) {
+		/*
+		 * scan next entry from bitmap(tree), tmpid is updated after
+		 * idr_get_next().
+		 */
+		spin_lock(&ss->id_lock);
+		tmp = idr_get_next(&ss->idr, &tmpid);
+		spin_unlock(&ss->id_lock);
+
+		if (!tmp) {
+			ret = NULL;
+			break;
+		}
+		if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
+			ret = rcu_dereference(tmp->css);
+			if (ret) {
+				*foundid = tmpid;
+				break;
+			}
+		}
+		/* continue to scan from next id */
+		tmpid = tmpid + 1;
+	}
+
+	rcu_read_unlock();
+	return ret;
+}
+
Index: mmotm-2.6.28-Jan7/include/linux/idr.h
===================================================================
--- mmotm-2.6.28-Jan7.orig/include/linux/idr.h
+++ mmotm-2.6.28-Jan7/include/linux/idr.h
@@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
 int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
 int idr_for_each(struct idr *idp,
 		 int (*fn)(int id, void *p, void *data), void *data);
+void *idr_get_next(struct idr *idp, int *nextid);
 void *idr_replace(struct idr *idp, void *ptr, int id);
 void idr_remove(struct idr *idp, int id);
 void idr_remove_all(struct idr *idp);
Index: mmotm-2.6.28-Jan7/lib/idr.c
===================================================================
--- mmotm-2.6.28-Jan7.orig/lib/idr.c
+++ mmotm-2.6.28-Jan7/lib/idr.c
@@ -579,6 +579,52 @@ int idr_for_each(struct idr *idp,
 EXPORT_SYMBOL(idr_for_each);
 
 /**
+ * idr_get_next - lookup next object of id to given id.
+ * @idp: idr handle
+ * @id:  pointer to lookup key
+ *
+ * Returns pointer to registered object with id, which is next number to
+ * given id.
+ */
+
+void *idr_get_next(struct idr *idp, int *nextidp)
+{
+	struct idr_layer *p, *pa[MAX_LEVEL];
+	struct idr_layer **paa = &pa[0];
+	int id = *nextidp;
+	int n, max;
+
+	/* find first ent */
+	n = idp->layers * IDR_BITS;
+	max = 1 << n;
+	p = rcu_dereference(idp->top);
+	if (!p)
+		return NULL;
+
+	while (id < max) {
+		while (n > 0 && p) {
+			n -= IDR_BITS;
+			*paa++ = p;
+			p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+		}
+
+		if (p) {
+			*nextidp = id;
+			return p;
+		}
+
+		id += 1 << n;
+		while (n < fls(id)) {
+			n += IDR_BITS;
+			p = *--paa;
+		}
+	}
+	return NULL;
+}
+
+
+
+/**
  * idr_replace - replace pointer for given id
  * @idp: idr handle
  * @ptr: pointer you want associated with the id

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID)
  2009-01-08  9:28 ` [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID) KAMEZAWA Hiroyuki
@ 2009-01-09  3:59   ` Li Zefan
  2009-01-09  4:24     ` KAMEZAWA Hiroyuki
  2009-01-10  0:23   ` Paul Menage
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-01-09  3:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	menage@google.com

>  static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
>  {
>  	struct inode *inode = new_inode(sb);
> @@ -2335,6 +2339,7 @@ static void init_cgroup_css(struct cgrou
>  	css->cgroup = cgrp;
>  	atomic_set(&css->refcnt, 1);
>  	css->flags = 0;
> +	css->id = NULL;
>  	if (cgrp == dummytop)
>  		set_bit(CSS_ROOT, &css->flags);
>  	BUG_ON(cgrp->subsys[ss->subsys_id]);
> @@ -2410,6 +2415,10 @@ static long cgroup_create(struct cgroup 
>  			goto err_destroy;
>  		}
>  		init_cgroup_css(css, ss, cgrp);
> +		if (ss->use_id)
> +			if (alloc_css_id(ss, parent, cgrp))
> +				goto err_destroy;
> +		/* At error, ->destroy() callback has to free assigned ID. */

A bug here:

if alloc_css_id(ss, parent, cgrp) failed, then ss->destroy() called free_css_id(),
then panic.

maybe check if (css->id == NULL) in free_css_id() ?

>  	}
>  
>  	cgroup_lock_hierarchy(root);
> @@ -2699,6 +2708,8 @@ int __init cgroup_init(void)
>  		struct cgroup_subsys *ss = subsys[i];
>  		if (!ss->early_init)
>  			cgroup_init_subsys(ss);
> +		if (ss->use_id)
> +			cgroup_subsys_init_idr(ss);
>  	}
>  
>  	/* Add init_css_set to the hash table */
> @@ -3231,3 +3242,260 @@ static int __init cgroup_disable(char *s
>  	return 1;
>  }
>  __setup("cgroup_disable=", cgroup_disable);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID)
  2009-01-09  3:59   ` Li Zefan
@ 2009-01-09  4:24     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-09  4:24 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	menage@google.com

On Fri, 09 Jan 2009 11:59:05 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> >  static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
> >  {
> >  	struct inode *inode = new_inode(sb);
> > @@ -2335,6 +2339,7 @@ static void init_cgroup_css(struct cgrou
> >  	css->cgroup = cgrp;
> >  	atomic_set(&css->refcnt, 1);
> >  	css->flags = 0;
> > +	css->id = NULL;
> >  	if (cgrp == dummytop)
> >  		set_bit(CSS_ROOT, &css->flags);
> >  	BUG_ON(cgrp->subsys[ss->subsys_id]);
> > @@ -2410,6 +2415,10 @@ static long cgroup_create(struct cgroup 
> >  			goto err_destroy;
> >  		}
> >  		init_cgroup_css(css, ss, cgrp);
> > +		if (ss->use_id)
> > +			if (alloc_css_id(ss, parent, cgrp))
> > +				goto err_destroy;
> > +		/* At error, ->destroy() callback has to free assigned ID. */
> 
> A bug here:
> 
> if alloc_css_id(ss, parent, cgrp) failed, then ss->destroy() called free_css_id(),
> then panic.
> 
> maybe check if (css->id == NULL) in free_css_id() ?
> 
Oh, thanks. it will be necessary.
I'll fix it.

But..Hmm...maybe it's useful to add fault injection feature to debug cgroup
for this kind of loop operation.

-Kame

> >  	}
> >  
> >  	cgroup_lock_hierarchy(root);
> > @@ -2699,6 +2708,8 @@ int __init cgroup_init(void)
> >  		struct cgroup_subsys *ss = subsys[i];
> >  		if (!ss->early_init)
> >  			cgroup_init_subsys(ss);
> > +		if (ss->use_id)
> > +			cgroup_subsys_init_idr(ss);
> >  	}
> >  
> >  	/* Add init_css_set to the hash table */
> > @@ -3231,3 +3242,260 @@ static int __init cgroup_disable(char *s
> >  	return 1;
> >  }
> >  __setup("cgroup_disable=", cgroup_disable);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID)
  2009-01-08  9:28 ` [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID) KAMEZAWA Hiroyuki
  2009-01-09  3:59   ` Li Zefan
@ 2009-01-10  0:23   ` Paul Menage
  2009-01-10  0:49     ` KAMEZAWA Hiroyuki
  2009-01-12  7:21   ` Balbir Singh
  2009-01-13  7:40   ` Li Zefan
  3 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2009-01-10  0:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Thu, Jan 8, 2009 at 1:28 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> + *
> + * Looking up and scanning function should be called under rcu_read_lock().
> + * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for all calls.

Can you clarify here - do you mean "not necessary for any calls"
(calls to what?) or "not necessary for some calls"? I presume the
former.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS  ID)
  2009-01-10  0:23   ` Paul Menage
@ 2009-01-10  0:49     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-10  0:49 UTC (permalink / raw)
  To: Paul Menage
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com

Paul Menage さんは書きました：
> On Thu, Jan 8, 2009 at 1:28 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> + *
>> + * Looking up and scanning function should be called under
>> rcu_read_lock().
>> + * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for all
>> calls.
>
> Can you clarify here - do you mean "not necessary for any calls"
> (calls to what?) or "not necessary for some calls"? I presume the
> former.
>
Ah, sorry bad text.

not necessary for any calls related to css_id.

-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID)
  2009-01-08  9:28 ` [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID) KAMEZAWA Hiroyuki
  2009-01-09  3:59   ` Li Zefan
  2009-01-10  0:23   ` Paul Menage
@ 2009-01-12  7:21   ` Balbir Singh
  2009-01-15  6:12     ` KAMEZAWA Hiroyuki
  2009-01-13  7:40   ` Li Zefan
  3 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2009-01-12  7:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com,
	menage@google.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-01-08 18:28:17]:

> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
> 
> This patch attaches unique ID to each css and provides following.
> 
>  - css_lookup(subsys, id)
>    returns pointer to struct cgroup_subysys_state of id.
>  - css_get_next(subsys, id, rootid, depth, foundid)
>    returns the next css under "root" by scanning
> 
> When cgrou_subsys->use_id is set, an id for css is maintained.
> The cgroup framework only parepares
> 	- css_id of root css for subsys
> 	- id is automatically attached at creation of css.
> 	- id is *not* freed automatically. Because the cgroup framework
> 	  don't know lifetime of cgroup_subsys_state.
> 	  free_css_id() function is provided. This must be called by subsys.
> 
> There are several reasons to develop this.
> 	- Saving space .... For example, memcg's swap_cgroup is array of
> 	  pointers to cgroup. But it is not necessary to be very fast.
> 	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
> 	  reduce much amount of memory usage.

2 bytes per entry means that we restrict the entries to 2^16-1 for
number of cgroups, using a pointer introduces no such restriction.
2^16-1 seems a reasonable number for now.

> 
> 	- Scanning without lock.
> 	  CSS_ID provides "scan id under this ROOT" function. By this, scanning
> 	  css under root can be written without locks.
> 	  ex)
> 	  do {
> 		rcu_read_lock();
> 		next = cgroup_get_next(subsys, id, root, &found);
> 		/* check sanity of next here */
> 		css_tryget();
> 		rcu_read_unlock();
> 		id = found + 1
> 	 } while(...)
> 
> Characteristics: 
> 	- Each css has unique ID under subsys.
> 	- Lifetime of ID is controlled by subsys.
> 	- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
> 	- Allowed ID is 1-65535, ID 0 is UNUSED ID.
> 
> Design Choices:
> 	- scan-by-ID v.s. scan-by-tree-walk.
> 	  As /proc's pid scan does, scan-by-ID is robust when scanning is done
> 	  by following kind of routine.
> 	  scan -> rest a while(release a lock) -> conitunue from interrupted
> 	  memcg's hierarchical reclaim does this.
> 
> 	- When subsys->use_id is set, # of css in the system is limited to
> 	  65535. 
> 
> Changelog: (v6) -> (v7)
> 	- refcnt for CSS ID is removed. Subsys can do it by own logic.
> 	- New id allocation is done automatically.
> 	- fixed typos.
> 	- fixed limit check of ID.
> 
> Changelog: (v5) -> (v6)
>  	- max depth is removed.
> 	- changed arguments to "scan"
> Changelog: (v4) -> (v5)
> 	- Totally re-designed as per-css ID.
> Changelog:(v3) -> (v4)
> 	- updated comments.
> 	- renamed hierarchy_code[] to stack[]
> 	- merged prepare_id routines.
> 
> Changelog (v2) -> (v3)
> 	- removed cgroup_id_getref().
> 	- added cgroup_id_tryget().
> 
> Changelog (v1) -> (v2):
> 	- Design change: show only ID(integer) to outside of cgroup.c
> 	- moved cgroup ID definition from include/ to kernel/cgroup.c
> 	- struct cgroup_id is freed by RCU.
> 	- changed interface from pointer to "int"
> 	- kill_sb() is handled. 
> 	- ID 0 as unused ID.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/cgroup.h |   44 +++++++
>  include/linux/idr.h    |    1 
>  kernel/cgroup.c        |  270 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  lib/idr.c              |   46 ++++++++
>  4 files changed, 360 insertions(+), 1 deletion(-)
> 
> Index: mmotm-2.6.28-Jan7/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/include/linux/cgroup.h
> +++ mmotm-2.6.28-Jan7/include/linux/cgroup.h
> @@ -15,6 +15,7 @@
>  #include <linux/cgroupstats.h>
>  #include <linux/prio_heap.h>
>  #include <linux/rwsem.h>
> +#include <linux/idr.h>
> 
>  #ifdef CONFIG_CGROUPS
> 
> @@ -22,6 +23,7 @@ struct cgroupfs_root;
>  struct cgroup_subsys;
>  struct inode;
>  struct cgroup;
> +struct css_id;
> 
>  extern int cgroup_init_early(void);
>  extern int cgroup_init(void);
> @@ -59,6 +61,8 @@ struct cgroup_subsys_state {
>  	atomic_t refcnt;
> 
>  	unsigned long flags;
> +	/* ID for this css, if possible */
> +	struct css_id *id;
>  };
> 
>  /* bits in struct cgroup_subsys_state flags field */
> @@ -363,6 +367,11 @@ struct cgroup_subsys {
>  	int active;
>  	int disabled;
>  	int early_init;
> +	/*
> +	 * True if this subsys uses ID. ID is not available before cgroup_init()
> +	 * (not available in early_init time.)
                                            ^ period should come later
> +	 */
> +	bool use_id;
>  #define MAX_CGROUP_TYPE_NAMELEN 32
>  	const char *name;
> 
> @@ -384,6 +393,9 @@ struct cgroup_subsys {
>  	 */
>  	struct cgroupfs_root *root;
>  	struct list_head sibling;
> +	/* used when use_id == true */
> +	struct idr idr;
> +	spinlock_t id_lock;
>  };
> 
>  #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
> @@ -437,6 +449,38 @@ void cgroup_iter_end(struct cgroup *cgrp
>  int cgroup_scan_tasks(struct cgroup_scanner *scan);
>  int cgroup_attach_task(struct cgroup *, struct task_struct *);
> 
> +/*
> + * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
> + * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
> + * CSS ID is assigned at cgroup allocation (create) automatically
> + * and removed when subsys calls free_css_id() function. This is because
> + * the lifetime of cgroup_subsys_state is  subsys's matter.
> + *
> + * Looking up and scanning function should be called under rcu_read_lock().
> + * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for all calls.
> + */
> +
> +/* Typically Called at ->destroy(), or somewhere the subsys frees
> +  cgroup_subsys_state. */
> +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
> +
> +/* Find a cgroup_subsys_state which has given ID */
> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
> +/*
> + * Get a cgroup whose id is greater than or equal to id under tree of root.
> + * Returning a cgroup_subsys_state or NULL.
> + */
> +struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
> +		struct cgroup_subsys_state *root, int *foundid);
> +
> +/* Returns true if root is ancestor of cg */
> +bool css_is_ancestor(struct cgroup_subsys_state *cg,
> +		     struct cgroup_subsys_state *root);
> +
> +/* Get id and depth of css */
> +unsigned short css_id(struct cgroup_subsys_state *css);
> +unsigned short css_depth(struct cgroup_subsys_state *css);
> +
>  #else /* !CONFIG_CGROUPS */
> 
>  static inline int cgroup_init_early(void) { return 0; }
> Index: mmotm-2.6.28-Jan7/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/kernel/cgroup.c
> +++ mmotm-2.6.28-Jan7/kernel/cgroup.c
> @@ -46,7 +46,6 @@
>  #include <linux/cgroupstats.h>
>  #include <linux/hash.h>
>  #include <linux/namei.h>
> -
>  #include <asm/atomic.h>
> 
>  static DEFINE_MUTEX(cgroup_mutex);
> @@ -185,6 +184,8 @@ struct cg_cgroup_link {
>  static struct css_set init_css_set;
>  static struct cg_cgroup_link init_css_set_link;
> 
> +static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
> +
>  /* css_set_lock protects the list of css_set objects, and the
>   * chain of tasks off each css_set.  Nests outside task->alloc_lock
>   * due to cgroup_iter_start() */
> @@ -567,6 +568,9 @@ static struct backing_dev_info cgroup_ba
>  	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
>  };
> 
> +static int alloc_css_id(struct cgroup_subsys *ss,
> +			struct cgroup *parent, struct cgroup *child);
> +
>  static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
>  {
>  	struct inode *inode = new_inode(sb);
> @@ -2335,6 +2339,7 @@ static void init_cgroup_css(struct cgrou
>  	css->cgroup = cgrp;
>  	atomic_set(&css->refcnt, 1);
>  	css->flags = 0;
> +	css->id = NULL;
>  	if (cgrp == dummytop)
>  		set_bit(CSS_ROOT, &css->flags);
>  	BUG_ON(cgrp->subsys[ss->subsys_id]);
> @@ -2410,6 +2415,10 @@ static long cgroup_create(struct cgroup 
>  			goto err_destroy;
>  		}
>  		init_cgroup_css(css, ss, cgrp);
> +		if (ss->use_id)
> +			if (alloc_css_id(ss, parent, cgrp))
> +				goto err_destroy;
> +		/* At error, ->destroy() callback has to free assigned ID. */
>  	}
> 
>  	cgroup_lock_hierarchy(root);
> @@ -2699,6 +2708,8 @@ int __init cgroup_init(void)
>  		struct cgroup_subsys *ss = subsys[i];
>  		if (!ss->early_init)
>  			cgroup_init_subsys(ss);
> +		if (ss->use_id)
> +			cgroup_subsys_init_idr(ss);
>  	}
> 
>  	/* Add init_css_set to the hash table */
> @@ -3231,3 +3242,260 @@ static int __init cgroup_disable(char *s
>  	return 1;
>  }
>  __setup("cgroup_disable=", cgroup_disable);
> +
> +/*
> + * CSS ID -- ID per Subsys's Cgroup Subsys State.
> + */
> +struct css_id {
> +	/*
> +	 * The cgroup to which this ID points. If cgroup is removed, this will
> +	 * be NULL. This pointer is expected to be RCU-safe because destroy()
> +	 * is called after synchronize_rcu(). But for safe use, css_is_removed()
> +	 * css_tryget() should be used for avoiding race.
> +	 */
> +	struct cgroup_subsys_state *css;
> +	/*
> +	 * ID of this css.
> +	 */
> +	unsigned short  id;
> +	/*
> +	 * Depth in hierarchy which this ID belongs to.
> +	 */
> +	unsigned short depth;
> +	/*
> +	 * ID is freed by RCU. (and lookup routine is RCU safe.)
> +	 */
> +	struct rcu_head rcu_head;
> +	/*
> +	 * Hierarchy of CSS ID belongs to.
> +	 */
> +	unsigned short  stack[0]; /* Array of Length (depth+1) */

By maintaining the path up to the root here, is that how we avoid walking
through cgroups links?

> +};
> +#define CSS_ID_MAX	(65535)
> +
> +/*
> + * To get ID other than 0, this should be called when !cgroup_is_removed().
> + */
> +unsigned short css_id(struct cgroup_subsys_state *css)
> +{
> +	struct css_id *cssid = rcu_dereference(css->id);
> +
> +	if (cssid)
> +		return cssid->id;
> +	return 0;
> +}
> +
> +unsigned short css_depth(struct cgroup_subsys_state *css)
> +{
> +	struct css_id *cssid = rcu_dereference(css->id);
> +
> +	if (cssid)
> +		return cssid->depth;
> +	return 0;
> +}
> +
> +bool css_is_ancestor(struct cgroup_subsys_state *child,
> +		    struct cgroup_subsys_state *root)
> +{
> +	struct css_id *child_id = rcu_dereference(child->id);
> +	struct css_id *root_id = rcu_dereference(root->id);
> +
> +	if (!child_id || !root_id || (child_id->depth < root_id->depth))
> +		return false;
> +	return child_id->stack[root_id->depth] == root_id->id;
> +}
> +
> +static void __free_css_id_cb(struct rcu_head *head)
> +{
> +	struct css_id *id;
> +
> +	id = container_of(head, struct css_id, rcu_head);
> +	kfree(id);
> +}
> +
> +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
> +{
> +	struct css_id *id = css->id;
> +
> +	BUG_ON(!ss->use_id);
> +
> +	rcu_assign_pointer(id->css, NULL);
> +	rcu_assign_pointer(css->id, NULL);
> +	spin_lock(&ss->id_lock);
> +	idr_remove(&ss->idr, id->id);
> +	spin_unlock(&ss->id_lock);
> +	call_rcu(&id->rcu_head, __free_css_id_cb);
> +}
> +
> +/*
> + * This is called by init or create(). Then, calls to this function are
> + * always serialized (By cgroup_mutex() at create()).
> + */
> +
> +static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
> +{
> +	struct css_id *newid;
> +	int myid, error, size;
> +
> +	BUG_ON(!ss->use_id);
> +
> +	size = sizeof(*newid) + sizeof(unsigned short) * (depth + 1);
> +	newid = kzalloc(size, GFP_KERNEL);
> +	if (!newid)
> +		return ERR_PTR(-ENOMEM);
> +	/* get id */
> +	if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
> +		error = -ENOMEM;
> +		goto err_out;
> +	}
> +	spin_lock(&ss->id_lock);
> +	/* Don't use 0. allocates an ID of 1-65535 */
> +	error = idr_get_new_above(&ss->idr, newid, 1, &myid);
> +	spin_unlock(&ss->id_lock);
> +
> +	/* Returns error when there are no free spaces for new ID.*/
> +	if (error) {
> +		error = -ENOSPC;
> +		goto err_out;
> +	}
> +	if (myid > CSS_ID_MAX) {
> +		error = -ENOSPC;
> +		spin_lock(&ss->id_lock);
> +		idr_remove(&ss->idr, myid);
> +		spin_unlock(&ss->id_lock);
> +		goto err_out;
> +	}
> +
> +	newid->id = myid;
> +	newid->depth = depth;
> +	return newid;
> +err_out:
> +	kfree(newid);
> +	return ERR_PTR(error);
> +
> +}
> +
> +
> +static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
> +{
> +	struct css_id *newid;
> +	struct cgroup_subsys_state *rootcss;
> +
> +	spin_lock_init(&ss->id_lock);
> +	idr_init(&ss->idr);
> +
> +	rootcss = init_css_set.subsys[ss->subsys_id];
> +	newid = get_new_cssid(ss, 0);
> +	if (IS_ERR(newid))
> +		return PTR_ERR(newid);
> +
> +	newid->stack[0] = newid->id;
> +	newid->css = rootcss;
> +	rootcss->id = newid;
> +	return 0;
> +}
> +
> +static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
> +			struct cgroup *child)
> +{
> +	int subsys_id, i, depth = 0;
> +	struct cgroup_subsys_state *parent_css, *child_css;
> +	struct css_id *child_id, *parent_id = NULL;
> +
> +	subsys_id = ss->subsys_id;
> +	parent_css = parent->subsys[subsys_id];
> +	child_css = child->subsys[subsys_id];
> +	depth = css_depth(parent_css) + 1;
> +	parent_id = parent_css->id;
> +
> +	child_id = get_new_cssid(ss, depth);
> +	if (IS_ERR(child_id))
> +		return PTR_ERR(child_id);
> +
> +	for (i = 0; i < depth; i++)
> +		child_id->stack[i] = parent_id->stack[i];
> +	child_id->stack[depth] = child_id->id;
> +
> +	rcu_assign_pointer(child_id->css, child_css);
> +	rcu_assign_pointer(child_css->id, child_id);
> +
> +	return 0;
> +}
> +
> +/**
> + * css_lookup - lookup css by id
> + * @ss: cgroup subsys to be looked into.
> + * @id: the id
> + *
> + * Returns pointer to cgroup_subsys_state if there is valid one with id.
> + * NULL if not.Should be called under rcu_read_lock()
> + */
> +
> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
> +{
> +	struct css_id *cssid = NULL;
> +
> +	BUG_ON(!ss->use_id);
> +	cssid = idr_find(&ss->idr, id);
> +
> +	if (unlikely(!cssid))
> +		return NULL;
> +
> +	return rcu_dereference(cssid->css);
> +}
> +
> +/**
> + * css_get_next - lookup next cgroup under specified hierarchy.
> + * @ss: pointer to subsystem
> + * @id: current position of iteration.
> + * @root: pointer to css. search tree under this.
> + * @foundid: position of found object.
> + *
> + * Search next css under the specified hierarchy of rootid. Calling under
> + * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
> + */
> +struct cgroup_subsys_state *
> +css_get_next(struct cgroup_subsys *ss, int id,
> +	     struct cgroup_subsys_state *root, int *foundid)
> +{
> +	struct cgroup_subsys_state *ret = NULL;
> +	struct css_id *tmp;
> +	int tmpid;
> +	int rootid = css_id(root);
> +	int depth = css_depth(root);
> +
> +	if (!rootid)
> +		return NULL;
> +
> +	BUG_ON(!ss->use_id);
> +	rcu_read_lock();
> +	/* fill start point for scan */
> +	tmpid = id;
> +	while (1) {
> +		/*
> +		 * scan next entry from bitmap(tree), tmpid is updated after
> +		 * idr_get_next().
> +		 */
> +		spin_lock(&ss->id_lock);
> +		tmp = idr_get_next(&ss->idr, &tmpid);
> +		spin_unlock(&ss->id_lock);
> +
> +		if (!tmp) {
> +			ret = NULL;
> +			break;
> +		}
> +		if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
> +			ret = rcu_dereference(tmp->css);
> +			if (ret) {
> +				*foundid = tmpid;
> +				break;
> +			}
> +		}
> +		/* continue to scan from next id */
> +		tmpid = tmpid + 1;
> +	}
> +
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> Index: mmotm-2.6.28-Jan7/include/linux/idr.h
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/include/linux/idr.h
> +++ mmotm-2.6.28-Jan7/include/linux/idr.h
> @@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
>  int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
>  int idr_for_each(struct idr *idp,
>  		 int (*fn)(int id, void *p, void *data), void *data);
> +void *idr_get_next(struct idr *idp, int *nextid);
>  void *idr_replace(struct idr *idp, void *ptr, int id);
>  void idr_remove(struct idr *idp, int id);
>  void idr_remove_all(struct idr *idp);
> Index: mmotm-2.6.28-Jan7/lib/idr.c
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/lib/idr.c
> +++ mmotm-2.6.28-Jan7/lib/idr.c
> @@ -579,6 +579,52 @@ int idr_for_each(struct idr *idp,
>  EXPORT_SYMBOL(idr_for_each);
> 
>  /**
> + * idr_get_next - lookup next object of id to given id.
> + * @idp: idr handle
> + * @id:  pointer to lookup key
> + *
> + * Returns pointer to registered object with id, which is next number to
> + * given id.
> + */
> +
> +void *idr_get_next(struct idr *idp, int *nextidp)
> +{
> +	struct idr_layer *p, *pa[MAX_LEVEL];
> +	struct idr_layer **paa = &pa[0];
> +	int id = *nextidp;
> +	int n, max;
> +
> +	/* find first ent */
> +	n = idp->layers * IDR_BITS;
> +	max = 1 << n;
> +	p = rcu_dereference(idp->top);
> +	if (!p)
> +		return NULL;
> +
> +	while (id < max) {
> +		while (n > 0 && p) {
> +			n -= IDR_BITS;
> +			*paa++ = p;
> +			p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
> +		}
> +
> +		if (p) {
> +			*nextidp = id;
> +			return p;
> +		}
> +
> +		id += 1 << n;
> +		while (n < fls(id)) {
> +			n += IDR_BITS;
> +			p = *--paa;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +
> +
> +/**
>   * idr_replace - replace pointer for given id
>   * @idp: idr handle
>   * @ptr: pointer you want associated with the id
>

Overall, I've taken a quick look and the patches seem OK.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID)
  2009-01-12  7:21   ` Balbir Singh
@ 2009-01-15  6:12     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-15  6:12 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com,
	menage@google.com

Sorry for delayed reply.


On Mon, 12 Jan 2009 12:51:48 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-01-08 18:28:17]:
> > There are several reasons to develop this.
> > 	- Saving space .... For example, memcg's swap_cgroup is array of
> > 	  pointers to cgroup. But it is not necessary to be very fast.
> > 	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
> > 	  reduce much amount of memory usage.
> 
> 2 bytes per entry means that we restrict the entries to 2^16-1 for
> number of cgroups, using a pointer introduces no such restriction.
> 2^16-1 seems a reasonable number for now.
> 
yes.


> >  /* bits in struct cgroup_subsys_state flags field */
> > @@ -363,6 +367,11 @@ struct cgroup_subsys {
> >  	int active;
> >  	int disabled;
> >  	int early_init;
> > +	/*
> > +	 * True if this subsys uses ID. ID is not available before cgroup_init()
> > +	 * (not available in early_init time.)
>                                             ^ period should come later
sure.

;
> > +	/*
> > +	 * Hierarchy of CSS ID belongs to.
> > +	 */
> > +	unsigned short  stack[0]; /* Array of Length (depth+1) */
> 
> By maintaining the path up to the root here, is that how we avoid walking
> through cgroups links?
> 
yes. we can check this css is under hierarchy by

   css->stack[root->depth] == root->id.


> > +/**
> >   * idr_replace - replace pointer for given id
> >   * @idp: idr handle
> >   * @ptr: pointer you want associated with the id
> >
> 
> Overall, I've taken a quick look and the patches seem OK.
> 
thx,

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID)
  2009-01-08  9:28 ` [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID) KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2009-01-12  7:21   ` Balbir Singh
@ 2009-01-13  7:40   ` Li Zefan
  2009-01-13  9:22     ` KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-01-13  7:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	menage@google.com

KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
> 

Except the one in cgroup_create(), I don't find any other bugs.

Some minor comments below. :)

> This patch attaches unique ID to each css and provides following.
> 
>  - css_lookup(subsys, id)
>    returns pointer to struct cgroup_subysys_state of id.
>  - css_get_next(subsys, id, rootid, depth, foundid)
>    returns the next css under "root" by scanning
> 
> When cgrou_subsys->use_id is set, an id for css is maintained.
> The cgroup framework only parepares
> 	- css_id of root css for subsys
> 	- id is automatically attached at creation of css.
> 	- id is *not* freed automatically. Because the cgroup framework
> 	  don't know lifetime of cgroup_subsys_state.
> 	  free_css_id() function is provided. This must be called by subsys.
> 
> There are several reasons to develop this.
> 	- Saving space .... For example, memcg's swap_cgroup is array of
> 	  pointers to cgroup. But it is not necessary to be very fast.
> 	  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
> 	  reduce much amount of memory usage.
> 
> 	- Scanning without lock.
> 	  CSS_ID provides "scan id under this ROOT" function. By this, scanning
> 	  css under root can be written without locks.
> 	  ex)
> 	  do {
> 		rcu_read_lock();
> 		next = cgroup_get_next(subsys, id, root, &found);
> 		/* check sanity of next here */
> 		css_tryget();
> 		rcu_read_unlock();
> 		id = found + 1
> 	 } while(...)
> 
> Characteristics: 
> 	- Each css has unique ID under subsys.
> 	- Lifetime of ID is controlled by subsys.
> 	- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
> 	- Allowed ID is 1-65535, ID 0 is UNUSED ID.
> 
> Design Choices:
> 	- scan-by-ID v.s. scan-by-tree-walk.
> 	  As /proc's pid scan does, scan-by-ID is robust when scanning is done
> 	  by following kind of routine.
> 	  scan -> rest a while(release a lock) -> conitunue from interrupted
> 	  memcg's hierarchical reclaim does this.
> 
> 	- When subsys->use_id is set, # of css in the system is limited to
> 	  65535. 
> 
> Changelog: (v6) -> (v7)
> 	- refcnt for CSS ID is removed. Subsys can do it by own logic.
> 	- New id allocation is done automatically.
> 	- fixed typos.
> 	- fixed limit check of ID.
> 
> Changelog: (v5) -> (v6)
>  	- max depth is removed.
> 	- changed arguments to "scan"
> Changelog: (v4) -> (v5)
> 	- Totally re-designed as per-css ID.
> Changelog:(v3) -> (v4)
> 	- updated comments.
> 	- renamed hierarchy_code[] to stack[]
> 	- merged prepare_id routines.
> 
> Changelog (v2) -> (v3)
> 	- removed cgroup_id_getref().
> 	- added cgroup_id_tryget().
> 
> Changelog (v1) -> (v2):
> 	- Design change: show only ID(integer) to outside of cgroup.c
> 	- moved cgroup ID definition from include/ to kernel/cgroup.c
> 	- struct cgroup_id is freed by RCU.
> 	- changed interface from pointer to "int"
> 	- kill_sb() is handled. 
> 	- ID 0 as unused ID.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/cgroup.h |   44 +++++++
>  include/linux/idr.h    |    1 
>  kernel/cgroup.c        |  270 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  lib/idr.c              |   46 ++++++++
>  4 files changed, 360 insertions(+), 1 deletion(-)
> 
> Index: mmotm-2.6.28-Jan7/include/linux/cgroup.h
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/include/linux/cgroup.h
> +++ mmotm-2.6.28-Jan7/include/linux/cgroup.h
> @@ -15,6 +15,7 @@
>  #include <linux/cgroupstats.h>
>  #include <linux/prio_heap.h>
>  #include <linux/rwsem.h>
> +#include <linux/idr.h>
>  
>  #ifdef CONFIG_CGROUPS
>  
> @@ -22,6 +23,7 @@ struct cgroupfs_root;
>  struct cgroup_subsys;
>  struct inode;
>  struct cgroup;
> +struct css_id;
>  
>  extern int cgroup_init_early(void);
>  extern int cgroup_init(void);
> @@ -59,6 +61,8 @@ struct cgroup_subsys_state {
>  	atomic_t refcnt;
>  
>  	unsigned long flags;
> +	/* ID for this css, if possible */
> +	struct css_id *id;
>  };
>  
>  /* bits in struct cgroup_subsys_state flags field */
> @@ -363,6 +367,11 @@ struct cgroup_subsys {
>  	int active;
>  	int disabled;
>  	int early_init;
> +	/*
> +	 * True if this subsys uses ID. ID is not available before cgroup_init()
> +	 * (not available in early_init time.)
> +	 */
> +	bool use_id;
>  #define MAX_CGROUP_TYPE_NAMELEN 32
>  	const char *name;
>  
> @@ -384,6 +393,9 @@ struct cgroup_subsys {
>  	 */
>  	struct cgroupfs_root *root;
>  	struct list_head sibling;
> +	/* used when use_id == true */
> +	struct idr idr;
> +	spinlock_t id_lock;
>  };
>  
>  #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
> @@ -437,6 +449,38 @@ void cgroup_iter_end(struct cgroup *cgrp
>  int cgroup_scan_tasks(struct cgroup_scanner *scan);
>  int cgroup_attach_task(struct cgroup *, struct task_struct *);
>  
> +/*
> + * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
> + * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
> + * CSS ID is assigned at cgroup allocation (create) automatically
> + * and removed when subsys calls free_css_id() function. This is because
> + * the lifetime of cgroup_subsys_state is  subsys's matter.

                                   2 spaces ^^

> + *
> + * Looking up and scanning function should be called under rcu_read_lock().
> + * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for all calls.
> + */
> +
> +/* Typically Called at ->destroy(), or somewhere the subsys frees
> +  cgroup_subsys_state. */

use
/*
 * xxx
 */
for consistentcy.

> +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
> +
> +/* Find a cgroup_subsys_state which has given ID */
> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);

add a blank line here

> +/*
> + * Get a cgroup whose id is greater than or equal to id under tree of root.

s/a cgroup/a css/

> + * Returning a cgroup_subsys_state or NULL.
> + */
> +struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
> +		struct cgroup_subsys_state *root, int *foundid);
> +
> +/* Returns true if root is ancestor of cg */
> +bool css_is_ancestor(struct cgroup_subsys_state *cg,
> +		     struct cgroup_subsys_state *root);
> +
> +/* Get id and depth of css */
> +unsigned short css_id(struct cgroup_subsys_state *css);
> +unsigned short css_depth(struct cgroup_subsys_state *css);
> +
>  #else /* !CONFIG_CGROUPS */
>  
>  static inline int cgroup_init_early(void) { return 0; }
> Index: mmotm-2.6.28-Jan7/kernel/cgroup.c
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/kernel/cgroup.c
> +++ mmotm-2.6.28-Jan7/kernel/cgroup.c
> @@ -46,7 +46,6 @@
>  #include <linux/cgroupstats.h>
>  #include <linux/hash.h>
>  #include <linux/namei.h>
> -

it's common to add a blank line between <linux/*> and <asm/*>

>  #include <asm/atomic.h>
>  
>  static DEFINE_MUTEX(cgroup_mutex);
> @@ -185,6 +184,8 @@ struct cg_cgroup_link {
>  static struct css_set init_css_set;
>  static struct cg_cgroup_link init_css_set_link;
>  
> +static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
> +
>  /* css_set_lock protects the list of css_set objects, and the
>   * chain of tasks off each css_set.  Nests outside task->alloc_lock
>   * due to cgroup_iter_start() */
> @@ -567,6 +568,9 @@ static struct backing_dev_info cgroup_ba
>  	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
>  };
>  
> +static int alloc_css_id(struct cgroup_subsys *ss,
> +			struct cgroup *parent, struct cgroup *child);
> +
>  static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
>  {
>  	struct inode *inode = new_inode(sb);
> @@ -2335,6 +2339,7 @@ static void init_cgroup_css(struct cgrou
>  	css->cgroup = cgrp;
>  	atomic_set(&css->refcnt, 1);
>  	css->flags = 0;
> +	css->id = NULL;
>  	if (cgrp == dummytop)
>  		set_bit(CSS_ROOT, &css->flags);
>  	BUG_ON(cgrp->subsys[ss->subsys_id]);
> @@ -2410,6 +2415,10 @@ static long cgroup_create(struct cgroup 
>  			goto err_destroy;
>  		}
>  		init_cgroup_css(css, ss, cgrp);
> +		if (ss->use_id)
> +			if (alloc_css_id(ss, parent, cgrp))
> +				goto err_destroy;
> +		/* At error, ->destroy() callback has to free assigned ID. */
>  	}
>  
>  	cgroup_lock_hierarchy(root);
> @@ -2699,6 +2708,8 @@ int __init cgroup_init(void)
>  		struct cgroup_subsys *ss = subsys[i];
>  		if (!ss->early_init)
>  			cgroup_init_subsys(ss);
> +		if (ss->use_id)
> +			cgroup_subsys_init_idr(ss);
>  	}
>  
>  	/* Add init_css_set to the hash table */
> @@ -3231,3 +3242,260 @@ static int __init cgroup_disable(char *s
>  	return 1;
>  }
>  __setup("cgroup_disable=", cgroup_disable);
> +
> +/*
> + * CSS ID -- ID per Subsys's Cgroup Subsys State.
> + */
> +struct css_id {
> +	/*
> +	 * The cgroup to which this ID points. If cgroup is removed, this will

s/The cgroup/css/ ?

> +	 * be NULL. This pointer is expected to be RCU-safe because destroy()
> +	 * is called after synchronize_rcu(). But for safe use, css_is_removed()
> +	 * css_tryget() should be used for avoiding race.
> +	 */
> +	struct cgroup_subsys_state *css;
> +	/*
> +	 * ID of this css.
> +	 */
> +	unsigned short  id;

                      ^^

> +	/*
> +	 * Depth in hierarchy which this ID belongs to.
> +	 */
> +	unsigned short depth;
> +	/*
> +	 * ID is freed by RCU. (and lookup routine is RCU safe.)
> +	 */
> +	struct rcu_head rcu_head;
> +	/*
> +	 * Hierarchy of CSS ID belongs to.
> +	 */
> +	unsigned short  stack[0]; /* Array of Length (depth+1) */

                      ^^

> +};
> +#define CSS_ID_MAX	(65535)
> +
> +/*
> + * To get ID other than 0, this should be called when !cgroup_is_removed().
> + */
> +unsigned short css_id(struct cgroup_subsys_state *css)
> +{
> +	struct css_id *cssid = rcu_dereference(css->id);
> +
> +	if (cssid)
> +		return cssid->id;
> +	return 0;
> +}
> +
> +unsigned short css_depth(struct cgroup_subsys_state *css)
> +{
> +	struct css_id *cssid = rcu_dereference(css->id);
> +
> +	if (cssid)
> +		return cssid->depth;
> +	return 0;
> +}
> +
> +bool css_is_ancestor(struct cgroup_subsys_state *child,
> +		    struct cgroup_subsys_state *root)
> +{
> +	struct css_id *child_id = rcu_dereference(child->id);
> +	struct css_id *root_id = rcu_dereference(root->id);
> +
> +	if (!child_id || !root_id || (child_id->depth < root_id->depth))
> +		return false;
> +	return child_id->stack[root_id->depth] == root_id->id;
> +}
> +
> +static void __free_css_id_cb(struct rcu_head *head)
> +{
> +	struct css_id *id;
> +
> +	id = container_of(head, struct css_id, rcu_head);
> +	kfree(id);
> +}
> +
> +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
> +{
> +	struct css_id *id = css->id;
> +
> +	BUG_ON(!ss->use_id);
> +
> +	rcu_assign_pointer(id->css, NULL);
> +	rcu_assign_pointer(css->id, NULL);
> +	spin_lock(&ss->id_lock);
> +	idr_remove(&ss->idr, id->id);
> +	spin_unlock(&ss->id_lock);
> +	call_rcu(&id->rcu_head, __free_css_id_cb);
> +}
> +
> +/*
> + * This is called by init or create(). Then, calls to this function are
> + * always serialized (By cgroup_mutex() at create()).
> + */
> +
> +static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
> +{
> +	struct css_id *newid;
> +	int myid, error, size;
> +
> +	BUG_ON(!ss->use_id);
> +
> +	size = sizeof(*newid) + sizeof(unsigned short) * (depth + 1);
> +	newid = kzalloc(size, GFP_KERNEL);
> +	if (!newid)
> +		return ERR_PTR(-ENOMEM);
> +	/* get id */
> +	if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
> +		error = -ENOMEM;
> +		goto err_out;
> +	}
> +	spin_lock(&ss->id_lock);
> +	/* Don't use 0. allocates an ID of 1-65535 */
> +	error = idr_get_new_above(&ss->idr, newid, 1, &myid);
> +	spin_unlock(&ss->id_lock);
> +
> +	/* Returns error when there are no free spaces for new ID.*/
> +	if (error) {
> +		error = -ENOSPC;
> +		goto err_out;
> +	}
> +	if (myid > CSS_ID_MAX) {
> +		error = -ENOSPC;
> +		spin_lock(&ss->id_lock);
> +		idr_remove(&ss->idr, myid);
> +		spin_unlock(&ss->id_lock);
> +		goto err_out;

I'd rather "goto remove_idr", this seperates normal routine and error routine.

> +	}
> +
> +	newid->id = myid;
> +	newid->depth = depth;
> +	return newid;
> +err_out:
> +	kfree(newid);
> +	return ERR_PTR(error);
> +
> +}
> +
> +

2 blanks lines here.

> +static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
> +{
> +	struct css_id *newid;
> +	struct cgroup_subsys_state *rootcss;
> +
> +	spin_lock_init(&ss->id_lock);
> +	idr_init(&ss->idr);
> +
> +	rootcss = init_css_set.subsys[ss->subsys_id];
> +	newid = get_new_cssid(ss, 0);
> +	if (IS_ERR(newid))
> +		return PTR_ERR(newid);
> +
> +	newid->stack[0] = newid->id;
> +	newid->css = rootcss;
> +	rootcss->id = newid;
> +	return 0;
> +}
> +
> +static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
> +			struct cgroup *child)
> +{
> +	int subsys_id, i, depth = 0;
> +	struct cgroup_subsys_state *parent_css, *child_css;
> +	struct css_id *child_id, *parent_id = NULL;
> +
> +	subsys_id = ss->subsys_id;
> +	parent_css = parent->subsys[subsys_id];
> +	child_css = child->subsys[subsys_id];
> +	depth = css_depth(parent_css) + 1;
> +	parent_id = parent_css->id;
> +
> +	child_id = get_new_cssid(ss, depth);
> +	if (IS_ERR(child_id))
> +		return PTR_ERR(child_id);
> +
> +	for (i = 0; i < depth; i++)
> +		child_id->stack[i] = parent_id->stack[i];
> +	child_id->stack[depth] = child_id->id;
> +
> +	rcu_assign_pointer(child_id->css, child_css);
> +	rcu_assign_pointer(child_css->id, child_id);
> +
> +	return 0;
> +}
> +
> +/**
> + * css_lookup - lookup css by id
> + * @ss: cgroup subsys to be looked into.
> + * @id: the id
> + *
> + * Returns pointer to cgroup_subsys_state if there is valid one with id.
> + * NULL if not.Should be called under rcu_read_lock()

s/not./not. /

> + */
> +

remove this line ?

> +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
> +{
> +	struct css_id *cssid = NULL;
> +
> +	BUG_ON(!ss->use_id);
> +	cssid = idr_find(&ss->idr, id);
> +
> +	if (unlikely(!cssid))
> +		return NULL;
> +
> +	return rcu_dereference(cssid->css);
> +}
> +
> +/**
> + * css_get_next - lookup next cgroup under specified hierarchy.
> + * @ss: pointer to subsystem
> + * @id: current position of iteration.
> + * @root: pointer to css. search tree under this.
> + * @foundid: position of found object.
> + *
> + * Search next css under the specified hierarchy of rootid. Calling under
> + * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
> + */
> +struct cgroup_subsys_state *
> +css_get_next(struct cgroup_subsys *ss, int id,
> +	     struct cgroup_subsys_state *root, int *foundid)
> +{
> +	struct cgroup_subsys_state *ret = NULL;
> +	struct css_id *tmp;
> +	int tmpid;
> +	int rootid = css_id(root);
> +	int depth = css_depth(root);
> +
> +	if (!rootid)
> +		return NULL;
> +
> +	BUG_ON(!ss->use_id);
> +	rcu_read_lock();
> +	/* fill start point for scan */
> +	tmpid = id;
> +	while (1) {
> +		/*
> +		 * scan next entry from bitmap(tree), tmpid is updated after
> +		 * idr_get_next().
> +		 */
> +		spin_lock(&ss->id_lock);
> +		tmp = idr_get_next(&ss->idr, &tmpid);
> +		spin_unlock(&ss->id_lock);
> +
> +		if (!tmp) {
> +			ret = NULL;

"ret = NULL" is unnecessary.

> +			break;
> +		}
> +		if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
> +			ret = rcu_dereference(tmp->css);
> +			if (ret) {
> +				*foundid = tmpid;
> +				break;
> +			}
> +		}
> +		/* continue to scan from next id */
> +		tmpid = tmpid + 1;
> +	}
> +
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> Index: mmotm-2.6.28-Jan7/include/linux/idr.h
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/include/linux/idr.h
> +++ mmotm-2.6.28-Jan7/include/linux/idr.h
> @@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
>  int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
>  int idr_for_each(struct idr *idp,
>  		 int (*fn)(int id, void *p, void *data), void *data);
> +void *idr_get_next(struct idr *idp, int *nextid);
>  void *idr_replace(struct idr *idp, void *ptr, int id);
>  void idr_remove(struct idr *idp, int id);
>  void idr_remove_all(struct idr *idp);
> Index: mmotm-2.6.28-Jan7/lib/idr.c
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/lib/idr.c
> +++ mmotm-2.6.28-Jan7/lib/idr.c
> @@ -579,6 +579,52 @@ int idr_for_each(struct idr *idp,
>  EXPORT_SYMBOL(idr_for_each);
>  
>  /**
> + * idr_get_next - lookup next object of id to given id.
> + * @idp: idr handle
> + * @id:  pointer to lookup key
> + *
> + * Returns pointer to registered object with id, which is next number to
> + * given id.
> + */
> +
> +void *idr_get_next(struct idr *idp, int *nextidp)
> +{
> +	struct idr_layer *p, *pa[MAX_LEVEL];
> +	struct idr_layer **paa = &pa[0];
> +	int id = *nextidp;
> +	int n, max;
> +
> +	/* find first ent */
> +	n = idp->layers * IDR_BITS;
> +	max = 1 << n;
> +	p = rcu_dereference(idp->top);
> +	if (!p)
> +		return NULL;
> +
> +	while (id < max) {
> +		while (n > 0 && p) {
> +			n -= IDR_BITS;
> +			*paa++ = p;
> +			p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
> +		}
> +
> +		if (p) {
> +			*nextidp = id;
> +			return p;
> +		}
> +
> +		id += 1 << n;
> +		while (n < fls(id)) {
> +			n += IDR_BITS;
> +			p = *--paa;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +
> +
> +/**
>   * idr_replace - replace pointer for given id
>   * @idp: idr handle
>   * @ptr: pointer you want associated with the id
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID)
  2009-01-13  7:40   ` Li Zefan
@ 2009-01-13  9:22     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-13  9:22 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	menage@google.com

On Tue, 13 Jan 2009 15:40:29 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
> > 
> 
> Except the one in cgroup_create(), I don't find any other bugs.
> 
Thank you!!

> Some minor comments below. :)
> 
<snip>

> > + * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
> > + * if cgroup_subsys.use_id == true. It can be used for looking up and scanning.
> > + * CSS ID is assigned at cgroup allocation (create) automatically
> > + * and removed when subsys calls free_css_id() function. This is because
> > + * the lifetime of cgroup_subsys_state is  subsys's matter.
> 
>                                    2 spaces ^^
> 

will fix.

> > + *
> > + * Looking up and scanning function should be called under rcu_read_lock().
> > + * Taking cgroup_mutex()/hierarchy_mutex() is not necessary for all calls.
> > + */
> > +
> > +/* Typically Called at ->destroy(), or somewhere the subsys frees
> > +  cgroup_subsys_state. */
> 
> use
> /*
>  * xxx
>  */
> for consistentcy.
> 

ok.


> > +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
> > +
> > +/* Find a cgroup_subsys_state which has given ID */
> > +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
> 
> add a blank line here
> 
ok

> > +/*
> > + * Get a cgroup whose id is greater than or equal to id under tree of root.
> 
> s/a cgroup/a css/
> 
sure,


> > + * Returning a cgroup_subsys_state or NULL.
> > + */
> > +struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
> > +		struct cgroup_subsys_state *root, int *foundid);
> > +
> > +/* Returns true if root is ancestor of cg */
> > +bool css_is_ancestor(struct cgroup_subsys_state *cg,
> > +		     struct cgroup_subsys_state *root);
> > +
> > +/* Get id and depth of css */
> > +unsigned short css_id(struct cgroup_subsys_state *css);
> > +unsigned short css_depth(struct cgroup_subsys_state *css);
> > +
> >  #else /* !CONFIG_CGROUPS */
> >  
> >  static inline int cgroup_init_early(void) { return 0; }
> > Index: mmotm-2.6.28-Jan7/kernel/cgroup.c
> > ===================================================================
> > --- mmotm-2.6.28-Jan7.orig/kernel/cgroup.c
> > +++ mmotm-2.6.28-Jan7/kernel/cgroup.c
> > @@ -46,7 +46,6 @@
> >  #include <linux/cgroupstats.h>
> >  #include <linux/hash.h>
> >  #include <linux/namei.h>
> > -
> 
> it's common to add a blank line between <linux/*> and <asm/*>
> 
ok,



> >  #include <asm/atomic.h>
> >  
> >  static DEFINE_MUTEX(cgroup_mutex);
> > @@ -185,6 +184,8 @@ struct cg_cgroup_link {
> >  static struct css_set init_css_set;
> >  static struct cg_cgroup_link init_css_set_link;
> >  
> > +static int cgroup_subsys_init_idr(struct cgroup_subsys *ss);
> > +
> >  /* css_set_lock protects the list of css_set objects, and the
> >   * chain of tasks off each css_set.  Nests outside task->alloc_lock
> >   * due to cgroup_iter_start() */
> > @@ -567,6 +568,9 @@ static struct backing_dev_info cgroup_ba
> >  	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
> >  };
> >  
> > +static int alloc_css_id(struct cgroup_subsys *ss,
> > +			struct cgroup *parent, struct cgroup *child);
> > +
> >  static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
> >  {
> >  	struct inode *inode = new_inode(sb);
> > @@ -2335,6 +2339,7 @@ static void init_cgroup_css(struct cgrou
> >  	css->cgroup = cgrp;
> >  	atomic_set(&css->refcnt, 1);
> >  	css->flags = 0;
> > +	css->id = NULL;
> >  	if (cgrp == dummytop)
> >  		set_bit(CSS_ROOT, &css->flags);
> >  	BUG_ON(cgrp->subsys[ss->subsys_id]);
> > @@ -2410,6 +2415,10 @@ static long cgroup_create(struct cgroup 
> >  			goto err_destroy;
> >  		}
> >  		init_cgroup_css(css, ss, cgrp);
> > +		if (ss->use_id)
> > +			if (alloc_css_id(ss, parent, cgrp))
> > +				goto err_destroy;
> > +		/* At error, ->destroy() callback has to free assigned ID. */
> >  	}
> >  
> >  	cgroup_lock_hierarchy(root);
> > @@ -2699,6 +2708,8 @@ int __init cgroup_init(void)
> >  		struct cgroup_subsys *ss = subsys[i];
> >  		if (!ss->early_init)
> >  			cgroup_init_subsys(ss);
> > +		if (ss->use_id)
> > +			cgroup_subsys_init_idr(ss);
> >  	}
> >  
> >  	/* Add init_css_set to the hash table */
> > @@ -3231,3 +3242,260 @@ static int __init cgroup_disable(char *s
> >  	return 1;
> >  }
> >  __setup("cgroup_disable=", cgroup_disable);
> > +
> > +/*
> > + * CSS ID -- ID per Subsys's Cgroup Subsys State.
> > + */
> > +struct css_id {
> > +	/*
> > +	 * The cgroup to which this ID points. If cgroup is removed, this will
> 
> s/The cgroup/css/ ?
> 
Ah, yes.


> > +	 * be NULL. This pointer is expected to be RCU-safe because destroy()
> > +	 * is called after synchronize_rcu(). But for safe use, css_is_removed()
> > +	 * css_tryget() should be used for avoiding race.
> > +	 */
> > +	struct cgroup_subsys_state *css;
> > +	/*
> > +	 * ID of this css.
> > +	 */
> > +	unsigned short  id;
> 
>                       ^^
> 
> > +	/*
> > +	 * Depth in hierarchy which this ID belongs to.
> > +	 */
> > +	unsigned short depth;
> > +	/*
> > +	 * ID is freed by RCU. (and lookup routine is RCU safe.)
> > +	 */
> > +	struct rcu_head rcu_head;
> > +	/*
> > +	 * Hierarchy of CSS ID belongs to.
> > +	 */
> > +	unsigned short  stack[0]; /* Array of Length (depth+1) */
> 
>                       ^^
> 

ok, I'll check double space for all my patches ...


> > +};
> > +#define CSS_ID_MAX	(65535)
> > +
> > +/*
> > + * To get ID other than 0, this should be called when !cgroup_is_removed().
> > + */
> > +unsigned short css_id(struct cgroup_subsys_state *css)
> > +{
> > +	struct css_id *cssid = rcu_dereference(css->id);
> > +
> > +	if (cssid)
> > +		return cssid->id;
> > +	return 0;
> > +}
> > +
> > +unsigned short css_depth(struct cgroup_subsys_state *css)
> > +{
> > +	struct css_id *cssid = rcu_dereference(css->id);
> > +
> > +	if (cssid)
> > +		return cssid->depth;
> > +	return 0;
> > +}
> > +
> > +bool css_is_ancestor(struct cgroup_subsys_state *child,
> > +		    struct cgroup_subsys_state *root)
> > +{
> > +	struct css_id *child_id = rcu_dereference(child->id);
> > +	struct css_id *root_id = rcu_dereference(root->id);
> > +
> > +	if (!child_id || !root_id || (child_id->depth < root_id->depth))
> > +		return false;
> > +	return child_id->stack[root_id->depth] == root_id->id;
> > +}
> > +
> > +static void __free_css_id_cb(struct rcu_head *head)
> > +{
> > +	struct css_id *id;
> > +
> > +	id = container_of(head, struct css_id, rcu_head);
> > +	kfree(id);
> > +}
> > +
> > +void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css)
> > +{
> > +	struct css_id *id = css->id;
> > +
> > +	BUG_ON(!ss->use_id);
> > +
> > +	rcu_assign_pointer(id->css, NULL);
> > +	rcu_assign_pointer(css->id, NULL);
> > +	spin_lock(&ss->id_lock);
> > +	idr_remove(&ss->idr, id->id);
> > +	spin_unlock(&ss->id_lock);
> > +	call_rcu(&id->rcu_head, __free_css_id_cb);
> > +}
> > +
> > +/*
> > + * This is called by init or create(). Then, calls to this function are
> > + * always serialized (By cgroup_mutex() at create()).
> > + */
> > +
> > +static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
> > +{
> > +	struct css_id *newid;
> > +	int myid, error, size;
> > +
> > +	BUG_ON(!ss->use_id);
> > +
> > +	size = sizeof(*newid) + sizeof(unsigned short) * (depth + 1);
> > +	newid = kzalloc(size, GFP_KERNEL);
> > +	if (!newid)
> > +		return ERR_PTR(-ENOMEM);
> > +	/* get id */
> > +	if (unlikely(!idr_pre_get(&ss->idr, GFP_KERNEL))) {
> > +		error = -ENOMEM;
> > +		goto err_out;
> > +	}
> > +	spin_lock(&ss->id_lock);
> > +	/* Don't use 0. allocates an ID of 1-65535 */
> > +	error = idr_get_new_above(&ss->idr, newid, 1, &myid);
> > +	spin_unlock(&ss->id_lock);
> > +
> > +	/* Returns error when there are no free spaces for new ID.*/
> > +	if (error) {
> > +		error = -ENOSPC;
> > +		goto err_out;
> > +	}
> > +	if (myid > CSS_ID_MAX) {
> > +		error = -ENOSPC;
> > +		spin_lock(&ss->id_lock);
> > +		idr_remove(&ss->idr, myid);
> > +		spin_unlock(&ss->id_lock);
> > +		goto err_out;
> 
> I'd rather "goto remove_idr", this seperates normal routine and error routine.
> 

ok.

> > +	}
> > +
> > +	newid->id = myid;
> > +	newid->depth = depth;
> > +	return newid;
> > +err_out:
> > +	kfree(newid);
> > +	return ERR_PTR(error);
> > +
> > +}
> > +
> > +
> 
> 2 blanks lines here.
> 
> > +static int __init cgroup_subsys_init_idr(struct cgroup_subsys *ss)
> > +{
> > +	struct css_id *newid;
> > +	struct cgroup_subsys_state *rootcss;
> > +
> > +	spin_lock_init(&ss->id_lock);
> > +	idr_init(&ss->idr);
> > +
> > +	rootcss = init_css_set.subsys[ss->subsys_id];
> > +	newid = get_new_cssid(ss, 0);
> > +	if (IS_ERR(newid))
> > +		return PTR_ERR(newid);
> > +
> > +	newid->stack[0] = newid->id;
> > +	newid->css = rootcss;
> > +	rootcss->id = newid;
> > +	return 0;
> > +}
> > +
> > +static int alloc_css_id(struct cgroup_subsys *ss, struct cgroup *parent,
> > +			struct cgroup *child)
> > +{
> > +	int subsys_id, i, depth = 0;
> > +	struct cgroup_subsys_state *parent_css, *child_css;
> > +	struct css_id *child_id, *parent_id = NULL;
> > +
> > +	subsys_id = ss->subsys_id;
> > +	parent_css = parent->subsys[subsys_id];
> > +	child_css = child->subsys[subsys_id];
> > +	depth = css_depth(parent_css) + 1;
> > +	parent_id = parent_css->id;
> > +
> > +	child_id = get_new_cssid(ss, depth);
> > +	if (IS_ERR(child_id))
> > +		return PTR_ERR(child_id);
> > +
> > +	for (i = 0; i < depth; i++)
> > +		child_id->stack[i] = parent_id->stack[i];
> > +	child_id->stack[depth] = child_id->id;
> > +
> > +	rcu_assign_pointer(child_id->css, child_css);
> > +	rcu_assign_pointer(child_css->id, child_id);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * css_lookup - lookup css by id
> > + * @ss: cgroup subsys to be looked into.
> > + * @id: the id
> > + *
> > + * Returns pointer to cgroup_subsys_state if there is valid one with id.
> > + * NULL if not.Should be called under rcu_read_lock()
> 
> s/not./not. /
> 
yes..


> > + */
> > +
> 
> remove this line ?
> 

ok.

Thank you very much.

-Kame

> > +struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
> > +{
> > +	struct css_id *cssid = NULL;
> > +
> > +	BUG_ON(!ss->use_id);
> > +	cssid = idr_find(&ss->idr, id);
> > +
> > +	if (unlikely(!cssid))
> > +		return NULL;
> > +
> > +	return rcu_dereference(cssid->css);
> > +}
> > +
> > +/**
> > + * css_get_next - lookup next cgroup under specified hierarchy.
> > + * @ss: pointer to subsystem
> > + * @id: current position of iteration.
> > + * @root: pointer to css. search tree under this.
> > + * @foundid: position of found object.
> > + *
> > + * Search next css under the specified hierarchy of rootid. Calling under
> > + * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
> > + */
> > +struct cgroup_subsys_state *
> > +css_get_next(struct cgroup_subsys *ss, int id,
> > +	     struct cgroup_subsys_state *root, int *foundid)
> > +{
> > +	struct cgroup_subsys_state *ret = NULL;
> > +	struct css_id *tmp;
> > +	int tmpid;
> > +	int rootid = css_id(root);
> > +	int depth = css_depth(root);
> > +
> > +	if (!rootid)
> > +		return NULL;
> > +
> > +	BUG_ON(!ss->use_id);
> > +	rcu_read_lock();
> > +	/* fill start point for scan */
> > +	tmpid = id;
> > +	while (1) {
> > +		/*
> > +		 * scan next entry from bitmap(tree), tmpid is updated after
> > +		 * idr_get_next().
> > +		 */
> > +		spin_lock(&ss->id_lock);
> > +		tmp = idr_get_next(&ss->idr, &tmpid);
> > +		spin_unlock(&ss->id_lock);
> > +
> > +		if (!tmp) {
> > +			ret = NULL;
> 
> "ret = NULL" is unnecessary.
> 
will check again.


> > +			break;
> > +		}
> > +		if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
> > +			ret = rcu_dereference(tmp->css);
> > +			if (ret) {
> > +				*foundid = tmpid;
> > +				break;
> > +			}
> > +		}
> > +		/* continue to scan from next id */
> > +		tmpid = tmpid + 1;
> > +	}
> > +
> > +	rcu_read_unlock();
> > +	return ret;
> > +}
> > +
> > Index: mmotm-2.6.28-Jan7/include/linux/idr.h
> > ===================================================================
> > --- mmotm-2.6.28-Jan7.orig/include/linux/idr.h
> > +++ mmotm-2.6.28-Jan7/include/linux/idr.h
> > @@ -106,6 +106,7 @@ int idr_get_new(struct idr *idp, void *p
> >  int idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
> >  int idr_for_each(struct idr *idp,
> >  		 int (*fn)(int id, void *p, void *data), void *data);
> > +void *idr_get_next(struct idr *idp, int *nextid);
> >  void *idr_replace(struct idr *idp, void *ptr, int id);
> >  void idr_remove(struct idr *idp, int id);
> >  void idr_remove_all(struct idr *idp);
> > Index: mmotm-2.6.28-Jan7/lib/idr.c
> > ===================================================================
> > --- mmotm-2.6.28-Jan7.orig/lib/idr.c
> > +++ mmotm-2.6.28-Jan7/lib/idr.c
> > @@ -579,6 +579,52 @@ int idr_for_each(struct idr *idp,
> >  EXPORT_SYMBOL(idr_for_each);
> >  
> >  /**
> > + * idr_get_next - lookup next object of id to given id.
> > + * @idp: idr handle
> > + * @id:  pointer to lookup key
> > + *
> > + * Returns pointer to registered object with id, which is next number to
> > + * given id.
> > + */
> > +
> > +void *idr_get_next(struct idr *idp, int *nextidp)
> > +{
> > +	struct idr_layer *p, *pa[MAX_LEVEL];
> > +	struct idr_layer **paa = &pa[0];
> > +	int id = *nextidp;
> > +	int n, max;
> > +
> > +	/* find first ent */
> > +	n = idp->layers * IDR_BITS;
> > +	max = 1 << n;
> > +	p = rcu_dereference(idp->top);
> > +	if (!p)
> > +		return NULL;
> > +
> > +	while (id < max) {
> > +		while (n > 0 && p) {
> > +			n -= IDR_BITS;
> > +			*paa++ = p;
> > +			p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
> > +		}
> > +
> > +		if (p) {
> > +			*nextidp = id;
> > +			return p;
> > +		}
> > +
> > +		id += 1 << n;
> > +		while (n < fls(id)) {
> > +			n += IDR_BITS;
> > +			p = *--paa;
> > +		}
> > +	}
> > +	return NULL;
> > +}
> > +
> > +
> > +
> > +/**
> >   * idr_replace - replace pointer for given id
> >   * @idp: idr handle
> >   * @ptr: pointer you want associated with the id
> > 
> > 
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 2/4] memcg: use CSS ID in memcg
  2009-01-08  9:25 [RFC][PATCH] cgroup and memcg updates 20090108 KAMEZAWA Hiroyuki
  2009-01-08  9:28 ` [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID) KAMEZAWA Hiroyuki
@ 2009-01-08  9:30 ` KAMEZAWA Hiroyuki
  2009-01-12 12:14   ` Balbir Singh
  2009-01-08  9:32 ` [RFC][PATCH 3/4] memcg: fix OOM KILL under hierarchy KAMEZAWA Hiroyuki
  2009-01-08  9:35 ` [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-08  9:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com, menage@google.com


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Use css ID in memcg.

Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.

	Assume folloing tree.

	group_A (ID=3)
		/01 (ID=4)
		   /0A (ID=7)
		/02 (ID=10)
	group_B (ID=5)
	and task in group_A/01/0A hits limit at group_A.

	reclaim will be done in following order (round-robin).
	group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
	-> group_A -> .....

	Round robin by ID. The last visited cgroup is recorded and restart
	from it when it start reclaim again.
	(More smart algorithm can be implemented..)

	No cgroup_mutex or hierarchy_mutex is required.

Changelog (v1) -> (v2)
  - Updated texts.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |  219 ++++++++++++++++++++------------------------------------
 1 file changed, 81 insertions(+), 138 deletions(-)

Index: mmotm-2.6.28-Jan7/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Jan7.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Jan7/mm/memcontrol.c
@@ -154,9 +154,10 @@ struct mem_cgroup {
 
 	/*
 	 * While reclaiming in a hiearchy, we cache the last child we
-	 * reclaimed from. Protected by hierarchy_mutex
+	 * reclaimed from.
 	 */
-	struct mem_cgroup *last_scanned_child;
+	int last_scanned_child;
+	unsigned long scan_age;
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
@@ -613,103 +614,6 @@ unsigned long mem_cgroup_isolate_pages(u
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
-/*
- * This routine finds the DFS walk successor. This routine should be
- * called with hierarchy_mutex held
- */
-static struct mem_cgroup *
-mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
-{
-	struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
-
-	curr_cgroup = curr->css.cgroup;
-	root_cgroup = root_mem->css.cgroup;
-
-	if (!list_empty(&curr_cgroup->children)) {
-		/*
-		 * Walk down to children
-		 */
-		mem_cgroup_put(curr);
-		cgroup = list_entry(curr_cgroup->children.next,
-						struct cgroup, sibling);
-		curr = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-visit_parent:
-	if (curr_cgroup == root_cgroup) {
-		mem_cgroup_put(curr);
-		curr = root_mem;
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-	/*
-	 * Goto next sibling
-	 */
-	if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
-		mem_cgroup_put(curr);
-		cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
-						sibling);
-		curr = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(curr);
-		goto done;
-	}
-
-	/*
-	 * Go up to next parent and next parent's sibling if need be
-	 */
-	curr_cgroup = curr_cgroup->parent;
-	goto visit_parent;
-
-done:
-	root_mem->last_scanned_child = curr;
-	return curr;
-}
-
-/*
- * Visit the first child (need not be the first child as per the ordering
- * of the cgroup list, since we track last_scanned_child) of @mem and use
- * that to reclaim free pages from.
- */
-static struct mem_cgroup *
-mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
-{
-	struct cgroup *cgroup;
-	struct mem_cgroup *ret;
-	bool obsolete;
-
-	obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
-
-	/*
-	 * Scan all children under the mem_cgroup mem
-	 */
-	mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
-	if (list_empty(&root_mem->css.cgroup->children)) {
-		ret = root_mem;
-		goto done;
-	}
-
-	if (!root_mem->last_scanned_child || obsolete) {
-
-		if (obsolete && root_mem->last_scanned_child)
-			mem_cgroup_put(root_mem->last_scanned_child);
-
-		cgroup = list_first_entry(&root_mem->css.cgroup->children,
-				struct cgroup, sibling);
-		ret = mem_cgroup_from_cont(cgroup);
-		mem_cgroup_get(ret);
-	} else
-		ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
-						root_mem);
-
-done:
-	root_mem->last_scanned_child = ret;
-	mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
-	return ret;
-}
-
 static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
 {
 	if (do_swap_account) {
@@ -739,49 +643,84 @@ static unsigned int get_swappiness(struc
 }
 
 /*
- * Dance down the hierarchy if needed to reclaim memory. We remember the
- * last child we reclaimed from, so that we don't end up penalizing
- * one child extensively based on its position in the children list.
+ * Visit the first child (need not be the first child as per the ordering
+ * of the cgroup list, since we track last_scanned_child) of @mem and use
+ * that to reclaim free pages from.
+ */
+static struct mem_cgroup *
+mem_cgroup_select_victim(struct mem_cgroup *root_mem)
+{
+	struct mem_cgroup *ret = NULL;
+	struct cgroup_subsys_state *css;
+	int nextid, found;
+
+	if (!root_mem->use_hierarchy) {
+		spin_lock(&root_mem->reclaim_param_lock);
+		root_mem->scan_age++;
+		spin_unlock(&root_mem->reclaim_param_lock);
+		css_get(&root_mem->css);
+		ret = root_mem;
+	}
+
+	while (!ret) {
+		rcu_read_lock();
+		nextid = root_mem->last_scanned_child + 1;
+		css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
+				   &found);
+		if (css && css_tryget(css))
+			ret = container_of(css, struct mem_cgroup, css);
+
+		rcu_read_unlock();
+		/* Updates scanning parameter */
+		spin_lock(&root_mem->reclaim_param_lock);
+		if (!css) {
+			/* this means start scan from ID:1 */
+			root_mem->last_scanned_child = 0;
+			root_mem->scan_age++;
+		} else
+			root_mem->last_scanned_child = found;
+		spin_unlock(&root_mem->reclaim_param_lock);
+	}
+
+	return ret;
+}
+
+/*
+ * Scan the hierarchy if needed to reclaim memory. We remember the last child
+ * we reclaimed from, so that we don't end up penalizing one child extensively
+ * based on its position in the children list.
  *
  * root_mem is the original ancestor that we've been reclaim from.
+ *
+ * scan_age is updated every time when select_victim returns "root" and
+ * it's shared under system (per hierarchy root).
+ *
+ * We give up and return to the caller when scan_age is increased by 2. This
+ * means try_to_free_mem_cgroup_pages() is called against all children cgroup,
+ * at least once. The caller itself will do further retry if necessary.
  */
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						gfp_t gfp_mask, bool noswap)
 {
-	struct mem_cgroup *next_mem;
-	int ret = 0;
-
-	/*
-	 * Reclaim unconditionally and don't check for return value.
-	 * We need to reclaim in the current group and down the tree.
-	 * One might think about checking for children before reclaiming,
-	 * but there might be left over accounting, even after children
-	 * have left.
-	 */
-	ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
-					   get_swappiness(root_mem));
-	if (mem_cgroup_check_under_limit(root_mem))
-		return 0;
-	if (!root_mem->use_hierarchy)
-		return ret;
-
-	next_mem = mem_cgroup_get_first_node(root_mem);
-
-	while (next_mem != root_mem) {
-		if (mem_cgroup_is_obsolete(next_mem)) {
-			mem_cgroup_put(next_mem);
-			next_mem = mem_cgroup_get_first_node(root_mem);
-			continue;
-		}
-		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
-						   get_swappiness(next_mem));
+	struct mem_cgroup *victim;
+	unsigned long start_age;
+	int ret, total = 0;
+	/*
+	 * Reclaim memory from cgroups under root_mem in round robin.
+	 */
+	start_age = root_mem->scan_age;
+
+	while (time_after((start_age + 2UL), root_mem->scan_age)) {
+		victim = mem_cgroup_select_victim(root_mem);
+		/* we use swappiness of local cgroup */
+		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
+						   get_swappiness(victim));
+		css_put(&victim->css);
+		total += ret;
 		if (mem_cgroup_check_under_limit(root_mem))
-			return 0;
-		mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
-		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
-		mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
+			return 1 + total;
 	}
-	return ret;
+	return total;
 }
 
 bool mem_cgroup_oom_called(struct task_struct *task)
@@ -1298,7 +1237,6 @@ __mem_cgroup_uncharge_common(struct page
 	default:
 		break;
 	}
-
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
 		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
@@ -2148,6 +2086,8 @@ static void __mem_cgroup_free(struct mem
 {
 	int node;
 
+	free_css_id(&mem_cgroup_subsys, &mem->css);
+
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);
 
@@ -2185,11 +2125,12 @@ static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	struct mem_cgroup *mem, *parent;
+	long error = -ENOMEM;
 	int node;
 
 	mem = mem_cgroup_alloc();
 	if (!mem)
-		return ERR_PTR(-ENOMEM);
+		return ERR_PTR(error);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -2210,7 +2151,8 @@ mem_cgroup_create(struct cgroup_subsys *
 		res_counter_init(&mem->res, NULL);
 		res_counter_init(&mem->memsw, NULL);
 	}
-	mem->last_scanned_child = NULL;
+	mem->last_scanned_child = 0;
+	mem->scan_age = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)
@@ -2219,7 +2161,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	return &mem->css;
 free_out:
 	__mem_cgroup_free(mem);
-	return ERR_PTR(-ENOMEM);
+	return ERR_PTR(error);
 }
 
 static void mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
@@ -2270,6 +2212,7 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
+	.use_id = 1,
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 2/4] memcg: use CSS ID in memcg
  2009-01-08  9:30 ` [RFC][PATCH 2/4] memcg: use CSS ID in memcg KAMEZAWA Hiroyuki
@ 2009-01-12 12:14   ` Balbir Singh
  2009-01-15  6:19     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2009-01-12 12:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com,
	menage@google.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-01-08 18:30:03]:

> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Use css ID in memcg.
> 
> Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.
> 
> 	Assume folloing tree.
> 
> 	group_A (ID=3)
> 		/01 (ID=4)
> 		   /0A (ID=7)
> 		/02 (ID=10)
> 	group_B (ID=5)
> 	and task in group_A/01/0A hits limit at group_A.
> 
> 	reclaim will be done in following order (round-robin).
> 	group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
> 	-> group_A -> .....
> 
> 	Round robin by ID. The last visited cgroup is recorded and restart
> 	from it when it start reclaim again.
> 	(More smart algorithm can be implemented..)
> 
> 	No cgroup_mutex or hierarchy_mutex is required.
> 
> Changelog (v1) -> (v2)
>   - Updated texts.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ---
>  mm/memcontrol.c |  219 ++++++++++++++++++++------------------------------------
>  1 file changed, 81 insertions(+), 138 deletions(-)
> 
> Index: mmotm-2.6.28-Jan7/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.28-Jan7.orig/mm/memcontrol.c
> +++ mmotm-2.6.28-Jan7/mm/memcontrol.c
> @@ -154,9 +154,10 @@ struct mem_cgroup {
> 
>  	/*
>  	 * While reclaiming in a hiearchy, we cache the last child we
> -	 * reclaimed from. Protected by hierarchy_mutex
> +	 * reclaimed from.
>  	 */
> -	struct mem_cgroup *last_scanned_child;
> +	int last_scanned_child;
> +	unsigned long scan_age;

A comment describing what scan_age represents and how it impacts
reclaim would be nice to have

>  	/*
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
> @@ -613,103 +614,6 @@ unsigned long mem_cgroup_isolate_pages(u
>  #define mem_cgroup_from_res_counter(counter, member)	\
>  	container_of(counter, struct mem_cgroup, member)
> 
> -/*
> - * This routine finds the DFS walk successor. This routine should be
> - * called with hierarchy_mutex held
> - */
> -static struct mem_cgroup *
> -mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> -{
> -	struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
> -
> -	curr_cgroup = curr->css.cgroup;
> -	root_cgroup = root_mem->css.cgroup;
> -
> -	if (!list_empty(&curr_cgroup->children)) {
> -		/*
> -		 * Walk down to children
> -		 */
> -		mem_cgroup_put(curr);
> -		cgroup = list_entry(curr_cgroup->children.next,
> -						struct cgroup, sibling);
> -		curr = mem_cgroup_from_cont(cgroup);
> -		mem_cgroup_get(curr);
> -		goto done;
> -	}
> -
> -visit_parent:
> -	if (curr_cgroup == root_cgroup) {
> -		mem_cgroup_put(curr);
> -		curr = root_mem;
> -		mem_cgroup_get(curr);
> -		goto done;
> -	}
> -
> -	/*
> -	 * Goto next sibling
> -	 */
> -	if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
> -		mem_cgroup_put(curr);
> -		cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
> -						sibling);
> -		curr = mem_cgroup_from_cont(cgroup);
> -		mem_cgroup_get(curr);
> -		goto done;
> -	}
> -
> -	/*
> -	 * Go up to next parent and next parent's sibling if need be
> -	 */
> -	curr_cgroup = curr_cgroup->parent;
> -	goto visit_parent;
> -
> -done:
> -	root_mem->last_scanned_child = curr;
> -	return curr;
> -}
> -
> -/*
> - * Visit the first child (need not be the first child as per the ordering
> - * of the cgroup list, since we track last_scanned_child) of @mem and use
> - * that to reclaim free pages from.
> - */
> -static struct mem_cgroup *
> -mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> -{
> -	struct cgroup *cgroup;
> -	struct mem_cgroup *ret;
> -	bool obsolete;
> -
> -	obsolete = mem_cgroup_is_obsolete(root_mem->last_scanned_child);
> -
> -	/*
> -	 * Scan all children under the mem_cgroup mem
> -	 */
> -	mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> -	if (list_empty(&root_mem->css.cgroup->children)) {
> -		ret = root_mem;
> -		goto done;
> -	}
> -
> -	if (!root_mem->last_scanned_child || obsolete) {
> -
> -		if (obsolete && root_mem->last_scanned_child)
> -			mem_cgroup_put(root_mem->last_scanned_child);
> -
> -		cgroup = list_first_entry(&root_mem->css.cgroup->children,
> -				struct cgroup, sibling);
> -		ret = mem_cgroup_from_cont(cgroup);
> -		mem_cgroup_get(ret);
> -	} else
> -		ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
> -						root_mem);
> -
> -done:
> -	root_mem->last_scanned_child = ret;
> -	mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> -	return ret;
> -}
> -
>  static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
>  {
>  	if (do_swap_account) {
> @@ -739,49 +643,84 @@ static unsigned int get_swappiness(struc
>  }
> 
>  /*
> - * Dance down the hierarchy if needed to reclaim memory. We remember the
> - * last child we reclaimed from, so that we don't end up penalizing
> - * one child extensively based on its position in the children list.
> + * Visit the first child (need not be the first child as per the ordering
> + * of the cgroup list, since we track last_scanned_child) of @mem and use
> + * that to reclaim free pages from.
> + */
> +static struct mem_cgroup *
> +mem_cgroup_select_victim(struct mem_cgroup *root_mem)
> +{
> +	struct mem_cgroup *ret = NULL;
> +	struct cgroup_subsys_state *css;
> +	int nextid, found;
> +
> +	if (!root_mem->use_hierarchy) {
> +		spin_lock(&root_mem->reclaim_param_lock);
> +		root_mem->scan_age++;
> +		spin_unlock(&root_mem->reclaim_param_lock);
> +		css_get(&root_mem->css);
> +		ret = root_mem;
> +	}
> +
> +	while (!ret) {
> +		rcu_read_lock();
> +		nextid = root_mem->last_scanned_child + 1;
> +		css = css_get_next(&mem_cgroup_subsys, nextid, &root_mem->css,
> +				   &found);
> +		if (css && css_tryget(css))
> +			ret = container_of(css, struct mem_cgroup, css);
> +
> +		rcu_read_unlock();
> +		/* Updates scanning parameter */
> +		spin_lock(&root_mem->reclaim_param_lock);
> +		if (!css) {
> +			/* this means start scan from ID:1 */
> +			root_mem->last_scanned_child = 0;
> +			root_mem->scan_age++;
> +		} else
> +			root_mem->last_scanned_child = found;
> +		spin_unlock(&root_mem->reclaim_param_lock);
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Scan the hierarchy if needed to reclaim memory. We remember the last child
> + * we reclaimed from, so that we don't end up penalizing one child extensively
> + * based on its position in the children list.
>   *
>   * root_mem is the original ancestor that we've been reclaim from.
> + *
> + * scan_age is updated every time when select_victim returns "root" and
> + * it's shared under system (per hierarchy root).
> + *
> + * We give up and return to the caller when scan_age is increased by 2. This
> + * means try_to_free_mem_cgroup_pages() is called against all children cgroup,
> + * at least once. The caller itself will do further retry if necessary.
>   */
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  						gfp_t gfp_mask, bool noswap)
>  {
> -	struct mem_cgroup *next_mem;
> -	int ret = 0;
> -
> -	/*
> -	 * Reclaim unconditionally and don't check for return value.
> -	 * We need to reclaim in the current group and down the tree.
> -	 * One might think about checking for children before reclaiming,
> -	 * but there might be left over accounting, even after children
> -	 * have left.
> -	 */
> -	ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap,
> -					   get_swappiness(root_mem));
> -	if (mem_cgroup_check_under_limit(root_mem))
> -		return 0;
> -	if (!root_mem->use_hierarchy)
> -		return ret;
> -
> -	next_mem = mem_cgroup_get_first_node(root_mem);
> -
> -	while (next_mem != root_mem) {
> -		if (mem_cgroup_is_obsolete(next_mem)) {
> -			mem_cgroup_put(next_mem);
> -			next_mem = mem_cgroup_get_first_node(root_mem);
> -			continue;
> -		}
> -		ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap,
> -						   get_swappiness(next_mem));
> +	struct mem_cgroup *victim;
> +	unsigned long start_age;
> +	int ret, total = 0;
> +	/*
> +	 * Reclaim memory from cgroups under root_mem in round robin.
> +	 */
> +	start_age = root_mem->scan_age;
> +
> +	while (time_after((start_age + 2UL), root_mem->scan_age)) {

This is confusing, why do we use time_after with scan_age. scan_age
seems to be incremented every time we scan and has no relationship
with time. The second thing is what happens if time_after() always
returns 0, if we've been aggressively scanning? The logic needs some
commenting, why the magic number 2?

> +		victim = mem_cgroup_select_victim(root_mem);
> +		/* we use swappiness of local cgroup */
> +		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
> +						   get_swappiness(victim));
> +		css_put(&victim->css);
> +		total += ret;
>  		if (mem_cgroup_check_under_limit(root_mem))
> -			return 0;
> -		mutex_lock(&mem_cgroup_subsys.hierarchy_mutex);
> -		next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
> -		mutex_unlock(&mem_cgroup_subsys.hierarchy_mutex);
> +			return 1 + total;
>  	}
> -	return ret;
> +	return total;
>  }
> 
>  bool mem_cgroup_oom_called(struct task_struct *task)
> @@ -1298,7 +1237,6 @@ __mem_cgroup_uncharge_common(struct page
>  	default:
>  		break;
>  	}
> -
>  	res_counter_uncharge(&mem->res, PAGE_SIZE);
>  	if (do_swap_account && (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
>  		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> @@ -2148,6 +2086,8 @@ static void __mem_cgroup_free(struct mem
>  {
>  	int node;
> 
> +	free_css_id(&mem_cgroup_subsys, &mem->css);
> +
>  	for_each_node_state(node, N_POSSIBLE)
>  		free_mem_cgroup_per_zone_info(mem, node);
> 
> @@ -2185,11 +2125,12 @@ static struct cgroup_subsys_state *
>  mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  {
>  	struct mem_cgroup *mem, *parent;
> +	long error = -ENOMEM;
>  	int node;
> 
>  	mem = mem_cgroup_alloc();
>  	if (!mem)
> -		return ERR_PTR(-ENOMEM);
> +		return ERR_PTR(error);
> 
>  	for_each_node_state(node, N_POSSIBLE)
>  		if (alloc_mem_cgroup_per_zone_info(mem, node))
> @@ -2210,7 +2151,8 @@ mem_cgroup_create(struct cgroup_subsys *
>  		res_counter_init(&mem->res, NULL);
>  		res_counter_init(&mem->memsw, NULL);
>  	}
> -	mem->last_scanned_child = NULL;
> +	mem->last_scanned_child = 0;
> +	mem->scan_age = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
> 
>  	if (parent)
> @@ -2219,7 +2161,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  	return &mem->css;
>  free_out:
>  	__mem_cgroup_free(mem);
> -	return ERR_PTR(-ENOMEM);
> +	return ERR_PTR(error);
>  }
> 
>  static void mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
> @@ -2270,6 +2212,7 @@ struct cgroup_subsys mem_cgroup_subsys =
>  	.populate = mem_cgroup_populate,
>  	.attach = mem_cgroup_move_task,
>  	.early_init = 0,
> +	.use_id = 1,
>  };
> 
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> 
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 2/4] memcg: use CSS ID in memcg
  2009-01-12 12:14   ` Balbir Singh
@ 2009-01-15  6:19     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-15  6:19 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, lizf@cn.fujitsu.com,
	menage@google.com

On Mon, 12 Jan 2009 17:44:24 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
					   get_swappiness(next_mem));
> > +	struct mem_cgroup *victim;
> > +	unsigned long start_age;
> > +	int ret, total = 0;
> > +	/*
> > +	 * Reclaim memory from cgroups under root_mem in round robin.
> > +	 */
> > +	start_age = root_mem->scan_age;
> > +
> > +	while (time_after((start_age + 2UL), root_mem->scan_age)) {
> 
> This is confusing, why do we use time_after with scan_age. scan_age
> seems to be incremented every time we scan and has no relationship
> with time. 

time_after() is useful macro for checking counter which can go MAX-1->MAX->0->1>...


> The second thing is what happens if time_after() always
> returns 0, if we've been aggressively scanning? 
That never happens.

> The logic needs some  commenting, why the magic number 2?
> 
memcg->scan_age is update when 
 - the memcg is root of hierarchy.
 - we reclaim memory from memcg.

So, memcg->scan_age is update by 2 means, all memcg under hierarchy is accessed
by reclaim routine.

example) Consider hierarhy like this.
 
          xxx(ID=8)
             /yyy (ID=4)
             /zzz (ID=9)
             /www (ID=3)

In this case, scan will be done in following order

  .....->3->4->8->9->3->4->8->9->...
(start point is determined by last_scanned_child)

everytime we visit "8", 8's scan_age is updated.

So, if we see "8"  2 times, all other groups 4,9,3 is all accessed for freeing
memory. (by me or other threads.)


-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 3/4] memcg: fix OOM KILL under hierarchy
  2009-01-08  9:25 [RFC][PATCH] cgroup and memcg updates 20090108 KAMEZAWA Hiroyuki
  2009-01-08  9:28 ` [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID) KAMEZAWA Hiroyuki
  2009-01-08  9:30 ` [RFC][PATCH 2/4] memcg: use CSS ID in memcg KAMEZAWA Hiroyuki
@ 2009-01-08  9:32 ` KAMEZAWA Hiroyuki
  2009-01-13  8:33   ` Li Zefan
  2009-01-08  9:35 ` [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-08  9:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com, menage@google.com


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memcg's oom-killer has 2 problems when hierarchy is used.

Assume following tree,
	Group_A/     use_hierarchy = 1, limit=1G
	        01/  nolimit
		02/  nolimit
		03/  nolimit
In this case, sum of memory usage from 01,02,03 is limited to 1G (of Group_A).

Assume a task in Group_A/01 causes OOM by limit of Group_A, in this case,
bad_process() will select a process in Group_A, not in 01,02,03.

This patch fixes the behavior and all processes under Group_A's hierarchy
01,02,03 will be OOM kill candidates.

And now, to avoid calling oom_kill twice, mem_cgroup_oom_called() hook is
used in pagefault_out_of_memory(). This check the timestamp of the most
recent OOM in memcg. This timestamp should be updated per hierarchy.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


---
 Documentation/controllers/memcg_test.txt |   15 ++++++++
 include/linux/memcontrol.h               |    4 +-
 mm/memcontrol.c                          |   55 +++++++++++++++++++++++++++++--
 mm/oom_kill.c                            |    4 +-
 4 files changed, 72 insertions(+), 6 deletions(-)

Index: mmotm-2.6.28-Jan7/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Jan7.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Jan7/mm/memcontrol.c
@@ -431,12 +431,31 @@ void mem_cgroup_move_lists(struct page *
 	mem_cgroup_add_lru_list(page, to);
 }
 
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
+static int
+mm_match_cgroup_hierarchy(struct mm_struct *mm, struct mem_cgroup *mem)
+{
+	struct mem_cgroup *curr;
+	int ret;
+
+	if (!mm)
+		return 0;
+	rcu_read_lock();
+	curr = mem_cgroup_from_task(mm->owner);
+	if (mem->use_hierarchy)
+		ret = css_is_ancestor(&curr->css, &mem->css);
+	else
+		ret = (curr == mem);
+	rcu_read_unlock();
+	return ret;
+}
+
+int task_in_mem_cgroup_hierarchy(struct task_struct *task,
+				 struct mem_cgroup *mem)
 {
 	int ret;
 
 	task_lock(task);
-	ret = task->mm && mm_match_cgroup(task->mm, mem);
+	ret = mm_match_cgroup_hierarchy(task->mm, mem);
 	task_unlock(task);
 	return ret;
 }
@@ -723,6 +742,36 @@ static int mem_cgroup_hierarchical_recla
 	return total;
 }
 
+/*
+ *  Update last_oom_jiffies of hierarchy.
+ */
+void mem_cgroup_update_oom_jiffies(struct mem_cgroup *mem)
+{
+	struct mem_cgroup *cur;
+	struct cgroup_subsys_state *css;
+	int id, found;
+
+	if (!mem->use_hierarchy) {
+		mem->last_oom_jiffies = jiffies;
+		return;
+	}
+
+	id = 0;
+	rcu_read_lock();
+	while (1) {
+		css = css_get_next(&mem_cgroup_subsys, id, &mem->css, &found);
+		if (!css)
+			break;
+		if (css_tryget(css)) {
+			cur = container_of(css, struct mem_cgroup, css);
+			cur->last_oom_jiffies = jiffies;
+			css_put(css);
+		}
+		id = found + 1;
+	}
+	rcu_read_unlock();
+	return;
+}
 bool mem_cgroup_oom_called(struct task_struct *task)
 {
 	bool ret = false;
@@ -819,7 +868,7 @@ static int __mem_cgroup_try_charge(struc
 				mutex_lock(&memcg_tasklist);
 				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
 				mutex_unlock(&memcg_tasklist);
-				mem_over_limit->last_oom_jiffies = jiffies;
+				mem_cgroup_update_oom_jiffies(mem_over_limit);
 			}
 			goto nomem;
 		}
Index: mmotm-2.6.28-Jan7/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.28-Jan7.orig/include/linux/memcontrol.h
+++ mmotm-2.6.28-Jan7/include/linux/memcontrol.h
@@ -66,7 +66,9 @@ extern unsigned long mem_cgroup_isolate_
 					struct mem_cgroup *mem_cont,
 					int active, int file);
 extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
+
+int task_in_mem_cgroup_hierarchy(struct task_struct *task,
+			struct mem_cgroup *mem);
 
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
Index: mmotm-2.6.28-Jan7/mm/oom_kill.c
===================================================================
--- mmotm-2.6.28-Jan7.orig/mm/oom_kill.c
+++ mmotm-2.6.28-Jan7/mm/oom_kill.c
@@ -220,7 +220,7 @@ static struct task_struct *select_bad_pr
 		/* skip the init task */
 		if (is_global_init(p))
 			continue;
-		if (mem && !task_in_mem_cgroup(p, mem))
+		if (mem && !task_in_mem_cgroup_hierarchy(p, mem))
 			continue;
 
 		/*
@@ -292,7 +292,7 @@ static void dump_tasks(const struct mem_
 		 */
 		if (!p->mm)
 			continue;
-		if (mem && !task_in_mem_cgroup(p, mem))
+		if (mem && !task_in_mem_cgroup_hierarchy(p, mem))
 			continue;
 		if (!thread_group_leader(p))
 			continue;
Index: mmotm-2.6.28-Jan7/Documentation/controllers/memcg_test.txt
===================================================================
--- mmotm-2.6.28-Jan7.orig/Documentation/controllers/memcg_test.txt
+++ mmotm-2.6.28-Jan7/Documentation/controllers/memcg_test.txt
@@ -340,3 +340,18 @@ Under below explanation, we assume CONFI
 	# mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices
 
 	and do task move, mkdir, rmdir etc...under this.
+
+ 9.7 OOM-KILL
+	If memcg finds out-of-memory, OOM Kill should kill a task in memcg.
+	The select_bad_process() should take hierarchy into account and
+	OOM-KILL itself shoudn't call panic_on_oom or some hooks for generic
+	OOM.
+
+	It's not difficult to cause OOM under memcg by setting memsw.limit
+	as following.
+	# echo 50M > memory.limit_in_bytes
+	# echo 50M > memory.memsw.limit_in_bytes
+	and run malloc(51M) program.
+	(Alternative is do swapoff/mlock and malloc())
+
+

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 3/4] memcg: fix OOM KILL under hierarchy
  2009-01-08  9:32 ` [RFC][PATCH 3/4] memcg: fix OOM KILL under hierarchy KAMEZAWA Hiroyuki
@ 2009-01-13  8:33   ` Li Zefan
  2009-01-13  9:25     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Li Zefan @ 2009-01-13  8:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	menage@google.com

> -int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
> +static int
> +mm_match_cgroup_hierarchy(struct mm_struct *mm, struct mem_cgroup *mem)
> +{
> +	struct mem_cgroup *curr;
> +	int ret;
> +
> +	if (!mm)
> +		return 0;
> +	rcu_read_lock();
> +	curr = mem_cgroup_from_task(mm->owner);

curr can be NULL ?

> +	if (mem->use_hierarchy)
> +		ret = css_is_ancestor(&curr->css, &mem->css);
> +	else
> +		ret = (curr == mem);
> +	rcu_read_unlock();
> +	return ret;
> +}
> +

...

> +void mem_cgroup_update_oom_jiffies(struct mem_cgroup *mem)
> +{
> +	struct mem_cgroup *cur;
> +	struct cgroup_subsys_state *css;
> +	int id, found;
> +
> +	if (!mem->use_hierarchy) {
> +		mem->last_oom_jiffies = jiffies;
> +		return;
> +	}
> +
> +	id = 0;
> +	rcu_read_lock();
> +	while (1) {
> +		css = css_get_next(&mem_cgroup_subsys, id, &mem->css, &found);
> +		if (!css)
> +			break;
> +		if (css_tryget(css)) {
> +			cur = container_of(css, struct mem_cgroup, css);
> +			cur->last_oom_jiffies = jiffies;
> +			css_put(css);
> +		}
> +		id = found + 1;
> +	}
> +	rcu_read_unlock();
> +	return;

redundant "return"

> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 3/4] memcg: fix OOM KILL under hierarchy
  2009-01-13  8:33   ` Li Zefan
@ 2009-01-13  9:25     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-13  9:25 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	menage@google.com

On Tue, 13 Jan 2009 16:33:12 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> > -int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
> > +static int
> > +mm_match_cgroup_hierarchy(struct mm_struct *mm, struct mem_cgroup *mem)
> > +{
> > +	struct mem_cgroup *curr;
> > +	int ret;
> > +
> > +	if (!mm)
> > +		return 0;
> > +	rcu_read_lock();
> > +	curr = mem_cgroup_from_task(mm->owner);
> 
> curr can be NULL ?
> 
Good Catch! You're right.

> > +	if (mem->use_hierarchy)
> > +		ret = css_is_ancestor(&curr->css, &mem->css);
> > +	else
> > +		ret = (curr == mem);
> > +	rcu_read_unlock();
> > +	return ret;
> > +}
> > +
> 
> ...
> 
> > +void mem_cgroup_update_oom_jiffies(struct mem_cgroup *mem)
> > +{
> > +	struct mem_cgroup *cur;
> > +	struct cgroup_subsys_state *css;
> > +	int id, found;
> > +
> > +	if (!mem->use_hierarchy) {
> > +		mem->last_oom_jiffies = jiffies;
> > +		return;
> > +	}
> > +
> > +	id = 0;
> > +	rcu_read_lock();
> > +	while (1) {
> > +		css = css_get_next(&mem_cgroup_subsys, id, &mem->css, &found);
> > +		if (!css)
> > +			break;
> > +		if (css_tryget(css)) {
> > +			cur = container_of(css, struct mem_cgroup, css);
> > +			cur->last_oom_jiffies = jiffies;
> > +			css_put(css);
> > +		}
> > +		id = found + 1;
> > +	}
> > +	rcu_read_unlock();
> > +	return;
> 
> redundant "return"
> 
ok, will remove it.

Thank you.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir
  2009-01-08  9:25 [RFC][PATCH] cgroup and memcg updates 20090108 KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2009-01-08  9:32 ` [RFC][PATCH 3/4] memcg: fix OOM KILL under hierarchy KAMEZAWA Hiroyuki
@ 2009-01-08  9:35 ` KAMEZAWA Hiroyuki
  2009-01-14  2:48   ` Paul Menage
  3 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-08  9:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com, menage@google.com

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Experimental. you may think of better fix.

When trying following test under memcg.

	create memcg under
		/cgroup/A/  use_hierarchy=1, limit=20M
			 /B some tasks
			 /C empty
			 /D empty

And run make kernel under /B (for example). This will hit limit of 20M and
hierarchical memory reclaim will scan A->B->C->D.
(C,D have to be scanned because it may have some page caches.)

	Here, run following scipt.

	while true; do
		rmdir /cgroup/A/C
		mkdir /cgroup/A/C
	done

You'll see -EBUSY at rmdir very often. This is because of temporal refcnt of
memory reclaim for safe scanning under hierarchy.

In usual, considering shell script,
	"please try again if you see -EBUSY at rmdir.
	 You may see -EBUSY at rmdir even if it seems you can do it."
is very unuseful.

This patch tries to fix -EBUSY behavior. memcg's pre_destroy() works pretty
fine and retrying rmdir() is O.K.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/cgroup.h |    5 +++++
 kernel/cgroup.c        |   32 +++++++++++++++++++++++++-------
 mm/memcontrol.c        |    9 ++++++---
 3 files changed, 36 insertions(+), 10 deletions(-)

Index: mmotm-2.6.28-Jan7/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.28-Jan7.orig/include/linux/cgroup.h
+++ mmotm-2.6.28-Jan7/include/linux/cgroup.h
@@ -368,6 +368,11 @@ struct cgroup_subsys {
 	int disabled;
 	int early_init;
 	/*
+	 * set if subsys may retrun EBUSY while there are no tasks and
+	 * subsys knows it's very temporal reference.
+ 	 */
+	int retry_at_rmdir_failure;
+	/*
 	 * True if this subsys uses ID. ID is not available before cgroup_init()
 	 * (not available in early_init time.)
 	 */
Index: mmotm-2.6.28-Jan7/kernel/cgroup.c
===================================================================
--- mmotm-2.6.28-Jan7.orig/kernel/cgroup.c
+++ mmotm-2.6.28-Jan7/kernel/cgroup.c
@@ -2503,15 +2503,17 @@ static int cgroup_has_css_refs(struct cg
 
 /*
  * Atomically mark all (or else none) of the cgroup's CSS objects as
- * CSS_REMOVED. Return true on success, or false if the cgroup has
+ * CSS_REMOVED. Return 0 on success, or false error code if the cgroup has
  * busy subsystems. Call with cgroup_mutex held
  */
 
 static int cgroup_clear_css_refs(struct cgroup *cgrp)
 {
-	struct cgroup_subsys *ss;
+	struct cgroup_subsys *ss, *failedhere;
 	unsigned long flags;
 	bool failed = false;
+	int err = -EBUSY;
+
 	local_irq_save(flags);
 	for_each_subsys(cgrp->root, ss) {
 		struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
@@ -2521,6 +2523,7 @@ static int cgroup_clear_css_refs(struct 
 			refcnt = atomic_read(&css->refcnt);
 			if (refcnt > 1) {
 				failed = true;
+				failedhere = ss;
 				goto done;
 			}
 			BUG_ON(!refcnt);
@@ -2548,7 +2551,13 @@ static int cgroup_clear_css_refs(struct 
 		}
 	}
 	local_irq_restore(flags);
-	return !failed;
+
+	if (failed) {
+		if (failedhere->retry_at_rmdir_failure)
+			err = -EAGAIN;
+	} else
+		err = 0;
+	return err;
 }
 
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
@@ -2556,9 +2565,10 @@ static int cgroup_rmdir(struct inode *un
 	struct cgroup *cgrp = dentry->d_fsdata;
 	struct dentry *d;
 	struct cgroup *parent;
+	int ret;
 
 	/* the vfs holds both inode->i_mutex already */
-
+retry:
 	mutex_lock(&cgroup_mutex);
 	if (atomic_read(&cgrp->count) != 0) {
 		mutex_unlock(&cgroup_mutex);
@@ -2579,12 +2589,20 @@ static int cgroup_rmdir(struct inode *un
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
 
-	if (atomic_read(&cgrp->count)
-	    || !list_empty(&cgrp->children)
-	    || !cgroup_clear_css_refs(cgrp)) {
+	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
+	ret = cgroup_clear_css_refs(cgrp);
+	if (ret == -EBUSY) { /* really busy */
+		mutex_unlock(&cgroup_mutex);
+		return ret;
+	}
+	if (ret == -EAGAIN) { /* subsys asks us to retry later */
+		mutex_unlock(&cgroup_mutex);
+		cond_resched();
+		goto retry;
+	}
 
 	spin_lock(&release_list_lock);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
Index: mmotm-2.6.28-Jan7/mm/memcontrol.c
===================================================================
--- mmotm-2.6.28-Jan7.orig/mm/memcontrol.c
+++ mmotm-2.6.28-Jan7/mm/memcontrol.c
@@ -723,7 +723,8 @@ static int mem_cgroup_hierarchical_recla
 {
 	struct mem_cgroup *victim;
 	unsigned long start_age;
-	int ret, total = 0;
+	int ret = 0;
+	int total = 0;
 	/*
 	 * Reclaim memory from cgroups under root_mem in round robin.
 	 */
@@ -732,8 +733,9 @@ static int mem_cgroup_hierarchical_recla
 	while (time_after((start_age + 2UL), root_mem->scan_age)) {
 		victim = mem_cgroup_select_victim(root_mem);
 		/* we use swappiness of local cgroup */
-		ret = try_to_free_mem_cgroup_pages(victim, gfp_mask, noswap,
-						   get_swappiness(victim));
+		if (victim->res.usage)
+			ret = try_to_free_mem_cgroup_pages(victim, gfp_mask,
+					noswap, get_swappiness(victim));
 		css_put(&victim->css);
 		total += ret;
 		if (mem_cgroup_check_under_limit(root_mem))
@@ -2261,6 +2263,7 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,
 	.early_init = 0,
+	.retry_at_rmdir_failure = 1,
 	.use_id = 1,
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir
  2009-01-08  9:35 ` [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir KAMEZAWA Hiroyuki
@ 2009-01-14  2:48   ` Paul Menage
  2009-01-14  3:00     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2009-01-14  2:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Thu, Jan 8, 2009 at 1:35 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> +       if (ret == -EAGAIN) { /* subsys asks us to retry later */
> +               mutex_unlock(&cgroup_mutex);
> +               cond_resched();
> +               goto retry;
> +       }

This spinning worries me a bit. It might be better to do an
interruptible sleep until the relevant CSS's refcount goes down to
zero. And is there no way that the memory controller can hang on to a
reference indefinitely, if the cgroup still has some pages charged to
it?

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir
  2009-01-14  2:48   ` Paul Menage
@ 2009-01-14  3:00     ` KAMEZAWA Hiroyuki
  2009-01-14  3:05       ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-14  3:00 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Tue, 13 Jan 2009 18:48:43 -0800
Paul Menage <menage@google.com> wrote:

> On Thu, Jan 8, 2009 at 1:35 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > +       if (ret == -EAGAIN) { /* subsys asks us to retry later */
> > +               mutex_unlock(&cgroup_mutex);
> > +               cond_resched();
> > +               goto retry;
> > +       }
> 
> This spinning worries me a bit. It might be better to do an
> interruptible sleep until the relevant CSS's refcount goes down to
> zero. 

Hmm, add wait_queue to css and wake it up at css_put() ?

like this ?
==
__css_put()
{
	if (atomi_dec_return(&css->refcnt) == 1) {
		if (notify_on_release(cgrp) {
			.....
		}
		if (someone_waiting_rmdir(css)) {
			wake_up_him().
		}
	}
}
==

> And is there no way that the memory controller can hang on to a
> reference indefinitely, if the cgroup still has some pages charged to
> it?
> 
pre_destroy() is for that.  Now, If there are still references from "page"
after pre_destroy(), it's bug.
swap-in after pre_destory() may add new refs from pages.
(I implemented reference from "swap" to be memcg internal refcnt not to css.)

Allowing Ctrl-C/alarm() here by signal_pending() will be better, anyway.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir
  2009-01-14  3:00     ` KAMEZAWA Hiroyuki
@ 2009-01-14  3:05       ` Paul Menage
  2009-01-14  3:12         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2009-01-14  3:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Tue, Jan 13, 2009 at 7:00 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> Hmm, add wait_queue to css and wake it up at css_put() ?
>
> like this ?
> ==
> __css_put()
> {
>        if (atomi_dec_return(&css->refcnt) == 1) {
>                if (notify_on_release(cgrp) {
>                        .....
>                }
>                if (someone_waiting_rmdir(css)) {
>                        wake_up_him().
>                }
>        }
> }

Yes, something like that. A system-wide wake queue is probably fine though.

> pre_destroy() is for that.  Now, If there are still references from "page"
> after pre_destroy(), it's bug.

OK.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir
  2009-01-14  3:05       ` Paul Menage
@ 2009-01-14  3:12         ` KAMEZAWA Hiroyuki
  2009-01-20 10:47           ` [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir v2 KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-14  3:12 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Tue, 13 Jan 2009 19:05:35 -0800
Paul Menage <menage@google.com> wrote:

> On Tue, Jan 13, 2009 at 7:00 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > Hmm, add wait_queue to css and wake it up at css_put() ?
> >
> > like this ?
> > ==
> > __css_put()
> > {
> >        if (atomi_dec_return(&css->refcnt) == 1) {
> >                if (notify_on_release(cgrp) {
> >                        .....
> >                }
> >                if (someone_waiting_rmdir(css)) {
> >                        wake_up_him().
> >                }
> >        }
> > }
> 
> Yes, something like that. A system-wide wake queue is probably fine though.
> 
Ok, I'll try that.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir v2
  2009-01-14  3:12         ` KAMEZAWA Hiroyuki
@ 2009-01-20 10:47           ` KAMEZAWA Hiroyuki
  2009-01-21 10:00             ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-20 10:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Paul Menage, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Wed, 14 Jan 2009 12:12:05 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 13 Jan 2009 19:05:35 -0800
> Paul Menage <menage@google.com> wrote:
> 
> > On Tue, Jan 13, 2009 at 7:00 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > Hmm, add wait_queue to css and wake it up at css_put() ?
> > >
> > > like this ?
> > > ==
> > > __css_put()
> > > {
> > >        if (atomi_dec_return(&css->refcnt) == 1) {
> > >                if (notify_on_release(cgrp) {
> > >                        .....
> > >                }
> > >                if (someone_waiting_rmdir(css)) {
> > >                        wake_up_him().
> > >                }
> > >        }
> > > }
> > 
> > Yes, something like that. A system-wide wake queue is probably fine though.
> > 
> Ok, I'll try that.
> 

I'm not testing this. any concerns ?
==
In following situation, with memory subsystem,

	/groupA use_hierarchy==1
		/01 some tasks
		/02 some tasks
		/03 some tasks
		/04 empty

When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
routine is triggered and the kernel walks tree under groupA.
Then, rmdir /groupA/04 fails with -EBUSY frequently because of temporal
refcnt from internal kernel.

In general. cgroup can be rmdir'd if there are no children groups and
no tasks. Frequent fails of rmdir() is not useful to users.
(And the reason for -EBUSY is unknown to users.....in most cases)

This patch tries to modify above behavior, by
	- retries if css_refcnt is got by someone.
	- add "return value" to pre_destroy() and allows subsystem to
	  say "we're really busy!"

Changelog: v1 -> v2.
	- added return value to pre_destroy().
	- removed modification to cgroup_subsys.
	- added signal_pending() check.
	- added waitqueue and avoid busy spin loop.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
Index: mmotm-2.6.29-Jan16/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.29-Jan16.orig/include/linux/cgroup.h
+++ mmotm-2.6.29-Jan16/include/linux/cgroup.h
@@ -128,6 +128,8 @@ enum {
 	CGRP_RELEASABLE,
 	/* Control Group requires release notifications to userspace */
 	CGRP_NOTIFY_ON_RELEASE,
+	/* Someone calls rmdir() and is wating for this cgroup is released */
+	CGRP_WAIT_ON_RMDIR,
 };
 
 struct cgroup {
@@ -350,7 +352,7 @@ int cgroup_is_descendant(const struct cg
 struct cgroup_subsys {
 	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
 						  struct cgroup *cgrp);
-	void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
+	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss,
 			  struct cgroup *cgrp, struct task_struct *tsk);
Index: mmotm-2.6.29-Jan16/kernel/cgroup.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/kernel/cgroup.c
+++ mmotm-2.6.29-Jan16/kernel/cgroup.c
@@ -141,6 +141,11 @@ static int notify_on_release(const struc
 	return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
 }
 
+static int wakeup_on_rmdir(const struct cgroup *cgrp)
+{
+	return test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+}
+
 /*
  * for_each_subsys() allows you to iterate on each subsystem attached to
  * an active hierarchy
@@ -572,6 +577,7 @@ static struct backing_dev_info cgroup_ba
 static void populate_css_id(struct cgroup_subsys_state *id);
 static int alloc_css_id(struct cgroup_subsys *ss,
 			struct cgroup *parent, struct cgroup *child);
+static void cgroup_rmdir_wakeup_waiters(void);
 
 static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
 {
@@ -591,13 +597,18 @@ static struct inode *cgroup_new_inode(mo
  * Call subsys's pre_destroy handler.
  * This is called before css refcnt check.
  */
-static void cgroup_call_pre_destroy(struct cgroup *cgrp)
+static int cgroup_call_pre_destroy(struct cgroup *cgrp)
 {
 	struct cgroup_subsys *ss;
+	int ret = 0;
+
 	for_each_subsys(cgrp->root, ss)
-		if (ss->pre_destroy)
-			ss->pre_destroy(ss, cgrp);
-	return;
+		if (ss->pre_destroy) {
+			ret = ss->pre_destroy(ss, cgrp);
+			if (ret)
+				break;
+		}
+	return ret;
 }
 
 static void free_cgroup_rcu(struct rcu_head *obj)
@@ -1283,6 +1294,10 @@ int cgroup_attach_task(struct cgroup *cg
 	set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
 	synchronize_rcu();
 	put_css_set(cg);
+
+	/* wake up rmdir() waiter....it should fail.*/
+	if (wakeup_on_rmdir(cgrp))
+		cgroup_rmdir_wakeup_waiters();
 	return 0;
 }
 
@@ -2446,6 +2461,8 @@ static long cgroup_create(struct cgroup 
 
 	mutex_unlock(&cgroup_mutex);
 	mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
+	if (wakeup_on_rmdir(parent))
+		cgroup_rmdir_wakeup_waiters();
 
 	return 0;
 
@@ -2561,14 +2578,23 @@ static int cgroup_clear_css_refs(struct 
 	return !failed;
 }
 
+DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
+
+static void cgroup_rmdir_wakeup_waiters(void)
+{
+	wake_up_all(&cgroup_rmdir_waitq);
+}
+
 static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cgroup *cgrp = dentry->d_fsdata;
 	struct dentry *d;
 	struct cgroup *parent;
+	DEFINE_WAIT(wait);
+	int ret;
 
 	/* the vfs holds both inode->i_mutex already */
-
+again:
 	mutex_lock(&cgroup_mutex);
 	if (atomic_read(&cgrp->count) != 0) {
 		mutex_unlock(&cgroup_mutex);
@@ -2580,21 +2606,42 @@ static int cgroup_rmdir(struct inode *un
 	}
 	mutex_unlock(&cgroup_mutex);
 
+	if (signal_pending(current))
+		return -EINTR;
 	/*
 	 * Call pre_destroy handlers of subsys. Notify subsystems
 	 * that rmdir() request comes.
 	 */
-	cgroup_call_pre_destroy(cgrp);
-
+	ret = cgroup_call_pre_destroy(cgrp);
+	if (ret == -EBUSY)
+		return -EBUSY;
 	mutex_lock(&cgroup_mutex);
 	parent = cgrp->parent;
-
-	if (atomic_read(&cgrp->count)
-	    || !list_empty(&cgrp->children)
-	    || !cgroup_clear_css_refs(cgrp)) {
+	if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
 		mutex_unlock(&cgroup_mutex);
 		return -EBUSY;
 	}
+	/*
+	 * css_put/get is provided for subsys to grab refcnt to css. In typical
+	 * case, subsystem has no referenece after pre_destroy(). But,
+	 * considering hierarchy management, some *temporal* refcnt can be hold.
+	 * To avoid returning -EBUSY, cgroup_rmdir_waitq is used. If subsys
+	 * is really busy, it should return -EBUSY at pre_destroy(). wake_up
+	 * is called when css_put() is called and it seems ready to rmdir().
+	 */
+	set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+	prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
+
+	if (!cgroup_clear_css_refs(cgrp)) {
+		mutex_unlock(&cgroup_mutex);
+		schedule();
+		finish_wait(&cgroup_rmdir_waitq, &wait);
+		clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
+		goto again;
+	}
+	/* NO css_tryget() can success after here. */
+	finish_wait(&cgroup_rmdir_waitq, &wait);
+	clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
 
 	spin_lock(&release_list_lock);
 	set_bit(CGRP_REMOVED, &cgrp->flags);
@@ -3150,10 +3197,13 @@ void __css_put(struct cgroup_subsys_stat
 {
 	struct cgroup *cgrp = css->cgroup;
 	rcu_read_lock();
-	if ((atomic_dec_return(&css->refcnt) == 1) &&
-	    notify_on_release(cgrp)) {
-		set_bit(CGRP_RELEASABLE, &cgrp->flags);
-		check_for_release(cgrp);
+	if (atomic_dec_return(&css->refcnt) == 1) {
+		if (notify_on_release(cgrp)) {
+			set_bit(CGRP_RELEASABLE, &cgrp->flags);
+			check_for_release(cgrp);
+		}
+		if (wakeup_on_rmdir(cgrp))
+			cgroup_rmdir_wakeup_waiters();
 	}
 	rcu_read_unlock();
 }
Index: mmotm-2.6.29-Jan16/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Jan16.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Jan16/mm/memcontrol.c
@@ -2381,11 +2381,13 @@ free_out:
 	return ERR_PTR(error);
 }
 
-static void mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
+static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
 					struct cgroup *cont)
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+
 	mem_cgroup_force_empty(mem, false);
+	return 0;
 }
 
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir v2
  2009-01-20 10:47           ` [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir v2 KAMEZAWA Hiroyuki
@ 2009-01-21 10:00             ` Paul Menage
  2009-01-21 10:32               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2009-01-21 10:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Tue, Jan 20, 2009 at 2:47 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>        CGRP_NOTIFY_ON_RELEASE,
> +       /* Someone calls rmdir() and is wating for this cgroup is released */

/* A thread is in rmdir() waiting to destroy this cgroup */

Also document that it can only be set/cleared when you're holding the
inode_sem for the cgroup directory. And we should probably move this
enum inside cgroup.c, since nothing in the header file uses it.

> +       CGRP_WAIT_ON_RMDIR,
>  };

>
>  struct cgroup {
> @@ -350,7 +352,7 @@ int cgroup_is_descendant(const struct cg
>  struct cgroup_subsys {
>        struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
>                                                  struct cgroup *cgrp);
> -       void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> +       int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);

Can you update the documentation to indicate what an error result from
pre_destroy indicates? Can pre_destroy() be called multiple times for
the same subsystem/cgroup?

> +
> +       /* wake up rmdir() waiter....it should fail.*/

/* Wake up rmdir() waiter - the rmdir should fail since the cgroup is
no longer empty */

But is this safe? If we do a pre-destroy, is it OK to let new tasks
into the cgroup?

> @@ -2446,6 +2461,8 @@ static long cgroup_create(struct cgroup
>
>        mutex_unlock(&cgroup_mutex);
>        mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
> +       if (wakeup_on_rmdir(parent))
> +               cgroup_rmdir_wakeup_waiters();

I don't think that there can be a waiter, since rmdir() would hold the
parent's inode semaphore, which would block this thread before it gets
to cgroup_create()

> +DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> +
> +static void cgroup_rmdir_wakeup_waiters(void)
> +{
> +       wake_up_all(&cgroup_rmdir_waitq);
> +}
> +

I think you can merge wakeup_on_rmdir() and
cgroup_rmdir_wakeup_waiters() into a single function,
cgroup_wakeup_rmdir(struct cgroup *)


>
> +       if (signal_pending(current))
> +               return -EINTR;

I think it would be better to move this check to after we've already
failed on cgroup_clear_css_refs(). That way we can't fail with an
EINTR just because we raced with a signal on the way into rmdir() - we
have to actually hit the EBUSY and try to sleep.
> +       ret = cgroup_call_pre_destroy(cgrp);
> +       if (ret == -EBUSY)
> +               return -EBUSY;

What about other potential error codes? If the subsystem's only
allowed to return 0 or EBUSY, then we should check for that.

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir v2
  2009-01-21 10:00             ` Paul Menage
@ 2009-01-21 10:32               ` KAMEZAWA Hiroyuki
  2009-01-21 10:43                 ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-21 10:32 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Wed, 21 Jan 2009 02:00:56 -0800
Paul Menage <menage@google.com> wrote:

> On Tue, Jan 20, 2009 at 2:47 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >        CGRP_NOTIFY_ON_RELEASE,
> > +       /* Someone calls rmdir() and is wating for this cgroup is released */
> 
> /* A thread is in rmdir() waiting to destroy this cgroup */
> 
> Also document that it can only be set/cleared when you're holding the
> inode_sem for the cgroup directory. And we should probably move this
> enum inside cgroup.c, since nothing in the header file uses it.
> 
> > +       CGRP_WAIT_ON_RMDIR,
> >  };

Hmm, ok. move this all enum to cgroup.c ?


> 
> >
> >  struct cgroup {
> > @@ -350,7 +352,7 @@ int cgroup_is_descendant(const struct cg
> >  struct cgroup_subsys {
> >        struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
> >                                                  struct cgroup *cgrp);
> > -       void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> > +       int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
> 
> Can you update the documentation to indicate what an error result from
> pre_destroy indicates? Can pre_destroy() be called multiple times for
> the same subsystem/cgroup?
> 

yes, after this, memcg will return -EBUSY in some special cases.
(patches are on my stack.)
We'll have -EBUSY situation especially on swap-less system.



> > +
> > +       /* wake up rmdir() waiter....it should fail.*/
> 
> /* Wake up rmdir() waiter - the rmdir should fail since the cgroup is
> no longer empty */
> 
> But is this safe? If we do a pre-destroy, is it OK to let new tasks
> into the cgroup?
> 
Current memcg allows it. (so, I removed "obsolete" flag in memcg and asked
you to add css_tryget().)



> > @@ -2446,6 +2461,8 @@ static long cgroup_create(struct cgroup
> >
> >        mutex_unlock(&cgroup_mutex);
> >        mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
> > +       if (wakeup_on_rmdir(parent))
> > +               cgroup_rmdir_wakeup_waiters();
> 
> I don't think that there can be a waiter, since rmdir() would hold the
> parent's inode semaphore, which would block this thread before it gets
> to cgroup_create()
> 
Oh, I see. I missed that. I'll remove this.


> > +DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> > +
> > +static void cgroup_rmdir_wakeup_waiters(void)
> > +{
> > +       wake_up_all(&cgroup_rmdir_waitq);
> > +}
> > +
> 
> I think you can merge wakeup_on_rmdir() and
> cgroup_rmdir_wakeup_waiters() into a single function,
> cgroup_wakeup_rmdir(struct cgroup *)
> 
will try.


> 
> >
> > +       if (signal_pending(current))
> > +               return -EINTR;
> 
> I think it would be better to move this check to after we've already
> failed on cgroup_clear_css_refs(). That way we can't fail with an
> EINTR just because we raced with a signal on the way into rmdir() - we
> have to actually hit the EBUSY and try to sleep.

Ok, will move.


> > +       ret = cgroup_call_pre_destroy(cgrp);
> > +       if (ret == -EBUSY)
> > +               return -EBUSY;
> 
> What about other potential error codes? If the subsystem's only
> allowed to return 0 or EBUSY, then we should check for that.
> 

Hmm, subsystem may return -EPERM or some..
I'll change this to

 if (!ret)
    return ret;

Thank you for review. very helpful.
I'll consider more.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir v2
  2009-01-21 10:32               ` KAMEZAWA Hiroyuki
@ 2009-01-21 10:43                 ` Paul Menage
  2009-01-21 10:45                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2009-01-21 10:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Wed, Jan 21, 2009 at 2:32 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> Hmm, subsystem may return -EPERM or some..
> I'll change this to
>
>  if (!ret)
>    return ret;

You mean

if (ret)
  return ret;

Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir v2
  2009-01-21 10:43                 ` Paul Menage
@ 2009-01-21 10:45                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-01-21 10:45 UTC (permalink / raw)
  To: Paul Menage
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
	lizf@cn.fujitsu.com

On Wed, 21 Jan 2009 02:43:44 -0800
Paul Menage <menage@google.com> wrote:

> On Wed, Jan 21, 2009 at 2:32 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > Hmm, subsystem may return -EPERM or some..
> > I'll change this to
> >
> >  if (!ret)
> >    return ret;
> 
> You mean
> 
> if (ret)
>   return ret;
> 
Yes. (>_<!

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2009-01-21 10:46 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-08  9:25 [RFC][PATCH] cgroup and memcg updates 20090108 KAMEZAWA Hiroyuki
2009-01-08  9:28 ` [RFC][PATCH 1/4] cgroup: support per cgroup subsys state ID (CSS ID) KAMEZAWA Hiroyuki
2009-01-09  3:59   ` Li Zefan
2009-01-09  4:24     ` KAMEZAWA Hiroyuki
2009-01-10  0:23   ` Paul Menage
2009-01-10  0:49     ` KAMEZAWA Hiroyuki
2009-01-12  7:21   ` Balbir Singh
2009-01-15  6:12     ` KAMEZAWA Hiroyuki
2009-01-13  7:40   ` Li Zefan
2009-01-13  9:22     ` KAMEZAWA Hiroyuki
2009-01-08  9:30 ` [RFC][PATCH 2/4] memcg: use CSS ID in memcg KAMEZAWA Hiroyuki
2009-01-12 12:14   ` Balbir Singh
2009-01-15  6:19     ` KAMEZAWA Hiroyuki
2009-01-08  9:32 ` [RFC][PATCH 3/4] memcg: fix OOM KILL under hierarchy KAMEZAWA Hiroyuki
2009-01-13  8:33   ` Li Zefan
2009-01-13  9:25     ` KAMEZAWA Hiroyuki
2009-01-08  9:35 ` [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir KAMEZAWA Hiroyuki
2009-01-14  2:48   ` Paul Menage
2009-01-14  3:00     ` KAMEZAWA Hiroyuki
2009-01-14  3:05       ` Paul Menage
2009-01-14  3:12         ` KAMEZAWA Hiroyuki
2009-01-20 10:47           ` [RFC][PATCH 4/4] cgroup-memcg fix frequent EBUSY at rmdir v2 KAMEZAWA Hiroyuki
2009-01-21 10:00             ` Paul Menage
2009-01-21 10:32               ` KAMEZAWA Hiroyuki
2009-01-21 10:43                 ` Paul Menage
2009-01-21 10:45                   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).