[RFC][PATCH 0/2] memcg: simple hierarchy (v2)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/2] memcg: simple hierarchy (v2)
@ 2008-05-30  1:43 KAMEZAWA Hiroyuki
  2008-05-30  1:45 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-05-30  1:43 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: LKML, balbir@linux.vnet.ibm.com, xemul@openvz.org,
	menage@google.com, yamamoto@valinux.co.jp, lizf@cn.fujitsu.com

This is rewritten version of memcg hierarchy handling.
...and I'm sorry tons of typos in v1.

Changelog:
  - fixed typo.
  - removed meaningless params (borrow)
  - renamed structure members.

not-for-test. just for discussion.  (I'll rewrite when our direction is fixed.)

Implemented Policy:
  - parent overcommits all children
     parent->usage = resource used by itself + resource moved to children.
     Of course, parent->limit > parent->usage. 
  - when child's limit is set, the resouce moves.
  - no automatic resource moving between parent <-> child

Example)
  1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
     - group_A limit=1G,usage=0M.

  2) create group B, C under A.
     - group A limit=1G, usage=0M
          - group B limit=0M, usage=0M.
          - group C limit=0M, usage=0M.

  3) increase group B's limit to 300M.
     - group A limit=1G, usage=300M.
          - group B limit=300M, usage=0M.
          - group C limit=0M, usage=0M.

  4) increase group C's limit to 500M
     - group A limit=1G, usage=800M.
          - group B limit=300M, usage=0M.
          - group C limit=500M, usage=0M.

  5) reduce group B's limit to 100M
     - group A limit=1G, usage=600M.
          - group B limit=100M, usage=0M.
          - group C limit=500M, usage=0M.

Why this is enough ?
  - A middleware can do various kind of resource balancing only by reseting "limit"
    in userland.


TODO(maybe)
  - rewrite force_empty to move the resource to the parent.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:43 [RFC][PATCH 0/2] memcg: simple hierarchy (v2) KAMEZAWA Hiroyuki
@ 2008-05-30  1:45 ` KAMEZAWA Hiroyuki
  2008-05-30 22:20   ` Balbir Singh
                     ` (3 more replies)
  2008-05-30  1:46 ` [RFC][PATCH 2/2] memcg: memcg hierarchy KAMEZAWA Hiroyuki
  2008-05-30  1:46 ` [RFC][PATCH 0/2] memcg: simple hierarchy (v2) Rik van Riel
  2 siblings, 4 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-05-30  1:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, LKML, balbir@linux.vnet.ibm.com,
	xemul@openvz.org, menage@google.com, yamamoto@valinux.co.jp,
	lizf@cn.fujitsu.com

This patch tries to implements _simple_ 'hierarchy policy' in res_counter.

While several policy of hierarchy can be considered, this patch implements
simple one 
   - the parent includes, over-commits the child
   - there are no shared resource
   - dynamic hierarchy resource usage management in the kernel is not necessary

works as following.

 1. create a child. set default child limits to be 0.
 2. set limit to child.
    2-a. before setting limit to child, prepare enough room in parent.
    2-b. increase 'usage' of parent by child's limit.
 3. the child sets its limit to the val moved from the parent.
    the parent remembers what amount of resource is to the children.

 Above means that
	- a directory's usage implies the sum of all sub directories +
          own usage.
	- there are no shared resource between parent <-> child.

 Pros.
  - simple and easy policy.
  - no hierarchy overhead.
  - no resource share among child <-> parent. very suitable for multilevel
    resource isolation.
 Cons.
  - not good to implement some kind of _intelligent_ hierarchy balancing
    in the _kernel_ 

Changelog:
  -removed borrow.
  -fixed tons of typos.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/resource_counter.txt |   28 +++++
 include/linux/res_counter.h                    |   72 +++++++++++++
 kernel/res_counter.c                           |  130 +++++++++++++++++++++++--
 3 files changed, 222 insertions(+), 8 deletions(-)

Index: hie-2.6.26-rc2-mm1/include/linux/res_counter.h
===================================================================
--- hie-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
+++ hie-2.6.26-rc2-mm1/include/linux/res_counter.h
@@ -39,6 +39,11 @@ struct res_counter {
 	 */
 	unsigned long long failcnt;
 	/*
+	 * the sum of all resource which is assigned to children.
+	 */
+	unsigned long long for_children;
+
+	/*
 	 * the lock to protect all of the above.
 	 * the routines below consider this to be IRQ-safe
 	 */
@@ -57,6 +62,7 @@ struct res_counter {
  * @nbytes:  its size...
  * @pos:     and the offset.
  */
+typedef int (*res_resize_callback_t)(struct res_counter *, unsigned long long);
 
 u64 res_counter_read_u64(struct res_counter *counter, int member);
 
@@ -65,8 +71,55 @@ ssize_t res_counter_read(struct res_coun
 		int (*read_strategy)(unsigned long long val, char *s));
 ssize_t res_counter_write(struct res_counter *counter, int member,
 		const char __user *buf, size_t nbytes, loff_t *pos,
-		int (*write_strategy)(char *buf, unsigned long long *val));
+		int (*write_strategy)(char *buf, unsigned long long *val),
+		res_resize_callback_t callback);
+
+/**
+ * Move resource from parent to child and set child's limit.
+ * By this,
+ *          res->usage of the parent increased by 'val'
+ *          res->for_children of the parent increased by 'val'
+ *          res->limit of the child increased by 'val'
+ *
+ * @child:    an entity to set res->limit.
+ * @parent:   parent of child and source of resource.
+ * @val:      How much does child want to move from parent ?
+ * @callback: A callback for making resource to allow this moving, called
+ *            against parent. callback should returns 0 at success,
+ *	      returns !0 at failure. _No_ lock is held while the callback is
+ *	      called. If no callback(NULL), no retry.
+ * @retry:    # of retries at calling callback for making resource.
+ *            -1 means infinite loop. At each retry, yield() is called.
+ *
+ * Returns 0 if success. !0 at failure.
+ *
+ */
+typedef int (*res_shrink_callback_t)(struct res_counter*, unsigned long long);
 
+int res_counter_borrow_resource(struct res_counter *child,
+				struct res_counter *parent,
+				unsigned long long val,
+				res_shrink_callback_t callback, int retry);
+/**
+ * Return resource to its parent.
+ * By this,
+ *           res->usage of the parent is decreased by 'val'
+ *           res->for_children of the parent is decreased by 'val'
+ *           res->limit of the child is decreased by 'val'
+ *
+ * @child:   entry to resize.
+ * @parent:  resource will be moved back to this.
+ * @val  :   How much does child repay to parent ? -1 means 'all'
+ * @callback: A callback for decreasing resource usage of child before
+ *            moving. If NULL, just deceases child's limit.
+ * @retry:   # of retries at calling callback for freeing resource.
+ *            -1 means infinite loop. At each retry, yield() is called.
+ * Returns 0 at success.
+ */
+int res_counter_repay_resource(struct res_counter *child,
+				struct res_counter *parent,
+				unsigned long long val,
+				res_shrink_callback_t callback, int retry);
 /*
  * the field descriptors. one for each member of res_counter
  */
@@ -76,6 +129,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_FOR_CHILDREN,
 };
 
 /*
@@ -104,6 +158,11 @@ enum charge_code __must_check res_counte
 
 enum charge_code __must_check res_counter_charge(struct res_counter *counter,
 		unsigned long val);
+/*
+ * Compare usage with ->borrow member instead of ->limit.
+ */
+int __must_check res_counter_charge_borrow(struct res_counter *counter,
+		unsigned long val);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -158,4 +217,15 @@ static inline void res_counter_reset_fai
 	cnt->failcnt = 0;
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
+
+/*
+ * should be called only after cgroup creation.
+ */
+static inline void res_counter_zero_limit(struct res_counter *cnt)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->limit = 0;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+}
 #endif
Index: hie-2.6.26-rc2-mm1/kernel/res_counter.c
===================================================================
--- hie-2.6.26-rc2-mm1.orig/kernel/res_counter.c
+++ hie-2.6.26-rc2-mm1/kernel/res_counter.c
@@ -76,6 +76,8 @@ res_counter_member(struct res_counter *c
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_FOR_CHILDREN:
+		return &counter->for_children;
 	};
 
 	BUG();
@@ -106,7 +108,8 @@ u64 res_counter_read_u64(struct res_coun
 
 ssize_t res_counter_write(struct res_counter *counter, int member,
 		const char __user *userbuf, size_t nbytes, loff_t *pos,
-		int (*write_strategy)(char *st_buf, unsigned long long *val))
+		int (*write_strategy)(char *st_buf, unsigned long long *val),
+		res_resize_callback_t callback)
 {
 	int ret;
 	char *buf, *end;
@@ -135,13 +138,118 @@ ssize_t res_counter_write(struct res_cou
 		if (*end != '\0')
 			goto out_free;
 	}
-	spin_lock_irqsave(&counter->lock, flags);
-	val = res_counter_member(counter, member);
-	*val = tmp;
-	spin_unlock_irqrestore(&counter->lock, flags);
-	ret = nbytes;
+	if (member != RES_LIMIT || !callback) {
+		spin_lock_irqsave(&counter->lock, flags);
+		val = res_counter_member(counter, member);
+		*val = tmp;
+		spin_unlock_irqrestore(&counter->lock, flags);
+		ret = nbytes;
+	} else {
+		/* call a callback for hierarchy management */
+		ret = callback(counter, tmp);
+		if (!ret)
+			ret = nbytes;
+	}
+
 out_free:
 	kfree(buf);
 out:
 	return ret;
 }
+
+/*
+ * This tries to move 'val' resource from parent. At success,
+ * child->limit += val.
+ * parent->for_children += val.
+ * parent->usage += val.
+ */
+
+int res_counter_borrow_resource(struct res_counter *child,
+				struct res_counter *parent,
+				unsigned long long val,
+				res_shrink_callback_t callback, int retry)
+{
+	int success = 0;
+	unsigned long flags;
+
+	/* Borrow resource from parent */
+	success = 0;
+	while (1) {
+		/* res_counter_charge just handles 'long' value...*/
+		spin_lock_irqsave(&parent->lock, flags);
+		if (parent->usage + val < parent->limit) {
+			parent->usage += val;
+			parent->for_children += val;
+			success = 1;
+		}
+		spin_unlock_irqrestore(&parent->lock, flags);
+		if (success)
+			break;
+		if (!retry || !callback)
+			goto fail;
+		if (retry > 0)
+			--retry;
+		yield();
+		callback(parent, val);
+	}
+	/* ok, we successfully got enough resource. */
+	spin_lock_irqsave(&child->lock, flags);
+	child->limit += val;
+	spin_unlock_irqrestore(&child->lock, flags);
+	return 0;
+fail:
+	return 1;
+}
+
+/*
+ * Move resource to its parent.
+ *   child->limit -= val.
+ *   parent->usage -= val.
+ *   parent->limit -= val.
+ */
+
+int res_counter_repay_resource(struct res_counter *child,
+				struct res_counter *parent,
+				unsigned long long val,
+				res_shrink_callback_t callback, int retry)
+{
+	unsigned long flags;
+	int done = 0;
+	/* Enough resources ? */
+	while (1) {
+		spin_lock_irqsave(&child->lock, flags);
+
+		if (val == (unsigned long long)-1) {
+			val = child->limit;
+			child->limit = 0;
+			done = 1;
+		} else if (child->usage + val <= child->limit) {
+			child->limit -= val;
+			done = 1;
+		}
+		spin_unlock_irqrestore(&child->lock, flags);
+		if (done)
+			break;
+		if (!retry-- || !callback)
+			goto fail;
+		/*
+		 * we want to rest somewhere but right after callback is
+		 * not good place. So rest here.
+		 */
+		yield();
+		/* reduce resource usage */
+		callback(child, val);
+	}
+
+	/* ok, we successfully got enough resource. */
+	spin_lock_irqsave(&parent->lock, flags);
+	BUG_ON(parent->for_children < val);
+	BUG_ON(parent->usage < val);
+	parent->for_children -= val;
+	parent->usage -= val;
+	spin_unlock_irqrestore(&parent->lock, flags);
+
+	return 0;
+fail:
+	return 1;
+}
Index: hie-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
===================================================================
--- hie-2.6.26-rc2-mm1.orig/Documentation/controllers/resource_counter.txt
+++ hie-2.6.26-rc2-mm1/Documentation/controllers/resource_counter.txt
@@ -39,10 +39,13 @@ to work with it.
  	The failcnt stands for "failures counter". This is the number of
 	resource allocation attempts that failed.
 
- c. spinlock_t lock
+ e. spinlock_t lock
 
  	Protects changes of the above values.
 
+ f. for_children
+        The amount of resource moved to the children.
+
 
 
 2. Basic accounting routines
@@ -179,3 +182,24 @@ counter fields. They are recommended to 
     still can help with it).
 
  c. Compile and run :)
+
+6. Hierarchy Model
+   1) simple isolation hierarchy.
+   res_counter supports a simple hierarchy model as that the child's resource
+   is moved from its parent.
+
+   When  the limit is set to a child, its parent's usage increases by the
+   amount of limit. i.e. the child borrows resource from its parent when
+   it set the limit.
+
+   hirarchical cgroup is very useful when you implements hierarchical
+   resource isolation. For example,
+   A) admin - user
+   - system admin layer ....the first level
+	- user layer    ....the second level for user A, B, C
+   B) application/service layer.
+   -  application layer ... the first level
+        - service layer ... the second level for service Gold, Silver,...
+
+  see res_counter_borrow_resource() and res_counter_repay_resource().
+

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC][PATCH 2/2] memcg: memcg hierarchy
  2008-05-30  1:43 [RFC][PATCH 0/2] memcg: simple hierarchy (v2) KAMEZAWA Hiroyuki
  2008-05-30  1:45 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
@ 2008-05-30  1:46 ` KAMEZAWA Hiroyuki
  2008-05-30  1:46 ` [RFC][PATCH 0/2] memcg: simple hierarchy (v2) Rik van Riel
  2 siblings, 0 replies; 14+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-05-30  1:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, LKML, balbir@linux.vnet.ibm.com,
	xemul@openvz.org, menage@google.com, yamamoto@valinux.co.jp,
	lizf@cn.fujitsu.com

hierarchy support for memcg.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 mm/memcontrol.c |   90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 89 insertions(+), 1 deletion(-)

Index: hie-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- hie-2.6.26-rc2-mm1.orig/mm/memcontrol.c
+++ hie-2.6.26-rc2-mm1/mm/memcontrol.c
@@ -792,6 +792,78 @@ int mem_cgroup_shrink_usage(struct mm_st
 }
 
 /*
+ * Memory Controller hierarchy support.
+ */
+
+int memcg_shrink_callback(struct res_counter *cnt, unsigned long long val)
+{
+	struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
+	unsigned long flags;
+	int ret = 1;
+	int progress = 1;
+
+retry:
+	spin_lock_irqsave(&cnt->lock, flags);
+	/* Need to shrink ? */
+	if (cnt->usage + val <= cnt->limit)
+		ret = 0;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+
+	if (!ret)
+		return 0;
+
+	if (!progress)
+		return 1;
+	progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
+
+	goto retry;
+}
+
+
+int mem_cgroup_resize_callback(struct res_counter *cnt, unsigned long long val)
+{
+	struct mem_cgroup *child = container_of(cnt, struct mem_cgroup, res);
+	struct mem_cgroup *parent;
+	struct cgroup *my_cg;
+	unsigned long flags, borrow;
+	unsigned long long diffs;
+	int ret = 0;
+
+	my_cg = child->css.cgroup;
+	/* Is this root group ? */
+	if (!my_cg->parent) {
+		spin_lock_irqsave(&cnt->lock, flags);
+		cnt->limit = val;
+		spin_unlock_irqrestore(&cnt->lock, flags);
+		return 0;
+	}
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (val > cnt->limit) {
+		diffs = val - cnt->limit;
+		borrow = 1;
+	} else {
+		diffs = cnt->limit - val;
+		borrow = 0;
+	}
+	spin_unlock_irqrestore(&cnt->lock, flags);
+
+	parent = mem_cgroup_from_cont(my_cg->parent);
+	/* When we increase resource, call borrow. When decrease, call repay*/
+	if (borrow)
+		ret = res_counter_borrow_resource(cnt, &parent->res, diffs,
+					memcg_shrink_callback, 5);
+	else
+		ret = res_counter_repay_resource(cnt, &parent->res, diffs,
+					memcg_shrink_callback, 5);
+	return ret;
+}
+
+
+
+
+
+
+/*
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
@@ -898,7 +970,8 @@ static ssize_t mem_cgroup_write(struct c
 {
 	return res_counter_write(&mem_cgroup_from_cont(cont)->res,
 				cft->private, userbuf, nbytes, ppos,
-				mem_cgroup_write_strategy);
+				mem_cgroup_write_strategy,
+				mem_cgroup_resize_callback);
 }
 
 static int mem_cgroup_reset(struct cgroup *cont, unsigned int event)
@@ -992,6 +1065,11 @@ static struct cftype mem_cgroup_files[] 
 		.name = "stat",
 		.read_map = mem_control_stat_show,
 	},
+	{
+		.name = "assigned_to_child",
+		.private = RES_FOR_CHILDREN,
+		.read_u64 = mem_cgroup_read,
+	},
 };
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -1069,6 +1147,8 @@ mem_cgroup_create(struct cgroup_subsys *
 	}
 
 	res_counter_init(&mem->res);
+	if (cont->parent)
+		res_counter_zero_limit(&mem->res);
 
 	for_each_node_state(node, N_POSSIBLE)
 		if (alloc_mem_cgroup_per_zone_info(mem, node))
@@ -1095,6 +1175,14 @@ static void mem_cgroup_destroy(struct cg
 {
 	int node;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *parent;
+
+	if (cont->parent) {
+		parent = mem_cgroup_from_cont(cont->parent);
+		/* we did what we can...just returns what we borrow */
+		res_counter_repay_resource(&mem->res,
+				&parent->res, -1, NULL, 0);
+	}
 
 	for_each_node_state(node, N_POSSIBLE)
 		free_mem_cgroup_per_zone_info(mem, node);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 0/2] memcg: simple hierarchy (v2)
  2008-05-30  1:43 [RFC][PATCH 0/2] memcg: simple hierarchy (v2) KAMEZAWA Hiroyuki
  2008-05-30  1:45 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
  2008-05-30  1:46 ` [RFC][PATCH 2/2] memcg: memcg hierarchy KAMEZAWA Hiroyuki
@ 2008-05-30  1:46 ` Rik van Riel
  2 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2008-05-30  1:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, LKML, balbir@linux.vnet.ibm.com,
	xemul@openvz.org, menage@google.com, yamamoto@valinux.co.jp,
	lizf@cn.fujitsu.com

On Fri, 30 May 2008 10:43:12 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Implemented Policy:
>   - parent overcommits all children
>      parent->usage = resource used by itself + resource moved to children.
>      Of course, parent->limit > parent->usage. 
>   - when child's limit is set, the resouce moves.
>   - no automatic resource moving between parent <-> child

> Why this is enough ?
>   - A middleware can do various kind of resource balancing only by reseting "limit"
>     in userland.

I like this idea.  The alternative could mean having a page live
on multiple cgroup LRU lists, not just the zone LRU and the one
cgroup LRU, and drastically increasing run time overhead.

Swapping memory in and out is horrendously slow anyway, so the
idea of having a daemon adjust the limits on the fly should work
just fine.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:45 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
@ 2008-05-30 22:20   ` Balbir Singh
  2008-05-31  1:59   ` kamezawa.hiroyu
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Balbir Singh @ 2008-05-30 22:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, LKML, xemul@openvz.org, menage@google.com,
	yamamoto@valinux.co.jp, lizf@cn.fujitsu.com

KAMEZAWA Hiroyuki wrote:
> This patch tries to implements _simple_ 'hierarchy policy' in res_counter.
> 
> While several policy of hierarchy can be considered, this patch implements
> simple one 
>    - the parent includes, over-commits the child
>    - there are no shared resource

I am not sure if this is desirable. The concept of a hierarchy applies really
well when there are shared resources.

>    - dynamic hierarchy resource usage management in the kernel is not necessary
> 

Could you please elaborate as to why? I am not sure I understand your point

> works as following.
> 
>  1. create a child. set default child limits to be 0.
>  2. set limit to child.
>     2-a. before setting limit to child, prepare enough room in parent.
>     2-b. increase 'usage' of parent by child's limit.

The problem with this is that you are forcing the parent will run into a reclaim
loop even if the child is not using the assigned limit to it.

>  3. the child sets its limit to the val moved from the parent.
>     the parent remembers what amount of resource is to the children.
> 

All of this needs to be dynamic

>  Above means that
> 	- a directory's usage implies the sum of all sub directories +
>           own usage.
> 	- there are no shared resource between parent <-> child.
> 
>  Pros.
>   - simple and easy policy.
>   - no hierarchy overhead.
>   - no resource share among child <-> parent. very suitable for multilevel
>     resource isolation.

Sharing is an important aspect of hierachies. I am not convinced of this
approach. Did you look at the patches I sent out? Was there something
fundamentally broken in them?

[snip]

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:45 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
  2008-05-30 22:20   ` Balbir Singh
@ 2008-05-31  1:59   ` kamezawa.hiroyu
  2008-05-31 11:20     ` Balbir Singh
  2008-05-31 14:47     ` kamezawa.hiroyu
  2008-06-02  2:15   ` YAMAMOTO Takashi
  2008-06-02  9:52   ` kamezawa.hiroyu
  3 siblings, 2 replies; 14+ messages in thread
From: kamezawa.hiroyu @ 2008-05-31  1:59 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-mm, LKML, xemul, menage, yamamoto, lizf

>KAMEZAWA Hiroyuki wrote:
>> This patch tries to implements _simple_ 'hierarchy policy' in res_counter.
>> 
>> While several policy of hierarchy can be considered, this patch implements
>> simple one 
>>    - the parent includes, over-commits the child
>>    - there are no shared resource
>
>I am not sure if this is desirable. The concept of a hierarchy applies really
>well when there are shared resources.
>
>>    - dynamic hierarchy resource usage management in the kernel is not neces
sary
>> 
>
>Could you please elaborate as to why? I am not sure I understand your point
>

ok, let's consider a _miiddleware_ wchich has following paramater.

An expoterd param to the user.
   - user_memory_limit
parameters for co-operation with the kernel
   - kernel_memory_limit

And here,
   user_memory_limit >= kernel_memory_limit == cgroup's memory.limits_in_bytes

When a user ask the miidleware to set limit to 1Gbytes
   user_memory_limit = 1G
   kernel_memory_limit = 0-1G.
It moves kernel_memory_limit dynamically 0 to 1Gbytes and reset limits_in_byte
s in dynamic way with checking memory cgroup's statistics.
Of course, we can add some kind of interdace , as following
  - failure_notifier - triggered at failcnt increment.
  - threshhold_notifier - triggered as usage > threshold.

>> works as following.
>> 
>>  1. create a child. set default child limits to be 0.
>>  2. set limit to child.
>>     2-a. before setting limit to child, prepare enough room in parent.
>>     2-b. increase 'usage' of parent by child's limit.
>
>The problem with this is that you are forcing the parent will run into a recl
aim
>loop even if the child is not using the assigned limit to it.
>
That's not problem because it's avoildable by users.
But it's ok to limit the sum of child's limit to be below XX % ot the parent.

>>  3. the child sets its limit to the val moved from the parent.
>>     the parent remembers what amount of resource is to the children.
>> 
>
>All of this needs to be dynamic
>
As explained, this can be dynamic by middleware.

>>  Above means that
>> 	- a directory's usage implies the sum of all sub directories +
>>           own usage.
>> 	- there are no shared resource between parent <-> child.
>> 
>>  Pros.
>>   - simple and easy policy.
>>   - no hierarchy overhead.
>>   - no resource share among child <-> parent. very suitable for multilevel
>>     resource isolation.
>
>Sharing is an important aspect of hierachies. I am not convinced of this
>approach. Did you look at the patches I sent out? Was there something
>fundamentally broken in them?
>

Yes, I read. And tried to make it faster and found it will be complicated.
One problem is overhead of counter itself.
Another problem is overhead of shrinking multi-level LRU with feedback.
One more problem is that it's hard to implement various kinds of hierarchy
policy. I believe there are other hierarhcy policies rather than OpenVZ
want to use. Kicking out functions to middleware AMAP is what I'm thinking
now.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-31  1:59   ` kamezawa.hiroyu
@ 2008-05-31 11:20     ` Balbir Singh
  2008-05-31 14:47     ` kamezawa.hiroyu
  1 sibling, 0 replies; 14+ messages in thread
From: Balbir Singh @ 2008-05-31 11:20 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, LKML, xemul, menage, yamamoto, lizf

kamezawa.hiroyu@jp.fujitsu.com wrote:
>> KAMEZAWA Hiroyuki wrote:
>>> This patch tries to implements _simple_ 'hierarchy policy' in res_counter.
>>>
>>> While several policy of hierarchy can be considered, this patch implements
>>> simple one 
>>>    - the parent includes, over-commits the child
>>>    - there are no shared resource
>> I am not sure if this is desirable. The concept of a hierarchy applies really
>> well when there are shared resources.
>>
>>>    - dynamic hierarchy resource usage management in the kernel is not neces
> sary
>> Could you please elaborate as to why? I am not sure I understand your point
>>
> 
> ok, let's consider a _miiddleware_ wchich has following paramater.
> 
> An expoterd param to the user.
>    - user_memory_limit
> parameters for co-operation with the kernel
>    - kernel_memory_limit
> 
> And here,
>    user_memory_limit >= kernel_memory_limit == cgroup's memory.limits_in_bytes
> 
> When a user ask the miidleware to set limit to 1Gbytes
>    user_memory_limit = 1G
>    kernel_memory_limit = 0-1G.
> It moves kernel_memory_limit dynamically 0 to 1Gbytes and reset limits_in_byte
> s in dynamic way with checking memory cgroup's statistics.
> Of course, we can add some kind of interdace , as following
>   - failure_notifier - triggered at failcnt increment.
>   - threshhold_notifier - triggered as usage > threshold.
> 
>>> works as following.
>>>
>>>  1. create a child. set default child limits to be 0.
>>>  2. set limit to child.
>>>     2-a. before setting limit to child, prepare enough room in parent.
>>>     2-b. increase 'usage' of parent by child's limit.
>> The problem with this is that you are forcing the parent will run into a recl
> aim
>> loop even if the child is not using the assigned limit to it.
>>
> That's not problem because it's avoildable by users.
> But it's ok to limit the sum of child's limit to be below XX % ot the parent.
> 
>>>  3. the child sets its limit to the val moved from the parent.
>>>     the parent remembers what amount of resource is to the children.
>>>
>> All of this needs to be dynamic
>>
> As explained, this can be dynamic by middleware.
> 
>>>  Above means that
>>> 	- a directory's usage implies the sum of all sub directories +
>>>           own usage.
>>> 	- there are no shared resource between parent <-> child.
>>>
>>>  Pros.
>>>   - simple and easy policy.
>>>   - no hierarchy overhead.
>>>   - no resource share among child <-> parent. very suitable for multilevel
>>>     resource isolation.
>> Sharing is an important aspect of hierachies. I am not convinced of this
>> approach. Did you look at the patches I sent out? Was there something
>> fundamentally broken in them?
>>
> 
> Yes, I read. And tried to make it faster and found it will be complicated.
> One problem is overhead of counter itself.
> Another problem is overhead of shrinking multi-level LRU with feedback.
> One more problem is that it's hard to implement various kinds of hierarchy
> policy. I believe there are other hierarhcy policies rather than OpenVZ
> want to use. Kicking out functions to middleware AMAP is what I'm thinking
> now.

One way to manage hierarchies other than via limits is to use shares (please see
the shares used by the cpu controller). Basically, what you've done with limits
is done with shares

If a parent has 100 shares, then it can decide how many to pass on to it's  children
based on the shares of the child and your logic would work well. I propose
assigning top level (high resolution) shares to the root of the cgroup and in a
hierarchy passing them down to children and sharing it with them. Based on the
shares, deduce the limit of each node in the hierarchy.

What do you think?


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-31  1:59   ` kamezawa.hiroyu
  2008-05-31 11:20     ` Balbir Singh
@ 2008-05-31 14:47     ` kamezawa.hiroyu
  2008-05-31 17:18       ` Balbir Singh
  2008-06-01  0:35       ` kamezawa.hiroyu
  1 sibling, 2 replies; 14+ messages in thread
From: kamezawa.hiroyu @ 2008-05-31 14:47 UTC (permalink / raw)
  To: balbir; +Cc: kamezawa.hiroyu, linux-mm, LKML, xemul, menage, yamamoto, lizf

----- Original Message -----

>> One more problem is that it's hard to implement various kinds of hierarchy
>> policy. I believe there are other hierarhcy policies rather than OpenVZ
>> want to use. Kicking out functions to middleware AMAP is what I'm thinking
>> now.
>
>One way to manage hierarchies other than via limits is to use shares (please 
see
>the shares used by the cpu controller). Basically, what you've done with limi
ts
>is done with shares
>
Yes, I like _share_ rather than limits.

>If a parent has 100 shares, then it can decide how many to pass on to it's  c
hildren
>based on the shares of the child and your logic would work well. I propose
>assigning top level (high resolution) shares to the root of the cgroup and in
 a
>hierarchy passing them down to children and sharing it with them. Based on th
e
>shares, deduce the limit of each node in the hierarchy.
>
>What do you think?
>
As you wrote, a middleware can do controls based on share by limits.
And it seems much easier to implement it in userland rather than in the kernel
.

Here is an example. (just an example...)
Please point out if I'm misunderstanding "share".

root_level/                   = limit 1G.
          /child_A = share=30
          /child_B = share=15
          /child_C = share=5
(and assume there is no process under root_level for make explanation easy..)

0. At first, before starting to use memory, set all kernel_memory_limit.
root_level.limit = 1G
  child_A.limit=64M,usage=0
  child_B.limit=64M,usage=0
  child_C.limit=64M,usage=0
  free_resource=808M 

1. next, a process in child_C start to run and use memory of 600M.
root_level.limit = 1G
  child_A.limit=64M
  child_B.limit=64M
  child_C.limit=600M,usage=600M
  free_resource=272M

2. now, a process in child_A start tu run and use memory of 800M.
  child_A.limit=800M,usage=800M
  child_B.limit=64M,usage=0M
  child_C.limit=136M,usage=136M
  free_resouce=0,A:C=6:1

3.Finally, a process in child_B start. and use memory of 500M.
  child_A.limit=600M,usage=600M
  child_B.limit=300M,usage=300M
  child_C.limit=100M,usage=100M
  free_resouce=0, A:B:C=6:3:1

4. one more, a process in A exits.
  child_A.limit=64M, usage=0M
  child_B.limit=500M, usage=500M
  child_C.limit=436M, usage=436M
  free_resouce=0, B:C=3:1 (but B just want to use 500M)

This is only an example and the middleware can more pricise "limit"
contols by checking statistics of memory controller hierarchy based on
their own policy.

What I think now is what kind of statistics/notifier/controls are
necessary to implement shares in middleware. How pricise/quick work the
middleware can do is based on interfaces.
Maybe the middleware should know "how fast the application runs now" by
some kind of check or co-operative interface with the application.
But I'm not sure how the kernel can help it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-31 14:47     ` kamezawa.hiroyu
@ 2008-05-31 17:18       ` Balbir Singh
  2008-06-01  0:35       ` kamezawa.hiroyu
  1 sibling, 0 replies; 14+ messages in thread
From: Balbir Singh @ 2008-05-31 17:18 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, LKML, xemul, menage, yamamoto, lizf

kamezawa.hiroyu@jp.fujitsu.com wrote:
> ----- Original Message -----
> 
>>> One more problem is that it's hard to implement various kinds of hierarchy
>>> policy. I believe there are other hierarhcy policies rather than OpenVZ
>>> want to use. Kicking out functions to middleware AMAP is what I'm thinking
>>> now.
>> One way to manage hierarchies other than via limits is to use shares (please 
> see
>> the shares used by the cpu controller). Basically, what you've done with limi
> ts
>> is done with shares
>>
> Yes, I like _share_ rather than limits.
> 
>> If a parent has 100 shares, then it can decide how many to pass on to it's  c
> hildren
>> based on the shares of the child and your logic would work well. I propose
>> assigning top level (high resolution) shares to the root of the cgroup and in
>  a
>> hierarchy passing them down to children and sharing it with them. Based on th
> e
>> shares, deduce the limit of each node in the hierarchy.
>>
>> What do you think?
>>
> As you wrote, a middleware can do controls based on share by limits.
> And it seems much easier to implement it in userland rather than in the kernel
> .

The good thing about user space is that moves unnecessary code outside the
kernel, but the hard thing is standardization. If every middleware is going to
implement what you say, imagine the code duplication, unless we standardize this
into a library component. More comments below. I am not sure about the
difference between user_memory_limit and kernel_memory_limit. Could you please
elaborate.

> 
> Here is an example. (just an example...)
> Please point out if I'm misunderstanding "share".
> 
> root_level/                   = limit 1G.
>           /child_A = share=30
>           /child_B = share=15
>           /child_C = share=5
> (and assume there is no process under root_level for make explanation easy..)
> 
> 0. At first, before starting to use memory, set all kernel_memory_limit.
> root_level.limit = 1G
>   child_A.limit=64M,usage=0
>   child_B.limit=64M,usage=0
>   child_C.limit=64M,usage=0
>   free_resource=808M 
> 

This sounds incorrect, since the limits should be proportional to shares. If the
maximum shares in the root were 100 (*ideally we want higher resolution than that)
Then

child_A.limit = .3 * 1G
child_B.limit = .15 * 1G

and so on


> 1. next, a process in child_C start to run and use memory of 600M.
> root_level.limit = 1G
>   child_A.limit=64M
>   child_B.limit=64M
>   child_C.limit=600M,usage=600M
>   free_resource=272M
> 

How is that feasible, it's limit was 64M, how did it bump up to 600M? If you
want something like that, child_C should have no limits.

> 2. now, a process in child_A start tu run and use memory of 800M.
>   child_A.limit=800M,usage=800M
>   child_B.limit=64M,usage=0M
>   child_C.limit=136M,usage=136M
>   free_resouce=0,A:C=6:1
> 

Not sure I understand this step

> 3.Finally, a process in child_B start. and use memory of 500M.
>   child_A.limit=600M,usage=600M
>   child_B.limit=300M,usage=300M
>   child_C.limit=100M,usage=100M
>   free_resouce=0, A:B:C=6:3:1
> 

Not sure I understand this step

> 4. one more, a process in A exits.
>   child_A.limit=64M, usage=0M
>   child_B.limit=500M, usage=500M
>   child_C.limit=436M, usage=436M
>   free_resouce=0, B:C=3:1 (but B just want to use 500M)
> 

Not sure I understand this step

> This is only an example and the middleware can more pricise "limit"
> contols by checking statistics of memory controller hierarchy based on
> their own policy.
> 
> What I think now is what kind of statistics/notifier/controls are
> necessary to implement shares in middleware. How pricise/quick work the
> middleware can do is based on interfaces.
> Maybe the middleware should know "how fast the application runs now" by
> some kind of check or co-operative interface with the application.
> But I'm not sure how the kernel can help it.

I am not sure if I understand your proposal at this point.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-31 14:47     ` kamezawa.hiroyu
  2008-05-31 17:18       ` Balbir Singh
@ 2008-06-01  0:35       ` kamezawa.hiroyu
  2008-06-02  6:16         ` Balbir Singh
  2008-06-02  9:48         ` kamezawa.hiroyu
  1 sibling, 2 replies; 14+ messages in thread
From: kamezawa.hiroyu @ 2008-06-01  0:35 UTC (permalink / raw)
  To: balbir; +Cc: kamezawa.hiroyu, linux-mm, LKML, xemul, menage, yamamoto, lizf

----- Original Message -----
>kamezawa.hiroyu@jp.fujitsu.com wrote:
>> ----- Original Message -----
>> 
>>>> One more problem is that it's hard to implement various kinds of hierarch
y
>>>> policy. I believe there are other hierarhcy policies rather than OpenVZ
>>>> want to use. Kicking out functions to middleware AMAP is what I'm thinkin
g
>>>> now.
>>> One way to manage hierarchies other than via limits is to use shares (plea
se 
>> see
>>> the shares used by the cpu controller). Basically, what you've done with l
imi
>> ts
>>> is done with shares
>>>
>> Yes, I like _share_ rather than limits.
>> 
>>> If a parent has 100 shares, then it can decide how many to pass on to it's
  c
>> hildren
>>> based on the shares of the child and your logic would work well. I propose
>>> assigning top level (high resolution) shares to the root of the cgroup and
 in
>>  a
>>> hierarchy passing them down to children and sharing it with them. Based on
 th
>> e
>>> shares, deduce the limit of each node in the hierarchy.
>>>
>>> What do you think?
>>>
>> As you wrote, a middleware can do controls based on share by limits.
>> And it seems much easier to implement it in userland rather than in the ker
nel
>> .
>
>The good thing about user space is that moves unnecessary code outside the
>kernel, but the hard thing is standardization. If every middleware is going t
o
>implement what you say, imagine the code duplication, unless we standardize t
his
>into a library component.

It's not problem. We're not developing world-wide eco system.
It's good that there are several development groups. It's a way to evolution.
Something popular will be defacto standard. 
What we have to do is providing proper interfaces for allowing fair race.

>> 
>> Here is an example. (just an example...)
>> Please point out if I'm misunderstanding "share".
>> 
>> root_level/                   = limit 1G.
>>           /child_A = share=30
>>           /child_B = share=15
>>           /child_C = share=5
>> (and assume there is no process under root_level for make explanation easy.
.)
>> 
>> 0. At first, before starting to use memory, set all kernel_memory_limit.
>> root_level.limit = 1G
>>   child_A.limit=64M,usage=0
>>   child_B.limit=64M,usage=0
>>   child_C.limit=64M,usage=0
>>   free_resource=808M 
>> 
>
>This sounds incorrect, since the limits should be proportional to shares. If 
the
>maximum shares in the root were 100 (*ideally we want higher resolution than 
that)
>Then
>
>child_A.limit = .3 * 1G
>child_B.limit = .15 * 1G
>
>and so on
>
Above just showing param to the kernel. 
>From user's view, memory limitation is A:B:C=6:3:1 if memory is fully used.
(In above case, usage=0)

In general, "share" works only when the total usage reaches limitation.
(See how cpu scheduler works.)
When the usage doesn't reach limit, there is no limitatiuon.

>
>> 1. next, a process in child_C start to run and use memory of 600M.
>> root_level.limit = 1G
>>   child_A.limit=64M
>>   child_B.limit=64M
>>   child_C.limit=600M,usage=600M
>>   free_resource=272M
>> 
>
>How is that feasible, it's limit was 64M, how did it bump up to 600M? If you
>want something like that, child_C should have no limits.

middleware just do when child_C.failcnt hits.
echo 64M > childC.memory.limits_in_bytes.
and periodically checks A,B,C and allow C to use what it wants becasue
A and B doesn't want memory.

>
>> 2. now, a process in child_A start tu run and use memory of 800M.
>>   child_A.limit=800M,usage=800M
>>   child_B.limit=64M,usage=0M
>>   child_C.limit=136M,usage=136M
>>   free_resouce=0,A:C=6:1
>> 
>
>Not sure I understand this step
>
Middleware notices that usage in A is growing and moves resources to A.

echo current child_C's limit - 64M > child_C
echo current child_A's limit + 64M > child_A
do above in step by step with loops for making A:C = 6:1
(64M is just an example)

>> 3.Finally, a process in child_B start. and use memory of 500M.
>>   child_A.limit=600M,usage=600M
>>   child_B.limit=300M,usage=300M
>>   child_C.limit=100M,usage=100M
>>   free_resouce=0, A:B:C=6:3:1
>> 
>
>Not sure I understand this step
>
echo current child_C's limit - 64M > child_C
echo current child_A's limit - 64M > child_A
echo current child_B's limit + 64M > child_B
do above in step by step with loops for making A:B:C = 6:3:1


>> 4. one more, a process in A exits.
>>   child_A.limit=64M, usage=0M
>>   child_B.limit=500M, usage=500M
>>   child_C.limit=436M, usage=436M
>>   free_resouce=0, B:C=3:1 (but B just want to use 500M)
>> 
>
>Not sure I understand this step
>
middleware can notice memory pressure from Child_A is reduced.

echo current child_A's limit - 64M > child_A
echo current child_C's limit + 64M > child_C
echo current child_B's limit + 64M > child_B
do above in step by step with loops for making B:C = 3:1 with avoiding
the waste of resources.



>> This is only an example and the middleware can more pricise "limit"
>> contols by checking statistics of memory controller hierarchy based on
>> their own policy.
>> 
>> What I think now is what kind of statistics/notifier/controls are
>> necessary to implement shares in middleware. How pricise/quick work the
>> middleware can do is based on interfaces.
>> Maybe the middleware should know "how fast the application runs now" by
>> some kind of check or co-operative interface with the application.
>> But I'm not sure how the kernel can help it.
>
>I am not sure if I understand your proposal at this point.
>

The most important point is cgoups.memory.memory.limit_in_bytes
is _just_ a notification to ask the kernel to limit the memory
usage of process groups temporally. It changes often.
Based on user's notification to the middleware (share or limit),
the middleware changes limit_in_bytes to be suitable value
and change it dynamically and periodically. 

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:45 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
  2008-05-30 22:20   ` Balbir Singh
  2008-05-31  1:59   ` kamezawa.hiroyu
@ 2008-06-02  2:15   ` YAMAMOTO Takashi
  2008-06-02  9:52   ` kamezawa.hiroyu
  3 siblings, 0 replies; 14+ messages in thread
From: YAMAMOTO Takashi @ 2008-06-02  2:15 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, linux-kernel, balbir, xemul, menage, lizf

> @@ -135,13 +138,118 @@ ssize_t res_counter_write(struct res_cou
>  		if (*end != '\0')
>  			goto out_free;
>  	}
> -	spin_lock_irqsave(&counter->lock, flags);
> -	val = res_counter_member(counter, member);
> -	*val = tmp;
> -	spin_unlock_irqrestore(&counter->lock, flags);
> -	ret = nbytes;
> +	if (member != RES_LIMIT || !callback) {

is there any reason to check member != RES_LIMIT here,
rather than in callers?

> +/*
> + * Move resource to its parent.
> + *   child->limit -= val.
> + *   parent->usage -= val.
> + *   parent->limit -= val.

s/limit/for_children/

> + */
> +
> +int res_counter_repay_resource(struct res_counter *child,
> +				struct res_counter *parent,
> +				unsigned long long val,
> +				res_shrink_callback_t callback, int retry)

can you reduce gratuitous differences between
res_counter_borrow_resource and res_counter_repay_resource?
eg. 'success' vs 'done', how to decrement 'retry'.

YAMAMOTO Takashi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-01  0:35       ` kamezawa.hiroyu
@ 2008-06-02  6:16         ` Balbir Singh
  2008-06-02  9:48         ` kamezawa.hiroyu
  1 sibling, 0 replies; 14+ messages in thread
From: Balbir Singh @ 2008-06-02  6:16 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: linux-mm, LKML, xemul, menage, yamamoto, lizf

Hi, Kamezawa-san,

kamezawa.hiroyu@jp.fujitsu.com wrote:
> 
> It's not problem. We're not developing world-wide eco system.
> It's good that there are several development groups. It's a way to evolution.
> Something popular will be defacto standard. 
> What we have to do is providing proper interfaces for allowing fair race.
> 

I did not claim that we were developing an eco system either :)
My point is that we should not confuse *Linux* users. Lets do the common/useful
stuff in the kernel and make it easy for users to use the cgroup subsystem.

>>> Here is an example. (just an example...)
>>> Please point out if I'm misunderstanding "share".
>>>
>>> root_level/                   = limit 1G.
>>>           /child_A = share=30
>>>           /child_B = share=15
>>>           /child_C = share=5
>>> (and assume there is no process under root_level for make explanation easy.
> .)
>>> 0. At first, before starting to use memory, set all kernel_memory_limit.
>>> root_level.limit = 1G
>>>   child_A.limit=64M,usage=0
>>>   child_B.limit=64M,usage=0
>>>   child_C.limit=64M,usage=0
>>>   free_resource=808M 
>>>
>> This sounds incorrect, since the limits should be proportional to shares. If 
> the
>> maximum shares in the root were 100 (*ideally we want higher resolution than 
> that)
>> Then
>>
>> child_A.limit = .3 * 1G
>> child_B.limit = .15 * 1G
>>
>> and so on
>>
> Above just showing param to the kernel. 
> From user's view, memory limitation is A:B:C=6:3:1 if memory is fully used.
> (In above case, usage=0)
> 
> In general, "share" works only when the total usage reaches limitation.
> (See how cpu scheduler works.)
> When the usage doesn't reach limit, there is no limitatiuon.
> 

If you are implying that shares imply a soft limit, I agree. But the only
parameter in the kernel right now is hard limits. We need to add soft limit support.

>>> 1. next, a process in child_C start to run and use memory of 600M.
>>> root_level.limit = 1G
>>>   child_A.limit=64M
>>>   child_B.limit=64M
>>>   child_C.limit=600M,usage=600M
>>>   free_resource=272M
>>>
>> How is that feasible, it's limit was 64M, how did it bump up to 600M? If you
>> want something like that, child_C should have no limits.
> 
> middleware just do when child_C.failcnt hits.
> echo 64M > childC.memory.limits_in_bytes.
> and periodically checks A,B,C and allow C to use what it wants becasue
> A and B doesn't want memory.
> 
>>> 2. now, a process in child_A start tu run and use memory of 800M.
>>>   child_A.limit=800M,usage=800M
>>>   child_B.limit=64M,usage=0M
>>>   child_C.limit=136M,usage=136M
>>>   free_resouce=0,A:C=6:1
>>>
>> Not sure I understand this step
>>
> Middleware notices that usage in A is growing and moves resources to A.
> 
> echo current child_C's limit - 64M > child_C
> echo current child_A's limit + 64M > child_A
> do above in step by step with loops for making A:C = 6:1
> (64M is just an example)
> 
>>> 3.Finally, a process in child_B start. and use memory of 500M.
>>>   child_A.limit=600M,usage=600M
>>>   child_B.limit=300M,usage=300M
>>>   child_C.limit=100M,usage=100M
>>>   free_resouce=0, A:B:C=6:3:1
>>>
>> Not sure I understand this step
>>
> echo current child_C's limit - 64M > child_C
> echo current child_A's limit - 64M > child_A
> echo current child_B's limit + 64M > child_B
> do above in step by step with loops for making A:B:C = 6:3:1
> 
> 
>>> 4. one more, a process in A exits.
>>>   child_A.limit=64M, usage=0M
>>>   child_B.limit=500M, usage=500M
>>>   child_C.limit=436M, usage=436M
>>>   free_resouce=0, B:C=3:1 (but B just want to use 500M)
>>>
>> Not sure I understand this step
>>
> middleware can notice memory pressure from Child_A is reduced.
> 
> echo current child_A's limit - 64M > child_A
> echo current child_C's limit + 64M > child_C
> echo current child_B's limit + 64M > child_B
> do above in step by step with loops for making B:C = 3:1 with avoiding
> the waste of resources.
> 
> 
> 
>>> This is only an example and the middleware can more pricise "limit"
>>> contols by checking statistics of memory controller hierarchy based on
>>> their own policy.
>>>
>>> What I think now is what kind of statistics/notifier/controls are
>>> necessary to implement shares in middleware. How pricise/quick work the
>>> middleware can do is based on interfaces.
>>> Maybe the middleware should know "how fast the application runs now" by
>>> some kind of check or co-operative interface with the application.
>>> But I'm not sure how the kernel can help it.
>> I am not sure if I understand your proposal at this point.
>>
> 
> The most important point is cgoups.memory.memory.limit_in_bytes
> is _just_ a notification to ask the kernel to limit the memory
> usage of process groups temporally. It changes often.
> Based on user's notification to the middleware (share or limit),
> the middleware changes limit_in_bytes to be suitable value
> and change it dynamically and periodically. 
> 

Why don't we add soft limits, so that we don't have to go to the kernel and
change limits frequently. One missing piece in the memory controller is that we
don't shrink the memory controller when limits change or when tasks move. I
think soft limits is a better solution.

Thanks for patiently explaining all of this.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-06-01  0:35       ` kamezawa.hiroyu
  2008-06-02  6:16         ` Balbir Singh
@ 2008-06-02  9:48         ` kamezawa.hiroyu
  1 sibling, 0 replies; 14+ messages in thread
From: kamezawa.hiroyu @ 2008-06-02  9:48 UTC (permalink / raw)
  To: balbir; +Cc: kamezawa.hiroyu, linux-mm, LKML, xemul, menage, yamamoto, lizf

>Why don't we add soft limits, so that we don't have to go to the kernel and
>change limits frequently. One missing piece in the memory controller is that 
we
>don't shrink the memory controller when limits change or when tasks move. I
>think soft limits is a better solution.
>
My code adds shirinking_at_limit_change. I'm now try to write migrate_resouces
_at_task_move. (But seems not so easy to be implemented in 
clean/fast way.)

I have no objection to soft-limit if it's easy to be implemented. (I wrote
my explanation was just an example and we could add more knobs.) 
_But_ I think that something to control multiple cgroups with regard to hierar
chy under some policy never be a simple one. Adding some knobs for each cgroup
s to do soft-limit will be simple one if no hirerachy.

Memory controller's  difference from scheduler's hirerachy is that we have to 
do multilevel page reclaim with feedback under some policy (not only one..). 
Even without hierarhcy, we _did_ make the kernel's LRU logic more complicated.
But we can get a help from the middleware here, I think.

My goal is never to make cgroup slow or complicated. If it's slow, 
I'd like to say "ok, please use VMware.It's simpler and enough fast for you." 
"How fast it works rather than Hardware-Virtualization" is the most
important for me. It should be much more faster.

>Thanks for patiently explaining all of this.
>
Thanks, I'm sorry for my poor explanation skill.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC][PATCH 1/2] memcg: res_counter hierarchy
  2008-05-30  1:45 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2008-06-02  2:15   ` YAMAMOTO Takashi
@ 2008-06-02  9:52   ` kamezawa.hiroyu
  3 siblings, 0 replies; 14+ messages in thread
From: kamezawa.hiroyu @ 2008-06-02  9:52 UTC (permalink / raw)
  To: YAMAMOTO Takashi
  Cc: kamezawa.hiroyu, linux-mm, linux-kernel, balbir, xemul, menage,
	lizf

----- Original Message -----

>> @@ -135,13 +138,118 @@ ssize_t res_counter_write(struct res_cou
>>  		if (*end != '\0')
>>  			goto out_free;
>>  	}
>> -	spin_lock_irqsave(&counter->lock, flags);
>> -	val = res_counter_member(counter, member);
>> -	*val = tmp;
>> -	spin_unlock_irqrestore(&counter->lock, flags);
>> -	ret = nbytes;
>> +	if (member != RES_LIMIT || !callback) {
>
>is there any reason to check member != RES_LIMIT here,
>rather than in callers?

Hmm...ok. This is messy. I'll rearrange this.


>
>> +/*
>> + * Move resource to its parent.
>> + *   child->limit -= val.
>> + *   parent->usage -= val.
>> + *   parent->limit -= val.
>
>s/limit/for_children/
>
>> + */
>> +
>> +int res_counter_repay_resource(struct res_counter *child,
>> +				struct res_counter *parent,
>> +				unsigned long long val,
>> +				res_shrink_callback_t callback, int retry)
>
>can you reduce gratuitous differences between
>res_counter_borrow_resource and res_counter_repay_resource?
>eg. 'success' vs 'done', how to decrement 'retry'.
>

Ah, sorry. I'll rewrite.
I'll make next version's quality better.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-06-02  9:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-30  1:43 [RFC][PATCH 0/2] memcg: simple hierarchy (v2) KAMEZAWA Hiroyuki
2008-05-30  1:45 ` [RFC][PATCH 1/2] memcg: res_counter hierarchy KAMEZAWA Hiroyuki
2008-05-30 22:20   ` Balbir Singh
2008-05-31  1:59   ` kamezawa.hiroyu
2008-05-31 11:20     ` Balbir Singh
2008-05-31 14:47     ` kamezawa.hiroyu
2008-05-31 17:18       ` Balbir Singh
2008-06-01  0:35       ` kamezawa.hiroyu
2008-06-02  6:16         ` Balbir Singh
2008-06-02  9:48         ` kamezawa.hiroyu
2008-06-02  2:15   ` YAMAMOTO Takashi
2008-06-02  9:52   ` kamezawa.hiroyu
2008-05-30  1:46 ` [RFC][PATCH 2/2] memcg: memcg hierarchy KAMEZAWA Hiroyuki
2008-05-30  1:46 ` [RFC][PATCH 0/2] memcg: simple hierarchy (v2) Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).