linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/9] memcg soft limit v2 (new design)
@ 2009-04-03  8:08 KAMEZAWA Hiroyuki
  2009-04-03  8:09 ` [RFC][PATCH 1/9] " KAMEZAWA Hiroyuki
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:08 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: linux-kernel@vger.kernel.org, balbir@linux.vnet.ibm.com,
	kosaki.motohiro@jp.fujitsu.com

Hi,

Memory cgroup's soft limit feature is a feature to tell global LRU 
"please reclaim from this memcg at memory shortage".

This is v2. Fixed some troubles under hierarchy. and increase soft limit
update hooks to proper places.

This patch is on to
  mmotom-Mar23 + memcg-cleanup-cache_charge.patch
  + vmscan-fix-it-to-take-care-of-nodemask.patch

So, not for wide use ;)

This patch tries to avoid to use existing memcg's reclaim routine and
just tell "Hints" to global LRU. This patch is briefly tested and shows
good result to me. (But may not to you. plz brame me.)

Major characteristic is.
 - memcg will be inserted to softlimit-queue at charge() if usage excess
   soft limit.
 - softlimit-queue is a queue with priority. priority is detemined by size
   of excessing usage.
 - memcg's soft limit hooks is called by shrink_xxx_list() to show hints.
 - Behavior is affected by vm.swappiness and LRU scan rate is determined by
   global LRU's status.

In this v2.
 - problems under use_hierarchy=1 case are fixed.
 - more hooks are added.
 - codes are cleaned up.

Shows good results on my private box test under several work loads.

But in special artificial case, when victim memcg's Active/Inactive ratio of
ANON is very different from global LRU, the result seems not very good.
i.e.
  under vicitm memcg, ACTIVE_ANON=100%, INACTIVE=0% (access memory in busy loop)
  under global, ACTIVE_ANON=10%, INACTIVE=90% (almost all processes are sleeping.)
memory can be swapped out from global LRU, not from vicitm.
(If there are file cache in victims, file cacahes will be out.)

But, in this case, even if we successfully swap out anon pages under victime memcg,
they will come back to memory soon and can show heavy slashing.

While using soft limit, I felt this is useful feature :)
But keep this RFC for a while. I'll prepare Documentation until the next post.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 1/9] memcg soft limit v2 (new design)
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
@ 2009-04-03  8:09 ` KAMEZAWA Hiroyuki
  2009-04-03  8:10 ` [RFC][PATCH 2/9] soft limit framework for memcg KAMEZAWA Hiroyuki
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, kosaki.motohiro@jp.fujitsu.com

No chanages from v1.
==
From: Balbir Singh <balbir@linux.vnet.ibm.com>
Changelog v2...v1
1. Add support for res_counter_check_soft_limit_locked. This is used
   by the hierarchy code.

Add an interface to allow get/set of soft limits. Soft limits for memory plus
swap controller (memsw) is currently not supported. Resource counters have
been enhanced to support soft limits and new type RES_SOFT_LIMIT has been
added. Unlike hard limits, soft limits can be directly set and do not
need any reclaim or checks before setting them to a newer value.

Kamezawa-San raised a question as to whether soft limit should belong
to res_counter. Since all resources understand the basic concepts of
hard and soft limits, it is justified to add soft limits here. Soft limits
are a generic resource usage feature, even file system quotas support
soft limits.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
Index: softlimit-test2/include/linux/res_counter.h
===================================================================
--- softlimit-test2.orig/include/linux/res_counter.h
+++ softlimit-test2/include/linux/res_counter.h
@@ -35,6 +35,10 @@ struct res_counter {
 	 */
 	unsigned long long limit;
 	/*
+	 * the limit that usage can be exceed
+	 */
+	unsigned long long soft_limit;
+	/*
 	 * the number of unsuccessful attempts to consume the resource
 	 */
 	unsigned long long failcnt;
@@ -85,6 +89,7 @@ enum {
 	RES_MAX_USAGE,
 	RES_LIMIT,
 	RES_FAILCNT,
+	RES_SOFT_LIMIT,
 };
 
 /*
@@ -130,6 +135,36 @@ static inline bool res_counter_limit_che
 	return false;
 }
 
+static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
+{
+	if (cnt->usage < cnt->soft_limit)
+		return true;
+
+	return false;
+}
+
+/**
+ * Get the difference between the usage and the soft limit
+ * @cnt: The counter
+ *
+ * Returns 0 if usage is less than or equal to soft limit
+ * The difference between usage and soft limit, otherwise.
+ */
+static inline unsigned long long
+res_counter_soft_limit_excess(struct res_counter *cnt)
+{
+	unsigned long long excess;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage <= cnt->soft_limit)
+		excess = 0;
+	else
+		excess = cnt->usage - cnt->soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return excess;
+}
+
 /*
  * Helper function to detect if the cgroup is within it's limit or
  * not. It's currently called from cgroup_rss_prepare()
@@ -145,6 +180,17 @@ static inline bool res_counter_check_und
 	return ret;
 }
 
+static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = res_counter_soft_limit_check_locked(cnt);
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
@@ -178,4 +224,16 @@ static inline int res_counter_set_limit(
 	return ret;
 }
 
+static inline int
+res_counter_set_soft_limit(struct res_counter *cnt,
+				unsigned long long soft_limit)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->soft_limit = soft_limit;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 #endif
Index: softlimit-test2/kernel/res_counter.c
===================================================================
--- softlimit-test2.orig/kernel/res_counter.c
+++ softlimit-test2/kernel/res_counter.c
@@ -19,6 +19,7 @@ void res_counter_init(struct res_counter
 {
 	spin_lock_init(&counter->lock);
 	counter->limit = (unsigned long long)LLONG_MAX;
+	counter->soft_limit = (unsigned long long)LLONG_MAX;
 	counter->parent = parent;
 }
 
@@ -101,6 +102,8 @@ res_counter_member(struct res_counter *c
 		return &counter->limit;
 	case RES_FAILCNT:
 		return &counter->failcnt;
+	case RES_SOFT_LIMIT:
+		return &counter->soft_limit;
 	};
 
 	BUG();
Index: softlimit-test2/mm/memcontrol.c
===================================================================
--- softlimit-test2.orig/mm/memcontrol.c
+++ softlimit-test2/mm/memcontrol.c
@@ -1988,6 +1988,20 @@ static int mem_cgroup_write(struct cgrou
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
+	case RES_SOFT_LIMIT:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		/*
+		 * For memsw, soft limits are hard to implement in terms
+		 * of semantics, for now, we support soft limits for
+		 * control without swap
+		 */
+		if (type == _MEM)
+			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else
+			ret = -EINVAL;
+		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
 		break;
@@ -2237,6 +2251,12 @@ static struct cftype mem_cgroup_files[] 
 		.read_u64 = mem_cgroup_read,
 	},
 	{
+		.name = "soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
 		.name = "failcnt",
 		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
 		.trigger = mem_cgroup_reset,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 2/9]  soft limit framework for memcg.
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
  2009-04-03  8:09 ` [RFC][PATCH 1/9] " KAMEZAWA Hiroyuki
@ 2009-04-03  8:10 ` KAMEZAWA Hiroyuki
  2009-04-03  8:12 ` [RFC][PATCH 3/9] soft limit update filter KAMEZAWA Hiroyuki
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, kosaki.motohiro@jp.fujitsu.com

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Add minimal modification for soft limit to res_counter_charge() and memcontol.c
Based on Balbir Singh <balbir@linux.vnet.ibm.com> 's work but most of
features are removed. (dropped or moved to later patch.)

This is for building a frame to implement soft limit handler in memcg.
 - Checks soft limit status at every charge.
 - Adds mem_cgroup_soft_limit_check() as a function to detect we need
   check now or not.
 - mem_cgroup_update_soft_limit() is a function for updates internal status
   of soft limit controller of memcg.
 - As an experimental, this has no hooks in uncharge path.

Changelog: v1 -> v2
 - removed "update" from mem_cgroup_free() (revisit in later patch.)

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/res_counter.h |    3 ++-
 kernel/res_counter.c        |   12 +++++++++++-
 mm/memcontrol.c             |   19 +++++++++++++++++--
 3 files changed, 30 insertions(+), 4 deletions(-)

Index: softlimit-test2/include/linux/res_counter.h
===================================================================
--- softlimit-test2.orig/include/linux/res_counter.h
+++ softlimit-test2/include/linux/res_counter.h
@@ -112,7 +112,8 @@ void res_counter_init(struct res_counter
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at);
+		unsigned long val, struct res_counter **limit_fail_at,
+		bool *soft_limit_failure);
 
 /*
  * uncharge - tell that some portion of the resource is released
Index: softlimit-test2/kernel/res_counter.c
===================================================================
--- softlimit-test2.orig/kernel/res_counter.c
+++ softlimit-test2/kernel/res_counter.c
@@ -37,9 +37,11 @@ int res_counter_charge_locked(struct res
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at)
+			struct res_counter **limit_fail_at,
+			bool *soft_limit_failure)
 {
 	int ret;
+	int soft_cnt = 0;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
@@ -48,6 +50,8 @@ int res_counter_charge(struct res_counte
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
+		if (!res_counter_soft_limit_check_locked(c))
+			soft_cnt += 1;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -55,6 +59,12 @@ int res_counter_charge(struct res_counte
 		}
 	}
 	ret = 0;
+	if (soft_limit_failure) {
+		if (!soft_cnt)
+			*soft_limit_failure = false;
+		else
+			*soft_limit_failure = true;
+	}
 	goto done;
 undo:
 	for (u = counter; u != c; u = u->parent) {
Index: softlimit-test2/mm/memcontrol.c
===================================================================
--- softlimit-test2.orig/mm/memcontrol.c
+++ softlimit-test2/mm/memcontrol.c
@@ -897,6 +897,15 @@ static void record_last_oom(struct mem_c
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
+static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
+{
+	return false;
+}
+
+static void mem_cgroup_update_soft_limit(struct mem_cgroup *mem)
+{
+	return;
+}
 
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
@@ -909,6 +918,7 @@ static int __mem_cgroup_try_charge(struc
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct res_counter *fail_res;
+	bool soft_fail;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -938,12 +948,13 @@ static int __mem_cgroup_try_charge(struc
 		int ret;
 		bool noswap = false;
 
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
+						&soft_fail);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+							&fail_res, NULL);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
@@ -985,6 +996,10 @@ static int __mem_cgroup_try_charge(struc
 			goto nomem;
 		}
 	}
+
+	if (soft_fail && mem_cgroup_soft_limit_check(mem))
+		mem_cgroup_update_soft_limit(mem);
+
 	return 0;
 nomem:
 	css_put(&mem->css);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 3/9] soft limit update filter
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
  2009-04-03  8:09 ` [RFC][PATCH 1/9] " KAMEZAWA Hiroyuki
  2009-04-03  8:10 ` [RFC][PATCH 2/9] soft limit framework for memcg KAMEZAWA Hiroyuki
@ 2009-04-03  8:12 ` KAMEZAWA Hiroyuki
  2009-04-06  9:43   ` Balbir Singh
  2009-04-03  8:12 ` [RFC][PATCH 4/9] soft limit queue and priority KAMEZAWA Hiroyuki
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, kosaki.motohiro@jp.fujitsu.com

No changes from v1.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Check/Update softlimit information at every charge is over-killing, so
we need some filter.

This patch tries to count events in the memcg and if events > threshold
tries to update memcg's soft limit status and reset event counter to 0.

Event counter is maintained by per-cpu which has been already used,
Then, no siginificant overhead(extra cache-miss etc..) in theory.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: mmotm-2.6.29-Mar23/mm/memcontrol.c
===================================================================
--- mmotm-2.6.29-Mar23.orig/mm/memcontrol.c
+++ mmotm-2.6.29-Mar23/mm/memcontrol.c
@@ -66,6 +66,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 
+	MEM_CGROUP_STAT_EVENTS,  /* sum of page-in/page-out for internal use */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
@@ -105,6 +106,22 @@ static s64 mem_cgroup_local_usage(struct
 	return ret;
 }
 
+/* For intenal use of per-cpu event counting. */
+
+static inline void
+__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
+		enum mem_cgroup_stat_index idx)
+{
+	stat->count[idx] = 0;
+}
+
+static inline s64
+__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
+			    enum mem_cgroup_stat_index idx)
+{
+	return stat->count[idx];
+}
+
 /*
  * per-zone information in memory controller.
  */
@@ -235,6 +252,8 @@ static void mem_cgroup_charge_statistics
 	else
 		__mem_cgroup_stat_add_safe(cpustat,
 				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
+	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
+
 	put_cpu();
 }
 
@@ -897,9 +916,26 @@ static void record_last_oom(struct mem_c
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
+#define SOFTLIMIT_EVENTS_THRESH (1024) /* 1024 times of page-in/out */
+/*
+ * Returns true if sum of page-in/page-out events since last check is
+ * over SOFTLIMIT_EVENT_THRESH. (counter is per-cpu.)
+ */
 static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
 {
-	return false;
+	bool ret = false;
+	int cpu = get_cpu();
+	s64 val;
+	struct mem_cgroup_stat_cpu *cpustat;
+
+	cpustat = &mem->stat.cpustat[cpu];
+	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
+	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
+		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
+		ret = true;
+	}
+	put_cpu();
+	return ret;
 }
 
 static void mem_cgroup_update_soft_limit(struct mem_cgroup *mem)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 4/9] soft limit queue and priority
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2009-04-03  8:12 ` [RFC][PATCH 3/9] soft limit update filter KAMEZAWA Hiroyuki
@ 2009-04-03  8:12 ` KAMEZAWA Hiroyuki
  2009-04-06 11:05   ` Balbir Singh
  2009-04-06 18:42   ` Balbir Singh
  2009-04-03  8:13 ` [RFC][PATCH 5/9] add more hooks and check in lazy manner KAMEZAWA Hiroyuki
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, kosaki.motohiro@jp.fujitsu.com

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Softlimitq. for memcg.

Implements an array of queue to list memcgs, array index is determined by
the amount of memory usage excess the soft limit.

While Balbir's one uses RB-tree and my old one used a per-zone queue
(with round-robin), this is one of mixture of them.
(I'd like to use rotation of queue in later patches)

Priority is determined by following.
   Assume unit = total pages/1024. (the code uses different value)
   if excess is...
      < unit,          priority = 0, 
      < unit*2,        priority = 1,
      < unit*2*2,      priority = 2,
      ...
      < unit*2^9,      priority = 9,
      < unit*2^10,     priority = 10, (> 50% to total mem)

This patch just includes queue management part and not includes 
selection logic from queue. Some trick will be used for selecting victims at
soft limit in efficient way.

And this equips 2 queues, for anon and file. Inset/Delete of both list is
done at once but scan will be independent. (These 2 queues are used later.)

Major difference from Balbir's one other than RB-tree is bahavior under
hierarchy. This one adds all children to queue by checking hierarchical
priority. This is for helping per-zone usage check on victim-selection logic.

Changelog: v1->v2
 - fixed comments.
 - change base size to exponent.
 - some micro optimization to reduce code size.
 - considering memory hotplug, it's not good to record a value calculated
   from totalram_pages at boot and using it later is bad manner. Fixed it.
 - removed soft_limit_lock (spinlock) 
 - added soft_limit_update counter for avoiding mulptiple update at once.
   

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 117 insertions(+), 1 deletion(-)

Index: softlimit-test2/mm/memcontrol.c
===================================================================
--- softlimit-test2.orig/mm/memcontrol.c
+++ softlimit-test2/mm/memcontrol.c
@@ -192,7 +192,14 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
-
+	/*
+	 * For soft limit.
+	 */
+	int soft_limit_priority;
+	struct list_head soft_limit_list[2];
+#define SL_ANON (0)
+#define SL_FILE (1)
+	atomic_t soft_limit_update;
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -938,11 +945,115 @@ static bool mem_cgroup_soft_limit_check(
 	return ret;
 }
 
+/*
+ * Assume "base_amount", and excess = usage - soft limit.
+ *
+ * 0...... if excess < base_amount
+ * 1...... if excess < base_amount * 2
+ * 2...... if excess < base_amount * 2^2
+ * 3.......if excess < base_amount * 2^3
+ * ....
+ * 9.......if excess < base_amount * 2^9
+ * 10 .....if excess < base_amount * 2^10
+ *
+ * base_amount is detemined from total pages in the system.
+ */
+
+#define SLQ_MAXPRIO (11)
+static struct {
+	spinlock_t lock;
+	struct list_head queue[SLQ_MAXPRIO][2]; /* 0:anon 1:file */
+} softlimitq;
+
+#define SLQ_PRIO_FACTOR (1024) /* 2^10 */
+
+static int __calc_soft_limit_prio(unsigned long excess)
+{
+	unsigned long factor = totalram_pages /SLQ_PRIO_FACTOR;
+
+	return fls(excess/factor);
+}
+
+static int mem_cgroup_soft_limit_prio(struct mem_cgroup *mem)
+{
+	unsigned long excess, max_excess = 0;
+	struct res_counter *c = &mem->res;
+
+	do {
+		excess = res_counter_soft_limit_excess(c) >> PAGE_SHIFT;
+		if (max_excess < excess)
+			max_excess = excess;
+		c = c->parent;
+	} while (c);
+
+	return __calc_soft_limit_prio(max_excess);
+}
+
+static void __mem_cgroup_requeue(struct mem_cgroup *mem, int prio)
+{
+	/* enqueue to softlimit queue */
+	int i;
+
+	spin_lock(&softlimitq.lock);
+	if (prio != mem->soft_limit_priority) {
+		mem->soft_limit_priority = prio;
+		for (i = 0; i < 2; i++) {
+			list_del_init(&mem->soft_limit_list[i]);
+			list_add_tail(&mem->soft_limit_list[i],
+				      &softlimitq.queue[prio][i]);
+		}
+	}
+	spin_unlock(&softlimitq.lock);
+}
+
+static void __mem_cgroup_dequeue(struct mem_cgroup *mem)
+{
+	int i;
+
+	spin_lock(&softlimitq.lock);
+	for (i = 0; i < 2; i++)
+		list_del_init(&mem->soft_limit_list[i]);
+	spin_unlock(&softlimitq.lock);
+}
+
+static int
+__mem_cgroup_update_soft_limit_cb(struct mem_cgroup *mem, void *data)
+{
+	int priority;
+	/* If someone updates, we don't need more */
+	priority = mem_cgroup_soft_limit_prio(mem);
+
+	if (priority != mem->soft_limit_priority)
+		__mem_cgroup_requeue(mem, priority);
+	return 0;
+}
+
 static void mem_cgroup_update_soft_limit(struct mem_cgroup *mem)
 {
+	int priority;
+
+	/* check status change */
+	priority = mem_cgroup_soft_limit_prio(mem);
+	if (priority != mem->soft_limit_priority &&
+	    atomic_inc_return(&mem->soft_limit_update) > 1) {
+		mem_cgroup_walk_tree(mem, NULL,
+				     __mem_cgroup_update_soft_limit_cb);
+		atomic_set(&mem->soft_limit_update, 0);
+	}
 	return;
 }
 
+static void softlimitq_init(void)
+{
+	int i;
+
+	spin_lock_init(&softlimitq.lock);
+	for (i = 0; i < SLQ_MAXPRIO; i++) {
+		INIT_LIST_HEAD(&softlimitq.queue[i][SL_ANON]);
+		INIT_LIST_HEAD(&softlimitq.queue[i][SL_FILE]);
+	}
+}
+
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
  * oom-killer can be invoked.
@@ -2512,6 +2623,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	if (cont->parent == NULL) {
 		enable_swap_cgroup();
 		parent = NULL;
+		softlimitq_init();
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
@@ -2532,6 +2644,9 @@ mem_cgroup_create(struct cgroup_subsys *
 		res_counter_init(&mem->memsw, NULL);
 	}
 	mem->last_scanned_child = 0;
+	mem->soft_limit_priority = 0;
+	INIT_LIST_HEAD(&mem->soft_limit_list[SL_ANON]);
+	INIT_LIST_HEAD(&mem->soft_limit_list[SL_FILE]);
 	spin_lock_init(&mem->reclaim_param_lock);
 
 	if (parent)
@@ -2556,6 +2671,7 @@ static void mem_cgroup_destroy(struct cg
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	__mem_cgroup_dequeue(mem);
 	mem_cgroup_put(mem);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 5/9] add more hooks and check in lazy manner
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2009-04-03  8:12 ` [RFC][PATCH 4/9] soft limit queue and priority KAMEZAWA Hiroyuki
@ 2009-04-03  8:13 ` KAMEZAWA Hiroyuki
  2009-04-03  8:14 ` [RFC][PATCH 6/9] active inactive ratio for private KAMEZAWA Hiroyuki
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, kosaki.motohiro@jp.fujitsu.com

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Adds 2 more soft limit update hooks.
 - uncharge
 - write to memory.soft_limit_in_bytes file.
And fixes issues under hierarchy. (This is the most complicated part...)

Because ucharge() can be called under very busy spin_lock, all checks should be 
done in lazy. We can use this lazy work to charge() part and make use of it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   66 ++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 55 insertions(+), 11 deletions(-)

Index: softlimit-test2/mm/memcontrol.c
===================================================================
--- softlimit-test2.orig/mm/memcontrol.c
+++ softlimit-test2/mm/memcontrol.c
@@ -200,6 +200,8 @@ struct mem_cgroup {
 #define SL_ANON (0)
 #define SL_FILE (1)
 	atomic_t soft_limit_update;
+	struct work_struct soft_limit_work;
+
 	/*
 	 * statistics. This must be placed at the end of memcg.
 	 */
@@ -989,6 +991,23 @@ static int mem_cgroup_soft_limit_prio(st
 	return __calc_soft_limit_prio(max_excess);
 }
 
+static struct mem_cgroup *
+mem_cgroup_soft_limit_need_check(struct mem_cgroup *mem)
+{
+	struct res_counter *c = &mem->res;
+	unsigned long excess, prio;
+
+	do {
+		excess = res_counter_soft_limit_excess(c) >> PAGE_SHIFT;
+		prio = __calc_soft_limit_prio(excess);
+		mem = container_of(c, struct mem_cgroup, res);
+		if (mem->soft_limit_priority != prio)
+			return mem;
+		c = c->parent;
+	} while (c);
+	return NULL;
+}
+
 static void __mem_cgroup_requeue(struct mem_cgroup *mem, int prio)
 {
 	/* enqueue to softlimit queue */
@@ -1028,18 +1047,36 @@ __mem_cgroup_update_soft_limit_cb(struct
 	return 0;
 }
 
-static void mem_cgroup_update_soft_limit(struct mem_cgroup *mem)
+static void mem_cgroup_update_soft_limit_work(struct work_struct *work)
 {
-	int priority;
+	struct mem_cgroup *mem;
+
+	mem = container_of(work, struct mem_cgroup, soft_limit_work);
+
+	mem_cgroup_walk_tree(mem, NULL, __mem_cgroup_update_soft_limit_cb);
+	atomic_set(&mem->soft_limit_update, 0);
+	css_put(&mem->css);
+}
+
+static void mem_cgroup_update_soft_limit_lazy(struct mem_cgroup *mem)
+{
+	int ret, priority;
+	struct mem_cgroup * root;
+
+	/*
+	 * check status change under hierarchy.
+	 */
+	root = mem_cgroup_soft_limit_need_check(mem);
+	if (!root)
+		return;
+
+	if (atomic_inc_return(&root->soft_limit_update) > 1)
+		return;
+	css_get(&root->css);
+	ret = schedule_work(&root->soft_limit_work);
+	if (!ret)
+		css_put(&root->css);
 
-	/* check status change */
-	priority = mem_cgroup_soft_limit_prio(mem);
-	if (priority != mem->soft_limit_priority &&
-	    atomic_inc_return(&mem->soft_limit_update) > 1) {
-		mem_cgroup_walk_tree(mem, NULL,
-				     __mem_cgroup_update_soft_limit_cb);
-		atomic_set(&mem->soft_limit_update, 0);
-	}
 	return;
 }
 
@@ -1145,7 +1182,7 @@ static int __mem_cgroup_try_charge(struc
 	}
 
 	if (soft_fail && mem_cgroup_soft_limit_check(mem))
-		mem_cgroup_update_soft_limit(mem);
+		mem_cgroup_update_soft_limit_lazy(mem);
 
 	return 0;
 nomem:
@@ -1625,6 +1662,9 @@ __mem_cgroup_uncharge_common(struct page
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
+	if (mem->soft_limit_priority && mem_cgroup_soft_limit_check(mem))
+		mem_cgroup_update_soft_limit_lazy(mem);
+
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		css_put(&mem->css);
@@ -2163,6 +2203,9 @@ static int mem_cgroup_write(struct cgrou
 			ret = res_counter_set_soft_limit(&memcg->res, val);
 		else
 			ret = -EINVAL;
+		if (!ret)
+			mem_cgroup_update_soft_limit_lazy(memcg);
+
 		break;
 	default:
 		ret = -EINVAL; /* should be BUG() ? */
@@ -2648,6 +2691,7 @@ mem_cgroup_create(struct cgroup_subsys *
 	INIT_LIST_HEAD(&mem->soft_limit_list[SL_ANON]);
 	INIT_LIST_HEAD(&mem->soft_limit_list[SL_FILE]);
 	spin_lock_init(&mem->reclaim_param_lock);
+	INIT_WORK(&mem->soft_limit_work, mem_cgroup_update_soft_limit_work);
 
 	if (parent)
 		mem->swappiness = get_swappiness(parent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 6/9] active inactive ratio for private
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2009-04-03  8:13 ` [RFC][PATCH 5/9] add more hooks and check in lazy manner KAMEZAWA Hiroyuki
@ 2009-04-03  8:14 ` KAMEZAWA Hiroyuki
  2009-04-03  8:15 ` [RFC][PATCH 7/9] vicitim selection logic KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, kosaki.motohiro@jp.fujitsu.com

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memcg's active/inactive ratio calclation ignores zone.
(It's designed for reducing usage of memcg and not for recovering
 from  memory shortage.)

But softlimit should take care of zone, later.

Changelog v1->v2:
 - fixed buggy argument in mem_cgroup_inactive_anon_is_low

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    4 ++--
 mm/memcontrol.c            |   26 ++++++++++++++++++++------
 mm/vmscan.c                |    2 +-
 3 files changed, 23 insertions(+), 9 deletions(-)

Index: softlimit-test2/mm/memcontrol.c
===================================================================
--- softlimit-test2.orig/mm/memcontrol.c
+++ softlimit-test2/mm/memcontrol.c
@@ -564,15 +564,28 @@ void mem_cgroup_record_reclaim_priority(
 	spin_unlock(&mem->reclaim_param_lock);
 }
 
-static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
+static int calc_inactive_ratio(struct mem_cgroup *memcg,
+			       unsigned long *present_pages,
+			       struct zone *z)
 {
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long gb;
 	unsigned long inactive_ratio;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_ANON);
-	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_ANON);
+	if (!z) {
+		inactive = mem_cgroup_get_local_zonestat(memcg,
+							 LRU_INACTIVE_ANON);
+		active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_ANON);
+	} else {
+		int nid = z->zone_pgdat->node_id;
+		int zid = zone_idx(z);
+		struct mem_cgroup_per_zone *mz;
+
+		mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+		inactive = MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
+		active = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON);
+	}
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
@@ -588,14 +601,14 @@ static int calc_inactive_ratio(struct me
 	return inactive_ratio;
 }
 
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *z)
 {
 	unsigned long active;
 	unsigned long inactive;
 	unsigned long present_pages[2];
 	unsigned long inactive_ratio;
 
-	inactive_ratio = calc_inactive_ratio(memcg, present_pages);
+	inactive_ratio = calc_inactive_ratio(memcg, present_pages, z);
 
 	inactive = present_pages[0];
 	active = present_pages[1];
@@ -2366,7 +2379,8 @@ static int mem_control_stat_show(struct 
 
 
 #ifdef CONFIG_DEBUG_VM
-	cb->fill(cb, "inactive_ratio", calc_inactive_ratio(mem_cont, NULL));
+	cb->fill(cb, "inactive_ratio",
+			calc_inactive_ratio(mem_cont, NULL, NULL));
 
 	{
 		int nid, zid;
Index: softlimit-test2/include/linux/memcontrol.h
===================================================================
--- softlimit-test2.orig/include/linux/memcontrol.h
+++ softlimit-test2/include/linux/memcontrol.h
@@ -93,7 +93,7 @@ extern void mem_cgroup_note_reclaim_prio
 							int priority);
 extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
 							int priority);
-int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
+int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *z);
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru);
@@ -234,7 +234,7 @@ static inline bool mem_cgroup_oom_called
 }
 
 static inline int
-mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
+mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg, struct zone *z)
 {
 	return 1;
 }
Index: softlimit-test2/mm/vmscan.c
===================================================================
--- softlimit-test2.orig/mm/vmscan.c
+++ softlimit-test2/mm/vmscan.c
@@ -1347,7 +1347,7 @@ static int inactive_anon_is_low(struct z
 	if (scanning_global_lru(sc))
 		low = inactive_anon_is_low_global(zone);
 	else
-		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup);
+		low = mem_cgroup_inactive_anon_is_low(sc->mem_cgroup, NULL);
 	return low;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 7/9] vicitim selection logic
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2009-04-03  8:14 ` [RFC][PATCH 6/9] active inactive ratio for private KAMEZAWA Hiroyuki
@ 2009-04-03  8:15 ` KAMEZAWA Hiroyuki
  2009-04-03  8:17 ` [RFC][PATCH 8/9] lru reordering KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, kosaki.motohiro@jp.fujitsu.com

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Soft Limit victim selection/cache logic.

This patch implements victim selection logic and caching method.

victim memcg is selected in following way, assume a zone under shrinking
is specified. Selected memcg will be
  - has the highest priority (high usage)
  - has memory on the zone.

When a memcg is selected, it's rotated and cached per cpu with tickets.

This cache is refreshed when
  - given ticket is exhausetd
  - very long time since last update.
  - the cached memcg doesn't include proper zone.

Even when no proper memcg is not found in victim selection logic,
some tickets are assigned to NULL victim.

As softlimitq, this cache's information has 2 ents for anon and file.

Change Log v1 -> v2:
 - clean up.
 - cpu hotplug support.
 - change "bonus" calclation of victime.
 - try to make the code slim.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  198 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 198 insertions(+)

Index: softlimit-test2/mm/memcontrol.c
===================================================================
--- softlimit-test2.orig/mm/memcontrol.c
+++ softlimit-test2/mm/memcontrol.c
@@ -37,6 +37,8 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include <linux/cpu.h>
+
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -1093,6 +1095,169 @@ static void mem_cgroup_update_soft_limit
 	return;
 }
 
+/* softlimit victim selection logic */
+
+/* Returns the amount of evictable memory in memcg */
+static unsigned long
+mem_cgroup_usage(struct mem_cgroup *mem, struct zone *zone, int file)
+{
+	struct mem_cgroup_per_zone *mz;
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	unsigned long usage = 0;
+	enum lru_list l = LRU_BASE;
+
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	if (file)
+		l += LRU_FILE;
+	usage = MEM_CGROUP_ZSTAT(mz, l) + MEM_CGROUP_ZSTAT(mz, l + LRU_ACTIVE);
+
+	return usage;
+}
+
+struct soft_limit_cache {
+	/* If ticket is 0, refresh and refill the cache.*/
+	int ticket[2];
+	/* next update time for ticket(jiffies)*/
+	unsigned long next_update;
+	/* victim memcg */
+	struct mem_cgroup *mem[2];
+};
+
+/*
+ * Typically, 32pages are reclaimed per call. 4*32=128pages as base ticket.
+ * 4 * prio scans are added as bonus for high priority.
+ */
+#define SLCACHE_NULL_TICKET (4)
+#define SLCACHE_UPDATE_JIFFIES (HZ*5) /* 5 minutes is very long. */
+DEFINE_PER_CPU(struct soft_limit_cache, soft_limit_cache);
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void forget_soft_limit_cache(long cpu)
+{
+	struct soft_limit_cache *slc;
+
+	slc = &per_cpu(soft_limit_cache, cpu);
+	slc->ticket[0] = 0;
+	slc->ticket[1] = 0;
+	slc->next_update = jiffies;
+	if (slc->mem[0])
+		mem_cgroup_put(slc->mem[0]);
+	if (slc->mem[1])
+		mem_cgroup_put(slc->mem[1]);
+	slc->mem[0] = NULL;
+	slc->mem[1] = NULL;
+}
+#endif
+
+
+/* This is called under preempt disabled context....*/
+static noinline void reload_softlimit_victim(struct soft_limit_cache *slc,
+				    struct zone *zone, int file)
+{
+	struct mem_cgroup *mem, *tmp;
+	struct list_head *queue, *cur;
+	int prio;
+	unsigned long usage = 0;
+
+	if (slc->mem[file]) {
+		mem_cgroup_put(slc->mem[file]);
+		slc->mem[file] = NULL;
+	}
+	slc->ticket[file] = SLCACHE_NULL_TICKET;
+	slc->next_update = jiffies + SLCACHE_UPDATE_JIFFIES;
+
+	/* brief check the queue */
+	for (prio = SLQ_MAXPRIO - 1; prio > 0; prio--) {
+		if (!list_empty(&softlimitq.queue[prio][file]))
+			break;
+	}
+retry:
+	if (prio == 0)
+		return;
+
+	/* check queue in priority order */
+
+	queue = &softlimitq.queue[prio][file];
+
+	spin_lock(&softlimitq.lock);
+	mem = NULL;
+	/*
+	 * does same behavior as list_for_each_entry but
+	 * member for next entity depends on "file".
+	 */
+	list_for_each(cur, queue) {
+		if (!file)
+			tmp = container_of(cur, struct mem_cgroup,
+					   soft_limit_list[0]);
+		else
+			tmp = container_of(cur, struct mem_cgroup,
+					   soft_limit_list[1]);
+
+		usage = mem_cgroup_usage(tmp, zone, file);
+		if (usage) {
+			mem = tmp;
+			list_move_tail(&mem->soft_limit_list[file], queue);
+			break;
+		}
+	}
+	spin_unlock(&softlimitq.lock);
+
+	/* If not found, goes to next priority */
+	if (!mem) {
+		prio--;
+		goto retry;
+	}
+
+	if (!css_is_removed(&mem->css)) {
+		int bonus = 0;
+		unsigned long estimated_excess;
+		estimated_excess = totalram_pages/SLQ_PRIO_FACTOR;
+		estimated_excess <<= prio;
+		slc->mem[file] = mem;
+		/*
+		 * If not using hierarchy, this memcg itself consumes memory.
+		 * Then, add extra scan bonus to this memcg itself.
+		 * If not, this memcg itself may not be very bad one. If
+		 * this memcg's (anon or file )usage > 12% of excess,
+		 * add extra scan bonus. if not, just small scan.
+		 */
+		if (!mem->use_hierarchy || (usage > estimated_excess/8))
+			bonus = SLCACHE_NULL_TICKET * prio;
+		else
+			bonus = SLCACHE_NULL_TICKET; /* twice to NULL */
+		slc->ticket[file] += bonus;
+		mem_cgroup_get(mem);
+	}
+}
+
+static void slc_reset_cache_ticket(int file)
+{
+	struct soft_limit_cache *slc = &get_cpu_var(soft_limit_cache);
+
+	slc->ticket[file] = 0;
+	put_cpu_var(soft_limit_cache);
+}
+
+static struct mem_cgroup *get_soft_limit_victim(struct zone *zone, int file)
+{
+	struct mem_cgroup *ret;
+	struct soft_limit_cache *slc;
+
+	slc = &get_cpu_var(soft_limit_cache);
+	/*
+	 * If ticket is expired or long time since last ticket.
+	 * reload victim.
+	 */
+	if ((--slc->ticket[file] < 0) ||
+	    (time_after(jiffies, slc->next_update)))
+		reload_softlimit_victim(slc, zone, file);
+	ret = slc->mem[file];
+	put_cpu_var(soft_limit_cache);
+	return ret;
+}
+
+
 static void softlimitq_init(void)
 {
 	int i;
@@ -2780,3 +2945,36 @@ static int __init disable_swap_account(c
 }
 __setup("noswapaccount", disable_swap_account);
 #endif
+
+#ifdef CONFIG_HOTPLUG_CPU
+/*
+ * _NOW_, what we have to handle is just cpu removal.
+ */
+static int __cpuinit memcg_cpu_callback(struct notifier_block *nfb,
+					unsigned long action,
+					void *hcpu)
+{
+	long cpu = (long) hcpu;
+
+	switch (action) {
+	case CPU_DEAD:
+	case CPU_DEAD_FROZEN:
+		forget_soft_limit_cache(cpu);
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __cpuinitdata soft_limit_notifier = {
+	&memcg_cpu_callback, NULL, 0
+};
+
+static int __cpuinit memcg_cpuhp_init(void)
+{
+	register_cpu_notifier(&soft_limit_notifier);
+	return 0;
+}
+__initcall(memcg_cpuhp_init);
+#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 8/9] lru reordering
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
                   ` (6 preceding siblings ...)
  2009-04-03  8:15 ` [RFC][PATCH 7/9] vicitim selection logic KAMEZAWA Hiroyuki
@ 2009-04-03  8:17 ` KAMEZAWA Hiroyuki
  2009-04-03  8:18 ` [RFC][PATCH 9/9] more event filter depend on priority KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, kosaki.motohiro@jp.fujitsu.com

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch adds a function to change the LRU order of pages in global LRU
under control of memcg's victim of soft limit.

FILE and ANON victim is divided and LRU rotation will be done independently.
(memcg which only includes FILE cache or ANON can exists.)

This patch removes finds specfied number of pages from memcg's LRU and
move it to top of global LRU. They will be the first target of shrink_xxx_list.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   15 ++++++++++
 mm/memcontrol.c            |   67 +++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |   18 +++++++++++-
 3 files changed, 99 insertions(+), 1 deletion(-)

Index: softlimit-test2/include/linux/memcontrol.h
===================================================================
--- softlimit-test2.orig/include/linux/memcontrol.h
+++ softlimit-test2/include/linux/memcontrol.h
@@ -117,6 +117,9 @@ static inline bool mem_cgroup_disabled(v
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
 
+void mem_cgroup_soft_limit_reorder_lru(struct zone *zone,
+			       unsigned long nr_to_scan, enum lru_list l);
+int mem_cgroup_soft_limit_inactive_anon_is_low(struct zone *zone);
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
@@ -264,6 +267,18 @@ mem_cgroup_print_oom_info(struct mem_cgr
 {
 }
 
+static inline void
+mem_cgroup_soft_limit_reorder_lru(struct zone *zone, unsigned long nr_to_scan,
+				  enum lru_list lru);
+{
+}
+
+static inline
+int mem_cgroup_soft_limit_inactive_anon_is_low(struct zone *zone)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
Index: softlimit-test2/mm/memcontrol.c
===================================================================
--- softlimit-test2.orig/mm/memcontrol.c
+++ softlimit-test2/mm/memcontrol.c
@@ -1257,6 +1257,73 @@ static struct mem_cgroup *get_soft_limit
 	return ret;
 }
 
+/*
+ * zone->lru and memcg's lru is synchronous under zone->lock.
+ * This tries to rotate pages in specfied LRU.
+ */
+void mem_cgroup_soft_limit_reorder_lru(struct zone *zone,
+				      unsigned long nr_to_scan,
+				      enum lru_list l)
+{
+	struct mem_cgroup *mem;
+	struct mem_cgroup_per_zone *mz;
+	int nid, zid, file;
+	unsigned long scan, flags;
+	struct list_head *src;
+	LIST_HEAD(found);
+	struct page_cgroup *pc;
+	struct page *page;
+
+	nid = zone->zone_pgdat->node_id;
+	zid = zone_idx(zone);
+
+	file = is_file_lru(l);
+
+	mem = get_soft_limit_victim(zone, file);
+	if (!mem)
+		return;
+	mz = mem_cgroup_zoneinfo(mem, nid, zid);
+	src = &mz->lists[l];
+	scan = 0;
+
+	/* Find at most nr_to_scan pages from local LRU */
+	spin_lock_irqsave(&zone->lru_lock, flags);
+	list_for_each_entry_reverse(pc, src, lru) {
+		if (scan >= nr_to_scan)
+			break;
+		/* We don't check Used bit */
+		page = pc->page;
+		/* Can happen ? */
+		if (unlikely(!PageLRU(page)))
+			continue;
+		/* This page is on (the same) LRU */
+		list_move(&page->lru, &found);
+		scan++;
+	}
+	/* vmscan searches pages from lru->prev. link this to lru->prev. */
+	list_splice_tail(&found, &zone->lru[l].list);
+	spin_unlock_irqrestore(&zone->lru_lock, flags);
+
+	/* When we cannot fill the request, check we should forget this cache
+	   or not */
+	if (scan < nr_to_scan &&
+	    !is_active_lru(l) &&
+	    mem_cgroup_usage(mem, zone, file) < SWAP_CLUSTER_MAX)
+		slc_reset_cache_ticket(file);
+}
+
+/* Returns 1 if soft limit is active && memcg's zone's status is that */
+int mem_cgroup_soft_limit_inactive_anon_is_low(struct zone *zone)
+{
+	struct soft_limit_cache *slc;
+	int ret = 0;
+
+	slc = &get_cpu_var(soft_limit_cache);
+	if (slc->mem[SL_ANON])
+		ret = mem_cgroup_inactive_anon_is_low(slc->mem[SL_ANON], zone);
+	put_cpu_var(soft_limit_cache);
+	return ret;
+}
 
 static void softlimitq_init(void)
 {
Index: softlimit-test2/mm/vmscan.c
===================================================================
--- softlimit-test2.orig/mm/vmscan.c
+++ softlimit-test2/mm/vmscan.c
@@ -1066,6 +1066,13 @@ static unsigned long shrink_inactive_lis
 	pagevec_init(&pvec, 1);
 
 	lru_add_drain();
+	if (scanning_global_lru(sc)) {
+		enum lru_list l = LRU_INACTIVE_ANON;
+		if (file)
+			l = LRU_INACTIVE_FILE;
+		mem_cgroup_soft_limit_reorder_lru(zone, max_scan, l);
+	}
+
 	spin_lock_irq(&zone->lru_lock);
 	do {
 		struct page *page;
@@ -1233,6 +1240,13 @@ static void shrink_active_list(unsigned 
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
 	lru_add_drain();
+	if (scanning_global_lru(sc)) {
+		enum lru_list l = LRU_ACTIVE_ANON;
+		if (file)
+			l = LRU_ACTIVE_FILE;
+		mem_cgroup_soft_limit_reorder_lru(zone, nr_pages, l);
+	}
+
 	spin_lock_irq(&zone->lru_lock);
 	pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
 					ISOLATE_ACTIVE, zone,
@@ -1328,7 +1342,9 @@ static int inactive_anon_is_low_global(s
 
 	if (inactive * zone->inactive_ratio < active)
 		return 1;
-
+	/* check soft limit vicitm's status */
+	if (mem_cgroup_soft_limit_inactive_anon_is_low(zone))
+		return 1;
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH 9/9] more event filter depend on priority
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
                   ` (7 preceding siblings ...)
  2009-04-03  8:17 ` [RFC][PATCH 8/9] lru reordering KAMEZAWA Hiroyuki
@ 2009-04-03  8:18 ` KAMEZAWA Hiroyuki
  2009-04-03  8:24 ` [RFC][PATCH ex/9] for debug KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, kosaki.motohiro@jp.fujitsu.com

I'll revisit this one before v3...

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reduce softlimit update ratio depends on its priority(usage).

After this.
  if priority=0,1 -> check once in 1024 page-in/out
  if priority=2,3 -> check once in 2048 page-in/out
  ...
  if priority=10,11 -> check once in 32k page-in/out

(Note: this is called only when the usage exceeds soft limit)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Index: softlimit-test2/mm/memcontrol.c
===================================================================
--- softlimit-test2.orig/mm/memcontrol.c
+++ softlimit-test2/mm/memcontrol.c
@@ -940,7 +940,7 @@ static void record_last_oom(struct mem_c
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
-#define SOFTLIMIT_EVENTS_THRESH (1024) /* 1024 times of page-in/out */
+#define SOFTLIMIT_EVENTS_THRESH (512) /* 512 times of page-in/out */
 /*
  * Returns true if sum of page-in/page-out events since last check is
  * over SOFTLIMIT_EVENT_THRESH. (counter is per-cpu.)
@@ -950,11 +950,15 @@ static bool mem_cgroup_soft_limit_check(
 	bool ret = false;
 	int cpu = get_cpu();
 	s64 val;
+	int thresh;
 	struct mem_cgroup_stat_cpu *cpustat;
 
 	cpustat = &mem->stat.cpustat[cpu];
 	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
-	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
+	/* If usage is big, this check can be rough */
+	thresh = SOFTLIMIT_EVENTS_THRESH;
+	thresh <<= ((mem->soft_limit_priority >> 1) + 1);
+	if (unlikely(val > thresh)) {
 		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
 		ret = true;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC][PATCH ex/9] for debug
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
                   ` (8 preceding siblings ...)
  2009-04-03  8:18 ` [RFC][PATCH 9/9] more event filter depend on priority KAMEZAWA Hiroyuki
@ 2009-04-03  8:24 ` KAMEZAWA Hiroyuki
  2009-04-06  9:08 ` [RFC][PATCH 0/9] memcg soft limit v2 (new design) Balbir Singh
  2009-04-24 12:24 ` Balbir Singh
  11 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-03  8:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 396 bytes --]

This mail attaches a patch and scirpt for debug.

soft_limit_show_prio.patch is for showing priority in memory.stat file.
I wonder I should add this to patch series or now...

cgroup.rb and ctop.rb is my personal ruby script, an utility to manage cgroup.
I sometimes use this. place both files to the same directory and run ctop.rb

#ruby ctop.rb

help will show this is for what.

Thanks,
-Kame

[-- Attachment #2: soft_limit_show_prio.patch --]
[-- Type: application/octet-stream, Size: 698 bytes --]

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Show internal control information of soft limit when DEBUG_VM is on.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    1 +
 1 file changed, 1 insertion(+)

Index: softlimit-test2/mm/memcontrol.c
===================================================================
--- softlimit-test2.orig/mm/memcontrol.c
+++ softlimit-test2/mm/memcontrol.c
@@ -2617,6 +2617,7 @@ static int mem_control_stat_show(struct 
 #ifdef CONFIG_DEBUG_VM
 	cb->fill(cb, "inactive_ratio",
 			calc_inactive_ratio(mem_cont, NULL, NULL));
+	cb->fill(cb, "soft_limit_prio", mem_cont->soft_limit_priority);
 
 	{
 		int nid, zid;

[-- Attachment #3: cgroup.rb --]
[-- Type: application/octet-stream, Size: 10798 bytes --]


require 'find'

$subsys_array = Array.new
$allsubsys = Hash.new
$allmounts = Hash.new

class Sub_system
  def initialize(name, mount, option)
    @name = name
    @mount= mount
    @hierarchy = Array.new
    if (option =~ /.*noprefix.*/) then
      @prefix =""
    else
      @prefix = name +"."
    end
   @option = option
   @writable_files = Array.new
  end

  def mount_point
    @mount
  end

  def type
    @name
  end

  def myfile(name, attr)
    name + "/" + @prefix + attr
  end

  def option
    @option
  end
  #
  # walk directroy tree and add Cgroups to hash.
  #
  def reload
    @hierarchy.clear
    len = @mount.size
    Find.find(@mount) do |file|
      if File.directory?(file) then
	@hierarchy.push(file);
      end
    end
  end

  def each_cgroup(&block)
    @hierarchy.each(&block)
  end

  def ent(id)
    if (id < 0) then return nil
    end
    return @hierarchy.at(id)
  end

  def size
    @hierarchy.size
  end

  def stat (name)
    [["Not implemented", ""]]
  end

  def each_writable_files(name)
    @writable_files.each {|x| yield myfile(name,x)}
  end

  def tasks(name)
    list=Array.new
    begin
      File.open(name+"/tasks", "r") do |file|
        file.each_line do |x|
        x.chomp!
        list.push(x)
        end
      end
    rescue
      return nil
    end
    return list
  end
end

def read_oneline_file(filename)
  val=nil
  begin
    f = File.open(filename, "r")
    line = f.readline
    val = line.to_i
  rescue
    throw :readfailure,false
  ensure
    f.close if f != nil
  end
  return val
end

#
#for CPU subsystem
#
class Cpu_Subsys < Sub_system
  def initialize(mount, option)
    super("cpu", mount, option)
    @writable_files += ["shares"]
  end

  def read_share (name)
    ret = nil
    catch :readfailure do
      val = read_oneline_file(myfile(name,"shares"))
      return [val.to_s, val.to_s+" (100%)"] if (name == @mount)
      all=0

      dirname = File.dirname(name)
      Dir.foreach(dirname) do |x|
        next if ((x == ".") || (x == ".."))
        x = "#{dirname}/#{x}"
        next unless File.directory?(x)
        next unless File.exist?(myfile(name,"shares"))
        got = read_oneline_file(myfile(name,"shares"))
        all+=got
      end
      share=sprintf("%d (%.1f%%)", all, val*100.0/all)
      ret = [val.to_s, share]
    end
    return ret
  end

  def stat(name)
    level=0
    data = Array.new
    pos = @mount
    name_array = Array.new
    loop do
      name_array.push(name)
      break if name == @mount
      name = File.dirname(name)
    end
    name_array.reverse!
    name_array.each do |x|
      val = read_share(x)
      if val == nil then
        data = nil
        break
      end
      str = sprintf("%5s / %s", val[0], val[1])
      data.push([x, str])
    end
    return data if (data != nil && data.size > 0)
    return nil
  end
end

#
#for CPUacct subsystem
#
class Cpuacct_Subsys < Sub_system
  def initialize(mount, option)
    super("cpuacct", mount, option)
  end
  def stat(name)
    data = Array.new
    catch :read_failure do
      val = read_oneline_file(myfile(name, "usage"))
      data.push(["All", val.to_s])
      begin
        f = File.open(myfile(name,"usage_percpu"), "r")
        id=0
        line = f.readline
        while (line =~/\d+/) do
          line =$'
          data.push(["cpu"+id.to_s, $&])
          id += 1
        end 
      rescue
        data.clear
      ensure
        f.close if f != nil
      end
    return data if data.size > 0
    return nil
    end
  end
end

#
# For cpuset
#
class Cpuset_Subsys < Sub_system
  def initialize(mount, option)
    super("cpuset", mount, option)
    @elements =["cpu_exclusive","cpus","mems", "mem_exclusive","mem_hardwall",
                "memory_migrate", "memory_pressure", "memory_pressure_enabled",
                "memory_spread_page","memory_spread_slab",
                "sched_load_balance","sched_relax_domain_level"]
    @writable_files += @elements
  end
  def stat(name)
    data = Array.new
    for x in @elements
      begin
        filename = myfile(name, x)
        next unless (File.file?(filename))
        File.open(filename, "r") do | file |
          str = file.readline
          str.chomp!
          case x
          when "cpus", "mems"
            str = "empty" if (str == "")
          end
          data.push([x,str])
        end
      rescue
        #data = nil
        break
      end
    end
    return data
  end
end
#
#for Memory Subsys
#
def convert_bytes(bytes, precise)
  case
  when (precise == 0) && (bytes > 64 * 1024*1024*1024*1024)
    sprintf("Unlimited")
  when (precise == 0) && (bytes > 1024*1024*1024*1024)
    sprintf("%dT",bytes/1024/1024/1024/1024)
  when (precise == 0) && (bytes > 1024*1024*1024)
    sprintf("%dG", bytes/1024/1024/1024)
  when (bytes > 1024*1024)
    sprintf("%dM", bytes/1024/1024)
  when (bytes > 1024)
    sprintf("%dk", bytes/1024)
  else
    sprintf("%d", bytes)
  end
end

#
#for Memory Subsystem
#
class Memory_Subsys < Sub_system
  def initialize(mount, option)
    super("memory", mount, option)
    if (File.exist?("#{mount}/memory.memsw.usage_in_bytes")) then
      @memsw=true
    else
      @memsw=false
    end
    @writable_files += ["limit_in_bytes", "use_hierarchy","swappiness", "soft_limit_in_bytes"]
    if (@memsw) then
      @writable_files += ["memsw.limit_in_bytes"]
    end
  end
  #
  # Find a root directroy of hierarchy.
  #
  def find_hierarchy_root(name)
    cur=[name, File.dirname(name)]
    ret=@mount
    while (cur[0] != @mount)
      ret="hoge"
      under = read_oneline_file("#{cur[1]}/memory.use_hierarchy")
      if (under == 0) then
        return cur[0]
      end
      cur[0] = cur[1]
      cur[1] = File.dirname(cur[1])
    end
    return ret
  end
  #
  # Generate an array for reporintg status
  #
  def stat(name)
    data = Array.new

    success = catch(:readfailure) do
      under =read_oneline_file(myfile(name,"use_hierarchy"))
      if (under == 1) then
        str=find_hierarchy_root(name)
        if (str != name) then
          str="under #{str}"
          under=2
        else
          str="hierarchy ROOT"
        end
      else #Not under hierarchy
        str=""
      end
      ent = ["Memory Subsys", str]
      data.push(ent)
      
      # Limit and Usage
      x=Array.new
      x.push("Usage/Limit")

      bytes = read_oneline_file(myfile(name,"usage_in_bytes"))
      usage = convert_bytes(bytes, 1)

      if (@memsw) then
        bytes = read_oneline_file(myfile(name,"memsw.usage_in_bytes"))
        usage2 = convert_bytes(bytes, 1)
        usage = "#{usage} (#{usage2})"
      end
      bytes = read_oneline_file(myfile(name,"limit_in_bytes"))
      limit = convert_bytes(bytes, 0)
      usage = "#{usage} / #{limit}"
      if (@memsw) then
        bytes = read_oneline_file(myfile(name,"memsw.limit_in_bytes"))
        limit2 = convert_bytes(bytes, 0)
        usage = "#{usage} (#{limit2})"
      end
      x.push(usage)

      data.push(x)

      # MAX USAGE
      x = Array.new
      x.push("Max Usage")
      bytes = read_oneline_file(myfile(name, "max_usage_in_bytes"))
      usage = convert_bytes(bytes, 1)
      if (@memsw) then
        bytes = read_oneline_file(myfile(name,"memsw.max_usage_in_bytes"))
        usage2 = convert_bytes(bytes, 1)
        usage = "#{usage} (#{usage2})"
      end
      x.push(usage)
      data.push(x)

      # soft limit
      x = Array.new
      x.push("Soft limit")
      bytes = read_oneline_file(myfile(name, "soft_limit_in_bytes"))
      usage = convert_bytes(bytes, 1)
      x.push(usage)
      data.push(x)

      # failcnt
      x = Array.new
      x.push("Fail Count")
      cnt = read_oneline_file(myfile(name,"failcnt"))
      failcnt = cnt.to_s
      if (@memsw) then
        cnt = read_oneline_file(myfile(name, "memsw.failcnt"))
        failcnt="#{failcnt} (#{cnt.to_s})"
      end
      x.push(failcnt)
      data.push(x)

      begin
        f = File.open(myfile(name,"stat"), "r")
        for x in ["Cache","Rss","Pagein","Pageout",nil, nil, nil, nil, nil,
                  "HierarchyLimit","SubtreeCache","SubtreeRss", nil, nil,
		  nil, nil, nil, nil, nil, nil, "soft_limit_prio"]
          line =f.readline
          next if x == nil
          line =~ /^\S+\s+(.+)/
          val=$1
          case x
          when "Cache","Rss","SubtreeCache","SubtreeRss"
            bytes = convert_bytes(val.to_i, 1)
            data.push([x, bytes])
          when "Pagein","Pageout"
            data.push([x, val])
	  when "soft_limit_prio"
	    data.push([x, val])
          when "HierarchyLimit"
            memlimit = convert_bytes(val.to_i, 0)
            if (@memsw) then
              line =f.readline
              line =~ /^\S+\s+(.+)/
              memswlimit = convert_bytes(val.to_i, 0)
              memlimit += " (" + memswlimit + ")"
            end
            data.push([x, memlimit])
          end
        end
      ensure
        f.close if f != nil
      end
    true
    end
    return data if success==true
    return nil
  end
end

#
# Read /proc/mounts and parse each lines.
# When cgroup mount point is found, each subsystem's cgroups are added
# to subsystem's Hash.
#

def register_subsys(name, mount, option)
  if $allsubsys[name] == nil then
    subsys = nil
    case name
    when "cpu" then subsys = Cpu_Subsys.new(mount, option)
    when "cpuacct" then subsys = Cpuacct_Subsys.new(mount, option)
    when "memory" then  subsys = Memory_Subsys.new(mount, option)
    when "cpuset" then  subsys = Cpuset_Subsys.new(mount, option)
    end
    if subsys != nil then
      $subsys_array.push(name)
      $allsubsys[name] = subsys
    end
  end
end

#
# Read /proc/mounts and prepare subsys array
#
def parse_mount(line)
  parsed = line.split(/\s+/)
  if parsed[2] == "cgroup" then
    mount=parsed[1]
    opts=parsed[3].split(/\,/)
    opts.each do |name|
      case name
      when "rw" then next
      else
        register_subsys(name, mount, parsed[3])
        $allmounts[mount]=name
      end
    end
  end
end


def read_mount
  File.open("/proc/mounts", "r") do |file|
    file.each_line {|line| parse_mount(line) }
  end
  $subsys_array.sort!
end

#
# Read all /proc/mounts and scan directory under mount point.
#
def refresh_all
   $allmounts.clear
   $subsys_array.clear
   $allsubsys.clear
   read_mount
end

def check_and_refresh_mount_info
  
  mysubsys=Array.new
  File.open("/proc/mounts", "r") do |file|
    file.each_line do |line|
      parsed = line.split(/\s+/)
      if (parsed[2] == "cgroup")  then
        mysubsys.push(parsed[1])
      end
    end
  end

  if (mysubsys.size != $allmounts.size) then
    refresh_all
    return true 
  end
  
  mysubsys.each do |x|
    if ($allmounts[x] == nil) then
      refresh_all
      return true
    end
  end
  return false
end


[-- Attachment #4: ctop.rb --]
[-- Type: application/octet-stream, Size: 20106 bytes --]

#
# ctop.rb 
# written by KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
# Copyright 2009 Fujitsu Limited
#
# Changelog:
#
# v003
#   - fixed bug in rmdir/mkdir
#   - changed command-mode interface
#   - added comments and made codes clean
# 
# v002 (2009/02/25)
#   - fixed leak of file descriptor
#   - mount/umount <-> reload data problem is fixed.
#   - "mount twice" problem is fixed.
#   - removed R key for reload all. it's now automatic
#   - handle "noprefix" mount option
#   - show mount option in help window
#   - add cpuset support
#   - add command-mode
#
# v001 (2009/02/04)
#   - first version released
#   - cpu, cpuacct, memory subsys is supported
#   known bugs -> noprefix, umount, mount twice
#
require 'cgroup.rb'
require 'curses'
require 'etc'
require 'timeout'
require 'singleton'

DIRWIN_LINES=7
DIRWIN_FIELDS= DIRWIN_LINES - 2
UPKEY=256
DOWNKEY=257
RIGHTKEY=258
LEFTKEY=259

#mode
SHOWMOUNT=0
SHOWTASKS=1
SHOWSUBSYS=2

#for 'ps'
PID=0
STATE=1
PPID=2
UID=3
COMMAND=4
PGID=5

#for process status filter
RUNNING=0

#
# Helper function for curses
#

def hit_any_key(str, window)
  window.addstr(str) if str != nil
  window.addstr("\n[Hit Any Key]")
  window.getch
end

def window_printf(window, format, *arg)
  str = sprintf(format, *arg)
  window.addstr(str)
end
#
# Cursor holds current status of subsys's window.
# 
#
class Cursor
  def initialize(name)
    @subsysname=name            # name of subsys
    @cursor=0                   # current directroy position
    @mode=SHOWTASKS             # current mode (ps-mode/stat-mode)
    @info_startline=0           # used for scroll in infowin
    @info_endline=0             # used for scorll in infowin
    @show_only_running = 0      # a filter for ps-mode
    @user_name_filter=nil       # a filter for ps-mode
    @command_name_filter=nil    # a filter for ps-mode
  end

  def pos
    @cursor
  end

  def mode
    @mode
  end

  def change_mode # switch mode ps-mode <-> stat-mode
    case @mode
    when SHOWTASKS then @mode=SHOWSUBSYS
    when SHOWSUBSYS then @mode=SHOWTASKS
    end
  end
  #
  # Filter for PS-MODE
  #
  def process_status_filter(stat)
    return true if (@show_only_running == 0)
    return true if (stat =="R")
    return false
  end

  def user_name_filter(str)
    return true if (@user_name_filter == nil)
    return true if (@user_name_filter == str)
    return false
  end

  def command_name_filter(str)
    return true if (@command_name_filter == nil)
    return true if (str =~ /#{@command_name_filter}/)
    return false
  end

  def toggle_show_only_running
    if (@show_only_running == 0) then
      @show_only_running = 1 # show only running process in ps-mode
    else
      @show_only_running = 0 # show all processes
    end
  end

  def set_user_name_filter(str)
    str = nil if (str == "")
    @user_name_filter=str
  end

  def set_command_name_filter(str)
    str=nil if (str == "")
    @command_name_filter=str
  end

  #
  # Scroll management for infowin 
  #
  def info_startline
    @info_startline
  end

  def set_infoendline(num)
    @info_endline=num
  end

  def set_infoline(num)
    if ((num < 0) || (num >= @info_endline)) then
      @info_startline=0
    else
      @info_startline=num
    end
  end

  #
  # chdir() for subsys.
  #
  def move(direction)
    subsys =$allsubsys[@subsysname]
    if (subsys == nil) then return
    end
    if (direction == -1) then
      @cursor -= 1 if @cursor > 0
    elsif (direction == 1)
      @cursor += 1 if @cursor < subsys.size-1
    end
  end
end

#
# Current is a singleton holds current status of this program.
#
class Current
  include Singleton
  def initialize
    @index=-1    #current susbsys index in $subsys_array[]
    @name=nil    #current name of subsys
    @cursor=nil  #reference to current Cursor 
    @subsys=nil  #reference to current Subsys 
    @subsys_cursor = Hash.new
  end

  def set(x)
    @index=x
    if (x == -1) then
      @index, @name, @cursor, @subsys = -1, "help", nil, nil
    else
      @name = $subsys_array[x]
      @subsys = $allsubsys[@name]
      if (@subsys_cursor[@name] == nil) then
        @subsys_cursor[@name] = Cursor.new(@name)
      end
      @cursor = @subsys_cursor[@name]
    end
  end

  #change subsys view
  def move (dir)
    case dir
    when "left"
      @index -= 1 if (@index > -1)
    when "right"
      @index += 1 if (@index < $subsys_array.size - 1)
    end
    set(@index)
  end

  #change directroy view of current cursor
  def chdir(direction)
    if (@cursor != nil) then
      @cursor.move(direction)  
    end
  end
  #switch current mode of cursor
  def change_mode
    if (@cursor != nil) then
      @cursor.change_mode
    end
  end

  def name
    @name
  end

  def cursor
    @cursor
  end

  def subsys
    @subsys
  end
end

$cur = Current.instance

#
# Show directory window
#

def detect_dirlist_position(subsysname, subsys)
  pos = 0
  size=subsys.size
  cursor = $cur.cursor
  return [0, 0, 0] if cursor == nil

  pos = cursor.pos
  if ((size < 4) || (pos <= 2)) then 
      head=0
      tail=4
  elsif (pos < size - 2) then
      head=pos-1
      tail=pos+2
  else
      head = size - 4
      tail = size - 1
  end
  return [pos, head, tail]
end

def get_owner_name(name)
  begin
    stat = File.stat(name)
  rescue
    return ""
  end
  begin
    info = Etc::getpwuid(stat.uid)
    uname = info.name
  rescue
    $barwin.addstr($!)
    uname = stat.uid.to_s
  end

  begin
    info = Etc::getgrgid(stat.gid)
    gname = info.name
  rescue
    gname = stat.gid.to_s
  end
  sprintf("\t-\t(%s/%s)", uname, gname)
end

def draw_dirlist(dirwin, subsys)

  now, head, tail = detect_dirlist_position($cur.name, subsys)

  lines=1
  i=head
  while i <= tail
    name = subsys.ent(i)
    if (name == nil) then break
    end

    dirwin.setpos(lines, 3)
    dirwin.standout if (i == now)
    dirwin.addstr(name + get_owner_name(name))
    dirwin.standend if (i == now)
    lines+=1
    i += 1
  end
end

#
# Fill dirwin contents.
#
def draw_dirwin(dirwin)
  dirwin.clear
  dirwin.box(?|,?-,?*)
  dirwin.setpos(0, 1)

  #show all subsyss in head
  -1.upto($subsys_array.size) do |x|
    dirwin.addstr("-")
    if (x == -1) then
      str="help"
    else
      str=sprintf("%s",$subsys_array[x])
    end
    break if (str == nil)

    dirwin.standout if (str == $cur.name)
    dirwin.addstr(str)
    dirwin.standend if (str == $cur.name)
  end

  #show current time
  dirwin.setpos(6,dirwin.maxx-32)
  dirwin.addstr("[#{Time.now.asctime}]")
  #
  # Show directory list
  #
  if $cur.subsys != nil then
    #Reload information 
    $cur.subsys.reload
    draw_dirlist(dirwin, $cur.subsys)
  end
end

#
#
# for infowin
#

#
# Contents of infowin will be passed by data[]
# This function shows contents based on current scroll infromation.
# for converting contents of array to string, code block is called by yield
#
def draw_infowin_limited(infowin, cursor, data)
  #
  # Generate Header
  #
  str = yield nil # write a header if necessary
  if (str != nil) then
    draw=1
    infowin.setpos(0,2)
    infowin.addstr(str)
  else
    draw=0
  end
  #
  # print a line whici is in the window
  #
  startline = cursor.info_startline
  endline = cursor.info_startline + infowin.maxy-2
  startline.upto(endline) do |linenumber|
    x = data.at(linenumber)
    return if (x == nil) #no more data
    str = yield(x)
    infowin.setpos(draw, 2)
    infowin.addstr(str)
    draw = 1+infowin.cury
    break if (draw == infowin.maxy)
  end

  cursor.set_infoendline(data.size)  
end

#
#
# Show help and current mount information in help window
#
def show_mount_info(infowin)
  if ($allsubsys.empty?) then
    $barwin.addstr("cgroups are not mounted\n")
  end
  $allsubsys.each do |name, subsys|
    window_printf(infowin, "%12s\t%s\t#%s\n",
                       name, subsys.mount_point, subsys.option)
  end
  #$barwin.addstr("mounted subsystems")
  #
  # Help
  #
  infowin.addstr("Command\n")
  infowin.addstr("[LEFT, RIGHT]\t move subsystems\n")
  infowin.addstr("[UP, DOWN]\t move directory\n")
  infowin.addstr("[n, b]\t\t scorll information window\n")
  infowin.addstr("[s]\t\t switch shown information (ps-mode/stat-mode)\n")
  infowin.addstr("[r]\t\t set refresh rate\n")
  infowin.addstr("[c]\t\t Enter command-mode\n")

  infowin.addstr("ps mode option\n")
  infowin.addstr("[t]\t\t (ps-mode)toggle show only running process\n")
  infowin.addstr("[u]\t\t (ps-mode)set/unset user name filter\n")
  infowin.addstr("[f]\t\t (ps-mode)set/unset command name filter")

end

#
# Read /proc/<pid>/status file and fill data[] array, return it
#

def parse_pid_status(f, es)
  input = f.readline
  input =~ es
  return $1
end

#
# data[] =[PID, State, PPID, UID, COMMAND, PGID]
#
def parse_process(pid)
  #
  # Status
  #
  data = Array.new
  stat = nil

  stat = catch(:bad_task_status) do
    data[PID]=pid.to_i
    begin
      f = File.open("/proc/#{pid}/status", "r")

      #Name
      data[COMMAND] = parse_pid_status(f,/^Name:\s+(.+)/)
      unless (File.exist?("/proc/#{pid}/exe")) then
        data[COMMAND] = "[" + data[COMMAND] + "]"
      end
      #State
      data[STATE] = parse_pid_status(f, /^State:\s+([A-Z]).+/)
      # TGID: Is thread grouo leader ?
      if (parse_pid_status(f, /^Tgid:\s+(.+)/) != pid) then
        throw :bad_task_status, false
      end
      #skip PID
      input = f.readline
      #PPID
      data[PPID]= parse_pid_status(f,/^PPid:\s+(.+)/)
      ppid=data[PPID]
      #TracerPID
      input = f.readline 
      #UID
      uid = parse_pid_status(f,/^Uid:\s+([0-9]+).+/)
      begin 
        info=Etc::getpwuid(uid.to_i)
        data[UID]=info.name
      rescue 
        data[UID]=uid
      end
    rescue
      throw :bad_task_status, false 
    ensure
      f.close unless f.nil?
    end
  end
  return data unless stat.nil?
  return nil
end

#
# PS-MODE
# Cat "tasks" file and visit all /proc/<pid>/status file
# All information will be pushed into "ps" array
#
def show_tasks(subsys, cursor, infowin)
  # Get Name of Current Cgroup and read task file
  ps = Array.new  
  catch :quit do
    group = subsys.ent(cursor.pos)
    throw :quit,"nogroup" if group==nil
    tasks = subsys.tasks(group)
    throw :quit,"nogroup" if tasks==nil
     
    tasks.each do |x|
      data = parse_process(x)
      next if (data == nil)
      next unless (cursor.process_status_filter(data[STATE]))
      next unless (cursor.command_name_filter(data[COMMAND]))
      ps.push(data) if (cursor.user_name_filter(data[UID]))
    end
    #
    # Sort ps's result, "R" first.
    #
    ps.sort! do |x , y|
      if (x[STATE] == "R" && y[STATE] != "R") then
        -1
      elsif (x[STATE] != "R" && y[STATE] == "R") then
        1
      else
        0
      end
    end
  end
  
  return if (ps.size == 0)

  draw_infowin_limited(infowin, cursor, ps)do |x|
    if (x == nil) then
      sprintf("%6s %6s %8s %5s %16s", "PID","PPID","USER","STATE", "COMMAND")
    else
      sprintf("%6d %6d %8s %5s %16s",
              x[PID], x[PPID], x[UID], x[STATE], x[COMMAND])
    end
  end

  unless ($cur.cursor.process_status_filter("S")) then
    $barwin.addstr("[r]")
  end
  unless ($cur.cursor.user_name_filter("badnamemandab")) then
    $barwin.addstr("[u]")
  end
  unless ($cur.cursor.command_name_filter("badnamemandab")) then
    $barwin.addstr("[c]")
  end
end

def show_subsys_stat(subsys, cursor, infowin)
  group = subsys.ent(cursor.pos)
  return if group == nil
  data = subsys.stat(group)
  return if data == nil
  draw_infowin_limited(infowin, cursor, data) do |x|
    next if x == nil
    if (x[0].size > 24) then
      len = x[0].size - 24
      x[0].slice!(0..len)
    end
    sprintf("%24s\t%s", x[0], x[1])
  end
end


#
# [n],[b]  Move cursor's current position in infowin
#
def set_scroll(infowin, direction)
  cursor = $cur.cursor
  return if (cursor == nil)

  if (direction == 1) then 
    curline=cursor.info_startline
    cursor.set_infoline(curline+infowin.maxy)
  else
    curline=cursor.info_startline
    cursor.set_infoline(curline-infowin.maxy)
  end
end

#
# [t] Set/Unset Show-Running-Only filter
#
def toggle_running_filter
  if ($cur.cursor != nil) then
    $cur.cursor.toggle_show_only_running
  end
end


#
# Filters for ps-mode
#

#
# [u] Filter by UID
#
def user_name_filter(infowin)
  infowin.clear
  window_printf(infowin, "user name filter:")
  str=infowin.getstr
  cursor= $cur.cursor
  cursor.set_user_name_filter(str) if (cursor != nil)
end

#
# [f] Filter by name of command
#
def command_name_filter(infowin)
  infowin.clear
  window_printf(infowin, "command name filter:")
  str=infowin.getstr
  cursor =$cur.cursor
  cursor.set_command_name_filter(str) if (cursor != nil)
end

#
# [r] set refresh time
#
def set_refresh_time(time, infowin)
  infowin.clear
  window_printf(infowin, "set refresh time(now %ds)",time)
  str=infowin.getstr
  return time if (str.to_i == 0)
  return str.to_i
end

#
# [c] Below are sub routines for command-mode.
#

def smart_print(str, window)
  if (window.maxx - window.curx < str.size-2) then
    window.addstr("\n"+str)
  else
    window.addstr(str)
  end
end

def show_writable_files(subsys, cursor, infowin)
  group = subsys.ent(cursor.pos)
  return nil if group == nil
  ent=1
  data = Array.new
  subsys.each_writable_files(group) do |x|
    str = sprintf("%2d: %s ", ent, File.basename(x))
    ent=ent+1
    smart_print(str, infowin)
    data.push(x)
  end
  infowin.refresh
  return data
end

#
# Scan directroy and change owner/group of all regular files
# and current directory.
#
def chown_all_files(uid, gid, group, infowin)
  # change owner/group of current dir
  begin
    File.chown(uid, gid, group)
  rescue
    hit_any_key("Error:"+$!, infowin)
    return
  end
  # change owner/group of regular files
  Dir.foreach(group) do |x|
    name = group+"/"+x
    next if File.directory?(name)
    begin
      File.chown(nil, gid, name)
    rescue
      hit_any_key("Error:"+$!, infowin)
      break
    end
  end
end

#
# Check "/" is included or not at mkdir/rmdir
#
def check_mkrmdir_string(str, infowin)
  if (str =~ /\//) then
    infowin.addstr("don't include /\n")
    return false
  elsif (str == ".") then
    infowin.addstr("can't remove current\n")
  end
  return true
end

#
# Get string and retuns uid or gid as integer
#
def parse_id(window, uid, str)
  if (str =~ /\D/) then
    begin
      if (uid == 1) then
        info = Etc::getpwnam(str)
        id = info.uid
      else
        info = Etc::getgrnam(str)
        id = info.gid
      end
    rescue
      hit_any_key("Error:"+$!, window)
      id=nil
    end
  else
    id = str.to_i
  end
  return id
end

#
#
# Command mode interface
#
#
def command_mode(infowin)
  return if ($cur.subsys == nil)
  infowin.clear
  $barwin.clear
  $barwin.addstr("[command-mode]")
  $barwin.refresh
  #
  # Subsys special files are in number
  #
  infowin.addstr("====subsys command====\n")
  data = show_writable_files($cur.subsys, $cur.cursor, infowin)
  if data==nil then
    infowin.addstr("no subsys command")
  end
  #
  # Cgroup generic ops are in alphabet 
  #
  infowin.addstr("\n====cgroup command====\n")
  smart_print("[A] attach task(PID)", infowin)
  smart_print(" [M] mkdir", infowin)
  smart_print(" [R] rmdir",infowin)
  smart_print(" [O] chown(OWNER)", infowin)
  smart_print(" [G] chown(GID)", infowin)
  infowin.addstr("\n\nModify which ? [and Hit return]:")

  #line to show prompt
  endline = infowin.cury+1
  #wait for the numbers or AOGMR
  str=infowin.getstr
  #target directory is this.
  group = $cur.subsys.ent($cur.cursor.pos)

  case str.to_i # if str is not number, returns 0.
  # Subsystem commands
  when 1..99
    if (data != nil) then
      name = data.at(str.to_i - 1)
      #get input
      infowin.setpos(endline, 0)
      window_printf(infowin, "#echo to >%s:", File.basename(name))
      str = infowin.getstr
      #write
      begin
        f = File.open(name, "w") {|f| f.write(str) }
      rescue
        hit_any_key("Error:"+$!, infowin)
      end
    end
  # Cgroup commands (str.to_i returns 0)
  when 0
    case str
    when "a","A" #Attach
      window_printf(infowin, "Attach task to %s:", group)
      str = infowin.getstr
      begin 
        File.open(group + "/tasks", "w") {|f| f.write(str) }
      rescue
        hit_any_key("Error:"+$!, infowin)
      end

    when "o","O" #chown (OWNER)
      infowin.addstr("change owner id of all files to:")
      id = parse_id(infowin, 1, infowin.getstr)
      chown_all_files(id, -1, group, infowin) if id != nil

    when "g","G" #chown (GROUP)
      infowin.addstr("change group id of all files to:")
      id = parse_id(infowin, 0, infowin.getstr)
      chown_all_files(-1, id, group, infowin) if id != nil

    when "m","M" #mkdir
      infowin.addstr("mkdir -.enter name:")
      str = infowin.getstr
      if (check_mkrmdir_string(str, infowin)) then
        begin
          if (Dir.mkdir(group+"/"+str) != 0) then
            hit_any_key("Error:"+$!, infowin)
	 end
        rescue
          hit_any_key("Error:"+$!, infowin)
        end
      else
        hit_any_key(nil, infowin)
      end

    when "r","R" #rmdir
      infowin.addstr("rmdir -.enter name:")
      str = infowin.getstr
      if (check_mkrmdir_string(str, infowin)) then
        begin
          if (Dir.rmdir(group+"/"+str) != 0) then
            hit_any_key("Error:"+$!, infowin)
          end
        rescue
          hit_any_key("Error:"+$!, infowin)
        end
      else
        hit_any_key(nil, infowin)
      end
    end
  end
  $barwin.clear
end

#
# Main draw routine
#
def draw_infowin(infowin)
  infowin.clear
  cursor = $cur.cursor
  if cursor == nil then
    mode = SHOWMOUNT
  else
    mode = cursor.mode
  end
  #
  # If no subsys is specified, just show mount information.
  #

  case mode
  when SHOWMOUNT
    show_mount_info(infowin)
  when SHOWTASKS
    $barwin.addstr("[ps-mode]")
    show_tasks($cur.subsys, cursor, infowin)
  when SHOWSUBSYS
    $barwin.addstr("[stat-mode]")
    show_subsys_stat($cur.subsys, cursor, infowin)
  end
end

#
# Main loop
#
#
# For stdscreen
#
# Check /proc/mounts and read all subsys.
#
refresh_all

#
# Main loop. create windows and wait for inputs
#
Curses::init_screen
begin
  $lines=Curses::lines
  $cols=Curses::cols
  off=0
  #
  # Create window
  #
  dirwin = Curses::stdscr.subwin(DIRWIN_LINES, $cols, off, 0)
  #for misc info
  off+=DIRWIN_LINES
  $barwin = Curses::stdscr.subwin(1, $cols, off, 0);
  $barwin.standout
  off+=1
  infowin = Curses::stdscr.subwin($lines-off, $cols, off, 0)
  mode=SHOWTASKS  
  quit=0
  refresh_time=15
  
  while quit == 0 
    #$barwin.clear

    #$barwin.addstr("Info:")
    draw_dirwin(dirwin)
    draw_infowin(infowin)
    dirwin.refresh
    infowin.refresh
    $barwin.refresh
    #
    # handle input.
    # 
    $barwin.clear
    Curses::setpos(0,0)
    ch=0
    Curses::noecho
    begin
      Timeout::timeout(refresh_time) do
        ch=Curses::getch
      end
    rescue Timeout::Error
      #$barwin.addstr("timeout")
    end
    Curses::echo
    #check espace sequence
    if ch == 27 then
      ch = Curses::getch
      if ch == 91 then
        ch = Curses::getch
        case ch
          when 65 then ch = UPKEY
          when 66 then ch = DOWNKEY
          when 67 then ch = RIGHTKEY
          when 68 then ch = LEFTKEY
        end
      end
    end
    #
    #
    #
    if (check_and_refresh_mount_info) then
      $cur.set(-1)
    end
   
    #$barwin.addstr(Time.now.asctime)
    case ch
      when ?q
        quit=1
        break
      when LEFTKEY then $cur.move("left")
      when RIGHTKEY then $cur.move("right")
      when UPKEY then $cur.chdir(-1)
      when DOWNKEY then $cur.chdir(1)
      when ?s then $cur.change_mode
      when ?n then set_scroll(infowin, 1)
      when ?b then set_scroll(infowin, -1)
      when ?t then toggle_running_filter
      when ?u then user_name_filter(infowin)
      when ?f then command_name_filter(infowin)
      when ?c then command_mode(infowin)
      when ?r then refresh_time=set_refresh_time(refresh_time, infowin)
    end
  end
ensure
  Curses::close_screen
end

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 0/9] memcg soft limit v2 (new design)
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
                   ` (9 preceding siblings ...)
  2009-04-03  8:24 ` [RFC][PATCH ex/9] for debug KAMEZAWA Hiroyuki
@ 2009-04-06  9:08 ` Balbir Singh
  2009-04-07  0:16   ` KAMEZAWA Hiroyuki
  2009-04-24 12:24 ` Balbir Singh
  11 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-04-06  9:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:08:35]:

> Hi,
> 
> Memory cgroup's soft limit feature is a feature to tell global LRU 
> "please reclaim from this memcg at memory shortage".
> 
> This is v2. Fixed some troubles under hierarchy. and increase soft limit
> update hooks to proper places.
> 
> This patch is on to
>   mmotom-Mar23 + memcg-cleanup-cache_charge.patch
>   + vmscan-fix-it-to-take-care-of-nodemask.patch
> 
> So, not for wide use ;)
> 
> This patch tries to avoid to use existing memcg's reclaim routine and
> just tell "Hints" to global LRU. This patch is briefly tested and shows
> good result to me. (But may not to you. plz brame me.)
> 
> Major characteristic is.
>  - memcg will be inserted to softlimit-queue at charge() if usage excess
>    soft limit.
>  - softlimit-queue is a queue with priority. priority is detemined by size
>    of excessing usage.

This is critical and good that you have this now. In my patchset, it
helps me achieve a lot of the expected functionality.

>  - memcg's soft limit hooks is called by shrink_xxx_list() to show hints.

I am not too happy with moving pages in global LRU based on soft
limits based on my comments earlier. My objection is not too strong,
since reclaiming from the memcg also exhibits functionally similar
behaviour.

>  - Behavior is affected by vm.swappiness and LRU scan rate is determined by
>    global LRU's status.
> 

I also have concerns about not sorting the list of memcg's. I need to
write some scalabilityt tests and check.

> In this v2.
>  - problems under use_hierarchy=1 case are fixed.
>  - more hooks are added.
>  - codes are cleaned up.
> 
> Shows good results on my private box test under several work loads.
> 
> But in special artificial case, when victim memcg's Active/Inactive ratio of
> ANON is very different from global LRU, the result seems not very good.
> i.e.
>   under vicitm memcg, ACTIVE_ANON=100%, INACTIVE=0% (access memory in busy loop)
>   under global, ACTIVE_ANON=10%, INACTIVE=90% (almost all processes are sleeping.)
> memory can be swapped out from global LRU, not from vicitm.
> (If there are file cache in victims, file cacahes will be out.)
> 
> But, in this case, even if we successfully swap out anon pages under victime memcg,
> they will come back to memory soon and can show heavy slashing.

heavy slashing? Not sure I understand what you mean.

> 
> While using soft limit, I felt this is useful feature :)
> But keep this RFC for a while. I'll prepare Documentation until the next post.
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 3/9] soft limit update filter
  2009-04-03  8:12 ` [RFC][PATCH 3/9] soft limit update filter KAMEZAWA Hiroyuki
@ 2009-04-06  9:43   ` Balbir Singh
  2009-04-07  0:04     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-04-06  9:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:12:02]:

> No changes from v1.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Check/Update softlimit information at every charge is over-killing, so
> we need some filter.
> 
> This patch tries to count events in the memcg and if events > threshold
> tries to update memcg's soft limit status and reset event counter to 0.
> 
> Event counter is maintained by per-cpu which has been already used,
> Then, no siginificant overhead(extra cache-miss etc..) in theory.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> Index: mmotm-2.6.29-Mar23/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.29-Mar23.orig/mm/memcontrol.c
> +++ mmotm-2.6.29-Mar23/mm/memcontrol.c
> @@ -66,6 +66,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> 
> +	MEM_CGROUP_STAT_EVENTS,  /* sum of page-in/page-out for internal use */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
> 
> @@ -105,6 +106,22 @@ static s64 mem_cgroup_local_usage(struct
>  	return ret;
>  }
> 
> +/* For intenal use of per-cpu event counting. */
> +
> +static inline void
> +__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
> +		enum mem_cgroup_stat_index idx)
> +{
> +	stat->count[idx] = 0;
> +}

Why do we do this and why do we need a special event?

> +
> +static inline s64
> +__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
> +			    enum mem_cgroup_stat_index idx)
> +{
> +	return stat->count[idx];
> +}
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -235,6 +252,8 @@ static void mem_cgroup_charge_statistics
>  	else
>  		__mem_cgroup_stat_add_safe(cpustat,
>  				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
> +	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
> +
>  	put_cpu();
>  }
> 
> @@ -897,9 +916,26 @@ static void record_last_oom(struct mem_c
>  	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
>  }
> 
> +#define SOFTLIMIT_EVENTS_THRESH (1024) /* 1024 times of page-in/out */
> +/*
> + * Returns true if sum of page-in/page-out events since last check is
> + * over SOFTLIMIT_EVENT_THRESH. (counter is per-cpu.)
> + */
>  static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
>  {
> -	return false;
> +	bool ret = false;
> +	int cpu = get_cpu();
> +	s64 val;
> +	struct mem_cgroup_stat_cpu *cpustat;
> +
> +	cpustat = &mem->stat.cpustat[cpu];
> +	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
> +	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
> +		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
> +		ret = true;
> +	}
> +	put_cpu();
> +	return ret;
>  }
>

It is good to have the caller and the function in the same patch.
Otherwise, you'll notice unused warnings. I think this function can be
simplified further

1. Lets gid rid of MEM_CGRUP_STAT_EVENTS
2. Lets rewrite mem_cgroup_soft_limit_check as

static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
{
     bool ret = false;
     int cpu = get_cpu();
     s64 pgin, pgout;
     struct mem_cgroup_stat_cpu *cpustat;

     cpustat = &mem->stat.cpustat[cpu];
     pgin = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_PGPGIN_COUNT);
     pgout = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
     val = pgin + pgout - mem->last_event_count;
     if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
             mem->last_event_count = pgin + pgout;
             ret = true;
     }
     put_cpu();
     return ret;
}

mem->last_event_count can either be atomic or protected using one of
the locks you intend to introduce. This will avoid the overhead of
incrementing event at every charge_statistics.


 
>  static void mem_cgroup_update_soft_limit(struct mem_cgroup *mem)
> 
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 4/9] soft limit queue and priority
  2009-04-03  8:12 ` [RFC][PATCH 4/9] soft limit queue and priority KAMEZAWA Hiroyuki
@ 2009-04-06 11:05   ` Balbir Singh
  2009-04-06 23:55     ` KAMEZAWA Hiroyuki
  2009-04-06 18:42   ` Balbir Singh
  1 sibling, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-04-06 11:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:12:48]:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Softlimitq. for memcg.
> 
> Implements an array of queue to list memcgs, array index is determined by
> the amount of memory usage excess the soft limit.
> 
> While Balbir's one uses RB-tree and my old one used a per-zone queue
> (with round-robin), this is one of mixture of them.
> (I'd like to use rotation of queue in later patches)
> 
> Priority is determined by following.
>    Assume unit = total pages/1024. (the code uses different value)
>    if excess is...
>       < unit,          priority = 0, 
>       < unit*2,        priority = 1,
>       < unit*2*2,      priority = 2,
>       ...
>       < unit*2^9,      priority = 9,
>       < unit*2^10,     priority = 10, (> 50% to total mem)
> 
> This patch just includes queue management part and not includes 
> selection logic from queue. Some trick will be used for selecting victims at
> soft limit in efficient way.
> 
> And this equips 2 queues, for anon and file. Inset/Delete of both list is
> done at once but scan will be independent. (These 2 queues are used later.)
> 
> Major difference from Balbir's one other than RB-tree is bahavior under
> hierarchy. This one adds all children to queue by checking hierarchical
> priority. This is for helping per-zone usage check on victim-selection logic.
> 
> Changelog: v1->v2
>  - fixed comments.
>  - change base size to exponent.
>  - some micro optimization to reduce code size.
>  - considering memory hotplug, it's not good to record a value calculated
>    from totalram_pages at boot and using it later is bad manner. Fixed it.
>  - removed soft_limit_lock (spinlock) 
>  - added soft_limit_update counter for avoiding mulptiple update at once.
>    
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 117 insertions(+), 1 deletion(-)
> 
> Index: softlimit-test2/mm/memcontrol.c
> ===================================================================
> --- softlimit-test2.orig/mm/memcontrol.c
> +++ softlimit-test2/mm/memcontrol.c
> @@ -192,7 +192,14 @@ struct mem_cgroup {
>  	atomic_t	refcnt;
> 
>  	unsigned int	swappiness;
> -
> +	/*
> +	 * For soft limit.
> +	 */
> +	int soft_limit_priority;
> +	struct list_head soft_limit_list[2];
> +#define SL_ANON (0)
> +#define SL_FILE (1)

Comments for the #define please.

> +	atomic_t soft_limit_update;
>  	/*
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
> @@ -938,11 +945,115 @@ static bool mem_cgroup_soft_limit_check(
>  	return ret;
>  }
> 
> +/*
> + * Assume "base_amount", and excess = usage - soft limit.
> + *
> + * 0...... if excess < base_amount
> + * 1...... if excess < base_amount * 2
> + * 2...... if excess < base_amount * 2^2
> + * 3.......if excess < base_amount * 2^3
> + * ....
> + * 9.......if excess < base_amount * 2^9
> + * 10 .....if excess < base_amount * 2^10
> + *
> + * base_amount is detemined from total pages in the system.
> + */
> +
> +#define SLQ_MAXPRIO (11)
> +static struct {
> +	spinlock_t lock;
> +	struct list_head queue[SLQ_MAXPRIO][2]; /* 0:anon 1:file */
> +} softlimitq;
> +
> +#define SLQ_PRIO_FACTOR (1024) /* 2^10 */
> +
> +static int __calc_soft_limit_prio(unsigned long excess)
> +{
> +	unsigned long factor = totalram_pages /SLQ_PRIO_FACTOR;

I would prefer to use global_lru_pages()

> +
> +	return fls(excess/factor);
> +}
> +
> +static int mem_cgroup_soft_limit_prio(struct mem_cgroup *mem)
> +{
> +	unsigned long excess, max_excess = 0;
> +	struct res_counter *c = &mem->res;
> +
> +	do {
> +		excess = res_counter_soft_limit_excess(c) >> PAGE_SHIFT;
> +		if (max_excess < excess)
> +			max_excess = excess;
                max_excess = min(max_excess, excess)
> +		c = c->parent;
> +	} while (c);
> +
> +	return __calc_soft_limit_prio(max_excess);
> +}
> +
> +static void __mem_cgroup_requeue(struct mem_cgroup *mem, int prio)
> +{
> +	/* enqueue to softlimit queue */
> +	int i;
> +
> +	spin_lock(&softlimitq.lock);
> +	if (prio != mem->soft_limit_priority) {
> +		mem->soft_limit_priority = prio;
> +		for (i = 0; i < 2; i++) {
> +			list_del_init(&mem->soft_limit_list[i]);
> +			list_add_tail(&mem->soft_limit_list[i],
> +				      &softlimitq.queue[prio][i]);
> +		}
> +	}
> +	spin_unlock(&softlimitq.lock);
> +}
> +
> +static void __mem_cgroup_dequeue(struct mem_cgroup *mem)
> +{
> +	int i;
> +
> +	spin_lock(&softlimitq.lock);
> +	for (i = 0; i < 2; i++)
> +		list_del_init(&mem->soft_limit_list[i]);
> +	spin_unlock(&softlimitq.lock);
> +}
> +
> +static int
> +__mem_cgroup_update_soft_limit_cb(struct mem_cgroup *mem, void *data)
> +{
> +	int priority;
> +	/* If someone updates, we don't need more */
> +	priority = mem_cgroup_soft_limit_prio(mem);
> +
> +	if (priority != mem->soft_limit_priority)
> +		__mem_cgroup_requeue(mem, priority);
> +	return 0;
> +}
> +
>  static void mem_cgroup_update_soft_limit(struct mem_cgroup *mem)
>  {
> +	int priority;
> +
> +	/* check status change */
> +	priority = mem_cgroup_soft_limit_prio(mem);
> +	if (priority != mem->soft_limit_priority &&
> +	    atomic_inc_return(&mem->soft_limit_update) > 1) {
> +		mem_cgroup_walk_tree(mem, NULL,
> +				     __mem_cgroup_update_soft_limit_cb);
> +		atomic_set(&mem->soft_limit_update, 0);
> +	}
>  	return;
>  }
> 
> +static void softlimitq_init(void)
> +{
> +	int i;
> +
> +	spin_lock_init(&softlimitq.lock);
> +	for (i = 0; i < SLQ_MAXPRIO; i++) {
> +		INIT_LIST_HEAD(&softlimitq.queue[i][SL_ANON]);
> +		INIT_LIST_HEAD(&softlimitq.queue[i][SL_FILE]);
> +	}
> +}
> +
>  /*
>   * Unlike exported interface, "oom" parameter is added. if oom==true,
>   * oom-killer can be invoked.
> @@ -2512,6 +2623,7 @@ mem_cgroup_create(struct cgroup_subsys *
>  	if (cont->parent == NULL) {
>  		enable_swap_cgroup();
>  		parent = NULL;
> +		softlimitq_init();
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		mem->use_hierarchy = parent->use_hierarchy;
> @@ -2532,6 +2644,9 @@ mem_cgroup_create(struct cgroup_subsys *
>  		res_counter_init(&mem->memsw, NULL);
>  	}
>  	mem->last_scanned_child = 0;
> +	mem->soft_limit_priority = 0;
> +	INIT_LIST_HEAD(&mem->soft_limit_list[SL_ANON]);
> +	INIT_LIST_HEAD(&mem->soft_limit_list[SL_FILE]);
>  	spin_lock_init(&mem->reclaim_param_lock);
> 
>  	if (parent)
> @@ -2556,6 +2671,7 @@ static void mem_cgroup_destroy(struct cg
>  {
>  	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> 
> +	__mem_cgroup_dequeue(mem);
>  	mem_cgroup_put(mem);
>  }
> 
> 
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 4/9] soft limit queue and priority
  2009-04-03  8:12 ` [RFC][PATCH 4/9] soft limit queue and priority KAMEZAWA Hiroyuki
  2009-04-06 11:05   ` Balbir Singh
@ 2009-04-06 18:42   ` Balbir Singh
  2009-04-06 23:54     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-04-06 18:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:12:48]:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Softlimitq. for memcg.
> 
> Implements an array of queue to list memcgs, array index is determined by
> the amount of memory usage excess the soft limit.
> 
> While Balbir's one uses RB-tree and my old one used a per-zone queue
> (with round-robin), this is one of mixture of them.
> (I'd like to use rotation of queue in later patches)
> 
> Priority is determined by following.
>    Assume unit = total pages/1024. (the code uses different value)
>    if excess is...
>       < unit,          priority = 0, 
>       < unit*2,        priority = 1,
>       < unit*2*2,      priority = 2,
>       ...
>       < unit*2^9,      priority = 9,
>       < unit*2^10,     priority = 10, (> 50% to total mem)
> 
> This patch just includes queue management part and not includes 
> selection logic from queue. Some trick will be used for selecting victims at
> soft limit in efficient way.
> 
> And this equips 2 queues, for anon and file. Inset/Delete of both list is
> done at once but scan will be independent. (These 2 queues are used later.)
> 
> Major difference from Balbir's one other than RB-tree is bahavior under
> hierarchy. This one adds all children to queue by checking hierarchical
> priority. This is for helping per-zone usage check on victim-selection logic.
> 
> Changelog: v1->v2
>  - fixed comments.
>  - change base size to exponent.
>  - some micro optimization to reduce code size.
>  - considering memory hotplug, it's not good to record a value calculated
>    from totalram_pages at boot and using it later is bad manner. Fixed it.
>  - removed soft_limit_lock (spinlock) 
>  - added soft_limit_update counter for avoiding mulptiple update at once.
>    
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 117 insertions(+), 1 deletion(-)
> 
> Index: softlimit-test2/mm/memcontrol.c
> ===================================================================
> --- softlimit-test2.orig/mm/memcontrol.c
> +++ softlimit-test2/mm/memcontrol.c
> @@ -192,7 +192,14 @@ struct mem_cgroup {
>  	atomic_t	refcnt;
> 
>  	unsigned int	swappiness;
> -
> +	/*
> +	 * For soft limit.
> +	 */
> +	int soft_limit_priority;
> +	struct list_head soft_limit_list[2];

Looking at the rest of the code in the patch, it is not apparent as to
why we need two list_heads/array of list_heads?

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 4/9] soft limit queue and priority
  2009-04-06 18:42   ` Balbir Singh
@ 2009-04-06 23:54     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-06 23:54 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

On Tue, 7 Apr 2009 00:12:21 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:12:48]:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Softlimitq. for memcg.
> > 
> > Implements an array of queue to list memcgs, array index is determined by
> > the amount of memory usage excess the soft limit.
> > 
> > While Balbir's one uses RB-tree and my old one used a per-zone queue
> > (with round-robin), this is one of mixture of them.
> > (I'd like to use rotation of queue in later patches)
> > 
> > Priority is determined by following.
> >    Assume unit = total pages/1024. (the code uses different value)
> >    if excess is...
> >       < unit,          priority = 0, 
> >       < unit*2,        priority = 1,
> >       < unit*2*2,      priority = 2,
> >       ...
> >       < unit*2^9,      priority = 9,
> >       < unit*2^10,     priority = 10, (> 50% to total mem)
> > 
> > This patch just includes queue management part and not includes 
> > selection logic from queue. Some trick will be used for selecting victims at
> > soft limit in efficient way.
> > 
> > And this equips 2 queues, for anon and file. Inset/Delete of both list is
> > done at once but scan will be independent. (These 2 queues are used later.)
> > 
> > Major difference from Balbir's one other than RB-tree is bahavior under
> > hierarchy. This one adds all children to queue by checking hierarchical
> > priority. This is for helping per-zone usage check on victim-selection logic.
> > 
> > Changelog: v1->v2
> >  - fixed comments.
> >  - change base size to exponent.
> >  - some micro optimization to reduce code size.
> >  - considering memory hotplug, it's not good to record a value calculated
> >    from totalram_pages at boot and using it later is bad manner. Fixed it.
> >  - removed soft_limit_lock (spinlock) 
> >  - added soft_limit_update counter for avoiding mulptiple update at once.
> >    
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 117 insertions(+), 1 deletion(-)
> > 
> > Index: softlimit-test2/mm/memcontrol.c
> > ===================================================================
> > --- softlimit-test2.orig/mm/memcontrol.c
> > +++ softlimit-test2/mm/memcontrol.c
> > @@ -192,7 +192,14 @@ struct mem_cgroup {
> >  	atomic_t	refcnt;
> > 
> >  	unsigned int	swappiness;
> > -
> > +	/*
> > +	 * For soft limit.
> > +	 */
> > +	int soft_limit_priority;
> > +	struct list_head soft_limit_list[2];
> 
> Looking at the rest of the code in the patch, it is not apparent as to
> why we need two list_heads/array of list_heads?
> 

Considering LRU rotation, it's done per anon, file in zone.

   ACTIVE -> INACTIVE -> out.

And, there can be 'File only', 'Anon only' cgroup.

Then, we have 2 design choices.

  1. Use one list for selecting victim.
     If target memory type (FILE/ANON) is empty, select another victim.
  2. Use two list for selecting victim.
     FILE and ANON victim selection can be done independently from each other.

This series uses "2". Because "1" can make "ticket" parameter useless in victim
selection.

Sorry for short text.

Thanks,
-Kame










--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 4/9] soft limit queue and priority
  2009-04-06 11:05   ` Balbir Singh
@ 2009-04-06 23:55     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-06 23:55 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

On Mon, 6 Apr 2009 16:35:34 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:12:48]:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Softlimitq. for memcg.
> > 
> > Implements an array of queue to list memcgs, array index is determined by
> > the amount of memory usage excess the soft limit.
> > 
> > While Balbir's one uses RB-tree and my old one used a per-zone queue
> > (with round-robin), this is one of mixture of them.
> > (I'd like to use rotation of queue in later patches)
> > 
> > Priority is determined by following.
> >    Assume unit = total pages/1024. (the code uses different value)
> >    if excess is...
> >       < unit,          priority = 0, 
> >       < unit*2,        priority = 1,
> >       < unit*2*2,      priority = 2,
> >       ...
> >       < unit*2^9,      priority = 9,
> >       < unit*2^10,     priority = 10, (> 50% to total mem)
> > 
> > This patch just includes queue management part and not includes 
> > selection logic from queue. Some trick will be used for selecting victims at
> > soft limit in efficient way.
> > 
> > And this equips 2 queues, for anon and file. Inset/Delete of both list is
> > done at once but scan will be independent. (These 2 queues are used later.)
> > 
> > Major difference from Balbir's one other than RB-tree is bahavior under
> > hierarchy. This one adds all children to queue by checking hierarchical
> > priority. This is for helping per-zone usage check on victim-selection logic.
> > 
> > Changelog: v1->v2
> >  - fixed comments.
> >  - change base size to exponent.
> >  - some micro optimization to reduce code size.
> >  - considering memory hotplug, it's not good to record a value calculated
> >    from totalram_pages at boot and using it later is bad manner. Fixed it.
> >  - removed soft_limit_lock (spinlock) 
> >  - added soft_limit_update counter for avoiding mulptiple update at once.
> >    
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 117 insertions(+), 1 deletion(-)
> > 
> > Index: softlimit-test2/mm/memcontrol.c
> > ===================================================================
> > --- softlimit-test2.orig/mm/memcontrol.c
> > +++ softlimit-test2/mm/memcontrol.c
> > @@ -192,7 +192,14 @@ struct mem_cgroup {
> >  	atomic_t	refcnt;
> > 
> >  	unsigned int	swappiness;
> > -
> > +	/*
> > +	 * For soft limit.
> > +	 */
> > +	int soft_limit_priority;
> > +	struct list_head soft_limit_list[2];
> > +#define SL_ANON (0)
> > +#define SL_FILE (1)
> 
> Comments for the #define please.
> 
Sure.

> > +	atomic_t soft_limit_update;
> >  	/*
> >  	 * statistics. This must be placed at the end of memcg.
> >  	 */
> > @@ -938,11 +945,115 @@ static bool mem_cgroup_soft_limit_check(
> >  	return ret;
> >  }
> > 
> > +/*
> > + * Assume "base_amount", and excess = usage - soft limit.
> > + *
> > + * 0...... if excess < base_amount
> > + * 1...... if excess < base_amount * 2
> > + * 2...... if excess < base_amount * 2^2
> > + * 3.......if excess < base_amount * 2^3
> > + * ....
> > + * 9.......if excess < base_amount * 2^9
> > + * 10 .....if excess < base_amount * 2^10
> > + *
> > + * base_amount is detemined from total pages in the system.
> > + */
> > +
> > +#define SLQ_MAXPRIO (11)
> > +static struct {
> > +	spinlock_t lock;
> > +	struct list_head queue[SLQ_MAXPRIO][2]; /* 0:anon 1:file */
> > +} softlimitq;
> > +
> > +#define SLQ_PRIO_FACTOR (1024) /* 2^10 */
> > +
> > +static int __calc_soft_limit_prio(unsigned long excess)
> > +{
> > +	unsigned long factor = totalram_pages /SLQ_PRIO_FACTOR;
> 
> I would prefer to use global_lru_pages()
> 
Hmm, ok.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 3/9] soft limit update filter
  2009-04-06  9:43   ` Balbir Singh
@ 2009-04-07  0:04     ` KAMEZAWA Hiroyuki
  2009-04-07  2:26       ` Balbir Singh
  0 siblings, 1 reply; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-07  0:04 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

On Mon, 6 Apr 2009 15:13:51 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:12:02]:
> 
> > No changes from v1.
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Check/Update softlimit information at every charge is over-killing, so
> > we need some filter.
> > 
> > This patch tries to count events in the memcg and if events > threshold
> > tries to update memcg's soft limit status and reset event counter to 0.
> > 
> > Event counter is maintained by per-cpu which has been already used,
> > Then, no siginificant overhead(extra cache-miss etc..) in theory.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> > Index: mmotm-2.6.29-Mar23/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.29-Mar23.orig/mm/memcontrol.c
> > +++ mmotm-2.6.29-Mar23/mm/memcontrol.c
> > @@ -66,6 +66,7 @@ enum mem_cgroup_stat_index {
> >  	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> >  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> > 
> > +	MEM_CGROUP_STAT_EVENTS,  /* sum of page-in/page-out for internal use */
> >  	MEM_CGROUP_STAT_NSTATS,
> >  };
> > 
> > @@ -105,6 +106,22 @@ static s64 mem_cgroup_local_usage(struct
> >  	return ret;
> >  }
> > 
> > +/* For intenal use of per-cpu event counting. */
> > +
> > +static inline void
> > +__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
> > +		enum mem_cgroup_stat_index idx)
> > +{
> > +	stat->count[idx] = 0;
> > +}
> 
> Why do we do this and why do we need a special event?
> 
2 points.

  1.  we do "reset" this counter.
  2.  We're counting page-in/page-out. I wonder I should counter others...

> > +
> > +static inline s64
> > +__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
> > +			    enum mem_cgroup_stat_index idx)
> > +{
> > +	return stat->count[idx];
> > +}
> > +
> >  /*
> >   * per-zone information in memory controller.
> >   */
> > @@ -235,6 +252,8 @@ static void mem_cgroup_charge_statistics
> >  	else
> >  		__mem_cgroup_stat_add_safe(cpustat,
> >  				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
> > +	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
> > +
> >  	put_cpu();
> >  }
> > 
> > @@ -897,9 +916,26 @@ static void record_last_oom(struct mem_c
> >  	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> >  }
> > 
> > +#define SOFTLIMIT_EVENTS_THRESH (1024) /* 1024 times of page-in/out */
> > +/*
> > + * Returns true if sum of page-in/page-out events since last check is
> > + * over SOFTLIMIT_EVENT_THRESH. (counter is per-cpu.)
> > + */
> >  static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
> >  {
> > -	return false;
> > +	bool ret = false;
> > +	int cpu = get_cpu();
> > +	s64 val;
> > +	struct mem_cgroup_stat_cpu *cpustat;
> > +
> > +	cpustat = &mem->stat.cpustat[cpu];
> > +	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
> > +	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
> > +		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
> > +		ret = true;
> > +	}
> > +	put_cpu();
> > +	return ret;
> >  }
> >
> 
> It is good to have the caller and the function in the same patch.
> Otherwise, you'll notice unused warnings. I think this function can be
> simplified further
> 
> 1. Lets gid rid of MEM_CGRUP_STAT_EVENTS
> 2. Lets rewrite mem_cgroup_soft_limit_check as
> 
> static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
> {
>      bool ret = false;
>      int cpu = get_cpu();
>      s64 pgin, pgout;
>      struct mem_cgroup_stat_cpu *cpustat;
> 
>      cpustat = &mem->stat.cpustat[cpu];
>      pgin = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_PGPGIN_COUNT);
>      pgout = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
>      val = pgin + pgout - mem->last_event_count;
>      if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
>              mem->last_event_count = pgin + pgout;
>              ret = true;
>      }
>      put_cpu();
>      return ret;
> }
> 
> mem->last_event_count can either be atomic or protected using one of
> the locks you intend to introduce. This will avoid the overhead of
> incrementing event at every charge_statistics.
> 
Incrementing always hits cache.

Hmm, making mem->last_event_count as per-cpu, we can do above. And maybe no
difference with current code. But you don't seem to like counting,
it's ok to change the shape.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 0/9] memcg soft limit v2 (new design)
  2009-04-06  9:08 ` [RFC][PATCH 0/9] memcg soft limit v2 (new design) Balbir Singh
@ 2009-04-07  0:16   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-07  0:16 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

On Mon, 6 Apr 2009 14:38:00 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:08:35]:
> 
> > Hi,
> > 
> > Memory cgroup's soft limit feature is a feature to tell global LRU 
> > "please reclaim from this memcg at memory shortage".
> > 
> > This is v2. Fixed some troubles under hierarchy. and increase soft limit
> > update hooks to proper places.
> > 
> > This patch is on to
> >   mmotom-Mar23 + memcg-cleanup-cache_charge.patch
> >   + vmscan-fix-it-to-take-care-of-nodemask.patch
> > 
> > So, not for wide use ;)
> > 
> > This patch tries to avoid to use existing memcg's reclaim routine and
> > just tell "Hints" to global LRU. This patch is briefly tested and shows
> > good result to me. (But may not to you. plz brame me.)
> > 
> > Major characteristic is.
> >  - memcg will be inserted to softlimit-queue at charge() if usage excess
> >    soft limit.
> >  - softlimit-queue is a queue with priority. priority is detemined by size
> >    of excessing usage.
> 
> This is critical and good that you have this now. In my patchset, it
> helps me achieve a lot of the expected functionality.
> 
> >  - memcg's soft limit hooks is called by shrink_xxx_list() to show hints.
> 
> I am not too happy with moving pages in global LRU based on soft
> limits based on my comments earlier. My objection is not too strong,
> since reclaiming from the memcg also exhibits functionally similar
> behaviour.
Yes, not so much difference from memcg' reclaim routine other than this is
called under scanning_global_lru()==ture.

> 
> >  - Behavior is affected by vm.swappiness and LRU scan rate is determined by
> >    global LRU's status.
> > 
> 
> I also have concerns about not sorting the list of memcg's. I need to
> write some scalabilityt tests and check.

Ah yes, I admit scalability is my concern, too. 

About sorting, this priority list uses exponet as parameter. Then,
  When excess is small, priority control is done under close observation.
  When excess is big, priority control is done under rough observation.

I'm wondering how ->ticket can be big, now.


> 
> > In this v2.
> >  - problems under use_hierarchy=1 case are fixed.
> >  - more hooks are added.
> >  - codes are cleaned up.
> > 
> > Shows good results on my private box test under several work loads.
> > 
> > But in special artificial case, when victim memcg's Active/Inactive ratio of
> > ANON is very different from global LRU, the result seems not very good.
> > i.e.
> >   under vicitm memcg, ACTIVE_ANON=100%, INACTIVE=0% (access memory in busy loop)
> >   under global, ACTIVE_ANON=10%, INACTIVE=90% (almost all processes are sleeping.)
> > memory can be swapped out from global LRU, not from vicitm.
> > (If there are file cache in victims, file cacahes will be out.)
> > 
> > But, in this case, even if we successfully swap out anon pages under victime memcg,
> > they will come back to memory soon and can show heavy slashing.
> 
> heavy slashing? Not sure I understand what you mean.
> 
Heavy swapin <-> swapout and user applicatons can't make progress.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 3/9] soft limit update filter
  2009-04-07  0:04     ` KAMEZAWA Hiroyuki
@ 2009-04-07  2:26       ` Balbir Singh
  0 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2009-04-07  2:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-07 09:04:38]:

> On Mon, 6 Apr 2009 15:13:51 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:12:02]:
> > 
> > > No changes from v1.
> > > ==
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > Check/Update softlimit information at every charge is over-killing, so
> > > we need some filter.
> > > 
> > > This patch tries to count events in the memcg and if events > threshold
> > > tries to update memcg's soft limit status and reset event counter to 0.
> > > 
> > > Event counter is maintained by per-cpu which has been already used,
> > > Then, no siginificant overhead(extra cache-miss etc..) in theory.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > > Index: mmotm-2.6.29-Mar23/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-2.6.29-Mar23.orig/mm/memcontrol.c
> > > +++ mmotm-2.6.29-Mar23/mm/memcontrol.c
> > > @@ -66,6 +66,7 @@ enum mem_cgroup_stat_index {
> > >  	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> > >  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> > > 
> > > +	MEM_CGROUP_STAT_EVENTS,  /* sum of page-in/page-out for internal use */
> > >  	MEM_CGROUP_STAT_NSTATS,
> > >  };
> > > 
> > > @@ -105,6 +106,22 @@ static s64 mem_cgroup_local_usage(struct
> > >  	return ret;
> > >  }
> > > 
> > > +/* For intenal use of per-cpu event counting. */
> > > +
> > > +static inline void
> > > +__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
> > > +		enum mem_cgroup_stat_index idx)
> > > +{
> > > +	stat->count[idx] = 0;
> > > +}
> > 
> > Why do we do this and why do we need a special event?
> > 
> 2 points.
> 
>   1.  we do "reset" this counter.
>   2.  We're counting page-in/page-out. I wonder I should counter others...
> 
> > > +
> > > +static inline s64
> > > +__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
> > > +			    enum mem_cgroup_stat_index idx)
> > > +{
> > > +	return stat->count[idx];
> > > +}
> > > +
> > >  /*
> > >   * per-zone information in memory controller.
> > >   */
> > > @@ -235,6 +252,8 @@ static void mem_cgroup_charge_statistics
> > >  	else
> > >  		__mem_cgroup_stat_add_safe(cpustat,
> > >  				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
> > > +	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
> > > +
> > >  	put_cpu();
> > >  }
> > > 
> > > @@ -897,9 +916,26 @@ static void record_last_oom(struct mem_c
> > >  	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
> > >  }
> > > 
> > > +#define SOFTLIMIT_EVENTS_THRESH (1024) /* 1024 times of page-in/out */
> > > +/*
> > > + * Returns true if sum of page-in/page-out events since last check is
> > > + * over SOFTLIMIT_EVENT_THRESH. (counter is per-cpu.)
> > > + */
> > >  static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
> > >  {
> > > -	return false;
> > > +	bool ret = false;
> > > +	int cpu = get_cpu();
> > > +	s64 val;
> > > +	struct mem_cgroup_stat_cpu *cpustat;
> > > +
> > > +	cpustat = &mem->stat.cpustat[cpu];
> > > +	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
> > > +	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
> > > +		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
> > > +		ret = true;
> > > +	}
> > > +	put_cpu();
> > > +	return ret;
> > >  }
> > >
> > 
> > It is good to have the caller and the function in the same patch.
> > Otherwise, you'll notice unused warnings. I think this function can be
> > simplified further
> > 
> > 1. Lets gid rid of MEM_CGRUP_STAT_EVENTS
> > 2. Lets rewrite mem_cgroup_soft_limit_check as
> > 
> > static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
> > {
> >      bool ret = false;
> >      int cpu = get_cpu();
> >      s64 pgin, pgout;
> >      struct mem_cgroup_stat_cpu *cpustat;
> > 
> >      cpustat = &mem->stat.cpustat[cpu];
> >      pgin = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_PGPGIN_COUNT);
> >      pgout = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
> >      val = pgin + pgout - mem->last_event_count;
> >      if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
> >              mem->last_event_count = pgin + pgout;
> >              ret = true;
> >      }
> >      put_cpu();
> >      return ret;
> > }
> > 
> > mem->last_event_count can either be atomic or protected using one of
> > the locks you intend to introduce. This will avoid the overhead of
> > incrementing event at every charge_statistics.
> > 
> Incrementing always hits cache.
> 
> Hmm, making mem->last_event_count as per-cpu, we can do above. And maybe no
> difference with current code. But you don't seem to like counting,
> it's ok to change the shape.
>

I was wondering as to why we were adding another EVENT counter, when
we can sum up pgpgin and pgpgout, but we already have the
infrastructure to make EVENT per-cpu, so lets stick with it for now. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 0/9] memcg soft limit v2 (new design)
  2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
                   ` (10 preceding siblings ...)
  2009-04-06  9:08 ` [RFC][PATCH 0/9] memcg soft limit v2 (new design) Balbir Singh
@ 2009-04-24 12:24 ` Balbir Singh
  2009-04-24 15:19   ` KAMEZAWA Hiroyuki
  11 siblings, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-04-24 12:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kosaki.motohiro@jp.fujitsu.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03 17:08:35]:

> Hi,
> 
> Memory cgroup's soft limit feature is a feature to tell global LRU 
> "please reclaim from this memcg at memory shortage".
> 
> This is v2. Fixed some troubles under hierarchy. and increase soft limit
> update hooks to proper places.
> 
> This patch is on to
>   mmotom-Mar23 + memcg-cleanup-cache_charge.patch
>   + vmscan-fix-it-to-take-care-of-nodemask.patch
> 
> So, not for wide use ;)
> 
> This patch tries to avoid to use existing memcg's reclaim routine and
> just tell "Hints" to global LRU. This patch is briefly tested and shows
> good result to me. (But may not to you. plz brame me.)
> 
> Major characteristic is.
>  - memcg will be inserted to softlimit-queue at charge() if usage excess
>    soft limit.
>  - softlimit-queue is a queue with priority. priority is detemined by size
>    of excessing usage.
>  - memcg's soft limit hooks is called by shrink_xxx_list() to show hints.
>  - Behavior is affected by vm.swappiness and LRU scan rate is determined by
>    global LRU's status.
> 
> In this v2.
>  - problems under use_hierarchy=1 case are fixed.
>  - more hooks are added.
>  - codes are cleaned up.
>

The results seem good so far with some basic tests I've been doing.
I'll come back with more feedback, I would like to see this feature in
-mm soon.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC][PATCH 0/9] memcg soft limit v2 (new design)
  2009-04-24 12:24 ` Balbir Singh
@ 2009-04-24 15:19   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 22+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-24 15:19 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kosaki.motohiro@jp.fujitsu.com

Balbir Singh wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-03
> 17:08:35]:
>> In this v2.
>>  - problems under use_hierarchy=1 case are fixed.
>>  - more hooks are added.
>>  - codes are cleaned up.
>>
>
> The results seem good so far with some basic tests I've been doing.
> I'll come back with more feedback, I would like to see this feature in
> -mm soon.
>
Thank you. I'll update this. But now I have bugfix patch for
stale swap caches (coop with Nishimura). Then, I'll go ahead one by one.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-04-24 15:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-03  8:08 [RFC][PATCH 0/9] memcg soft limit v2 (new design) KAMEZAWA Hiroyuki
2009-04-03  8:09 ` [RFC][PATCH 1/9] " KAMEZAWA Hiroyuki
2009-04-03  8:10 ` [RFC][PATCH 2/9] soft limit framework for memcg KAMEZAWA Hiroyuki
2009-04-03  8:12 ` [RFC][PATCH 3/9] soft limit update filter KAMEZAWA Hiroyuki
2009-04-06  9:43   ` Balbir Singh
2009-04-07  0:04     ` KAMEZAWA Hiroyuki
2009-04-07  2:26       ` Balbir Singh
2009-04-03  8:12 ` [RFC][PATCH 4/9] soft limit queue and priority KAMEZAWA Hiroyuki
2009-04-06 11:05   ` Balbir Singh
2009-04-06 23:55     ` KAMEZAWA Hiroyuki
2009-04-06 18:42   ` Balbir Singh
2009-04-06 23:54     ` KAMEZAWA Hiroyuki
2009-04-03  8:13 ` [RFC][PATCH 5/9] add more hooks and check in lazy manner KAMEZAWA Hiroyuki
2009-04-03  8:14 ` [RFC][PATCH 6/9] active inactive ratio for private KAMEZAWA Hiroyuki
2009-04-03  8:15 ` [RFC][PATCH 7/9] vicitim selection logic KAMEZAWA Hiroyuki
2009-04-03  8:17 ` [RFC][PATCH 8/9] lru reordering KAMEZAWA Hiroyuki
2009-04-03  8:18 ` [RFC][PATCH 9/9] more event filter depend on priority KAMEZAWA Hiroyuki
2009-04-03  8:24 ` [RFC][PATCH ex/9] for debug KAMEZAWA Hiroyuki
2009-04-06  9:08 ` [RFC][PATCH 0/9] memcg soft limit v2 (new design) Balbir Singh
2009-04-07  0:16   ` KAMEZAWA Hiroyuki
2009-04-24 12:24 ` Balbir Singh
2009-04-24 15:19   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).