[RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3
@ 2009-09-09  8:39 KAMEZAWA Hiroyuki
  2009-09-09  8:41 ` [RFC][PATCH 1/4][mmotm] memcg: soft limit clean up KAMEZAWA Hiroyuki
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-09  8:39 UTC (permalink / raw)
  To: linux-mm@kvack.org; +Cc: balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp

This patch series is for reducing memcg's lock contention on res_counter, v3.
(sending today just for reporting current status in my stack.)

It's reported that memcg's res_counter can cause heavy false sharing / lock
conention and scalability is not good. This is for relaxing that.
No terrible bugs are found, I'll maintain/update this until the end of next
merge window. Tests on big-smp and new-good-idea are welcome.

This patch is on to mmotm+Nishimura's fix + Hugh's get_user_pages() patch.
But can be applied directly against mmotm, I think.

numbers:

I used 8cpu x86-64 box and run make -j 12 kernel.
Before make, make clean and drop_caches.

[Before patch(mmotm)] 3 runs
real    3m1.127s
user    4m42.143s
sys     6m22.588s

real    3m0.942s
user    4m42.377s
sys     6m24.463s

real    2m53.982s
user    4m42.635s
sys     6m23.124s

[After patch] 3 runs.
real    2m53.052s
user    4m48.095s
sys     5m43.042s

real    2m54.367s
user    4m43.738s
sys     5m40.626s

real    2m55.108s
user    4m43.367s
sys     5m40.265s

you can see 'sys' is reduced.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 1/4][mmotm] memcg: soft limit clean up
  2009-09-09  8:39 [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 KAMEZAWA Hiroyuki
@ 2009-09-09  8:41 ` KAMEZAWA Hiroyuki
       [not found]   ` <661de9470909090410t160454a2k658c980b92d11612@mail.gmail.com>
  2009-09-09  8:41 ` [RFC][PATCH 2/4][mmotm] clean up charge path of softlimit KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-09  8:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

This patch clean up/fixes for memcg's uncharge soft limit path.
This is also a preparation for batched uncharge.

Problems:
  Now, res_counter_charge()/uncharge() handles softlimit information at
  charge/uncharge and softlimit-check is done when event counter per memcg
  goes over limit. But event counter per memcg is updated only when
  when memcg is over soft limit. But ancerstors are handled in charge path
  but not in uncharge path.
  For batched charge/uncharge, event counter check should be more strict.

  Prolems:
  1. memcg's event counter incremented only when softlimit hits. That's bad.
     It makes event counter hard to be reused for other purpose.
  2. At uncharge, only the lowest level rescounter is handled. This is bug.
     Because ancesotor's event counter is not incremented, children should
     take care of them.
  3. res_counter_uncharge()'s 3rd argument is NULL in most case.
     ops under res_counter->lock should be small. No "if" sentense is better.

Fixes:
  * Removed soft_limit_xx poitner and checsk from charge and uncharge.
    Do-check-only-when-necessary scheme works enough well without them.

  * make event-counter of memcg checked at every charge/uncharge.
    (per-cpu area will be accessed soon anyway)

  * All ancestors are checked at soft-limit-check. This is necessary because
    ancesotor's event counter may never be modified. Then, they should be
    checked at the same time.

Todo;
  We may need to modify EVENT_COUNTER_THRESH of parent with many children.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/res_counter.h |    6 --
 kernel/res_counter.c        |   18 ------
 mm/memcontrol.c             |  115 +++++++++++++++++++-------------------------
 3 files changed, 55 insertions(+), 84 deletions(-)

Index: mmotm-2.6.31-Sep3/kernel/res_counter.c
===================================================================
--- mmotm-2.6.31-Sep3.orig/kernel/res_counter.c
+++ mmotm-2.6.31-Sep3/kernel/res_counter.c
@@ -37,27 +37,17 @@ int res_counter_charge_locked(struct res
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at,
-			struct res_counter **soft_limit_fail_at)
+			struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
-	if (soft_limit_fail_at)
-		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
-		/*
-		 * With soft limits, we return the highest ancestor
-		 * that exceeds its soft limit
-		 */
-		if (soft_limit_fail_at &&
-			!res_counter_soft_limit_check_locked(c))
-			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -85,8 +75,7 @@ void res_counter_uncharge_locked(struct 
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
-				bool *was_soft_limit_excess)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
@@ -94,9 +83,6 @@ void res_counter_uncharge(struct res_cou
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
-		if (was_soft_limit_excess)
-			*was_soft_limit_excess =
-				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
 	}
Index: mmotm-2.6.31-Sep3/include/linux/res_counter.h
===================================================================
--- mmotm-2.6.31-Sep3.orig/include/linux/res_counter.h
+++ mmotm-2.6.31-Sep3/include/linux/res_counter.h
@@ -114,8 +114,7 @@ void res_counter_init(struct res_counter
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at,
-		struct res_counter **soft_limit_at);
+		unsigned long val, struct res_counter **limit_fail_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -128,8 +127,7 @@ int __must_check res_counter_charge(stru
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
-				bool *was_soft_limit_excess);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
Index: mmotm-2.6.31-Sep3/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep3.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep3/mm/memcontrol.c
@@ -353,16 +353,6 @@ __mem_cgroup_remove_exceeded(struct mem_
 }
 
 static void
-mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_insert_exceeded(mem, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-static void
 mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
 				struct mem_cgroup_tree_per_zone *mctz)
@@ -392,34 +382,40 @@ static bool mem_cgroup_soft_limit_check(
 
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
 {
-	unsigned long long prev_usage_in_excess, new_usage_in_excess;
-	bool updated_tree = false;
+	unsigned long long new_usage_in_excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
-
-	mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
+	int nid = page_to_nid(page);
+	int zid = page_zonenum(page);
 	mctz = soft_limit_tree_from_page(page);
 
 	/*
-	 * We do updates in lazy mode, mem's are removed
-	 * lazily from the per-zone, per-node rb tree
+	 * Necessary to update all ancestors when hierarchy is used.
+	 * because their event counter is not touched.
 	 */
-	prev_usage_in_excess = mz->usage_in_excess;
-
-	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
-	if (prev_usage_in_excess) {
-		mem_cgroup_remove_exceeded(mem, mz, mctz);
-		updated_tree = true;
-	}
-	if (!new_usage_in_excess)
-		goto done;
-	mem_cgroup_insert_exceeded(mem, mz, mctz);
-
-done:
-	if (updated_tree) {
-		spin_lock(&mctz->lock);
-		mz->usage_in_excess = new_usage_in_excess;
-		spin_unlock(&mctz->lock);
+	for (; mem; mem = parent_mem_cgroup(mem)) {
+		mz = mem_cgroup_zoneinfo(mem, nid, zid);
+		new_usage_in_excess =
+			res_counter_soft_limit_excess(&mem->res);
+		/*
+		 * We have to update the tree if mz is on RB-tree or
+		 * mem is over its softlimit.
+		 */
+		if (new_usage_in_excess || mz->on_tree) {
+			spin_lock(&mctz->lock);
+			/* if on-tree, remove it */
+			if (mz->on_tree)
+				__mem_cgroup_remove_exceeded(mem, mz, mctz);
+			/*
+			 * if over soft limit, insert again. mz->usage_in_excess
+			 * will be updated properly.
+			 */
+			if (new_usage_in_excess)
+				__mem_cgroup_insert_exceeded(mem, mz, mctz);
+			else
+				mz->usage_in_excess = 0;
+			spin_unlock(&mctz->lock);
+		}
 	}
 }
 
@@ -1270,9 +1266,9 @@ static int __mem_cgroup_try_charge(struc
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom, struct page *page)
 {
-	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
+	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res, *soft_fail_res = NULL;
+	struct res_counter *fail_res;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1304,17 +1300,16 @@ static int __mem_cgroup_try_charge(struc
 
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
-						&soft_fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res, NULL);
+							&fail_res);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -1353,16 +1348,11 @@ static int __mem_cgroup_try_charge(struc
 		}
 	}
 	/*
-	 * Insert just the ancestor, we should trickle down to the correct
-	 * cgroup for reclaim, since the other nodes will be below their
-	 * soft limit
-	 */
-	if (soft_fail_res) {
-		mem_over_soft_limit =
-			mem_cgroup_from_res_counter(soft_fail_res, res);
-		if (mem_cgroup_soft_limit_check(mem_over_soft_limit))
-			mem_cgroup_update_tree(mem_over_soft_limit, page);
-	}
+	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
+	 * if they exceeds softlimit.
+	 */
+	if (mem_cgroup_soft_limit_check(mem))
+		mem_cgroup_update_tree(mem, page);
 done:
 	return 0;
 nomem:
@@ -1437,10 +1427,9 @@ static void __mem_cgroup_commit_charge(s
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		if (!mem_cgroup_is_root(mem)) {
-			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			if (do_swap_account)
-				res_counter_uncharge(&mem->memsw, PAGE_SIZE,
-							NULL);
+				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 		}
 		css_put(&mem->css);
 		return;
@@ -1519,7 +1508,7 @@ static int mem_cgroup_move_account(struc
 		goto out;
 
 	if (!mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&from->res, PAGE_SIZE);
 	mem_cgroup_charge_statistics(from, pc, false);
 
 	page = pc->page;
@@ -1539,7 +1528,7 @@ static int mem_cgroup_move_account(struc
 	}
 
 	if (do_swap_account && !mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
+		res_counter_uncharge(&from->memsw, PAGE_SIZE);
 	css_put(&from->css);
 
 	css_get(&to->css);
@@ -1610,9 +1599,9 @@ uncharge:
 	css_put(&parent->css);
 	/* uncharge if move fails */
 	if (!mem_cgroup_is_root(parent)) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&parent->res, PAGE_SIZE);
 		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
 	}
 	return ret;
 }
@@ -1803,8 +1792,7 @@ __mem_cgroup_commit_charge_swapin(struct
 			 * calling css_tryget
 			 */
 			if (!mem_cgroup_is_root(memcg))
-				res_counter_uncharge(&memcg->memsw, PAGE_SIZE,
-							NULL);
+				res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 			mem_cgroup_swap_statistics(memcg, false);
 			mem_cgroup_put(memcg);
 		}
@@ -1831,9 +1819,9 @@ void mem_cgroup_cancel_charge_swapin(str
 	if (!mem)
 		return;
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	}
 	css_put(&mem->css);
 }
@@ -1848,7 +1836,6 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
-	bool soft_limit_excess = false;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1888,10 +1875,10 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		if (do_swap_account &&
 				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	}
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
@@ -1908,7 +1895,7 @@ __mem_cgroup_uncharge_common(struct page
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
-	if (soft_limit_excess && mem_cgroup_soft_limit_check(mem))
+	if (mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
@@ -1986,7 +1973,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
 		 * This memcg can be obsolete one. We avoid calling css_tryget
 		 */
 		if (!mem_cgroup_is_root(memcg))
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 		mem_cgroup_swap_statistics(memcg, false);
 		mem_cgroup_put(memcg);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 2/4][mmotm] clean up charge path of softlimit
  2009-09-09  8:39 [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 KAMEZAWA Hiroyuki
  2009-09-09  8:41 ` [RFC][PATCH 1/4][mmotm] memcg: soft limit clean up KAMEZAWA Hiroyuki
@ 2009-09-09  8:41 ` KAMEZAWA Hiroyuki
  2009-09-09  8:44 ` [RFC][PATCH 3/4][mmotm] memcg: batched uncharge KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-09  8:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly and
it takes res_counter's spin_lock every time.

This patch removes unnecessary calls for res_count_soft_limit_excess.

Changelog:
 - fixed description.
 - fixed unsigned long to be unsigned long long (Thanks, Nishimura)

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

Index: mmotm-2.6.31-Sep3/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep3.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep3/mm/memcontrol.c
@@ -313,7 +313,8 @@ soft_limit_tree_from_page(struct page *p
 static void
 __mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
+				struct mem_cgroup_tree_per_zone *mctz,
+				unsigned long long new_usage_in_excess)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
@@ -322,7 +323,9 @@ __mem_cgroup_insert_exceeded(struct mem_
 	if (mz->on_tree)
 		return;
 
-	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+	mz->usage_in_excess = new_usage_in_excess;
+	if (!mz->usage_in_excess)
+		return;
 	while (*p) {
 		parent = *p;
 		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
@@ -382,7 +385,7 @@ static bool mem_cgroup_soft_limit_check(
 
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
 {
-	unsigned long long new_usage_in_excess;
+	unsigned long long excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
 	int nid = page_to_nid(page);
@@ -395,25 +398,21 @@ static void mem_cgroup_update_tree(struc
 	 */
 	for (; mem; mem = parent_mem_cgroup(mem)) {
 		mz = mem_cgroup_zoneinfo(mem, nid, zid);
-		new_usage_in_excess =
-			res_counter_soft_limit_excess(&mem->res);
+		excess = res_counter_soft_limit_excess(&mem->res);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
 		 * mem is over its softlimit.
 		 */
-		if (new_usage_in_excess || mz->on_tree) {
+		if (excess || mz->on_tree) {
 			spin_lock(&mctz->lock);
 			/* if on-tree, remove it */
 			if (mz->on_tree)
 				__mem_cgroup_remove_exceeded(mem, mz, mctz);
 			/*
-			 * if over soft limit, insert again. mz->usage_in_excess
-			 * will be updated properly.
+			 * Insert again. mz->usage_in_excess will be updated.
+			 * If excess is 0, no tree ops.
 			 */
-			if (new_usage_in_excess)
-				__mem_cgroup_insert_exceeded(mem, mz, mctz);
-			else
-				mz->usage_in_excess = 0;
+			__mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
 			spin_unlock(&mctz->lock);
 		}
 	}
@@ -2220,6 +2219,7 @@ unsigned long mem_cgroup_soft_limit_recl
 	unsigned long reclaimed;
 	int loop = 0;
 	struct mem_cgroup_tree_per_zone *mctz;
+	unsigned long long excess;
 
 	if (order > 0)
 		return 0;
@@ -2269,9 +2269,8 @@ unsigned long mem_cgroup_soft_limit_recl
 					break;
 			} while (1);
 		}
-		mz->usage_in_excess =
-			res_counter_soft_limit_excess(&mz->mem->res);
 		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		/*
 		 * One school of thought says that we should not add
 		 * back the node to the tree if reclaim returns 0.
@@ -2280,8 +2279,8 @@ unsigned long mem_cgroup_soft_limit_recl
 		 * memory to reclaim from. Consider this as a longer
 		 * term TODO.
 		 */
-		if (mz->usage_in_excess)
-			__mem_cgroup_insert_exceeded(mz->mem, mz, mctz);
+		/* If excess == 0, no tree ops */
+		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
 		spin_unlock(&mctz->lock);
 		css_put(&mz->mem->css);
 		loop++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 3/4][mmotm] memcg: batched uncharge
  2009-09-09  8:39 [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 KAMEZAWA Hiroyuki
  2009-09-09  8:41 ` [RFC][PATCH 1/4][mmotm] memcg: soft limit clean up KAMEZAWA Hiroyuki
  2009-09-09  8:41 ` [RFC][PATCH 2/4][mmotm] clean up charge path of softlimit KAMEZAWA Hiroyuki
@ 2009-09-09  8:44 ` KAMEZAWA Hiroyuki
  2009-09-09  8:45 ` [RFC][PATCH 4/4][mmotm] memcg: coalescing charge KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-09  8:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp


In massive parallel enviroment, res_counter can be a performance bottleneck.
This patch is a trial for reducing lock contention.
One strong techinque to reduce lock contention is reducing calls by
batching some amount of calls int one.

Considering charge/uncharge chatacteristic,
	- charge is done one by one via demand-paging.
	- uncharge is done by
		- in chunk at munmap, truncate, exit, execve...
		- one by one via vmscan/paging.

It seems we hace a chance to batched-uncharge.
This patch is a base patch for batched uncharge. For avoiding
scattering memcg's structure, this patch adds memcg batch uncharge
information to the task. please see start/end usage in next patch.

The degree of coalescing depends on callers
  - at invalidate/trucate... pagevec size
  - at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)

Changelog: v3
 - fixed invaldiate_inode_pages2() path. for this, do_batch is
   not 'int', not 'bool'

Changelog: v2
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   13 ++++++
 include/linux/sched.h      |    7 +++
 mm/memcontrol.c            |   89 +++++++++++++++++++++++++++++++++++++++++----
 mm/memory.c                |    2 +
 mm/truncate.c              |    6 +++
 5 files changed, 111 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep3/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.31-Sep3.orig/include/linux/memcontrol.h
+++ mmotm-2.6.31-Sep3/include/linux/memcontrol.h
@@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
 extern void mem_cgroup_del_lru(struct page *page);
 extern void mem_cgroup_move_lists(struct page *page,
 				  enum lru_list from, enum lru_list to);
+
+/* For coalescing uncharge for reducing memcg' overhead*/
+extern void mem_cgroup_uncharge_start(void);
+extern void mem_cgroup_uncharge_end(void);
+
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 extern int mem_cgroup_shmem_charge_fallback(struct page *page,
@@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
 {
 }
 
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
 static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
Index: mmotm-2.6.31-Sep3/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep3.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep3/mm/memcontrol.c
@@ -1825,6 +1825,49 @@ void mem_cgroup_cancel_charge_swapin(str
 	css_put(&mem->css);
 }
 
+static void
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+{
+	struct memcg_batch_info *batch = NULL;
+	bool uncharge_memsw = true;
+	/* If swapout, usage of swap doesn't decrease */
+	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		uncharge_memsw = false;
+	/*
+	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
+	 * In those cases, all pages freed continously can be expected to be in
+	 * the same cgroup and we have chance to coalesce uncharges.
+	 * And, we do uncharge one by one if this is killed by OOM.
+	 */
+	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
+		goto direct_uncharge;
+
+	batch = &current->memcg_batch;
+	/*
+	 * In usual, we do css_get() when we remember memcg pointer.
+	 * But in this case, we keep res->usage until end of a series of
+	 * uncharges. Then, it's ok to ignore memcg's refcnt.
+	 */
+	if (!batch->memcg)
+		batch->memcg = mem;
+	/*
+	 * In typical case, batch->memcg == mem. This means we can
+	 * merge a series of uncharges to an uncharge of res_counter.
+	 * If not, we uncharge res_counter ony by one.
+	 */
+	if (batch->memcg != mem)
+		goto direct_uncharge;
+	/* remember freed charge and uncharge it later */
+	batch->pages += PAGE_SIZE;
+	if (uncharge_memsw)
+		batch->memsw += PAGE_SIZE;
+	return;
+direct_uncharge:
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (uncharge_memsw)
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	return;
+}
 
 /*
  * uncharge if !page_mapped(page)
@@ -1873,12 +1916,8 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		if (do_swap_account &&
-				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-	}
+	if (!mem_cgroup_is_root(mem))
+		__do_uncharge(mem, ctype);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -1924,6 +1963,46 @@ void mem_cgroup_uncharge_cache_page(stru
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+/*
+ * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
+ * In that cases, pages are freed continuously and we can expect pages
+ * are in the same memcg. All these calls itself limits the number of
+ * pages freed at once, then uncharge_start/end() is called properly.
+ */
+
+void mem_cgroup_uncharge_start(void)
+{
+	if (!current->memcg_batch.do_batch) {
+		current->memcg_batch.memcg = NULL;
+		current->memcg_batch.pages = 0;
+		current->memcg_batch.memsw = 0;
+	}
+	current->memcg_batch.do_batch++;
+}
+
+void mem_cgroup_uncharge_end(void)
+{
+	struct mem_cgroup *mem;
+
+	if (!current->memcg_batch.do_batch)
+		return;
+
+	current->memcg_batch.do_batch--;
+	if (current->memcg_batch.do_batch) /* Nested ? */
+		return;
+
+	mem = current->memcg_batch.memcg;
+	if (!mem)
+		return;
+	/* This "mem" is valid bacause we hide charges behind us. */
+	if (current->memcg_batch.pages)
+		res_counter_uncharge(&mem->res, current->memcg_batch.pages);
+	if (current->memcg_batch.memsw)
+		res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
+	/* Not necessary. but forget this pointer */
+	current->memcg_batch.memcg = NULL;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * called after __delete_from_swap_cache() and drop "page" account.
Index: mmotm-2.6.31-Sep3/include/linux/sched.h
===================================================================
--- mmotm-2.6.31-Sep3.orig/include/linux/sched.h
+++ mmotm-2.6.31-Sep3/include/linux/sched.h
@@ -1548,6 +1548,13 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 	unsigned long stack_start;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
+	struct memcg_batch_info {
+		int do_batch;
+		struct mem_cgroup *memcg;
+		long pages, memsw;
+	} memcg_batch;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
Index: mmotm-2.6.31-Sep3/mm/memory.c
===================================================================
--- mmotm-2.6.31-Sep3.orig/mm/memory.c
+++ mmotm-2.6.31-Sep3/mm/memory.c
@@ -922,6 +922,7 @@ static unsigned long unmap_page_range(st
 		details = NULL;
 
 	BUG_ON(addr >= end);
+	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -934,6 +935,7 @@ static unsigned long unmap_page_range(st
 						zap_work, details);
 	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
 	tlb_end_vma(tlb, vma);
+	mem_cgroup_uncharge_end();
 
 	return addr;
 }
Index: mmotm-2.6.31-Sep3/mm/truncate.c
===================================================================
--- mmotm-2.6.31-Sep3.orig/mm/truncate.c
+++ mmotm-2.6.31-Sep3/mm/truncate.c
@@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
 			pagevec_release(&pvec);
 			break;
 		}
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 	}
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
 	pagevec_init(&pvec, 0);
 	while (next <= end &&
 			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t index;
@@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
 				break;
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;
@@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
 	while (next <= end && !wrapped &&
 		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index;
@@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 4/4][mmotm] memcg: coalescing charge
  2009-09-09  8:39 [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2009-09-09  8:44 ` [RFC][PATCH 3/4][mmotm] memcg: batched uncharge KAMEZAWA Hiroyuki
@ 2009-09-09  8:45 ` KAMEZAWA Hiroyuki
  2009-09-12  4:58   ` Daisuke Nishimura
  2009-09-09 20:30 ` [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 Balbir Singh
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-09  8:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is a patch for coalescing access to res_counter at charging by
percpu cache.At charge, memcg charges 64pages and remember it in percpu cache.
Because it's cache, drain/flush if necessary.

This version uses public percpu area.
 2 benefits for using public percpu area.
 1. Sum of stocked charge in the system is limited to # of cpus
    not to the number of memcg. This shows better synchonization.
 2. drain code for flush/cpuhotplug is very easy (and quick)

The most important point of this patch is that we never touch res_counter
in fast path. The res_counter is system-wide shared counter which is modified
very frequently. We shouldn't touch it as far as we can for avoid false sharing.

Changelog:
  - added asynchronous flush routine.
  - fixed bugs pointed out by Nishimura-san.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  123 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 117 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep3/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep3.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep3/mm/memcontrol.c
@@ -38,6 +38,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include <linux/cpu.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -275,6 +276,7 @@ enum charge_type {
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
+static void drain_all_stock_async(void);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1136,6 +1138,8 @@ static int mem_cgroup_hierarchical_recla
 		victim = mem_cgroup_select_victim(root_mem);
 		if (victim == root_mem) {
 			loop++;
+			if (loop >= 1)
+				drain_all_stock_async();
 			if (loop >= 2) {
 				/*
 				 * If we have not been able to reclaim
@@ -1257,6 +1261,102 @@ done:
 	unlock_page_cgroup(pc);
 }
 
+#define CHARGE_SIZE	(64 * PAGE_SIZE)
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached;
+	int charge;
+	struct work_struct work;
+};
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_MUTEX(memcg_drain_mutex);
+
+static bool consume_stock(struct mem_cgroup *mem)
+{
+	struct memcg_stock_pcp *stock;
+	bool ret = true;
+
+	stock = &get_cpu_var(memcg_stock);
+	if (mem == stock->cached && stock->charge)
+		stock->charge -= PAGE_SIZE;
+	else
+		ret = false;
+	put_cpu_var(memcg_stock);
+	return ret;
+}
+
+static void drain_stock(struct memcg_stock_pcp *stock)
+{
+	struct mem_cgroup *old = stock->cached;
+
+	if (stock->charge) {
+		res_counter_uncharge(&old->res, stock->charge);
+		if (do_swap_account)
+			res_counter_uncharge(&old->memsw, stock->charge);
+	}
+	stock->cached = NULL;
+	stock->charge = 0;
+}
+
+static void drain_local_stock(struct work_struct *dummy)
+{
+	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
+	drain_stock(stock);
+	put_cpu_var(memcg_stock);
+}
+
+static void refill_stock(struct mem_cgroup *mem, int val)
+{
+	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
+
+	if (stock->cached != mem) {
+		drain_stock(stock);
+		stock->cached = mem;
+	}
+	stock->charge += val;
+	put_cpu_var(memcg_stock);
+}
+
+static void drain_all_stock_async(void)
+{
+	int cpu;
+	/* Contention means someone tries to flush. */
+	if (!mutex_trylock(&memcg_drain_mutex))
+		return;
+	for_each_online_cpu(cpu) {
+		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
+		if (work_pending(&stock->work))
+			continue;
+		INIT_WORK(&stock->work, drain_local_stock);
+		schedule_work_on(cpu, &stock->work);
+	}
+	mutex_unlock(&memcg_drain_mutex);
+	/* We don't wait for flush_work */
+}
+
+static void drain_all_stock_sync(void)
+{
+	/* called when force_empty is called */
+	mutex_lock(&memcg_drain_mutex);
+	schedule_on_each_cpu(drain_local_stock);
+	mutex_unlock(&memcg_drain_mutex);
+}
+
+static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
+					unsigned long action,
+					void *hcpu)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+	int cpu = (unsigned long)hcpu;
+	struct memcg_stock_pcp *stock;
+
+	if (action != CPU_DEAD)
+		return NOTIFY_OK;
+	stock = &per_cpu(memcg_stock, cpu);
+	drain_stock(stock);
+#endif
+	return NOTIFY_OK;
+}
+
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
  * oom-killer can be invoked.
@@ -1268,6 +1368,7 @@ static int __mem_cgroup_try_charge(struc
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct res_counter *fail_res;
+	int csize = CHARGE_SIZE;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1292,23 +1393,25 @@ static int __mem_cgroup_try_charge(struc
 		return 0;
 
 	VM_BUG_ON(css_is_removed(&mem->css));
+	if (mem_cgroup_is_root(mem))
+		goto done;
 
 	while (1) {
 		int ret = 0;
 		unsigned long flags = 0;
 
-		if (mem_cgroup_is_root(mem))
-			goto done;
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		if (consume_stock(mem))
+			goto charged;
+
+		ret = res_counter_charge(&mem->res, csize, &fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
-			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+			ret = res_counter_charge(&mem->memsw, csize, &fail_res);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			res_counter_uncharge(&mem->res, csize);
 			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -1320,6 +1423,9 @@ static int __mem_cgroup_try_charge(struc
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
+		/* we don't make stocks if failed */
+		csize = PAGE_SIZE;
+
 		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
 						gfp_mask, flags);
 		if (ret)
@@ -1346,6 +1452,9 @@ static int __mem_cgroup_try_charge(struc
 			goto nomem;
 		}
 	}
+	if (csize > PAGE_SIZE)
+		refill_stock(mem, csize - PAGE_SIZE);
+charged:
 	/*
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
@@ -2460,6 +2569,7 @@ move_account:
 			goto out;
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
+		drain_all_stock_sync();
 		ret = 0;
 		for_each_node_state(node, N_HIGH_MEMORY) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
@@ -3178,6 +3288,7 @@ mem_cgroup_create(struct cgroup_subsys *
 		root_mem_cgroup = mem;
 		if (mem_cgroup_soft_limit_tree_init())
 			goto free_out;
+		hotcpu_notifier(memcg_stock_cpu_callback, 0);
 
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3
  2009-09-09  8:39 [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2009-09-09  8:45 ` [RFC][PATCH 4/4][mmotm] memcg: coalescing charge KAMEZAWA Hiroyuki
@ 2009-09-09 20:30 ` Balbir Singh
  2009-09-10  0:20   ` KAMEZAWA Hiroyuki
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 24+ messages in thread
From: Balbir Singh @ 2009-09-09 20:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-09 17:39:03]:

> This patch series is for reducing memcg's lock contention on res_counter, v3.
> (sending today just for reporting current status in my stack.)
> 
> It's reported that memcg's res_counter can cause heavy false sharing / lock
> conention and scalability is not good. This is for relaxing that.
> No terrible bugs are found, I'll maintain/update this until the end of next
> merge window. Tests on big-smp and new-good-idea are welcome.
> 
> This patch is on to mmotm+Nishimura's fix + Hugh's get_user_pages() patch.
> But can be applied directly against mmotm, I think.
> 
> numbers:
> 
> I used 8cpu x86-64 box and run make -j 12 kernel.
> Before make, make clean and drop_caches.
>

Kamezawa-San

I was able to test on a 24 way using my parallel page fault test
program and here is what I see

 Performance counter stats for '/home/balbir/parallel_pagefault' (3
runs):

 7191673.834385  task-clock-msecs         #     23.953 CPUs    ( +- 0.001% )
         427765  context-switches         #      0.000 M/sec   ( +- 0.106% )
            234  CPU-migrations           #      0.000 M/sec   ( +- 20.851% )
       87975343  page-faults              #      0.012 M/sec   ( +- 0.347% )
  5962193345280  cycles                   #    829.041 M/sec   ( +- 0.012% )
  1009132401195  instructions             #      0.169 IPC     ( +- 0.059% )
    10068652670  cache-references         #      1.400 M/sec   ( +- 2.581% )
     2053688394  cache-misses             #      0.286 M/sec   ( +- 0.481% )

  300.238748326  seconds time elapsed   ( +-   0.001% )

Without the patch I saw

 Performance counter stats for '/home/balbir/parallel_pagefault' (3
runs):

 7198364.596593  task-clock-msecs         #     23.959 CPUs    ( +- 0.004% )
         425104  context-switches         #      0.000 M/sec   ( +- 0.244% )
            157  CPU-migrations           #      0.000 M/sec   ( +- 13.291% )
       28964117  page-faults              #      0.004 M/sec   ( +- 0.106% )
  5786854402292  cycles                   #    803.912 M/sec   ( +- 0.013% )
   835828892399  instructions             #      0.144 IPC     ( +- 0.073% )
     6240606753  cache-references         #      0.867 M/sec   ( +- 1.058% )
     2068445332  cache-misses             #      0.287 M/sec   ( +- 1.844% )

  300.443366784  seconds time elapsed   ( +-   0.005% )


This does look like a very good improvement.



 
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/4][mmotm] memcg: soft limit clean up
       [not found]   ` <661de9470909090410t160454a2k658c980b92d11612@mail.gmail.com>
@ 2009-09-10  0:10     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-10  0:10 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp

On Wed, 9 Sep 2009 16:40:03 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> On Wed, Sep 9, 2009 at 2:11 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > This patch clean up/fixes for memcg's uncharge soft limit path.
> > This is also a preparation for batched uncharge.
> >
> > Problems:
> >  Now, res_counter_charge()/uncharge() handles softlimit information at
> >  charge/uncharge and softlimit-check is done when event counter per memcg
> >  goes over limit. But event counter per memcg is updated only when
> >  when memcg is over soft limit. But ancerstors are handled in charge path
> >  but not in uncharge path.
> >  For batched charge/uncharge, event counter check should be more strict.
> >
> >  Prolems:
> >
> 
> typo, should be Problems
> 
yes..


> 
> >  1. memcg's event counter incremented only when softlimit hits. That's bad.
> >     It makes event counter hard to be reused for other purpose.
> >
> 
> I don't understand the context, are these existing problems?
> 

"event" counter is useful for other purposes as
  - memory usage threshold notifier or something fancy.

Then, I think "event" of charge/uncharge should be counted at every charge/uncharge.
Now, charge/uncharge event is counted only when soft_fail_res != NULL.

>  2. At uncharge, only the lowest level rescounter is handled. This is bug.
> >     Because ancesotor's event counter is not incremented, children should
> >     take care of them.
> >  3. res_counter_uncharge()'s 3rd argument is NULL in most case.
> >     ops under res_counter->lock should be small. No "if" sentense is
> > better.
> >
> > Fixes:
> >  * Removed soft_limit_xx poitner and checsk from charge and uncharge.
> >
> 
> typo should be soft_limit_xxx_pointer and check
> 
will fix.

Thank you for review. I'll brush up.

Thanks,
-Kame

> 
> >    Do-check-only-when-necessary scheme works enough well without them.
> >
> >  * make event-counter of memcg checked at every charge/uncharge.
> >    (per-cpu area will be accessed soon anyway)
> >
> >  * All ancestors are checked at soft-limit-check. This is necessary because
> >    ancesotor's event counter may never be modified. Then, they should be
> >    checked at the same time.
> >
> > Todo;
> >  We may need to modify EVENT_COUNTER_THRESH of parent with many children.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > ---
> >  include/linux/res_counter.h |    6 --
> >  kernel/res_counter.c        |   18 ------
> >  mm/memcontrol.c             |  115
> > +++++++++++++++++++-------------------------
> >  3 files changed, 55 insertions(+), 84 deletions(-)
> >
> > Index: mmotm-2.6.31-Sep3/kernel/res_counter.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep3.orig/kernel/res_counter.c
> > +++ mmotm-2.6.31-Sep3/kernel/res_counter.c
> > @@ -37,27 +37,17 @@ int res_counter_charge_locked(struct res
> >  }
> >
> >  int res_counter_charge(struct res_counter *counter, unsigned long val,
> > -                       struct res_counter **limit_fail_at,
> > -                       struct res_counter **soft_limit_fail_at)
> > +                       struct res_counter **limit_fail_at)
> >  {
> >        int ret;
> >        unsigned long flags;
> >        struct res_counter *c, *u;
> >
> >        *limit_fail_at = NULL;
> > -       if (soft_limit_fail_at)
> > -               *soft_limit_fail_at = NULL;
> >        local_irq_save(flags);
> >        for (c = counter; c != NULL; c = c->parent) {
> >                spin_lock(&c->lock);
> >                ret = res_counter_charge_locked(c, val);
> > -               /*
> > -                * With soft limits, we return the highest ancestor
> > -                * that exceeds its soft limit
> > -                */
> > -               if (soft_limit_fail_at &&
> > -                       !res_counter_soft_limit_check_locked(c))
> > -                       *soft_limit_fail_at = c;
> >                spin_unlock(&c->lock);
> >                if (ret < 0) {
> >                        *limit_fail_at = c;
> > @@ -85,8 +75,7 @@ void res_counter_uncharge_locked(struct
> >        counter->usage -= val;
> >  }
> >
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > -                               bool *was_soft_limit_excess)
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val)
> >  {
> >        unsigned long flags;
> >        struct res_counter *c;
> > @@ -94,9 +83,6 @@ void res_counter_uncharge(struct res_cou
> >        local_irq_save(flags);
> >        for (c = counter; c != NULL; c = c->parent) {
> >                spin_lock(&c->lock);
> > -               if (was_soft_limit_excess)
> > -                       *was_soft_limit_excess =
> > -                               !res_counter_soft_limit_check_locked(c);
> >                res_counter_uncharge_locked(c, val);
> >                spin_unlock(&c->lock);
> >        }
> > Index: mmotm-2.6.31-Sep3/include/linux/res_counter.h
> > ===================================================================
> > --- mmotm-2.6.31-Sep3.orig/include/linux/res_counter.h
> > +++ mmotm-2.6.31-Sep3/include/linux/res_counter.h
> > @@ -114,8 +114,7 @@ void res_counter_init(struct res_counter
> >  int __must_check res_counter_charge_locked(struct res_counter *counter,
> >                unsigned long val);
> >  int __must_check res_counter_charge(struct res_counter *counter,
> > -               unsigned long val, struct res_counter **limit_fail_at,
> > -               struct res_counter **soft_limit_at);
> > +               unsigned long val, struct res_counter **limit_fail_at);
> >
> >  /*
> >  * uncharge - tell that some portion of the resource is released
> > @@ -128,8 +127,7 @@ int __must_check res_counter_charge(stru
> >  */
> >
> >  void res_counter_uncharge_locked(struct res_counter *counter, unsigned
> > long val);
> > -void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> > -                               bool *was_soft_limit_excess);
> > +void res_counter_uncharge(struct res_counter *counter, unsigned long val);
> >
> >  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> >  {
> > Index: mmotm-2.6.31-Sep3/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.31-Sep3.orig/mm/memcontrol.c
> > +++ mmotm-2.6.31-Sep3/mm/memcontrol.c
> > @@ -353,16 +353,6 @@ __mem_cgroup_remove_exceeded(struct mem_
> >  }
> >
> >  static void
> > -mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
> > -                               struct mem_cgroup_per_zone *mz,
> > -                               struct mem_cgroup_tree_per_zone *mctz)
> > -{
> > -       spin_lock(&mctz->lock);
> > -       __mem_cgroup_insert_exceeded(mem, mz, mctz);
> > -       spin_unlock(&mctz->lock);
> > -}
> > -
> > -static void
> >  mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
> >                                struct mem_cgroup_per_zone *mz,
> >                                struct mem_cgroup_tree_per_zone *mctz)
> > @@ -392,34 +382,40 @@ static bool mem_cgroup_soft_limit_check(
> >
> >  static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page
> > *page)
> >  {
> > -       unsigned long long prev_usage_in_excess, new_usage_in_excess;
> > -       bool updated_tree = false;
> > +       unsigned long long new_usage_in_excess;
> >        struct mem_cgroup_per_zone *mz;
> >        struct mem_cgroup_tree_per_zone *mctz;
> > -
> > -       mz = mem_cgroup_zoneinfo(mem, page_to_nid(page),
> > page_zonenum(page));
> > +       int nid = page_to_nid(page);
> > +       int zid = page_zonenum(page);
> >        mctz = soft_limit_tree_from_page(page);
> >
> >        /*
> > -        * We do updates in lazy mode, mem's are removed
> > -        * lazily from the per-zone, per-node rb tree
> > +        * Necessary to update all ancestors when hierarchy is used.
> > +        * because their event counter is not touched.
> >         */
> > -       prev_usage_in_excess = mz->usage_in_excess;
> > -
> > -       new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
> > -       if (prev_usage_in_excess) {
> > -               mem_cgroup_remove_exceeded(mem, mz, mctz);
> > -               updated_tree = true;
> > -       }
> > -       if (!new_usage_in_excess)
> > -               goto done;
> > -       mem_cgroup_insert_exceeded(mem, mz, mctz);
> > -
> > -done:
> > -       if (updated_tree) {
> > -               spin_lock(&mctz->lock);
> > -               mz->usage_in_excess = new_usage_in_excess;
> > -               spin_unlock(&mctz->lock);
> > +       for (; mem; mem = parent_mem_cgroup(mem)) {
> > +               mz = mem_cgroup_zoneinfo(mem, nid, zid);
> > +               new_usage_in_excess =
> > +                       res_counter_soft_limit_excess(&mem->res);
> > +               /*
> > +                * We have to update the tree if mz is on RB-tree or
> > +                * mem is over its softlimit.
> > +                */
> > +               if (new_usage_in_excess || mz->on_tree) {
> > +                       spin_lock(&mctz->lock);
> > +                       /* if on-tree, remove it */
> > +                       if (mz->on_tree)
> > +                               __mem_cgroup_remove_exceeded(mem, mz,
> > mctz);
> > +                       /*
> > +                        * if over soft limit, insert again.
> > mz->usage_in_excess
> > +                        * will be updated properly.
> > +                        */
> > +                       if (new_usage_in_excess)
> > +                               __mem_cgroup_insert_exceeded(mem, mz,
> > mctz);
> > +                       else
> > +                               mz->usage_in_excess = 0;
> > +                       spin_unlock(&mctz->lock);
> > +               }
> >        }
> >  }
> >
> > @@ -1270,9 +1266,9 @@ static int __mem_cgroup_try_charge(struc
> >                        gfp_t gfp_mask, struct mem_cgroup **memcg,
> >                        bool oom, struct page *page)
> >  {
> > -       struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
> > +       struct mem_cgroup *mem, *mem_over_limit;
> >        int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > -       struct res_counter *fail_res, *soft_fail_res = NULL;
> > +       struct res_counter *fail_res;
> >
> >        if (unlikely(test_thread_flag(TIF_MEMDIE))) {
> >                /* Don't account this! */
> > @@ -1304,17 +1300,16 @@ static int __mem_cgroup_try_charge(struc
> >
> >                if (mem_cgroup_is_root(mem))
> >                        goto done;
> > -               ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
> > -                                               &soft_fail_res);
> > +               ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
> >                if (likely(!ret)) {
> >                        if (!do_swap_account)
> >                                break;
> >                        ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
> > -                                                       &fail_res, NULL);
> > +                                                       &fail_res);
> >                        if (likely(!ret))
> >                                break;
> >                        /* mem+swap counter fails */
> > -                       res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
> > +                       res_counter_uncharge(&mem->res, PAGE_SIZE);
> >                        flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> >                        mem_over_limit =
> > mem_cgroup_from_res_counter(fail_res,
> >
> >  memsw);
> > @@ -1353,16 +1348,11 @@ static int __mem_cgroup_try_charge(struc
> >                }
> >        }
> >        /*
> > -        * Insert just the ancestor, we should trickle down to the correct
> > -        * cgroup for reclaim, since the other nodes will be below their
> > -        * soft limit
> > -        */
> > -       if (soft_fail_res) {
> > -               mem_over_soft_limit =
> > -                       mem_cgroup_from_res_counter(soft_fail_res, res);
> > -               if (mem_cgroup_soft_limit_check(mem_over_soft_limit))
> > -                       mem_cgroup_update_tree(mem_over_soft_limit, page);
> > -       }
> > +        * Insert ancestor (and ancestor's ancestors), to softlimit
> > RB-tree.
> > +        * if they exceeds softlimit.
> > +        */
> > +       if (mem_cgroup_soft_limit_check(mem))
> > +               mem_cgroup_update_tree(mem, page);
> >  done:
> >        return 0;
> >  nomem:
> > @@ -1437,10 +1427,9 @@ static void __mem_cgroup_commit_charge(s
> >        if (unlikely(PageCgroupUsed(pc))) {
> >                unlock_page_cgroup(pc);
> >                if (!mem_cgroup_is_root(mem)) {
> > -                       res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
> > +                       res_counter_uncharge(&mem->res, PAGE_SIZE);
> >                        if (do_swap_account)
> > -                               res_counter_uncharge(&mem->memsw,
> > PAGE_SIZE,
> > -                                                       NULL);
> > +                               res_counter_uncharge(&mem->memsw,
> > PAGE_SIZE);
> >                }
> >                css_put(&mem->css);
> >                return;
> > @@ -1519,7 +1508,7 @@ static int mem_cgroup_move_account(struc
> >                goto out;
> >
> >        if (!mem_cgroup_is_root(from))
> > -               res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
> > +               res_counter_uncharge(&from->res, PAGE_SIZE);
> >        mem_cgroup_charge_statistics(from, pc, false);
> >
> >        page = pc->page;
> > @@ -1539,7 +1528,7 @@ static int mem_cgroup_move_account(struc
> >        }
> >
> >        if (do_swap_account && !mem_cgroup_is_root(from))
> > -               res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
> > +               res_counter_uncharge(&from->memsw, PAGE_SIZE);
> >        css_put(&from->css);
> >
> >        css_get(&to->css);
> > @@ -1610,9 +1599,9 @@ uncharge:
> >        css_put(&parent->css);
> >        /* uncharge if move fails */
> >        if (!mem_cgroup_is_root(parent)) {
> > -               res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
> > +               res_counter_uncharge(&parent->res, PAGE_SIZE);
> >                if (do_swap_account)
> > -                       res_counter_uncharge(&parent->memsw, PAGE_SIZE,
> > NULL);
> > +                       res_counter_uncharge(&parent->memsw, PAGE_SIZE);
> >        }
> >        return ret;
> >  }
> > @@ -1803,8 +1792,7 @@ __mem_cgroup_commit_charge_swapin(struct
> >                         * calling css_tryget
> >                         */
> >                        if (!mem_cgroup_is_root(memcg))
> > -                               res_counter_uncharge(&memcg->memsw,
> > PAGE_SIZE,
> > -                                                       NULL);
> > +                               res_counter_uncharge(&memcg->memsw,
> > PAGE_SIZE);
> >                        mem_cgroup_swap_statistics(memcg, false);
> >                        mem_cgroup_put(memcg);
> >                }
> > @@ -1831,9 +1819,9 @@ void mem_cgroup_cancel_charge_swapin(str
> >        if (!mem)
> >                return;
> >        if (!mem_cgroup_is_root(mem)) {
> > -               res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
> > +               res_counter_uncharge(&mem->res, PAGE_SIZE);
> >                if (do_swap_account)
> > -                       res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
> > +                       res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> >        }
> >        css_put(&mem->css);
> >  }
> > @@ -1848,7 +1836,6 @@ __mem_cgroup_uncharge_common(struct page
> >        struct page_cgroup *pc;
> >        struct mem_cgroup *mem = NULL;
> >        struct mem_cgroup_per_zone *mz;
> > -       bool soft_limit_excess = false;
> >
> >        if (mem_cgroup_disabled())
> >                return NULL;
> > @@ -1888,10 +1875,10 @@ __mem_cgroup_uncharge_common(struct page
> >        }
> >
> >        if (!mem_cgroup_is_root(mem)) {
> > -               res_counter_uncharge(&mem->res, PAGE_SIZE,
> > &soft_limit_excess);
> > +               res_counter_uncharge(&mem->res, PAGE_SIZE);
> >                if (do_swap_account &&
> >                                (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
> > -                       res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
> > +                       res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> >        }
> >        if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> >                mem_cgroup_swap_statistics(mem, true);
> > @@ -1908,7 +1895,7 @@ __mem_cgroup_uncharge_common(struct page
> >        mz = page_cgroup_zoneinfo(pc);
> >        unlock_page_cgroup(pc);
> >
> > -       if (soft_limit_excess && mem_cgroup_soft_limit_check(mem))
> > +       if (mem_cgroup_soft_limit_check(mem))
> >                mem_cgroup_update_tree(mem, page);
> >        /* at swapout, this memcg will be accessed to record to swap */
> >        if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> > @@ -1986,7 +1973,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
> >                 * This memcg can be obsolete one. We avoid calling
> > css_tryget
> >                 */
> >                if (!mem_cgroup_is_root(memcg))
> > -                       res_counter_uncharge(&memcg->memsw, PAGE_SIZE,
> > NULL);
> > +                       res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> >                mem_cgroup_swap_statistics(memcg, false);
> >                mem_cgroup_put(memcg);
> >        }
> >
> >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3
  2009-09-09 20:30 ` [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 Balbir Singh
@ 2009-09-10  0:20   ` KAMEZAWA Hiroyuki
  2009-09-10  5:18     ` Balbir Singh
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-10  0:20 UTC (permalink / raw)
  To: balbir; +Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp

On Thu, 10 Sep 2009 02:00:42 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-09 17:39:03]:
> 
> > This patch series is for reducing memcg's lock contention on res_counter, v3.
> > (sending today just for reporting current status in my stack.)
> > 
> > It's reported that memcg's res_counter can cause heavy false sharing / lock
> > conention and scalability is not good. This is for relaxing that.
> > No terrible bugs are found, I'll maintain/update this until the end of next
> > merge window. Tests on big-smp and new-good-idea are welcome.
> > 
> > This patch is on to mmotm+Nishimura's fix + Hugh's get_user_pages() patch.
> > But can be applied directly against mmotm, I think.
> > 
> > numbers:
> > 
> > I used 8cpu x86-64 box and run make -j 12 kernel.
> > Before make, make clean and drop_caches.
> >
> 
> Kamezawa-San
> 
> I was able to test on a 24 way using my parallel page fault test
> program and here is what I see
> 
thank you.

>  Performance counter stats for '/home/balbir/parallel_pagefault' (3
> runs):
> 
>  7191673.834385  task-clock-msecs         #     23.953 CPUs    ( +- 0.001% )
>          427765  context-switches         #      0.000 M/sec   ( +- 0.106% )
>             234  CPU-migrations           #      0.000 M/sec   ( +- 20.851% )
>        87975343  page-faults              #      0.012 M/sec   ( +- 0.347% )
>   5962193345280  cycles                   #    829.041 M/sec   ( +- 0.012% )
>   1009132401195  instructions             #      0.169 IPC     ( +- 0.059% )
>     10068652670  cache-references         #      1.400 M/sec   ( +- 2.581% )
>      2053688394  cache-misses             #      0.286 M/sec   ( +- 0.481% )
> 
>   300.238748326  seconds time elapsed   ( +-   0.001% )
> 
> Without the patch I saw
> 
>  Performance counter stats for '/home/balbir/parallel_pagefault' (3
> runs):
> 
>  7198364.596593  task-clock-msecs         #     23.959 CPUs    ( +- 0.004% )
>          425104  context-switches         #      0.000 M/sec   ( +- 0.244% )
>             157  CPU-migrations           #      0.000 M/sec   ( +- 13.291% )
>        28964117  page-faults              #      0.004 M/sec   ( +- 0.106% )
>   5786854402292  cycles                   #    803.912 M/sec   ( +- 0.013% )
>    835828892399  instructions             #      0.144 IPC     ( +- 0.073% )
>      6240606753  cache-references         #      0.867 M/sec   ( +- 1.058% )
>      2068445332  cache-misses             #      0.287 M/sec   ( +- 1.844% )
> 
>   300.443366784  seconds time elapsed   ( +-   0.005% )
> 
> 
> This does look like a very good improvement.
> 
Seems good.
BTW, why the number of page-faults after patch is 3 times bigger than
one before patch ? The difference in the number of instructions  meets it ?

THanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3
  2009-09-10  0:20   ` KAMEZAWA Hiroyuki
@ 2009-09-10  5:18     ` Balbir Singh
  0 siblings, 0 replies; 24+ messages in thread
From: Balbir Singh @ 2009-09-10  5:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-10 09:20:17]:

> On Thu, 10 Sep 2009 02:00:42 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-09 17:39:03]:
> > 
> > > This patch series is for reducing memcg's lock contention on res_counter, v3.
> > > (sending today just for reporting current status in my stack.)
> > > 
> > > It's reported that memcg's res_counter can cause heavy false sharing / lock
> > > conention and scalability is not good. This is for relaxing that.
> > > No terrible bugs are found, I'll maintain/update this until the end of next
> > > merge window. Tests on big-smp and new-good-idea are welcome.
> > > 
> > > This patch is on to mmotm+Nishimura's fix + Hugh's get_user_pages() patch.
> > > But can be applied directly against mmotm, I think.
> > > 
> > > numbers:
> > > 
> > > I used 8cpu x86-64 box and run make -j 12 kernel.
> > > Before make, make clean and drop_caches.
> > >
> > 
> > Kamezawa-San
> > 
> > I was able to test on a 24 way using my parallel page fault test
> > program and here is what I see
> > 
> thank you.
> 
> >  Performance counter stats for '/home/balbir/parallel_pagefault' (3
> > runs):
> > 
> >  7191673.834385  task-clock-msecs         #     23.953 CPUs    ( +- 0.001% )
> >          427765  context-switches         #      0.000 M/sec   ( +- 0.106% )
> >             234  CPU-migrations           #      0.000 M/sec   ( +- 20.851% )
> >        87975343  page-faults              #      0.012 M/sec   ( +- 0.347% )
> >   5962193345280  cycles                   #    829.041 M/sec   ( +- 0.012% )
> >   1009132401195  instructions             #      0.169 IPC     ( +- 0.059% )
> >     10068652670  cache-references         #      1.400 M/sec   ( +- 2.581% )
> >      2053688394  cache-misses             #      0.286 M/sec   ( +- 0.481% )
> > 
> >   300.238748326  seconds time elapsed   ( +-   0.001% )
> > 
> > Without the patch I saw
> > 
> >  Performance counter stats for '/home/balbir/parallel_pagefault' (3
> > runs):
> > 
> >  7198364.596593  task-clock-msecs         #     23.959 CPUs    ( +- 0.004% )
> >          425104  context-switches         #      0.000 M/sec   ( +- 0.244% )
> >             157  CPU-migrations           #      0.000 M/sec   ( +- 13.291% )
> >        28964117  page-faults              #      0.004 M/sec   ( +- 0.106% )
> >   5786854402292  cycles                   #    803.912 M/sec   ( +- 0.013% )
> >    835828892399  instructions             #      0.144 IPC     ( +- 0.073% )
> >      6240606753  cache-references         #      0.867 M/sec   ( +- 1.058% )
> >      2068445332  cache-misses             #      0.287 M/sec   ( +- 1.844% )
> > 
> >   300.443366784  seconds time elapsed   ( +-   0.005% )
> > 
> > 
> > This does look like a very good improvement.
> > 
> Seems good.
> BTW, why the number of page-faults after patch is 3 times bigger than
> one before patch ? The difference in the number of instructions  meets it ?
>

This is a page fault program, whose sole goal is to keep page faulting
parallely. In the case before patch, we find that resource counter
gets in the way, in the second, resource counters allow more page
faults in the same time. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/4][mmotm] memcg: coalescing charge
  2009-09-09  8:45 ` [RFC][PATCH 4/4][mmotm] memcg: coalescing charge KAMEZAWA Hiroyuki
@ 2009-09-12  4:58   ` Daisuke Nishimura
  2009-09-15  0:09     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 24+ messages in thread
From: Daisuke Nishimura @ 2009-09-12  4:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, d-nishimura

> @@ -1320,6 +1423,9 @@ static int __mem_cgroup_try_charge(struc
>  		if (!(gfp_mask & __GFP_WAIT))
>  			goto nomem;
>  
> +		/* we don't make stocks if failed */
> +		csize = PAGE_SIZE;
> +
>  		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
>  						gfp_mask, flags);
>  		if (ret)
It might be a nitpick though, isn't it better to move csize modification
before checking __GFP_WAIT ?
It might look like:

	/* we don't make stocks if failed */
	if (csize > PAGE_SIZE) {
		csize = PAGE_SIZE;
		continue;
	}

	if (!(gfp_mask & __GFP_WAIT))
		goto nomem;

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 4/4][mmotm] memcg: coalescing charge
  2009-09-12  4:58   ` Daisuke Nishimura
@ 2009-09-15  0:09     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-15  0:09 UTC (permalink / raw)
  To: nishimura
  Cc: Daisuke Nishimura, linux-mm@kvack.org, balbir@linux.vnet.ibm.com

On Sat, 12 Sep 2009 13:58:25 +0900
Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> wrote:

> > @@ -1320,6 +1423,9 @@ static int __mem_cgroup_try_charge(struc
> >  		if (!(gfp_mask & __GFP_WAIT))
> >  			goto nomem;
> >  
> > +		/* we don't make stocks if failed */
> > +		csize = PAGE_SIZE;
> > +
> >  		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> >  						gfp_mask, flags);
> >  		if (ret)
> It might be a nitpick though, isn't it better to move csize modification
> before checking __GFP_WAIT ?
> It might look like:
> 
> 	/* we don't make stocks if failed */
> 	if (csize > PAGE_SIZE) {
> 		csize = PAGE_SIZE;
> 		continue;
> 	}
> 
> 	if (!(gfp_mask & __GFP_WAIT))
> 		goto nomem;
> 
Hmm ok. thank you.
Because it's in merge-window, I don't push this series but will post
new version just for showing updates.

Thanks,
-Kame


> Thanks,
> Daisuke Nishimura.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18)
  2009-09-09  8:39 [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2009-09-09 20:30 ` [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 Balbir Singh
@ 2009-09-18  8:47 ` KAMEZAWA Hiroyuki
  2009-09-18  8:50   ` [RFC][PATCH 1/11] memcg: clean up softlimit uncharge KAMEZAWA Hiroyuki
                     ` (11 more replies)
  5 siblings, 12 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  8:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

Posting just for dumping my stack, plz see if you have time.
(will repost, this set is not for any merge)

Because my office is closed until next Thursday, my RTT will be long for a while.

Patches are mainly in 3 parts.
 - soft-limit modification (1,2)
 - coalescing chages (3,4)
 - cleanups. (5-11)

In these days, I feel I have to make memcgroup.c cleaner.
Some comments are old and placement of functions are at random.

Patches are still messy but plz see applied image if you interested in.

1. fix up softlimit's uncharge path
2. fix up softlimit's charge path
3. coalescing uncharge path
4. coalescing charge path
5. memcg_charge_cancel ....from Nishimura's set. this is very nice.
6. clean up percpu statistics of memcg.
7. clean up mem_cgroup_from_xxxx functions.
8. adds commentary and remove unused macros.
9. clean up for mem_cgroup's per-zone stat
10. adds commentary for soft-limit and moves functions for per-cpu 
11. misc. commentary and function replacement...not sorted out well.

Patches in 6-11 sounds like bad-news for Nishimura-san, but I guess
no heavy hunk you'll have...

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 1/11] memcg: clean up softlimit uncharge
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
@ 2009-09-18  8:50   ` KAMEZAWA Hiroyuki
  2009-09-18  8:52   ` [RFC][PATCH 2/11]memcg: reduce res_counter_soft_limit_excess KAMEZAWA Hiroyuki
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  8:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

No changes from previous one.

==
This patch clean up/fixes for memcg's uncharge soft limit path.
This is also a preparation for batched uncharge.

Problems:
  Now, res_counter_charge()/uncharge() handles softlimit information at
  charge/uncharge and softlimit-check is done when event counter per memcg
  goes over limit. But event counter per memcg is updated only when
  when memcg is over soft limit. But ancerstors are handled in charge path
  but not in uncharge path.
  For batched charge/uncharge, event counter check should be more strict.

  Prolems:
  1. memcg's event counter incremented only when softlimit hits. That's bad.
     It makes event counter hard to be reused for other purpose.
  2. At uncharge, only the lowest level rescounter is handled. This is bug.
     Because ancesotor's event counter is not incremented, children should
     take care of them.
  3. res_counter_uncharge()'s 3rd argument is NULL in most case.
     ops under res_counter->lock should be small. No "if" sentense is better.

Fixes:
  * Removed soft_limit_xx poitner and checsk from charge and uncharge.
    Do-check-only-when-necessary scheme works enough well without them.

  * make event-counter of memcg checked at every charge/uncharge.
    (per-cpu area will be accessed soon anyway)

  * All ancestors are checked at soft-limit-check. This is necessary because
    ancesotor's event counter may never be modified. Then, they should be
    checked at the same time.

Todo;
  We may need to modify EVENT_COUNTER_THRESH of parent with many children.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 include/linux/res_counter.h |    6 --
 kernel/res_counter.c        |   18 ------
 mm/memcontrol.c             |  115 +++++++++++++++++++-------------------------
 3 files changed, 55 insertions(+), 84 deletions(-)

Index: mmotm-2.6.31-Sep3/kernel/res_counter.c
===================================================================
--- mmotm-2.6.31-Sep3.orig/kernel/res_counter.c
+++ mmotm-2.6.31-Sep3/kernel/res_counter.c
@@ -37,27 +37,17 @@ int res_counter_charge_locked(struct res
 }
 
 int res_counter_charge(struct res_counter *counter, unsigned long val,
-			struct res_counter **limit_fail_at,
-			struct res_counter **soft_limit_fail_at)
+			struct res_counter **limit_fail_at)
 {
 	int ret;
 	unsigned long flags;
 	struct res_counter *c, *u;
 
 	*limit_fail_at = NULL;
-	if (soft_limit_fail_at)
-		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
-		/*
-		 * With soft limits, we return the highest ancestor
-		 * that exceeds its soft limit
-		 */
-		if (soft_limit_fail_at &&
-			!res_counter_soft_limit_check_locked(c))
-			*soft_limit_fail_at = c;
 		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
@@ -85,8 +75,7 @@ void res_counter_uncharge_locked(struct 
 	counter->usage -= val;
 }
 
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
-				bool *was_soft_limit_excess)
+void res_counter_uncharge(struct res_counter *counter, unsigned long val)
 {
 	unsigned long flags;
 	struct res_counter *c;
@@ -94,9 +83,6 @@ void res_counter_uncharge(struct res_cou
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
 		spin_lock(&c->lock);
-		if (was_soft_limit_excess)
-			*was_soft_limit_excess =
-				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
 		spin_unlock(&c->lock);
 	}
Index: mmotm-2.6.31-Sep3/include/linux/res_counter.h
===================================================================
--- mmotm-2.6.31-Sep3.orig/include/linux/res_counter.h
+++ mmotm-2.6.31-Sep3/include/linux/res_counter.h
@@ -114,8 +114,7 @@ void res_counter_init(struct res_counter
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
-		unsigned long val, struct res_counter **limit_fail_at,
-		struct res_counter **soft_limit_at);
+		unsigned long val, struct res_counter **limit_fail_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
@@ -128,8 +127,7 @@ int __must_check res_counter_charge(stru
  */
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
-void res_counter_uncharge(struct res_counter *counter, unsigned long val,
-				bool *was_soft_limit_excess);
+void res_counter_uncharge(struct res_counter *counter, unsigned long val);
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
Index: mmotm-2.6.31-Sep3/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep3.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep3/mm/memcontrol.c
@@ -353,16 +353,6 @@ __mem_cgroup_remove_exceeded(struct mem_
 }
 
 static void
-mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
-				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
-{
-	spin_lock(&mctz->lock);
-	__mem_cgroup_insert_exceeded(mem, mz, mctz);
-	spin_unlock(&mctz->lock);
-}
-
-static void
 mem_cgroup_remove_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
 				struct mem_cgroup_tree_per_zone *mctz)
@@ -392,34 +382,40 @@ static bool mem_cgroup_soft_limit_check(
 
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
 {
-	unsigned long long prev_usage_in_excess, new_usage_in_excess;
-	bool updated_tree = false;
+	unsigned long long new_usage_in_excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
-
-	mz = mem_cgroup_zoneinfo(mem, page_to_nid(page), page_zonenum(page));
+	int nid = page_to_nid(page);
+	int zid = page_zonenum(page);
 	mctz = soft_limit_tree_from_page(page);
 
 	/*
-	 * We do updates in lazy mode, mem's are removed
-	 * lazily from the per-zone, per-node rb tree
+	 * Necessary to update all ancestors when hierarchy is used.
+	 * because their event counter is not touched.
 	 */
-	prev_usage_in_excess = mz->usage_in_excess;
-
-	new_usage_in_excess = res_counter_soft_limit_excess(&mem->res);
-	if (prev_usage_in_excess) {
-		mem_cgroup_remove_exceeded(mem, mz, mctz);
-		updated_tree = true;
-	}
-	if (!new_usage_in_excess)
-		goto done;
-	mem_cgroup_insert_exceeded(mem, mz, mctz);
-
-done:
-	if (updated_tree) {
-		spin_lock(&mctz->lock);
-		mz->usage_in_excess = new_usage_in_excess;
-		spin_unlock(&mctz->lock);
+	for (; mem; mem = parent_mem_cgroup(mem)) {
+		mz = mem_cgroup_zoneinfo(mem, nid, zid);
+		new_usage_in_excess =
+			res_counter_soft_limit_excess(&mem->res);
+		/*
+		 * We have to update the tree if mz is on RB-tree or
+		 * mem is over its softlimit.
+		 */
+		if (new_usage_in_excess || mz->on_tree) {
+			spin_lock(&mctz->lock);
+			/* if on-tree, remove it */
+			if (mz->on_tree)
+				__mem_cgroup_remove_exceeded(mem, mz, mctz);
+			/*
+			 * if over soft limit, insert again. mz->usage_in_excess
+			 * will be updated properly.
+			 */
+			if (new_usage_in_excess)
+				__mem_cgroup_insert_exceeded(mem, mz, mctz);
+			else
+				mz->usage_in_excess = 0;
+			spin_unlock(&mctz->lock);
+		}
 	}
 }
 
@@ -1270,9 +1266,9 @@ static int __mem_cgroup_try_charge(struc
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom, struct page *page)
 {
-	struct mem_cgroup *mem, *mem_over_limit, *mem_over_soft_limit;
+	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct res_counter *fail_res, *soft_fail_res = NULL;
+	struct res_counter *fail_res;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1304,17 +1300,16 @@ static int __mem_cgroup_try_charge(struc
 
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res,
-						&soft_fail_res);
+		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
 			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res, NULL);
+							&fail_res);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -1353,16 +1348,11 @@ static int __mem_cgroup_try_charge(struc
 		}
 	}
 	/*
-	 * Insert just the ancestor, we should trickle down to the correct
-	 * cgroup for reclaim, since the other nodes will be below their
-	 * soft limit
-	 */
-	if (soft_fail_res) {
-		mem_over_soft_limit =
-			mem_cgroup_from_res_counter(soft_fail_res, res);
-		if (mem_cgroup_soft_limit_check(mem_over_soft_limit))
-			mem_cgroup_update_tree(mem_over_soft_limit, page);
-	}
+	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
+	 * if they exceeds softlimit.
+	 */
+	if (mem_cgroup_soft_limit_check(mem))
+		mem_cgroup_update_tree(mem, page);
 done:
 	return 0;
 nomem:
@@ -1437,10 +1427,9 @@ static void __mem_cgroup_commit_charge(s
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		if (!mem_cgroup_is_root(mem)) {
-			res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
 			if (do_swap_account)
-				res_counter_uncharge(&mem->memsw, PAGE_SIZE,
-							NULL);
+				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 		}
 		css_put(&mem->css);
 		return;
@@ -1519,7 +1508,7 @@ static int mem_cgroup_move_account(struc
 		goto out;
 
 	if (!mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&from->res, PAGE_SIZE);
 	mem_cgroup_charge_statistics(from, pc, false);
 
 	page = pc->page;
@@ -1539,7 +1528,7 @@ static int mem_cgroup_move_account(struc
 	}
 
 	if (do_swap_account && !mem_cgroup_is_root(from))
-		res_counter_uncharge(&from->memsw, PAGE_SIZE, NULL);
+		res_counter_uncharge(&from->memsw, PAGE_SIZE);
 	css_put(&from->css);
 
 	css_get(&to->css);
@@ -1610,9 +1599,9 @@ uncharge:
 	css_put(&parent->css);
 	/* uncharge if move fails */
 	if (!mem_cgroup_is_root(parent)) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&parent->res, PAGE_SIZE);
 		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
 	}
 	return ret;
 }
@@ -1803,8 +1792,7 @@ __mem_cgroup_commit_charge_swapin(struct
 			 * calling css_tryget
 			 */
 			if (!mem_cgroup_is_root(memcg))
-				res_counter_uncharge(&memcg->memsw, PAGE_SIZE,
-							NULL);
+				res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 			mem_cgroup_swap_statistics(memcg, false);
 			mem_cgroup_put(memcg);
 		}
@@ -1831,9 +1819,9 @@ void mem_cgroup_cancel_charge_swapin(str
 	if (!mem)
 		return;
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE, NULL);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	}
 	css_put(&mem->css);
 }
@@ -1848,7 +1836,6 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	struct mem_cgroup_per_zone *mz;
-	bool soft_limit_excess = false;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1888,10 +1875,10 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE, &soft_limit_excess);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		if (do_swap_account &&
 				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
 	}
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
@@ -1908,7 +1895,7 @@ __mem_cgroup_uncharge_common(struct page
 	mz = page_cgroup_zoneinfo(pc);
 	unlock_page_cgroup(pc);
 
-	if (soft_limit_excess && mem_cgroup_soft_limit_check(mem))
+	if (mem_cgroup_soft_limit_check(mem))
 		mem_cgroup_update_tree(mem, page);
 	/* at swapout, this memcg will be accessed to record to swap */
 	if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
@@ -1986,7 +1973,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
 		 * This memcg can be obsolete one. We avoid calling css_tryget
 		 */
 		if (!mem_cgroup_is_root(memcg))
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE, NULL);
+			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 		mem_cgroup_swap_statistics(memcg, false);
 		mem_cgroup_put(memcg);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 2/11]memcg: reduce res_counter_soft_limit_excess
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
  2009-09-18  8:50   ` [RFC][PATCH 1/11] memcg: clean up softlimit uncharge KAMEZAWA Hiroyuki
@ 2009-09-18  8:52   ` KAMEZAWA Hiroyuki
  2009-09-18  8:53   ` [RFC][PATCH 3/11] memcg: coalescing uncharge KAMEZAWA Hiroyuki
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  8:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly and
it takes res_counter's spin_lock every time.

This patch removes unnecessary calls for res_count_soft_limit_excess.

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -313,7 +313,8 @@ soft_limit_tree_from_page(struct page *p
 static void
 __mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
-				struct mem_cgroup_tree_per_zone *mctz)
+				struct mem_cgroup_tree_per_zone *mctz,
+				unsigned long long new_usage_in_excess)
 {
 	struct rb_node **p = &mctz->rb_root.rb_node;
 	struct rb_node *parent = NULL;
@@ -322,7 +323,9 @@ __mem_cgroup_insert_exceeded(struct mem_
 	if (mz->on_tree)
 		return;
 
-	mz->usage_in_excess = res_counter_soft_limit_excess(&mem->res);
+	mz->usage_in_excess = new_usage_in_excess;
+	if (!mz->usage_in_excess)
+		return;
 	while (*p) {
 		parent = *p;
 		mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
@@ -382,7 +385,7 @@ static bool mem_cgroup_soft_limit_check(
 
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
 {
-	unsigned long long new_usage_in_excess;
+	unsigned long long excess;
 	struct mem_cgroup_per_zone *mz;
 	struct mem_cgroup_tree_per_zone *mctz;
 	int nid = page_to_nid(page);
@@ -395,25 +398,21 @@ static void mem_cgroup_update_tree(struc
 	 */
 	for (; mem; mem = parent_mem_cgroup(mem)) {
 		mz = mem_cgroup_zoneinfo(mem, nid, zid);
-		new_usage_in_excess =
-			res_counter_soft_limit_excess(&mem->res);
+		excess = res_counter_soft_limit_excess(&mem->res);
 		/*
 		 * We have to update the tree if mz is on RB-tree or
 		 * mem is over its softlimit.
 		 */
-		if (new_usage_in_excess || mz->on_tree) {
+		if (excess || mz->on_tree) {
 			spin_lock(&mctz->lock);
 			/* if on-tree, remove it */
 			if (mz->on_tree)
 				__mem_cgroup_remove_exceeded(mem, mz, mctz);
 			/*
-			 * if over soft limit, insert again. mz->usage_in_excess
-			 * will be updated properly.
+			 * Insert again. mz->usage_in_excess will be updated.
+			 * If excess is 0, no tree ops.
 			 */
-			if (new_usage_in_excess)
-				__mem_cgroup_insert_exceeded(mem, mz, mctz);
-			else
-				mz->usage_in_excess = 0;
+			__mem_cgroup_insert_exceeded(mem, mz, mctz, excess);
 			spin_unlock(&mctz->lock);
 		}
 	}
@@ -2220,6 +2219,7 @@ unsigned long mem_cgroup_soft_limit_recl
 	unsigned long reclaimed;
 	int loop = 0;
 	struct mem_cgroup_tree_per_zone *mctz;
+	unsigned long long excess;
 
 	if (order > 0)
 		return 0;
@@ -2271,9 +2271,8 @@ unsigned long mem_cgroup_soft_limit_recl
 					break;
 			} while (1);
 		}
-		mz->usage_in_excess =
-			res_counter_soft_limit_excess(&mz->mem->res);
 		__mem_cgroup_remove_exceeded(mz->mem, mz, mctz);
+		excess = res_counter_soft_limit_excess(&mz->mem->res);
 		/*
 		 * One school of thought says that we should not add
 		 * back the node to the tree if reclaim returns 0.
@@ -2282,8 +2281,8 @@ unsigned long mem_cgroup_soft_limit_recl
 		 * memory to reclaim from. Consider this as a longer
 		 * term TODO.
 		 */
-		if (mz->usage_in_excess)
-			__mem_cgroup_insert_exceeded(mz->mem, mz, mctz);
+		/* If excess == 0, no tree ops */
+		__mem_cgroup_insert_exceeded(mz->mem, mz, mctz, excess);
 		spin_unlock(&mctz->lock);
 		css_put(&mz->mem->css);
 		loop++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 3/11] memcg: coalescing uncharge
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
  2009-09-18  8:50   ` [RFC][PATCH 1/11] memcg: clean up softlimit uncharge KAMEZAWA Hiroyuki
  2009-09-18  8:52   ` [RFC][PATCH 2/11]memcg: reduce res_counter_soft_limit_excess KAMEZAWA Hiroyuki
@ 2009-09-18  8:53   ` KAMEZAWA Hiroyuki
  2009-09-18  8:54   ` [RFC][PATCH 4/11] memcg: coalescing charge KAMEZAWA Hiroyuki
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  8:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

No changes from previous one.
==
In massive parallel enviroment, res_counter can be a performance bottleneck.
This patch is a trial for reducing lock contention.
One strong techinque to reduce lock contention is reducing calls by
batching some amount of calls int one.

Considering charge/uncharge chatacteristic,
	- charge is done one by one via demand-paging.
	- uncharge is done by
		- in chunk at munmap, truncate, exit, execve...
		- one by one via vmscan/paging.

It seems we hace a chance to batched-uncharge.
This patch is a base patch for batched uncharge. For avoiding
scattering memcg's structure, this patch adds memcg batch uncharge
information to the task. please see start/end usage in next patch.

The degree of coalescing depends on callers
  - at invalidate/trucate... pagevec size
  - at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)

Changelog(now):
 - no changes from previous version.

Changelog(old):
 - unified patch for callers
 - added commetns.
 - make ->do_batch as bool.
 - removed css_get() at el. We don't need it.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   13 ++++++
 include/linux/sched.h      |    7 +++
 mm/memcontrol.c            |   91 ++++++++++++++++++++++++++++++++++++++++++---
 mm/memory.c                |    2 
 mm/truncate.c              |    6 ++
 5 files changed, 113 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep17/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.31-Sep17.orig/include/linux/memcontrol.h
+++ mmotm-2.6.31-Sep17/include/linux/memcontrol.h
@@ -54,6 +54,11 @@ extern void mem_cgroup_rotate_lru_list(s
 extern void mem_cgroup_del_lru(struct page *page);
 extern void mem_cgroup_move_lists(struct page *page,
 				  enum lru_list from, enum lru_list to);
+
+/* For coalescing uncharge for reducing memcg' overhead*/
+extern void mem_cgroup_uncharge_start(void);
+extern void mem_cgroup_uncharge_end(void);
+
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
 extern int mem_cgroup_shmem_charge_fallback(struct page *page,
@@ -151,6 +156,14 @@ static inline void mem_cgroup_cancel_cha
 {
 }
 
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
+static inline void mem_cgroup_uncharge_batch_start(void)
+{
+}
+
 static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -1825,6 +1825,49 @@ void mem_cgroup_cancel_charge_swapin(str
 	css_put(&mem->css);
 }
 
+static void
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+{
+	struct memcg_batch_info *batch = NULL;
+	bool uncharge_memsw = true;
+	/* If swapout, usage of swap doesn't decrease */
+	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		uncharge_memsw = false;
+	/*
+	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
+	 * In those cases, all pages freed continously can be expected to be in
+	 * the same cgroup and we have chance to coalesce uncharges.
+	 * And, we do uncharge one by one if this is killed by OOM.
+	 */
+	if (!current->memcg_batch.do_batch || test_thread_flag(TIF_MEMDIE))
+		goto direct_uncharge;
+
+	batch = &current->memcg_batch;
+	/*
+	 * In usual, we do css_get() when we remember memcg pointer.
+	 * But in this case, we keep res->usage until end of a series of
+	 * uncharges. Then, it's ok to ignore memcg's refcnt.
+	 */
+	if (!batch->memcg)
+		batch->memcg = mem;
+	/*
+	 * In typical case, batch->memcg == mem. This means we can
+	 * merge a series of uncharges to an uncharge of res_counter.
+	 * If not, we uncharge res_counter ony by one.
+	 */
+	if (batch->memcg != mem)
+		goto direct_uncharge;
+	/* remember freed charge and uncharge it later */
+	batch->pages += PAGE_SIZE;
+	if (uncharge_memsw)
+		batch->memsw += PAGE_SIZE;
+	return;
+direct_uncharge:
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	if (uncharge_memsw)
+		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	return;
+}
 
 /*
  * uncharge if !page_mapped(page)
@@ -1873,12 +1916,8 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		if (do_swap_account &&
-				(ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT))
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-	}
+	if (!mem_cgroup_is_root(mem))
+		__do_uncharge(mem, ctype);
 	if (ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		mem_cgroup_swap_statistics(mem, true);
 	mem_cgroup_charge_statistics(mem, pc, false);
@@ -1924,6 +1963,46 @@ void mem_cgroup_uncharge_cache_page(stru
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+/*
+ * batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
+ * In that cases, pages are freed continuously and we can expect pages
+ * are in the same memcg. All these calls itself limits the number of
+ * pages freed at once, then uncharge_start/end() is called properly.
+ */
+
+void mem_cgroup_uncharge_start(void)
+{
+	if (!current->memcg_batch.do_batch) {
+		current->memcg_batch.memcg = NULL;
+		current->memcg_batch.pages = 0;
+		current->memcg_batch.memsw = 0;
+	}
+	current->memcg_batch.do_batch++;
+}
+
+void mem_cgroup_uncharge_end(void)
+{
+	struct mem_cgroup *mem;
+
+	if (!current->memcg_batch.do_batch)
+		return;
+
+	current->memcg_batch.do_batch--;
+	if (current->memcg_batch.do_batch) /* Nested ? */
+		return;
+
+	mem = current->memcg_batch.memcg;
+	if (!mem)
+		return;
+	/* This "mem" is valid bacause we hide charges behind us. */
+	if (current->memcg_batch.pages)
+		res_counter_uncharge(&mem->res, current->memcg_batch.pages);
+	if (current->memcg_batch.memsw)
+		res_counter_uncharge(&mem->memsw, current->memcg_batch.memsw);
+	/* Not necessary. but forget this pointer */
+	current->memcg_batch.memcg = NULL;
+}
+
 #ifdef CONFIG_SWAP
 /*
  * called after __delete_from_swap_cache() and drop "page" account.
Index: mmotm-2.6.31-Sep17/include/linux/sched.h
===================================================================
--- mmotm-2.6.31-Sep17.orig/include/linux/sched.h
+++ mmotm-2.6.31-Sep17/include/linux/sched.h
@@ -1534,6 +1534,13 @@ struct task_struct {
 	unsigned long trace_recursion;
 #endif /* CONFIG_TRACING */
 	unsigned long stack_start;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR /* memcg uses this to do batch job */
+	struct memcg_batch_info {
+		int do_batch;
+		struct mem_cgroup *memcg;
+		long pages, memsw;
+	} memcg_batch;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
Index: mmotm-2.6.31-Sep17/mm/memory.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memory.c
+++ mmotm-2.6.31-Sep17/mm/memory.c
@@ -939,6 +939,7 @@ static unsigned long unmap_page_range(st
 		details = NULL;
 
 	BUG_ON(addr >= end);
+	mem_cgroup_uncharge_start();
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
 	do {
@@ -951,6 +952,7 @@ static unsigned long unmap_page_range(st
 						zap_work, details);
 	} while (pgd++, addr = next, (addr != end && *zap_work > 0));
 	tlb_end_vma(tlb, vma);
+	mem_cgroup_uncharge_end();
 
 	return addr;
 }
Index: mmotm-2.6.31-Sep17/mm/truncate.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/truncate.c
+++ mmotm-2.6.31-Sep17/mm/truncate.c
@@ -272,6 +272,7 @@ void truncate_inode_pages_range(struct a
 			pagevec_release(&pvec);
 			break;
 		}
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 
@@ -286,6 +287,7 @@ void truncate_inode_pages_range(struct a
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 	}
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -327,6 +329,7 @@ unsigned long invalidate_mapping_pages(s
 	pagevec_init(&pvec, 0);
 	while (next <= end &&
 			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t index;
@@ -354,6 +357,7 @@ unsigned long invalidate_mapping_pages(s
 				break;
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;
@@ -428,6 +432,7 @@ int invalidate_inode_pages2_range(struct
 	while (next <= end && !wrapped &&
 		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			pgoff_t page_index;
@@ -477,6 +482,7 @@ int invalidate_inode_pages2_range(struct
 			unlock_page(page);
 		}
 		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 4/11] memcg: coalescing charge
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2009-09-18  8:53   ` [RFC][PATCH 3/11] memcg: coalescing uncharge KAMEZAWA Hiroyuki
@ 2009-09-18  8:54   ` KAMEZAWA Hiroyuki
  2009-09-18  8:55   ` [RFC][PATCH 5/11] memcg: clean up cancel charge KAMEZAWA Hiroyuki
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  8:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

Applied Nishimura-san's comment.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is a patch for coalescing access to res_counter at charging by
percpu cache.At charge, memcg charges 64pages and remember it in percpu cache.
Because it's cache, drain/flush if necessary.

This version uses public percpu area.
 2 benefits for using public percpu area.
 1. Sum of stocked charge in the system is limited to # of cpus
    not to the number of memcg. This shows better synchonization.
 2. drain code for flush/cpuhotplug is very easy (and quick)

The most important point of this patch is that we never touch res_counter
in fast path. The res_counter is system-wide shared counter which is modified
very frequently. We shouldn't touch it as far as we can for avoid false sharing.

Changelog (latest):
  - moved charge size check before __GFP_WAIT check for avoiding unnecesary
    memory allocation failure.

Changelog (old):
  - added asynchronous flush routine.
  - fixed bugs pointed out by Nishimura-san.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  126 +++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 120 insertions(+), 6 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -38,6 +38,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include <linux/cpu.h>
 #include "internal.h"
 
 #include <asm/uaccess.h>
@@ -275,6 +276,7 @@ enum charge_type {
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
+static void drain_all_stock_async(void);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1136,6 +1138,8 @@ static int mem_cgroup_hierarchical_recla
 		victim = mem_cgroup_select_victim(root_mem);
 		if (victim == root_mem) {
 			loop++;
+			if (loop >= 1)
+				drain_all_stock_async();
 			if (loop >= 2) {
 				/*
 				 * If we have not been able to reclaim
@@ -1257,6 +1261,102 @@ done:
 	unlock_page_cgroup(pc);
 }
 
+#define CHARGE_SIZE	(64 * PAGE_SIZE)
+struct memcg_stock_pcp {
+	struct mem_cgroup *cached;
+	int charge;
+	struct work_struct work;
+};
+static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
+static DEFINE_MUTEX(memcg_drain_mutex);
+
+static bool consume_stock(struct mem_cgroup *mem)
+{
+	struct memcg_stock_pcp *stock;
+	bool ret = true;
+
+	stock = &get_cpu_var(memcg_stock);
+	if (mem == stock->cached && stock->charge)
+		stock->charge -= PAGE_SIZE;
+	else
+		ret = false;
+	put_cpu_var(memcg_stock);
+	return ret;
+}
+
+static void drain_stock(struct memcg_stock_pcp *stock)
+{
+	struct mem_cgroup *old = stock->cached;
+
+	if (stock->charge) {
+		res_counter_uncharge(&old->res, stock->charge);
+		if (do_swap_account)
+			res_counter_uncharge(&old->memsw, stock->charge);
+	}
+	stock->cached = NULL;
+	stock->charge = 0;
+}
+
+static void drain_local_stock(struct work_struct *dummy)
+{
+	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
+	drain_stock(stock);
+	put_cpu_var(memcg_stock);
+}
+
+static void refill_stock(struct mem_cgroup *mem, int val)
+{
+	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
+
+	if (stock->cached != mem) {
+		drain_stock(stock);
+		stock->cached = mem;
+	}
+	stock->charge += val;
+	put_cpu_var(memcg_stock);
+}
+
+static void drain_all_stock_async(void)
+{
+	int cpu;
+	/* Contention means someone tries to flush. */
+	if (!mutex_trylock(&memcg_drain_mutex))
+		return;
+	for_each_online_cpu(cpu) {
+		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
+		if (work_pending(&stock->work))
+			continue;
+		INIT_WORK(&stock->work, drain_local_stock);
+		schedule_work_on(cpu, &stock->work);
+	}
+	mutex_unlock(&memcg_drain_mutex);
+	/* We don't wait for flush_work */
+}
+
+static void drain_all_stock_sync(void)
+{
+	/* called when force_empty is called */
+	mutex_lock(&memcg_drain_mutex);
+	schedule_on_each_cpu(drain_local_stock);
+	mutex_unlock(&memcg_drain_mutex);
+}
+
+static int __cpuinit memcg_stock_cpu_callback(struct notifier_block *nb,
+					unsigned long action,
+					void *hcpu)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+	int cpu = (unsigned long)hcpu;
+	struct memcg_stock_pcp *stock;
+
+	if (action != CPU_DEAD)
+		return NOTIFY_OK;
+	stock = &per_cpu(memcg_stock, cpu);
+	drain_stock(stock);
+#endif
+	return NOTIFY_OK;
+}
+
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
  * oom-killer can be invoked.
@@ -1268,6 +1368,7 @@ static int __mem_cgroup_try_charge(struc
 	struct mem_cgroup *mem, *mem_over_limit;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct res_counter *fail_res;
+	int csize = CHARGE_SIZE;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
@@ -1292,23 +1393,25 @@ static int __mem_cgroup_try_charge(struc
 		return 0;
 
 	VM_BUG_ON(css_is_removed(&mem->css));
+	if (mem_cgroup_is_root(mem))
+		goto done;
 
 	while (1) {
 		int ret = 0;
 		unsigned long flags = 0;
 
-		if (mem_cgroup_is_root(mem))
-			goto done;
-		ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
+		if (consume_stock(mem))
+			goto charged;
+
+		ret = res_counter_charge(&mem->res, csize, &fail_res);
 		if (likely(!ret)) {
 			if (!do_swap_account)
 				break;
-			ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
-							&fail_res);
+			ret = res_counter_charge(&mem->memsw, csize, &fail_res);
 			if (likely(!ret))
 				break;
 			/* mem+swap counter fails */
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			res_counter_uncharge(&mem->res, csize);
 			flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									memsw);
@@ -1317,6 +1420,12 @@ static int __mem_cgroup_try_charge(struc
 			mem_over_limit = mem_cgroup_from_res_counter(fail_res,
 									res);
 
+
+		/* reduce request size and retry */
+		if (csize > PAGE_SIZE) {
+			csize = PAGE_SIZE;
+			continue;
+		}
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
 
@@ -1346,6 +1455,9 @@ static int __mem_cgroup_try_charge(struc
 			goto nomem;
 		}
 	}
+	if (csize > PAGE_SIZE)
+		refill_stock(mem, csize - PAGE_SIZE);
+charged:
 	/*
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
 	 * if they exceeds softlimit.
@@ -2462,6 +2574,7 @@ move_account:
 			goto out;
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
+		drain_all_stock_sync();
 		ret = 0;
 		for_each_node_state(node, N_HIGH_MEMORY) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
@@ -3180,6 +3293,7 @@ mem_cgroup_create(struct cgroup_subsys *
 		root_mem_cgroup = mem;
 		if (mem_cgroup_soft_limit_tree_init())
 			goto free_out;
+		hotcpu_notifier(memcg_stock_cpu_callback, 0);
 
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 5/11] memcg: clean up cancel charge
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
                     ` (3 preceding siblings ...)
  2009-09-18  8:54   ` [RFC][PATCH 4/11] memcg: coalescing charge KAMEZAWA Hiroyuki
@ 2009-09-18  8:55   ` KAMEZAWA Hiroyuki
  2009-09-18  8:57   ` [RFC][PATCH 6/11] memcg: cleaun up percpu statistics KAMEZAWA Hiroyuki
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  8:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

>From Nishimra-san's set.

==
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

There are some places calling both res_counter_uncharge() and css_put()
to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().

This patch introduces mem_cgroup_cancel_charge() and call it in those places.

Modification from Nishimura's 
 - removed 'inline'
 - adjusted for a change in res_counter_uncharge.
 - added comment

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

---
 mm/memcontrol.c |   37 ++++++++++++++++++-------------------
 1 file changed, 18 insertions(+), 19 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -1472,6 +1472,21 @@ nomem:
 }
 
 /*
+ * Somemtimes we have to undo a charge we got by try_charge().
+ * This function is for that and do uncharge, put css's refcnt.
+ * gotten by try_charge().
+ */
+static void __mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+{
+	if (!mem_cgroup_is_root(mem)) {
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		if (do_swap_account)
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	}
+	css_put(&mem->css);
+}
+
+/*
  * A helper function to get mem_cgroup from ID. must be called under
  * rcu_read_lock(). The caller must check css_is_removed() or some if
  * it's concern. (dropping refcnt from swap can be called against removed
@@ -1537,12 +1552,7 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		if (!mem_cgroup_is_root(mem)) {
-			res_counter_uncharge(&mem->res, PAGE_SIZE);
-			if (do_swap_account)
-				res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-		}
-		css_put(&mem->css);
+		__mem_cgroup_cancel_charge(mem);
 		return;
 	}
 
@@ -1707,13 +1717,7 @@ cancel:
 	put_page(page);
 uncharge:
 	/* drop extra refcnt by try_charge() */
-	css_put(&parent->css);
-	/* uncharge if move fails */
-	if (!mem_cgroup_is_root(parent)) {
-		res_counter_uncharge(&parent->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&parent->memsw, PAGE_SIZE);
-	}
+	__mem_cgroup_cancel_charge(parent);
 	return ret;
 }
 
@@ -1929,12 +1933,7 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE);
-	}
-	css_put(&mem->css);
+	__mem_cgroup_cancel_charge(mem);
 }
 
 static void

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 6/11] memcg: cleaun up percpu statistics
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
                     ` (4 preceding siblings ...)
  2009-09-18  8:55   ` [RFC][PATCH 5/11] memcg: clean up cancel charge KAMEZAWA Hiroyuki
@ 2009-09-18  8:57   ` KAMEZAWA Hiroyuki
  2009-09-18  8:58   ` [RFC][PATCH 7/11] memcg: rename from_cont to from_cgroup KAMEZAWA Hiroyuki
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  8:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

mem_cgroup has per cpu statistics counter. When implemented,
the users are limited and very low level interface was enough.

But in these days, there are more users. This patch is for cleaning up them.
And adds more precise commentary.

Diff seems big but this patch includes big code moving....
This moves percpu stat access function after definition of mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  224 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 125 insertions(+), 99 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -59,7 +59,7 @@ static DEFINE_MUTEX(memcg_tasklist);	/* 
 #define SOFTLIMIT_EVENTS_THRESH (1000)
 
 /*
- * Statistics for memory cgroup.
+ * Statistics for memory cgroup. accounted per cpu.
  */
 enum mem_cgroup_stat_index {
 	/*
@@ -84,48 +84,6 @@ struct mem_cgroup_stat {
 	struct mem_cgroup_stat_cpu cpustat[0];
 };
 
-static inline void
-__mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat,
-				enum mem_cgroup_stat_index idx)
-{
-	stat->count[idx] = 0;
-}
-
-static inline s64
-__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
-				enum mem_cgroup_stat_index idx)
-{
-	return stat->count[idx];
-}
-
-/*
- * For accounting under irq disable, no need for increment preempt count.
- */
-static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
-		enum mem_cgroup_stat_index idx, int val)
-{
-	stat->count[idx] += val;
-}
-
-static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
-		enum mem_cgroup_stat_index idx)
-{
-	int cpu;
-	s64 ret = 0;
-	for_each_possible_cpu(cpu)
-		ret += stat->cpustat[cpu].count[idx];
-	return ret;
-}
-
-static s64 mem_cgroup_local_usage(struct mem_cgroup_stat *stat)
-{
-	s64 ret;
-
-	ret = mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_CACHE);
-	ret += mem_cgroup_read_stat(stat, MEM_CGROUP_STAT_RSS);
-	return ret;
-}
-
 /*
  * per-zone information in memory controller.
  */
@@ -278,6 +236,101 @@ static void mem_cgroup_put(struct mem_cg
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
 static void drain_all_stock_async(void);
 
+/*
+ * Functions for acceccing cpu local statistics. modification should be
+ * done under preempt disabled. __mem_cgroup_xxx functions are for low level.
+ */
+static inline void
+__mem_cgroup_stat_reset_local(struct mem_cgroup_stat_cpu *stat,
+				enum mem_cgroup_stat_index idx)
+{
+	stat->count[idx] = 0;
+}
+
+static inline void
+mem_cgroup_stat_reset_local(struct mem_cgroup *mem,
+			enum mem_cgroup_stat_index idx)
+{
+	struct mem_cgroup_stat *stat = &mem->stat;
+	struct mem_cgroup_stat_cpu *cstat;
+	int cpu = get_cpu();
+
+	cstat = &stat->cpustat[cpu];
+	__mem_cgroup_stat_reset_local(cstat, idx);
+	put_cpu();
+}
+
+static inline s64
+__mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat,
+				enum mem_cgroup_stat_index idx)
+{
+	return stat->count[idx];
+}
+
+static inline s64
+mem_cgroup_stat_read_local(struct mem_cgroup *mem,
+			enum mem_cgroup_stat_index idx)
+{
+	struct mem_cgroup_stat *stat = &mem->stat;
+	struct mem_cgroup_stat_cpu *cstat;
+	int cpu = get_cpu();
+	s64 val;
+
+	cstat = &stat->cpustat[cpu];
+	val = __mem_cgroup_stat_read_local(cstat, idx);
+	put_cpu();
+	return val;
+}
+
+static inline void __mem_cgroup_stat_add_local(struct mem_cgroup_stat_cpu *stat,
+		enum mem_cgroup_stat_index idx, int val)
+{
+	stat->count[idx] += val;
+}
+
+static inline void
+mem_cgroup_stat_add_local(struct mem_cgroup *mem,
+		enum mem_cgroup_stat_index idx, int val)
+{
+	struct mem_cgroup_stat *stat = &mem->stat;
+	struct mem_cgroup_stat_cpu *cstat;
+	int cpu = get_cpu();
+
+	cstat = &stat->cpustat[cpu];
+	__mem_cgroup_stat_add_local(cstat, idx, val);
+	put_cpu();
+}
+
+/*
+ * A function for reading sum of all percpu statistics.
+ * Will be slow on big machines.
+ */
+static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
+		enum mem_cgroup_stat_index idx)
+{
+	int cpu;
+	s64 ret = 0;
+	struct mem_cgroup_stat *stat = &mem->stat;
+
+	for_each_possible_cpu(cpu)
+		ret += stat->cpustat[cpu].count[idx];
+	return ret;
+}
+/*
+ * When mem_cgroup is used with hierarchy inheritance enabled, cgroup local
+ * memory usage is just shown by sum of percpu statitics. This function returns
+ * cgroup local memory usage even if it's under hierarchy.
+ */
+static s64 mem_cgroup_local_usage(struct mem_cgroup *mem)
+{
+	s64 ret;
+
+	ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
+	ret += mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
+	return ret;
+}
+
+
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
@@ -370,18 +423,13 @@ mem_cgroup_remove_exceeded(struct mem_cg
 static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
 {
 	bool ret = false;
-	int cpu;
 	s64 val;
-	struct mem_cgroup_stat_cpu *cpustat;
 
-	cpu = get_cpu();
-	cpustat = &mem->stat.cpustat[cpu];
-	val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS);
+	val = mem_cgroup_stat_read_local(mem, MEM_CGROUP_STAT_EVENTS);
 	if (unlikely(val > SOFTLIMIT_EVENTS_THRESH)) {
-		__mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS);
+		mem_cgroup_stat_reset_local(mem, MEM_CGROUP_STAT_EVENTS);
 		ret = true;
 	}
-	put_cpu();
 	return ret;
 }
 
@@ -480,13 +528,7 @@ static void mem_cgroup_swap_statistics(s
 					 bool charge)
 {
 	int val = (charge) ? 1 : -1;
-	struct mem_cgroup_stat *stat = &mem->stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu = get_cpu();
-
-	cpustat = &stat->cpustat[cpu];
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SWAPOUT, val);
-	put_cpu();
+	mem_cgroup_stat_add_local(mem, MEM_CGROUP_STAT_SWAPOUT, val);
 }
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
@@ -495,22 +537,22 @@ static void mem_cgroup_charge_statistics
 {
 	int val = (charge) ? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
-	struct mem_cgroup_stat_cpu *cpustat;
+	struct mem_cgroup_stat_cpu *cstat;
 	int cpu = get_cpu();
-
-	cpustat = &stat->cpustat[cpu];
+	/* for fast access, we use open-coded manner */
+	cstat = &stat->cpustat[cpu];
 	if (PageCgroupCache(pc))
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
+		__mem_cgroup_stat_add_local(cstat, MEM_CGROUP_STAT_CACHE, val);
 	else
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
+		__mem_cgroup_stat_add_local(cstat, MEM_CGROUP_STAT_RSS, val);
 
 	if (charge)
-		__mem_cgroup_stat_add_safe(cpustat,
+		__mem_cgroup_stat_add_local(cstat,
 				MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
 	else
-		__mem_cgroup_stat_add_safe(cpustat,
+		__mem_cgroup_stat_add_local(cstat,
 				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1);
+	__mem_cgroup_stat_add_local(cstat, MEM_CGROUP_STAT_EVENTS, 1);
 	put_cpu();
 }
 
@@ -1163,7 +1205,11 @@ static int mem_cgroup_hierarchical_recla
 				}
 			}
 		}
-		if (!mem_cgroup_local_usage(&victim->stat)) {
+		/*
+		 * mem->res can includes children cgroup's memory usage.What
+		 * we need to check here is local usage.
+		 */
+		if (!mem_cgroup_local_usage(victim)) {
 			/* this cgroup's local usage == 0 */
 			css_put(&victim->css);
 			continue;
@@ -1229,9 +1275,6 @@ static void record_last_oom(struct mem_c
 void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
 {
 	struct mem_cgroup *mem;
-	struct mem_cgroup_stat *stat;
-	struct mem_cgroup_stat_cpu *cpustat;
-	int cpu;
 	struct page_cgroup *pc;
 
 	if (!page_is_file_cache(page))
@@ -1249,14 +1292,7 @@ void mem_cgroup_update_mapped_file_stat(
 	if (!PageCgroupUsed(pc))
 		goto done;
 
-	/*
-	 * Preemption is already disabled, we don't need get_cpu()
-	 */
-	cpu = smp_processor_id();
-	stat = &mem->stat;
-	cpustat = &stat->cpustat[cpu];
-
-	__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE, val);
+	mem_cgroup_stat_add_local(mem, MEM_CGROUP_STAT_MAPPED_FILE, val);
 done:
 	unlock_page_cgroup(pc);
 }
@@ -1607,9 +1643,6 @@ static int mem_cgroup_move_account(struc
 	int nid, zid;
 	int ret = -EBUSY;
 	struct page *page;
-	int cpu;
-	struct mem_cgroup_stat *stat;
-	struct mem_cgroup_stat_cpu *cpustat;
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
@@ -1634,18 +1667,11 @@ static int mem_cgroup_move_account(struc
 
 	page = pc->page;
 	if (page_is_file_cache(page) && page_mapped(page)) {
-		cpu = smp_processor_id();
-		/* Update mapped_file data for mem_cgroup "from" */
-		stat = &from->stat;
-		cpustat = &stat->cpustat[cpu];
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
-						-1);
-
-		/* Update mapped_file data for mem_cgroup "to" */
-		stat = &to->stat;
-		cpustat = &stat->cpustat[cpu];
-		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_MAPPED_FILE,
-						1);
+		/* decrement mapped_file data for mem_cgroup "from" */
+		mem_cgroup_stat_add_local(from,
+				MEM_CGROUP_STAT_MAPPED_FILE, -1);
+		/* increment mapped_file data for mem_cgroup "to" */
+		mem_cgroup_stat_add_local(to, MEM_CGROUP_STAT_MAPPED_FILE, 1);
 	}
 
 	if (do_swap_account && !mem_cgroup_is_root(from))
@@ -2685,7 +2711,7 @@ static int
 mem_cgroup_get_idx_stat(struct mem_cgroup *mem, void *data)
 {
 	struct mem_cgroup_idx_data *d = data;
-	d->val += mem_cgroup_read_stat(&mem->stat, d->idx);
+	d->val += mem_cgroup_read_stat(mem, d->idx);
 	return 0;
 }
 
@@ -2890,18 +2916,18 @@ static int mem_cgroup_get_local_stat(str
 	s64 val;
 
 	/* per cpu stat */
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_CACHE);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_CACHE);
 	s->stat[MCS_CACHE] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_RSS);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_RSS);
 	s->stat[MCS_RSS] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_MAPPED_FILE);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_MAPPED_FILE);
 	s->stat[MCS_MAPPED_FILE] += val * PAGE_SIZE;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGIN_COUNT);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGIN_COUNT);
 	s->stat[MCS_PGPGIN] += val;
-	val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_PGPGOUT_COUNT);
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_PGPGOUT_COUNT);
 	s->stat[MCS_PGPGOUT] += val;
 	if (do_swap_account) {
-		val = mem_cgroup_read_stat(&mem->stat, MEM_CGROUP_STAT_SWAPOUT);
+		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 7/11] memcg: rename from_cont to from_cgroup
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
                     ` (5 preceding siblings ...)
  2009-09-18  8:57   ` [RFC][PATCH 6/11] memcg: cleaun up percpu statistics KAMEZAWA Hiroyuki
@ 2009-09-18  8:58   ` KAMEZAWA Hiroyuki
  2009-09-18  9:00   ` [RFC][PATCH 8/11]memcg: remove unused macro and adds commentary KAMEZAWA Hiroyuki
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  8:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp


Rename mem_cgroup_from_cont() to mem_cgroup_from_cgroup()
And moves functions for accessing mem_cgroup to the top of the file.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  127 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 67 insertions(+), 60 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -237,6 +237,57 @@ static struct mem_cgroup *parent_mem_cgr
 static void drain_all_stock_async(void);
 
 /*
+ * Utitily for accessing mem_cgroup via various objects.
+ */
+#define mem_cgroup_from_res_counter(counter, member)	\
+	container_of(counter, struct mem_cgroup, member)
+
+
+static struct mem_cgroup *mem_cgroup_from_cgroup(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont,
+				mem_cgroup_subsys_id), struct mem_cgroup,
+				css);
+}
+
+/* we get task's mem_cgroup from mm->owner, not this task */
+struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+	/*
+	 * mm_update_next_owner() may clear mm->owner to NULL
+	 * if it races with swapoff, page migration, etc.
+	 * So this can be called with p == NULL.
+	 */
+	if (unlikely(!p))
+		return NULL;
+
+	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
+				struct mem_cgroup, css);
+}
+
+/* get mem_cgroup from mm_struct and increment css->refcnt */
+static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem = NULL;
+
+	if (!mm)
+		return NULL;
+	/*
+	 * Because we have no locks, mm->owner's may be being moved to other
+	 * cgroup. We use css_tryget() here even if this looks
+	 * pessimistic (rather than adding locks here).
+	 */
+	rcu_read_lock();
+	do {
+		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+		if (unlikely(!mem))
+			break;
+	} while (!css_tryget(&mem->css));
+	rcu_read_unlock();
+	return mem;
+}
+
+/*
  * Functions for acceccing cpu local statistics. modification should be
  * done under preempt disabled. __mem_cgroup_xxx functions are for low level.
  */
@@ -571,48 +622,6 @@ static unsigned long mem_cgroup_get_loca
 	return total;
 }
 
-static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont)
-{
-	return container_of(cgroup_subsys_state(cont,
-				mem_cgroup_subsys_id), struct mem_cgroup,
-				css);
-}
-
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
-{
-	/*
-	 * mm_update_next_owner() may clear mm->owner to NULL
-	 * if it races with swapoff, page migration, etc.
-	 * So this can be called with p == NULL.
-	 */
-	if (unlikely(!p))
-		return NULL;
-
-	return container_of(task_subsys_state(p, mem_cgroup_subsys_id),
-				struct mem_cgroup, css);
-}
-
-static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
-{
-	struct mem_cgroup *mem = NULL;
-
-	if (!mm)
-		return NULL;
-	/*
-	 * Because we have no locks, mm->owner's may be being moved to other
-	 * cgroup. We use css_tryget() here even if this looks
-	 * pessimistic (rather than adding locks here).
-	 */
-	rcu_read_lock();
-	do {
-		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-		if (unlikely(!mem))
-			break;
-	} while (!css_tryget(&mem->css));
-	rcu_read_unlock();
-	return mem;
-}
-
 /*
  * Call callback function against all cgroup under hierarchy tree.
  */
@@ -992,8 +1001,6 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-#define mem_cgroup_from_res_counter(counter, member)	\
-	container_of(counter, struct mem_cgroup, member)
 
 static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
 {
@@ -1712,7 +1719,7 @@ static int mem_cgroup_move_parent(struct
 		return -EINVAL;
 
 
-	parent = mem_cgroup_from_cont(pcg);
+	parent = mem_cgroup_from_cgroup(pcg);
 
 
 	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page);
@@ -2660,25 +2667,25 @@ try_to_free:
 
 int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
 {
-	return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true);
+	return mem_cgroup_force_empty(mem_cgroup_from_cgroup(cont), true);
 }
 
 
 static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
 {
-	return mem_cgroup_from_cont(cont)->use_hierarchy;
+	return mem_cgroup_from_cgroup(cont)->use_hierarchy;
 }
 
 static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
 					u64 val)
 {
 	int retval = 0;
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *mem = mem_cgroup_from_cgroup(cont);
 	struct cgroup *parent = cont->parent;
 	struct mem_cgroup *parent_mem = NULL;
 
 	if (parent)
-		parent_mem = mem_cgroup_from_cont(parent);
+		parent_mem = mem_cgroup_from_cgroup(parent);
 
 	cgroup_lock();
 	/*
@@ -2728,7 +2735,7 @@ mem_cgroup_get_recursive_idx_stat(struct
 
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 {
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *mem = mem_cgroup_from_cgroup(cont);
 	u64 idx_val, val;
 	int type, name;
 
@@ -2774,7 +2781,7 @@ static u64 mem_cgroup_read(struct cgroup
 static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			    const char *buffer)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cont);
 	int type, name;
 	unsigned long long val;
 	int ret;
@@ -2831,7 +2838,7 @@ static void memcg_get_hierarchical_limit
 
 	while (cgroup->parent) {
 		cgroup = cgroup->parent;
-		memcg = mem_cgroup_from_cont(cgroup);
+		memcg = mem_cgroup_from_cgroup(cgroup);
 		if (!memcg->use_hierarchy)
 			break;
 		tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
@@ -2850,7 +2857,7 @@ static int mem_cgroup_reset(struct cgrou
 	struct mem_cgroup *mem;
 	int type, name;
 
-	mem = mem_cgroup_from_cont(cont);
+	mem = mem_cgroup_from_cgroup(cont);
 	type = MEMFILE_TYPE(event);
 	name = MEMFILE_ATTR(event);
 	switch (name) {
@@ -2954,7 +2961,7 @@ mem_cgroup_get_total_stat(struct mem_cgr
 static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft,
 				 struct cgroup_map_cb *cb)
 {
-	struct mem_cgroup *mem_cont = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *mem_cont = mem_cgroup_from_cgroup(cont);
 	struct mcs_total_stat mystat;
 	int i;
 
@@ -3018,7 +3025,7 @@ static int mem_control_stat_show(struct 
 
 static u64 mem_cgroup_swappiness_read(struct cgroup *cgrp, struct cftype *cft)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cgrp);
 
 	return get_swappiness(memcg);
 }
@@ -3026,7 +3033,7 @@ static u64 mem_cgroup_swappiness_read(st
 static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 				       u64 val)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cgrp);
 	struct mem_cgroup *parent;
 
 	if (val > 100)
@@ -3035,7 +3042,7 @@ static int mem_cgroup_swappiness_write(s
 	if (cgrp->parent == NULL)
 		return -EINVAL;
 
-	parent = mem_cgroup_from_cont(cgrp->parent);
+	parent = mem_cgroup_from_cgroup(cgrp->parent);
 
 	cgroup_lock();
 
@@ -3321,7 +3328,7 @@ mem_cgroup_create(struct cgroup_subsys *
 		hotcpu_notifier(memcg_stock_cpu_callback, 0);
 
 	} else {
-		parent = mem_cgroup_from_cont(cont->parent);
+		parent = mem_cgroup_from_cgroup(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 	}
 
@@ -3355,7 +3362,7 @@ free_out:
 static int mem_cgroup_pre_destroy(struct cgroup_subsys *ss,
 					struct cgroup *cont)
 {
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *mem = mem_cgroup_from_cgroup(cont);
 
 	return mem_cgroup_force_empty(mem, false);
 }
@@ -3363,7 +3370,7 @@ static int mem_cgroup_pre_destroy(struct
 static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
-	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+	struct mem_cgroup *mem = mem_cgroup_from_cgroup(cont);
 
 	mem_cgroup_put(mem);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 8/11]memcg: remove unused macro and adds commentary
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
                     ` (6 preceding siblings ...)
  2009-09-18  8:58   ` [RFC][PATCH 7/11] memcg: rename from_cont to from_cgroup KAMEZAWA Hiroyuki
@ 2009-09-18  9:00   ` KAMEZAWA Hiroyuki
  2009-09-18  9:01   ` [RFC][PATCH 9/11]memcg: clean up zonestat funcs KAMEZAWA Hiroyuki
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  9:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

This patch does
  - adds more commentary for commentless fileds.
  - remove unused flags as MEM_CGROUP_CHARGE_TYPE_FORCE and PCGF_XXX
  - Upadte comments for charge_type. Especially SWAPOUT and DROP is
    a not easy charge type to understand.
  - moved mem_cgroup_is_root() to head position.
    (after mem_cgroup_from_xxx functions)
  - move  MEM_CGROUP_MAX_RECLAIM_LOOPS near to other macros for reclaim. 

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   84 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 56 insertions(+), 28 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -89,12 +89,16 @@ struct mem_cgroup_stat {
  */
 struct mem_cgroup_per_zone {
 	/*
-	 * spin_lock to protect the per cgroup LRU
+	 *  LRU fields are guarded by zone->lru_lock.
 	 */
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
-
+	/*
+	 * Reclaim stat is used for recording statistics of LRU behavior.
+	 * This is used by vmscan.c under zone->lru_lock
+	 */
 	struct zone_reclaim_stat reclaim_stat;
+	/* for softlimit tree management */
 	struct rb_node		tree_node;	/* RB tree node */
 	unsigned long long	usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
@@ -166,55 +170,66 @@ struct mem_cgroup {
 	spinlock_t reclaim_param_lock;
 
 	int	prev_priority;	/* for recording reclaim priority */
+	unsigned int	swappiness; /* a vmscan parameter (see vmscan.c) */
 
 	/*
 	 * While reclaiming in a hiearchy, we cache the last child we
 	 * reclaimed from.
 	 */
 	int last_scanned_child;
+
+	/* true if hierarchical page accounting is ued in this memcg. */
+	bool use_hierarchy;
+
 	/*
-	 * Should the accounting and control be hierarchical, per subtree?
+	 * For recording jiffies of the last OOM under this memcg.
+	 * This is used by mem_cgroup_oom_called() which is called by
+	 * pagefault_out_of_memory() for checking OOM was system-wide or
+	 * memcg local.
 	 */
-	bool use_hierarchy;
+
 	unsigned long	last_oom_jiffies;
+	/*
+	 * Private refcnt. This is mainly used by swap accounting
+	 * Because we don't move swap account at destroy(), mem_cgroup
+	 * object must be alive as zombie until all references from
+	 * swap disappears.
+	 */
 	atomic_t	refcnt;
 
-	unsigned int	swappiness;
-
-	/* set when res.limit == memsw.limit */
+	/*
+	 * set when res.limit == memsw.limit. If this is true, swapout
+	 * will be no help for reducing the usage.
+	 */
 	bool		memsw_is_minimum;
 
 	/*
-	 * statistics. This must be placed at the end of memcg.
+	 * per-cpu statistics. This must be placed at the end of memcg.
 	 */
 	struct mem_cgroup_stat stat;
 };
 
 /*
- * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
- * limit reclaim to prevent infinite loops, if they ever occur.
+ * Types of charge/uncharge. memcg's behavior is depends on these types.
+ * SWAPOUT is for mem+swap accounting. It's used when a page is dropped
+ * from memory but there is a valid reference in swap. DROP means
+ * a page is removed from swap cache and no reference from swap itself.
  */
-#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
-#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
 
 enum charge_type {
-	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
-	MEM_CGROUP_CHARGE_TYPE_MAPPED,
+	MEM_CGROUP_CHARGE_TYPE_CACHE = 0, /* used when charges for cache */
+	MEM_CGROUP_CHARGE_TYPE_MAPPED,  /* used when charges for anon */
 	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
-	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
 	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* for accounting swapcache */
 	MEM_CGROUP_CHARGE_TYPE_DROP,	/* a page was unused swap cache */
 	NR_CHARGE_TYPE,
 };
 
-/* only for here (for easy reading.) */
-#define PCGF_CACHE	(1UL << PCG_CACHE)
-#define PCGF_USED	(1UL << PCG_USED)
-#define PCGF_LOCK	(1UL << PCG_LOCK)
-/* Not used, but added here for completeness */
-#define PCGF_ACCT	(1UL << PCG_ACCT)
-
-/* for encoding cft->private value on file */
+/*
+ * Because mem_cgroup has 2 contorls, mem & mem+swap. There are control files
+ * of similar functions. We use following encoding macro for controls files'
+ * type. These will be used for encoding cft->private value on file
+ */
 #define _MEM			(0)
 #define _MEMSWAP		(1)
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -231,6 +246,13 @@ enum charge_type {
 #define MEM_CGROUP_RECLAIM_SOFT_BIT	0x2
 #define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
+/*
+ * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
+ * limit reclaim to prevent infinite loops, if they ever occur.
+ */
+#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		(100)
+#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	(2)
+
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
@@ -287,6 +309,17 @@ static struct mem_cgroup *try_get_mem_cg
 	return mem;
 }
 
+
+/*
+ * Because we dont' do "charge/uncharge" in root cgroup, some
+ * special handling is used.
+ */
+static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
+{
+	return (mem == root_mem_cgroup);
+}
+
+
 /*
  * Functions for acceccing cpu local statistics. modification should be
  * done under preempt disabled. __mem_cgroup_xxx functions are for low level.
@@ -657,11 +690,6 @@ static int mem_cgroup_walk_tree(struct m
 	return ret;
 }
 
-static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
-{
-	return (mem == root_mem_cgroup);
-}
-
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 9/11]memcg: clean up zonestat funcs
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
                     ` (7 preceding siblings ...)
  2009-09-18  9:00   ` [RFC][PATCH 8/11]memcg: remove unused macro and adds commentary KAMEZAWA Hiroyuki
@ 2009-09-18  9:01   ` KAMEZAWA Hiroyuki
  2009-09-18  9:04   ` [RFC][PATCH 10/11][mmotm] memcg: clean up percpu and more commentary for soft limit KAMEZAWA Hiroyuki
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  9:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch does
  - rename mem_cgroup_get_local_zonestat() to be mem_cgroup_get_lru_stat().
    I named the function as "local" but this "local" was ambiguous.
    This local means "not considering hierarchy"...
    Maybe get_lru_stat() is better name.

  - moves zone statitics related functions after xxx_cgroup_zoneinfo function.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   68 ++++++++++++++++++++++++++++----------------------------
 1 file changed, 35 insertions(+), 33 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -434,6 +434,32 @@ page_cgroup_zoneinfo(struct page_cgroup 
 	return mem_cgroup_zoneinfo(mem, nid, zid);
 }
 
+unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
+				       struct zone *zone,
+				       enum lru_list lru)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	return MEM_CGROUP_ZSTAT(mz, lru);
+}
+
+static unsigned long mem_cgroup_get_lru_stat(struct mem_cgroup *mem,
+					enum lru_list idx)
+{
+	int nid, zid;
+	struct mem_cgroup_per_zone *mz;
+	u64 total = 0;
+
+	for_each_online_node(nid)
+		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+			mz = mem_cgroup_zoneinfo(mem, nid, zid);
+			total += MEM_CGROUP_ZSTAT(mz, idx);
+		}
+	return total;
+}
+
 static struct mem_cgroup_tree_per_zone *
 soft_limit_tree_node_zone(int nid, int zid)
 {
@@ -640,20 +666,6 @@ static void mem_cgroup_charge_statistics
 	put_cpu();
 }
 
-static unsigned long mem_cgroup_get_local_zonestat(struct mem_cgroup *mem,
-					enum lru_list idx)
-{
-	int nid, zid;
-	struct mem_cgroup_per_zone *mz;
-	u64 total = 0;
-
-	for_each_online_node(nid)
-		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-			mz = mem_cgroup_zoneinfo(mem, nid, zid);
-			total += MEM_CGROUP_ZSTAT(mz, idx);
-		}
-	return total;
-}
 
 /*
  * Call callback function against all cgroup under hierarchy tree.
@@ -882,8 +894,8 @@ static int calc_inactive_ratio(struct me
 	unsigned long gb;
 	unsigned long inactive_ratio;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_ANON);
-	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_ANON);
+	inactive = mem_cgroup_get_lru_stat(memcg, LRU_INACTIVE_ANON);
+	active = mem_cgroup_get_lru_stat(memcg, LRU_ACTIVE_ANON);
 
 	gb = (inactive + active) >> (30 - PAGE_SHIFT);
 	if (gb)
@@ -922,22 +934,12 @@ int mem_cgroup_inactive_file_is_low(stru
 	unsigned long active;
 	unsigned long inactive;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
-	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+	inactive = mem_cgroup_get_lru_stat(memcg, LRU_INACTIVE_FILE);
+	active = mem_cgroup_get_lru_stat(memcg, LRU_ACTIVE_FILE);
 
 	return (active > inactive);
 }
 
-unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
-				       struct zone *zone,
-				       enum lru_list lru)
-{
-	int nid = zone->zone_pgdat->node_id;
-	int zid = zone_idx(zone);
-	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
-
-	return MEM_CGROUP_ZSTAT(mz, lru);
-}
 
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone)
@@ -2967,15 +2969,15 @@ static int mem_cgroup_get_local_stat(str
 	}
 
 	/* per zone stat */
-	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
+	val = mem_cgroup_get_lru_stat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
-	val = mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_ANON);
+	val = mem_cgroup_get_lru_stat(mem, LRU_ACTIVE_ANON);
 	s->stat[MCS_ACTIVE_ANON] += val * PAGE_SIZE;
-	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_FILE);
+	val = mem_cgroup_get_lru_stat(mem, LRU_INACTIVE_FILE);
 	s->stat[MCS_INACTIVE_FILE] += val * PAGE_SIZE;
-	val = mem_cgroup_get_local_zonestat(mem, LRU_ACTIVE_FILE);
+	val = mem_cgroup_get_lru_stat(mem, LRU_ACTIVE_FILE);
 	s->stat[MCS_ACTIVE_FILE] += val * PAGE_SIZE;
-	val = mem_cgroup_get_local_zonestat(mem, LRU_UNEVICTABLE);
+	val = mem_cgroup_get_lru_stat(mem, LRU_UNEVICTABLE);
 	s->stat[MCS_UNEVICTABLE] += val * PAGE_SIZE;
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 10/11][mmotm] memcg: clean up percpu and more commentary for soft limit
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
                     ` (8 preceding siblings ...)
  2009-09-18  9:01   ` [RFC][PATCH 9/11]memcg: clean up zonestat funcs KAMEZAWA Hiroyuki
@ 2009-09-18  9:04   ` KAMEZAWA Hiroyuki
  2009-09-18  9:06   ` [RFC][PATCH 11/11][mmotm] memcg: more commentary and clean up KAMEZAWA Hiroyuki
  2009-09-18 10:37   ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) Daisuke Nishimura
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  9:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

yes, should be separeted to 2 patches...

==
This patch does
  - adds some commentary on softlimit codes.
  - moves per-cpu statitics code right after percpu stat functions.

Signed-off-by: KAMEZAWA Hiroyuki  <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  161 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 97 insertions(+), 64 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -56,7 +56,7 @@ static int really_do_swap_account __init
 #endif
 
 static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
-#define SOFTLIMIT_EVENTS_THRESH (1000)
+
 
 /*
  * Statistics for memory cgroup. accounted per cpu.
@@ -118,8 +118,9 @@ struct mem_cgroup_lru_info {
 };
 
 /*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
+ * Cgroups above their soft-limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation. This RB-tree is system-wide but maintained
+ * per zone.
  */
 
 struct mem_cgroup_tree_per_zone {
@@ -415,6 +416,70 @@ static s64 mem_cgroup_local_usage(struct
 }
 
 
+static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
+					 bool charge)
+{
+	int val = (charge) ? 1 : -1;
+	mem_cgroup_stat_add_local(mem, MEM_CGROUP_STAT_SWAPOUT, val);
+}
+
+static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
+					 struct page_cgroup *pc,
+					 bool charge)
+{
+	int val = (charge) ? 1 : -1;
+	struct mem_cgroup_stat *stat = &mem->stat;
+	struct mem_cgroup_stat_cpu *cstat;
+	int cpu = get_cpu();
+	/* for fast access, we use open-coded manner */
+	cstat = &stat->cpustat[cpu];
+	if (PageCgroupCache(pc))
+		__mem_cgroup_stat_add_local(cstat, MEM_CGROUP_STAT_CACHE, val);
+	else
+		__mem_cgroup_stat_add_local(cstat, MEM_CGROUP_STAT_RSS, val);
+
+	if (charge)
+		__mem_cgroup_stat_add_local(cstat,
+				MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
+	else
+		__mem_cgroup_stat_add_local(cstat,
+				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
+	__mem_cgroup_stat_add_local(cstat, MEM_CGROUP_STAT_EVENTS, 1);
+	put_cpu();
+}
+
+/*
+ * Currently used to update mapped file statistics, but the routine can be
+ * generalized to update other statistics as well.
+ */
+void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
+{
+	struct mem_cgroup *mem;
+	struct page_cgroup *pc;
+
+	if (!page_is_file_cache(page))
+		return;
+
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc))
+		return;
+
+	lock_page_cgroup(pc);
+	mem = pc->mem_cgroup;
+	if (!mem)
+		goto done;
+
+	if (!PageCgroupUsed(pc))
+		goto done;
+
+	mem_cgroup_stat_add_local(mem, MEM_CGROUP_STAT_MAPPED_FILE, val);
+done:
+	unlock_page_cgroup(pc);
+}
+
+/*
+ * For per-zone statistics.
+ */
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
 {
@@ -460,6 +525,17 @@ static unsigned long mem_cgroup_get_zone
 	return total;
 }
 
+/*
+ * Followings are functions for per-zone memcg softlimit RB-tree management.
+ * Tree is system-wide but maintained per zone.
+ */
+
+/*
+ * Soft limit uses percpu event counter for status check instead of checking
+ * status at every charge/uncharge.
+ */
+#define SOFTLIMIT_EVENTS_THRESH (1000)
+
 static struct mem_cgroup_tree_per_zone *
 soft_limit_tree_node_zone(int nid, int zid)
 {
@@ -472,9 +548,14 @@ soft_limit_tree_from_page(struct page *p
 	int nid = page_to_nid(page);
 	int zid = page_zonenum(page);
 
-	return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
+	return soft_limit_tree_node_zone(nid, zid);
 }
 
+/*
+ * Insert memcg's per-zone struct onto softlimit RB-tree. For inserting,
+ * mz should be not on tree. Tree-lock is held before calling this.
+ * tree lock (mctz->lock) should be held.
+ */
 static void
 __mem_cgroup_insert_exceeded(struct mem_cgroup *mem,
 				struct mem_cgroup_per_zone *mz,
@@ -530,6 +611,10 @@ mem_cgroup_remove_exceeded(struct mem_cg
 	spin_unlock(&mctz->lock);
 }
 
+/*
+ * Check per-cpu EVENT COUNTER. If it's over threshold, we check
+ * how memory uasge exceeds softlimit and update tree.
+ */
 static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem)
 {
 	bool ret = false;
@@ -543,6 +628,11 @@ static bool mem_cgroup_soft_limit_check(
 	return ret;
 }
 
+/*
+ * This function updates soft-limit RB-tree by checking "excess" of
+ * memcgs. When hierarchy is used, all ancestors have to be updated, too.
+ */
+
 static void mem_cgroup_update_tree(struct mem_cgroup *mem, struct page *page)
 {
 	unsigned long long excess;
@@ -598,6 +688,9 @@ static inline unsigned long mem_cgroup_g
 	return res_counter_soft_limit_excess(&mem->res) >> PAGE_SHIFT;
 }
 
+/*
+ * Check RB-tree of a zone and find a memcg which has the largest "excess"
+ */
 static struct mem_cgroup_per_zone *
 __mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
 {
@@ -634,38 +727,6 @@ mem_cgroup_largest_soft_limit_node(struc
 	return mz;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *mem,
-					 bool charge)
-{
-	int val = (charge) ? 1 : -1;
-	mem_cgroup_stat_add_local(mem, MEM_CGROUP_STAT_SWAPOUT, val);
-}
-
-static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
-					 struct page_cgroup *pc,
-					 bool charge)
-{
-	int val = (charge) ? 1 : -1;
-	struct mem_cgroup_stat *stat = &mem->stat;
-	struct mem_cgroup_stat_cpu *cstat;
-	int cpu = get_cpu();
-	/* for fast access, we use open-coded manner */
-	cstat = &stat->cpustat[cpu];
-	if (PageCgroupCache(pc))
-		__mem_cgroup_stat_add_local(cstat, MEM_CGROUP_STAT_CACHE, val);
-	else
-		__mem_cgroup_stat_add_local(cstat, MEM_CGROUP_STAT_RSS, val);
-
-	if (charge)
-		__mem_cgroup_stat_add_local(cstat,
-				MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
-	else
-		__mem_cgroup_stat_add_local(cstat,
-				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
-	__mem_cgroup_stat_add_local(cstat, MEM_CGROUP_STAT_EVENTS, 1);
-	put_cpu();
-}
-
 
 /*
  * Call callback function against all cgroup under hierarchy tree.
@@ -1305,34 +1366,6 @@ static void record_last_oom(struct mem_c
 	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
 }
 
-/*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
- */
-void mem_cgroup_update_mapped_file_stat(struct page *page, int val)
-{
-	struct mem_cgroup *mem;
-	struct page_cgroup *pc;
-
-	if (!page_is_file_cache(page))
-		return;
-
-	pc = lookup_page_cgroup(page);
-	if (unlikely(!pc))
-		return;
-
-	lock_page_cgroup(pc);
-	mem = pc->mem_cgroup;
-	if (!mem)
-		goto done;
-
-	if (!PageCgroupUsed(pc))
-		goto done;
-
-	mem_cgroup_stat_add_local(mem, MEM_CGROUP_STAT_MAPPED_FILE, val);
-done:
-	unlock_page_cgroup(pc);
-}
 
 #define CHARGE_SIZE	(64 * PAGE_SIZE)
 struct memcg_stock_pcp {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 11/11][mmotm] memcg: more commentary and clean up
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
                     ` (9 preceding siblings ...)
  2009-09-18  9:04   ` [RFC][PATCH 10/11][mmotm] memcg: clean up percpu and more commentary for soft limit KAMEZAWA Hiroyuki
@ 2009-09-18  9:06   ` KAMEZAWA Hiroyuki
  2009-09-18 10:37   ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) Daisuke Nishimura
  11 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-18  9:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp

This patch itself should be sorted out ;)
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch does
  - move mem_cgroup_move_lists() before swap-cache special LRU functions.
  - move get_swappiness() around functions related to vmscan logic.
  - grouping oom-killer functions.
  - adds some commentary

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  144 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 85 insertions(+), 59 deletions(-)

Index: mmotm-2.6.31-Sep17/mm/memcontrol.c
===================================================================
--- mmotm-2.6.31-Sep17.orig/mm/memcontrol.c
+++ mmotm-2.6.31-Sep17/mm/memcontrol.c
@@ -853,6 +853,15 @@ void mem_cgroup_add_lru_list(struct page
 	list_add(&pc->lru, &mz->lists[lru]);
 }
 
+void mem_cgroup_move_lists(struct page *page,
+			   enum lru_list from, enum lru_list to)
+{
+	if (mem_cgroup_disabled())
+		return;
+	mem_cgroup_del_lru_list(page, from);
+	mem_cgroup_add_lru_list(page, to);
+}
+
 /*
  * At handling SwapCache, pc->mem_cgroup may be changed while it's linked to
  * lru because the page may.be reused after it's fully uncharged (because of
@@ -889,15 +898,10 @@ static void mem_cgroup_lru_add_after_com
 	spin_unlock_irqrestore(&zone->lru_lock, flags);
 }
 
-
-void mem_cgroup_move_lists(struct page *page,
-			   enum lru_list from, enum lru_list to)
-{
-	if (mem_cgroup_disabled())
-		return;
-	mem_cgroup_del_lru_list(page, from);
-	mem_cgroup_add_lru_list(page, to);
-}
+/*
+ * Check a task is under a mem_cgroup. Because we do hierarchical accounting,
+ * we have to check whether one of ancestors is "mem" or not.
+ */
 
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
@@ -920,7 +924,7 @@ int task_in_mem_cgroup(struct task_struc
 }
 
 /*
- * prev_priority control...this will be used in memory reclaim path.
+ * Functions for LRU managenet called by vmscan.c
  */
 int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
 {
@@ -948,7 +952,13 @@ void mem_cgroup_record_reclaim_priority(
 	spin_unlock(&mem->reclaim_param_lock);
 }
 
-static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
+/*
+ * Inactive ratio is a parameter for what ratio of pages should be in
+ * inactive list. This is used by memory reclaim codes.(see vmscan.c)
+ * generic zone's inactive_ratio is calculated in page_alloc.c
+ */
+static int
+calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages)
 {
 	unsigned long active;
 	unsigned long inactive;
@@ -971,7 +981,10 @@ static int calc_inactive_ratio(struct me
 
 	return inactive_ratio;
 }
-
+/*
+ * If inactive_xxx is in short, active_xxx will be scanned. And
+ * rotation occurs.
+ */
 int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
@@ -1037,6 +1050,28 @@ mem_cgroup_get_reclaim_stat_from_page(st
 	return &mz->reclaim_stat;
 }
 
+
+static unsigned int get_swappiness(struct mem_cgroup *memcg)
+{
+	struct cgroup *cgrp = memcg->css.cgroup;
+	unsigned int swappiness;
+
+	/* root ? */
+	if (cgrp->parent == NULL)
+		return vm_swappiness;
+
+	spin_lock(&memcg->reclaim_param_lock);
+	swappiness = memcg->swappiness;
+	spin_unlock(&memcg->reclaim_param_lock);
+
+	return swappiness;
+}
+
+/*
+ * Called by shrink_xxxx_list functions for grabbing pages as reclaim target.
+ * please see isolate_lru_pages() in mm/vmscan.c
+ */
+
 unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -1092,7 +1127,7 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-
+/* check we hit mem->res or mem->memsw hard-limit or not */
 static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
 {
 	if (do_swap_account) {
@@ -1104,32 +1139,41 @@ static bool mem_cgroup_check_under_limit
 			return true;
 	return false;
 }
-
-static unsigned int get_swappiness(struct mem_cgroup *memcg)
+/*
+ * OOM-Killer related stuff.
+ */
+bool mem_cgroup_oom_called(struct task_struct *task)
 {
-	struct cgroup *cgrp = memcg->css.cgroup;
-	unsigned int swappiness;
-
-	/* root ? */
-	if (cgrp->parent == NULL)
-		return vm_swappiness;
-
-	spin_lock(&memcg->reclaim_param_lock);
-	swappiness = memcg->swappiness;
-	spin_unlock(&memcg->reclaim_param_lock);
+	bool ret = false;
+	struct mem_cgroup *mem;
+	struct mm_struct *mm;
 
-	return swappiness;
+	rcu_read_lock();
+	mm = task->mm;
+	if (!mm)
+		mm = &init_mm;
+	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
+		ret = true;
+	rcu_read_unlock();
+	return ret;
 }
 
-static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
+static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
 {
-	int *val = data;
-	(*val)++;
+	mem->last_oom_jiffies = jiffies;
 	return 0;
 }
 
+static void record_last_oom(struct mem_cgroup *mem)
+{
+	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
+}
+
+
 /**
- * mem_cgroup_print_mem_info: Called from OOM with tasklist_lock held in read mode.
+ * mem_cgroup_print_mem_info: Called from OOM with tasklist_lock held in
+ * read mode.
  * @memcg: The memory cgroup that went over limit
  * @p: Task that is going to be killed
  *
@@ -1195,6 +1239,14 @@ done:
 		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
 }
 
+
+
+static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
+{
+	int *val = data;
+	(*val)++;
+	return 0;
+}
 /*
  * This function returns the number of memcg under hierarchy tree. Returns
  * 1(self count) if no children.
@@ -1338,35 +1390,9 @@ static int mem_cgroup_hierarchical_recla
 	return total;
 }
 
-bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	bool ret = false;
-	struct mem_cgroup *mem;
-	struct mm_struct *mm;
-
-	rcu_read_lock();
-	mm = task->mm;
-	if (!mm)
-		mm = &init_mm;
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
-		ret = true;
-	rcu_read_unlock();
-	return ret;
-}
-
-static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
-{
-	mem->last_oom_jiffies = jiffies;
-	return 0;
-}
-
-static void record_last_oom(struct mem_cgroup *mem)
-{
-	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
-}
-
-
+/*
+ * For Batch charge.
+ */
 #define CHARGE_SIZE	(64 * PAGE_SIZE)
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18)
  2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
                     ` (10 preceding siblings ...)
  2009-09-18  9:06   ` [RFC][PATCH 11/11][mmotm] memcg: more commentary and clean up KAMEZAWA Hiroyuki
@ 2009-09-18 10:37   ` Daisuke Nishimura
  11 siblings, 0 replies; 24+ messages in thread
From: Daisuke Nishimura @ 2009-09-18 10:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	nishimura@mxp.nes.nec.co.jp, d-nishimura

On Fri, 18 Sep 2009 17:47:57 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Posting just for dumping my stack, plz see if you have time.
> (will repost, this set is not for any merge)
> 
> Because my office is closed until next Thursday, my RTT will be long for a while.
> 
> Patches are mainly in 3 parts.
>  - soft-limit modification (1,2)
>  - coalescing chages (3,4)
>  - cleanups. (5-11)
> 
> In these days, I feel I have to make memcgroup.c cleaner.
> Some comments are old and placement of functions are at random.
> 
> Patches are still messy but plz see applied image if you interested in.
> 
> 1. fix up softlimit's uncharge path
> 2. fix up softlimit's charge path
> 3. coalescing uncharge path
> 4. coalescing charge path
> 5. memcg_charge_cancel ....from Nishimura's set. this is very nice.
Thank you for including this one.
I'll leave this patch to you.

> 6. clean up percpu statistics of memcg.
> 7. clean up mem_cgroup_from_xxxx functions.
> 8. adds commentary and remove unused macros.
> 9. clean up for mem_cgroup's per-zone stat
> 10. adds commentary for soft-limit and moves functions for per-cpu 
> 11. misc. commentary and function replacement...not sorted out well.
> 
> Patches in 6-11 sounds like bad-news for Nishimura-san, but I guess
> no heavy hunk you'll have...
> 
Don't worry, do it as you like :)
I've read through these patches briefly, I don't think it's so difficult
to re-base my patches on them. And they are good clean up, IMHO.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2009-09-18 10:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-09  8:39 [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 KAMEZAWA Hiroyuki
2009-09-09  8:41 ` [RFC][PATCH 1/4][mmotm] memcg: soft limit clean up KAMEZAWA Hiroyuki
     [not found]   ` <661de9470909090410t160454a2k658c980b92d11612@mail.gmail.com>
2009-09-10  0:10     ` KAMEZAWA Hiroyuki
2009-09-09  8:41 ` [RFC][PATCH 2/4][mmotm] clean up charge path of softlimit KAMEZAWA Hiroyuki
2009-09-09  8:44 ` [RFC][PATCH 3/4][mmotm] memcg: batched uncharge KAMEZAWA Hiroyuki
2009-09-09  8:45 ` [RFC][PATCH 4/4][mmotm] memcg: coalescing charge KAMEZAWA Hiroyuki
2009-09-12  4:58   ` Daisuke Nishimura
2009-09-15  0:09     ` KAMEZAWA Hiroyuki
2009-09-09 20:30 ` [RFC][PATCH 0/4][mmotm] memcg: reduce lock contention v3 Balbir Singh
2009-09-10  0:20   ` KAMEZAWA Hiroyuki
2009-09-10  5:18     ` Balbir Singh
2009-09-18  8:47 ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) KAMEZAWA Hiroyuki
2009-09-18  8:50   ` [RFC][PATCH 1/11] memcg: clean up softlimit uncharge KAMEZAWA Hiroyuki
2009-09-18  8:52   ` [RFC][PATCH 2/11]memcg: reduce res_counter_soft_limit_excess KAMEZAWA Hiroyuki
2009-09-18  8:53   ` [RFC][PATCH 3/11] memcg: coalescing uncharge KAMEZAWA Hiroyuki
2009-09-18  8:54   ` [RFC][PATCH 4/11] memcg: coalescing charge KAMEZAWA Hiroyuki
2009-09-18  8:55   ` [RFC][PATCH 5/11] memcg: clean up cancel charge KAMEZAWA Hiroyuki
2009-09-18  8:57   ` [RFC][PATCH 6/11] memcg: cleaun up percpu statistics KAMEZAWA Hiroyuki
2009-09-18  8:58   ` [RFC][PATCH 7/11] memcg: rename from_cont to from_cgroup KAMEZAWA Hiroyuki
2009-09-18  9:00   ` [RFC][PATCH 8/11]memcg: remove unused macro and adds commentary KAMEZAWA Hiroyuki
2009-09-18  9:01   ` [RFC][PATCH 9/11]memcg: clean up zonestat funcs KAMEZAWA Hiroyuki
2009-09-18  9:04   ` [RFC][PATCH 10/11][mmotm] memcg: clean up percpu and more commentary for soft limit KAMEZAWA Hiroyuki
2009-09-18  9:06   ` [RFC][PATCH 11/11][mmotm] memcg: more commentary and clean up KAMEZAWA Hiroyuki
2009-09-18 10:37   ` [RFC][PATCH 0/11][mmotm] memcg: patch dump (Sep/18) Daisuke Nishimura

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).