[PATCH 0/4] [BUGFIX] thp vs memcg fix.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] [BUGFIX] thp vs memcg fix.
@ 2011-01-14 10:04 KAMEZAWA Hiroyuki
  2011-01-14 10:06 ` [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues KAMEZAWA Hiroyuki
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14 10:04 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, Greg Thelen, hannes, aarcange,
	akpm@linux-foundation.org

Now, memcg is broken when used with THP in -mm. I thought I had more time but it's
now merged....so, I made a quick fix. I'm sorry for my laziness.

All patches are onto mmotm-Jan07. 

The issues are mainly on statistics accounting.
 - the number of anon pages in memcg.
 - the number of pages on LRU.
 - rmdir() will leak some accounts. etc.

Please see each patches for details. 

I don't have enough test time before positing, so please take this as quick
trial and give me strict review. And if you find another issues, please
notify me.

Testeres shoulld make CONFIG_TRANSPARENT_HUGEPAGE=y and
use 'always'. giving memory pressure, create/remove memory cgroup,
move tasks without account move. And see memory.stat file.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues
  2011-01-14 10:04 [PATCH 0/4] [BUGFIX] thp vs memcg fix KAMEZAWA Hiroyuki
@ 2011-01-14 10:06 ` KAMEZAWA Hiroyuki
  2011-01-14 11:51   ` Johannes Weiner
  2011-01-14 10:09 ` [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14 10:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	Greg Thelen, hannes, aarcange, akpm@linux-foundation.org

mem_cgroup_charge_staistics() was designed for charging a page but
now, we have transparent hugepage. To fix problems (in following patch)
it's required to change the function to get the number of pages
as its arguments.

The new function gets following as argument.
  - type of page rather than 'pc'
  - size of page which is accounted.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -600,23 +600,23 @@ static void mem_cgroup_swap_statistics(s
 }
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
-					 struct page_cgroup *pc,
-					 bool charge)
+					 bool file,
+					 int pages)
 {
-	int val = (charge) ? 1 : -1;
-
 	preempt_disable();
 
-	if (PageCgroupCache(pc))
-		__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], val);
+	if (file)
+		__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], pages);
 	else
-		__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], val);
+		__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], pages);
 
-	if (charge)
+	/* pagein of a big page is an event. So, ignore page size */
+	if (pages > 0)
 		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGIN_COUNT]);
 	else
 		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGOUT_COUNT]);
-	__this_cpu_inc(mem->stat->count[MEM_CGROUP_EVENTS]);
+
+	__this_cpu_add(mem->stat->count[MEM_CGROUP_EVENTS], pages);
 
 	preempt_enable();
 }
@@ -2092,6 +2092,7 @@ static void ____mem_cgroup_commit_charge
 					 struct page_cgroup *pc,
 					 enum charge_type ctype)
 {
+	bool file = false;
 	pc->mem_cgroup = mem;
 	/*
 	 * We access a page_cgroup asynchronously without lock_page_cgroup().
@@ -2106,6 +2107,7 @@ static void ____mem_cgroup_commit_charge
 	case MEM_CGROUP_CHARGE_TYPE_SHMEM:
 		SetPageCgroupCache(pc);
 		SetPageCgroupUsed(pc);
+		file = true;
 		break;
 	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
 		ClearPageCgroupCache(pc);
@@ -2115,7 +2117,7 @@ static void ____mem_cgroup_commit_charge
 		break;
 	}
 
-	mem_cgroup_charge_statistics(mem, pc, true);
+	mem_cgroup_charge_statistics(mem, file, 1);
 }
 
 static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
@@ -2186,14 +2188,14 @@ static void __mem_cgroup_move_account(st
 		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		preempt_enable();
 	}
-	mem_cgroup_charge_statistics(from, pc, false);
+	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
 		mem_cgroup_cancel_charge(from, PAGE_SIZE);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, pc, true);
+	mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of
@@ -2551,6 +2553,7 @@ __mem_cgroup_uncharge_common(struct page
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	int page_size = PAGE_SIZE;
+	bool file = false;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2578,6 +2581,9 @@ __mem_cgroup_uncharge_common(struct page
 	if (!PageCgroupUsed(pc))
 		goto unlock_out;
 
+	if (PageCgroupCache(pc))
+		file = true;
+
 	switch (ctype) {
 	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
 	case MEM_CGROUP_CHARGE_TYPE_DROP:
@@ -2597,7 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
 	}
 
 	for (i = 0; i < count; i++)
-		mem_cgroup_charge_statistics(mem, pc + i, false);
+		mem_cgroup_charge_statistics(mem, file, -1);
 
 	ClearPageCgroupUsed(pc);
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues
  2011-01-14 10:06 ` [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues KAMEZAWA Hiroyuki
@ 2011-01-14 11:51   ` Johannes Weiner
  2011-01-14 12:07     ` Hiroyuki Kamezawa
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2011-01-14 11:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	Greg Thelen, aarcange, akpm@linux-foundation.org

On Fri, Jan 14, 2011 at 07:06:44PM +0900, KAMEZAWA Hiroyuki wrote:
> mem_cgroup_charge_staistics() was designed for charging a page but
> now, we have transparent hugepage. To fix problems (in following patch)
> it's required to change the function to get the number of pages
> as its arguments.
> 
> The new function gets following as argument.
>   - type of page rather than 'pc'
>   - size of page which is accounted.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I agree with the patch in general, below are only a few nitpicks.

> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -600,23 +600,23 @@ static void mem_cgroup_swap_statistics(s
>  }
>  
>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> -					 struct page_cgroup *pc,
> -					 bool charge)
> +					 bool file,
> +					 int pages)

I think 'nr_pages' would be a better name.  This makes me think of a
'struct page *[]'.

>  {
> -	int val = (charge) ? 1 : -1;
> -
>  	preempt_disable();
>  
> -	if (PageCgroupCache(pc))
> -		__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], val);
> +	if (file)
> +		__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], pages);
>  	else
> -		__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], val);
> +		__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], pages);
>  
> -	if (charge)
> +	/* pagein of a big page is an event. So, ignore page size */
> +	if (pages > 0)
>  		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGIN_COUNT]);
>  	else
>  		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGOUT_COUNT]);
> -	__this_cpu_inc(mem->stat->count[MEM_CGROUP_EVENTS]);
> +
> +	__this_cpu_add(mem->stat->count[MEM_CGROUP_EVENTS], pages);
>  
>  	preempt_enable();
>  }
> @@ -2092,6 +2092,7 @@ static void ____mem_cgroup_commit_charge
>  					 struct page_cgroup *pc,
>  					 enum charge_type ctype)
>  {
> +	bool file = false;
>  	pc->mem_cgroup = mem;
>  	/*
>  	 * We access a page_cgroup asynchronously without lock_page_cgroup().
> @@ -2106,6 +2107,7 @@ static void ____mem_cgroup_commit_charge
>  	case MEM_CGROUP_CHARGE_TYPE_SHMEM:
>  		SetPageCgroupCache(pc);
>  		SetPageCgroupUsed(pc);
> +		file = true;
>  		break;
>  	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
>  		ClearPageCgroupCache(pc);
> @@ -2115,7 +2117,7 @@ static void ____mem_cgroup_commit_charge
>  		break;
>  	}
>  
> -	mem_cgroup_charge_statistics(mem, pc, true);
> +	mem_cgroup_charge_statistics(mem, file, 1);

The extra local variable is a bit awkward, since there are already
several sources of this information (ctype and pc->flags).

Could you keep it like the other sites, just pass PageCgroupCache()
here as well?

> @@ -2186,14 +2188,14 @@ static void __mem_cgroup_move_account(st
>  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>  		preempt_enable();
>  	}
> -	mem_cgroup_charge_statistics(from, pc, false);
> +	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
>  		mem_cgroup_cancel_charge(from, PAGE_SIZE);
>  
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
> -	mem_cgroup_charge_statistics(to, pc, true);
> +	mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> @@ -2551,6 +2553,7 @@ __mem_cgroup_uncharge_common(struct page
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem = NULL;
>  	int page_size = PAGE_SIZE;
> +	bool file = false;
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -2578,6 +2581,9 @@ __mem_cgroup_uncharge_common(struct page
>  	if (!PageCgroupUsed(pc))
>  		goto unlock_out;
>  
> +	if (PageCgroupCache(pc))
> +		file = true;
> +
>  	switch (ctype) {
>  	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
>  	case MEM_CGROUP_CHARGE_TYPE_DROP:
> @@ -2597,7 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
>  	}
>  
>  	for (i = 0; i < count; i++)
> -		mem_cgroup_charge_statistics(mem, pc + i, false);
> +		mem_cgroup_charge_statistics(mem, file, -1);

I see you get rid of this loop in the next patch, anyway.  Can you
just use PageCgroupCache() instead of the extra variable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues
  2011-01-14 11:51   ` Johannes Weiner
@ 2011-01-14 12:07     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-14 12:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, Greg Thelen, aarcange,
	akpm@linux-foundation.org

2011/1/14 Johannes Weiner <hannes@cmpxchg.org>:
> On Fri, Jan 14, 2011 at 07:06:44PM +0900, KAMEZAWA Hiroyuki wrote:
>> mem_cgroup_charge_staistics() was designed for charging a page but
>> now, we have transparent hugepage. To fix problems (in following patch)
>> it's required to change the function to get the number of pages
>> as its arguments.
>>
>> The new function gets following as argument.
>>   - type of page rather than 'pc'
>>   - size of page which is accounted.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> I agree with the patch in general, below are only a few nitpicks.
>
Thanks, I think details should be updated, too.


>> --- mmotm-0107.orig/mm/memcontrol.c
>> +++ mmotm-0107/mm/memcontrol.c
>> @@ -600,23 +600,23 @@ static void mem_cgroup_swap_statistics(s
>>  }
>>
>>  static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>> -                                      struct page_cgroup *pc,
>> -                                      bool charge)
>> +                                      bool file,
>> +                                      int pages)
>
> I think 'nr_pages' would be a better name.  This makes me think of a
> 'struct page *[]'.
>
ok.

>>  {
>> -     int val = (charge) ? 1 : -1;
>> -
>>       preempt_disable();
>>
>> -     if (PageCgroupCache(pc))
>> -             __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], val);
>> +     if (file)
>> +             __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], pages);
>>       else
>> -             __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], val);
>> +             __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], pages);
>>
>> -     if (charge)
>> +     /* pagein of a big page is an event. So, ignore page size */
>> +     if (pages > 0)
>>               __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGIN_COUNT]);
>>       else
>>               __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGOUT_COUNT]);
>> -     __this_cpu_inc(mem->stat->count[MEM_CGROUP_EVENTS]);
>> +
>> +     __this_cpu_add(mem->stat->count[MEM_CGROUP_EVENTS], pages);
>>
>>       preempt_enable();
>>  }
>> @@ -2092,6 +2092,7 @@ static void ____mem_cgroup_commit_charge
>>                                        struct page_cgroup *pc,
>>                                        enum charge_type ctype)
>>  {
>> +     bool file = false;
>>       pc->mem_cgroup = mem;
>>       /*
>>        * We access a page_cgroup asynchronously without lock_page_cgroup().
>> @@ -2106,6 +2107,7 @@ static void ____mem_cgroup_commit_charge
>>       case MEM_CGROUP_CHARGE_TYPE_SHMEM:
>>               SetPageCgroupCache(pc);
>>               SetPageCgroupUsed(pc);
>> +             file = true;
>>               break;
>>       case MEM_CGROUP_CHARGE_TYPE_MAPPED:
>>               ClearPageCgroupCache(pc);
>> @@ -2115,7 +2117,7 @@ static void ____mem_cgroup_commit_charge
>>               break;
>>       }
>>
>> -     mem_cgroup_charge_statistics(mem, pc, true);
>> +     mem_cgroup_charge_statistics(mem, file, 1);
>
> The extra local variable is a bit awkward, since there are already
> several sources of this information (ctype and pc->flags).
>
> Could you keep it like the other sites, just pass PageCgroupCache()
> here as well?
>
Ok.

>> @@ -2186,14 +2188,14 @@ static void __mem_cgroup_move_account(st
>>               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>>               preempt_enable();
>>       }
>> -     mem_cgroup_charge_statistics(from, pc, false);
>> +     mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
>>       if (uncharge)
>>               /* This is not "cancel", but cancel_charge does all we need. */
>>               mem_cgroup_cancel_charge(from, PAGE_SIZE);
>>
>>       /* caller should have done css_get */
>>       pc->mem_cgroup = to;
>> -     mem_cgroup_charge_statistics(to, pc, true);
>> +     mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
>>       /*
>>        * We charges against "to" which may not have any tasks. Then, "to"
>>        * can be under rmdir(). But in current implementation, caller of
>> @@ -2551,6 +2553,7 @@ __mem_cgroup_uncharge_common(struct page
>>       struct page_cgroup *pc;
>>       struct mem_cgroup *mem = NULL;
>>       int page_size = PAGE_SIZE;
>> +     bool file = false;
>>
>>       if (mem_cgroup_disabled())
>>               return NULL;
>> @@ -2578,6 +2581,9 @@ __mem_cgroup_uncharge_common(struct page
>>       if (!PageCgroupUsed(pc))
>>               goto unlock_out;
>>
>> +     if (PageCgroupCache(pc))
>> +             file = true;
>> +
>>       switch (ctype) {
>>       case MEM_CGROUP_CHARGE_TYPE_MAPPED:
>>       case MEM_CGROUP_CHARGE_TYPE_DROP:
>> @@ -2597,7 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
>>       }
>>
>>       for (i = 0; i < count; i++)
>> -             mem_cgroup_charge_statistics(mem, pc + i, false);
>> +             mem_cgroup_charge_statistics(mem, file, -1);
>
> I see you get rid of this loop in the next patch, anyway.  Can you
> just use PageCgroupCache() instead of the extra variable?

will do.
Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages
  2011-01-14 10:04 [PATCH 0/4] [BUGFIX] thp vs memcg fix KAMEZAWA Hiroyuki
  2011-01-14 10:06 ` [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues KAMEZAWA Hiroyuki
@ 2011-01-14 10:09 ` KAMEZAWA Hiroyuki
  2011-01-14 12:09   ` Johannes Weiner
  2011-01-14 10:10 ` [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP KAMEZAWA Hiroyuki
  2011-01-14 10:15 ` [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir " KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14 10:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	Greg Thelen, hannes, aarcange, akpm@linux-foundation.org

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, under THP:

at charge:
  - PageCgroupUsed bit is set to all page_cgroup on a hugepage.
    ....set to 512 pages.
at uncharge
  - PageCgroupUsed bit is unset on the head page.

So, tail pages will remain with "Used" bit.

This patch fixes that Used bit is set only to the head page.
Used bits for tail pages will be set at spliting if necessary.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    4 ++
 mm/huge_memory.c           |    2 +
 mm/memcontrol.c            |   80 ++++++++++++++++++++++-----------------------
 3 files changed, 46 insertions(+), 40 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2084,15 +2084,28 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 	return mem;
 }
 
-/*
- * commit a charge got by __mem_cgroup_try_charge() and makes page_cgroup to be
- * USED state. If already USED, uncharge and return.
- */
-static void ____mem_cgroup_commit_charge(struct mem_cgroup *mem,
-					 struct page_cgroup *pc,
-					 enum charge_type ctype)
+static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
 {
-	bool file = false;
+	int count = page_size >> PAGE_SHIFT;
+
+	/* try_charge() can return NULL to *memcg, taking care of it. */
+	if (!mem)
+		return;
+
+	lock_page_cgroup(pc);
+	if (unlikely(PageCgroupUsed(pc))) {
+		unlock_page_cgroup(pc);
+		mem_cgroup_cancel_charge(mem, page_size);
+		return;
+	}
+
+	/*
+	 * we don't need page_cgroup_lock about tail pages, becase they are not
+	 * accessed by any other context at this point.
+	 */
 	pc->mem_cgroup = mem;
 	/*
 	 * We access a page_cgroup asynchronously without lock_page_cgroup().
@@ -2107,7 +2120,6 @@ static void ____mem_cgroup_commit_charge
 	case MEM_CGROUP_CHARGE_TYPE_SHMEM:
 		SetPageCgroupCache(pc);
 		SetPageCgroupUsed(pc);
-		file = true;
 		break;
 	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
 		ClearPageCgroupCache(pc);
@@ -2117,34 +2129,7 @@ static void ____mem_cgroup_commit_charge
 		break;
 	}
 
-	mem_cgroup_charge_statistics(mem, file, 1);
-}
-
-static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				       struct page_cgroup *pc,
-				       enum charge_type ctype,
-				       int page_size)
-{
-	int i;
-	int count = page_size >> PAGE_SHIFT;
-
-	/* try_charge() can return NULL to *memcg, taking care of it. */
-	if (!mem)
-		return;
-
-	lock_page_cgroup(pc);
-	if (unlikely(PageCgroupUsed(pc))) {
-		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem, page_size);
-		return;
-	}
-
-	/*
-	 * we don't need page_cgroup_lock about tail pages, becase they are not
-	 * accessed by any other context at this point.
-	 */
-	for (i = 0; i < count; i++)
-		____mem_cgroup_commit_charge(mem, pc + i, ctype);
+	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), count);
 
 	unlock_page_cgroup(pc);
 	/*
@@ -2154,6 +2139,23 @@ static void __mem_cgroup_commit_charge(s
 	 */
 	memcg_check_events(mem, pc->page);
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Because tail pages are not mared as "used", set it. We're under
+ * compund_lock and don't need to take care of races.
+ * Statistics are updated properly at charging. We just mark Used bits.
+ */
+void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
+{
+	struct page_cgroup *hpc = lookup_page_cgroup(head);
+	struct page_cgroup *tpc = lookup_page_cgroup(tail);
+
+	tpc->mem_cgroup = hpc->mem_cgroup;
+	smp_wmb(); /* see __commit_charge() */
+	SetPageCgroupUsed(tpc);
+	VM_BUG_ON(PageCgroupCache(hpc));
+}
+#endif
 
 /**
  * __mem_cgroup_move_account - move account of the page
@@ -2548,7 +2550,6 @@ direct_uncharge:
 static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
-	int i;
 	int count;
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
@@ -2602,8 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	for (i = 0; i < count; i++)
-		mem_cgroup_charge_statistics(mem, file, -1);
+	mem_cgroup_charge_statistics(mem, file, -count);
 
 	ClearPageCgroupUsed(pc);
 	/*
Index: mmotm-0107/include/linux/memcontrol.h
===================================================================
--- mmotm-0107.orig/include/linux/memcontrol.h
+++ mmotm-0107/include/linux/memcontrol.h
@@ -146,6 +146,10 @@ unsigned long mem_cgroup_soft_limit_recl
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
+#endif
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct mem_cgroup;
 
Index: mmotm-0107/mm/huge_memory.c
===================================================================
--- mmotm-0107.orig/mm/huge_memory.c
+++ mmotm-0107/mm/huge_memory.c
@@ -1203,6 +1203,8 @@ static void __split_huge_page_refcount(s
 		BUG_ON(!PageDirty(page_tail));
 		BUG_ON(!PageSwapBacked(page_tail));
 
+		mem_cgroup_split_huge_fixup(page, page_tail);
+
 		lru_add_page_tail(zone, page, page_tail);
 	}
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages
  2011-01-14 10:09 ` [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages KAMEZAWA Hiroyuki
@ 2011-01-14 12:09   ` Johannes Weiner
  2011-01-14 12:23     ` Hiroyuki Kamezawa
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2011-01-14 12:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	Greg Thelen, aarcange, akpm@linux-foundation.org

On Fri, Jan 14, 2011 at 07:09:09PM +0900, KAMEZAWA Hiroyuki wrote:
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c

> @@ -2154,6 +2139,23 @@ static void __mem_cgroup_commit_charge(s
>  	 */
>  	memcg_check_events(mem, pc->page);
>  }
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Because tail pages are not mared as "used", set it. We're under

marked

> + * compund_lock and don't need to take care of races.
> + * Statistics are updated properly at charging. We just mark Used bits.
> + */
> +void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
> +{
> +	struct page_cgroup *hpc = lookup_page_cgroup(head);
> +	struct page_cgroup *tpc = lookup_page_cgroup(tail);

I have trouble reading the code fluently with those names as they are
just very similar random letter sequences.  Could you rename them so
that they're better to discriminate?  headpc and tailpc perhaps?

> +	tpc->mem_cgroup = hpc->mem_cgroup;
> +	smp_wmb(); /* see __commit_charge() */
> +	SetPageCgroupUsed(tpc);
> +	VM_BUG_ON(PageCgroupCache(hpc));

Right now, this would be a bug due to other circumstances, but this
function does not require the page to be anon to function correctly,
does it?  I don't think we should encode a made up dependency here.

> @@ -2602,8 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
>  		break;
>  	}
>  
> -	for (i = 0; i < count; i++)
> -		mem_cgroup_charge_statistics(mem, file, -1);
> +	mem_cgroup_charge_statistics(mem, file, -count);

Pass PageCgroupCache(pc) instead, ditch the `file' variable?

>  	ClearPageCgroupUsed(pc);
>  	/*
> Index: mmotm-0107/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-0107.orig/include/linux/memcontrol.h
> +++ mmotm-0107/include/linux/memcontrol.h
> @@ -146,6 +146,10 @@ unsigned long mem_cgroup_soft_limit_recl
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
> +#endif
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
>  
> Index: mmotm-0107/mm/huge_memory.c
> ===================================================================
> --- mmotm-0107.orig/mm/huge_memory.c
> +++ mmotm-0107/mm/huge_memory.c
> @@ -1203,6 +1203,8 @@ static void __split_huge_page_refcount(s
>  		BUG_ON(!PageDirty(page_tail));
>  		BUG_ON(!PageSwapBacked(page_tail));
>  
> +		mem_cgroup_split_huge_fixup(page, page_tail);

You need to provide a dummy for non-memcg configurations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages
  2011-01-14 12:09   ` Johannes Weiner
@ 2011-01-14 12:23     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-14 12:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, Greg Thelen, aarcange,
	akpm@linux-foundation.org

Thank you for review.

2011/1/14 Johannes Weiner <hannes@cmpxchg.org>:
> On Fri, Jan 14, 2011 at 07:09:09PM +0900, KAMEZAWA Hiroyuki wrote:
>> --- mmotm-0107.orig/mm/memcontrol.c
>> +++ mmotm-0107/mm/memcontrol.c
>
>> @@ -2154,6 +2139,23 @@ static void __mem_cgroup_commit_charge(s
>>        */
>>       memcg_check_events(mem, pc->page);
>>  }
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +/*
>> + * Because tail pages are not mared as "used", set it. We're under
>
> marked
>
will fix.

>> + * compund_lock and don't need to take care of races.
>> + * Statistics are updated properly at charging. We just mark Used bits.
>> + */
>> +void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
>> +{
>> +     struct page_cgroup *hpc = lookup_page_cgroup(head);
>> +     struct page_cgroup *tpc = lookup_page_cgroup(tail);
>
> I have trouble reading the code fluently with those names as they are
> just very similar random letter sequences.  Could you rename them so
> that they're better to discriminate?  headpc and tailpc perhaps?
>
ok. I'll use headpc,tailpc.

>> +     tpc->mem_cgroup = hpc->mem_cgroup;
>> +     smp_wmb(); /* see __commit_charge() */
>> +     SetPageCgroupUsed(tpc);
>> +     VM_BUG_ON(PageCgroupCache(hpc));
>
> Right now, this would be a bug due to other circumstances, but this
> function does not require the page to be anon to function correctly,
> does it?

No.

>  I don't think we should encode a made up dependency here.
>
Ok, I'll remove BUG_ON.


>> @@ -2602,8 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
>>               break;
>>       }
>>
>> -     for (i = 0; i < count; i++)
>> -             mem_cgroup_charge_statistics(mem, file, -1);
>> +     mem_cgroup_charge_statistics(mem, file, -count);
>
> Pass PageCgroupCache(pc) instead, ditch the `file' variable?
>
will do.

>>       ClearPageCgroupUsed(pc);
>>       /*
>> Index: mmotm-0107/include/linux/memcontrol.h
>> ===================================================================
>> --- mmotm-0107.orig/include/linux/memcontrol.h
>> +++ mmotm-0107/include/linux/memcontrol.h
>> @@ -146,6 +146,10 @@ unsigned long mem_cgroup_soft_limit_recl
>>                                               gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
>> +#endif
>> +
>>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>>  struct mem_cgroup;
>>
>> Index: mmotm-0107/mm/huge_memory.c
>> ===================================================================
>> --- mmotm-0107.orig/mm/huge_memory.c
>> +++ mmotm-0107/mm/huge_memory.c
>> @@ -1203,6 +1203,8 @@ static void __split_huge_page_refcount(s
>>               BUG_ON(!PageDirty(page_tail));
>>               BUG_ON(!PageSwapBacked(page_tail));
>>
>> +             mem_cgroup_split_huge_fixup(page, page_tail);
>
> You need to provide a dummy for non-memcg configurations.

Ahhh, yes. I'll add one.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP
  2011-01-14 10:04 [PATCH 0/4] [BUGFIX] thp vs memcg fix KAMEZAWA Hiroyuki
  2011-01-14 10:06 ` [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues KAMEZAWA Hiroyuki
  2011-01-14 10:09 ` [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages KAMEZAWA Hiroyuki
@ 2011-01-14 10:10 ` KAMEZAWA Hiroyuki
  2011-01-14 12:14   ` Johannes Weiner
  2011-01-14 10:15 ` [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir " KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14 10:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	Greg Thelen, hannes, aarcange, akpm@linux-foundation.org

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

memroy cgroup's LRU stat should take care of size of pages because
Transparent Hugepage inserts hugepage into LRU and zone counter
is updeted based on the size of page.

If this value is the number wrong, memory reclaim will not work well.

Note: only head page of THP's huge page is linked into LRU.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -815,7 +815,10 @@ void mem_cgroup_del_lru_list(struct page
 	 * removed from global LRU.
 	 */
 	mz = page_cgroup_zoneinfo(pc);
-	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
+	if (!PageTransHuge(page))
+		MEM_CGROUP_ZSTAT(mz, lru) -= 1;
+	else
+		MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
 	if (mem_cgroup_is_root(pc->mem_cgroup))
 		return;
 	VM_BUG_ON(list_empty(&pc->lru));
@@ -866,7 +869,10 @@ void mem_cgroup_add_lru_list(struct page
 		return;
 
 	mz = page_cgroup_zoneinfo(pc);
-	MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	if (!PageTransHuge(page))
+		MEM_CGROUP_ZSTAT(mz, lru) += 1;
+	else
+		MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
 	SetPageCgroupAcctLRU(pc);
 	if (mem_cgroup_is_root(pc->mem_cgroup))
 		return;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP
  2011-01-14 10:10 ` [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP KAMEZAWA Hiroyuki
@ 2011-01-14 12:14   ` Johannes Weiner
  2011-01-14 12:24     ` Hiroyuki Kamezawa
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2011-01-14 12:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	Greg Thelen, aarcange, akpm@linux-foundation.org

On Fri, Jan 14, 2011 at 07:10:42PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> memroy cgroup's LRU stat should take care of size of pages because
> Transparent Hugepage inserts hugepage into LRU and zone counter
> is updeted based on the size of page.
> 
> If this value is the number wrong, memory reclaim will not work well.
> 
> Note: only head page of THP's huge page is linked into LRU.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -815,7 +815,10 @@ void mem_cgroup_del_lru_list(struct page
>  	 * removed from global LRU.
>  	 */
>  	mz = page_cgroup_zoneinfo(pc);
> -	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
> +	if (!PageTransHuge(page))
> +		MEM_CGROUP_ZSTAT(mz, lru) -= 1;
> +	else
> +		MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);

compound_order() returns 0 for !PG_head pages, that should do the
right thing without checking PageTransHuge(), right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP
  2011-01-14 12:14   ` Johannes Weiner
@ 2011-01-14 12:24     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-14 12:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, Greg Thelen, aarcange,
	akpm@linux-foundation.org

2011/1/14 Johannes Weiner <hannes@cmpxchg.org>:
> On Fri, Jan 14, 2011 at 07:10:42PM +0900, KAMEZAWA Hiroyuki wrote:
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> memroy cgroup's LRU stat should take care of size of pages because
>> Transparent Hugepage inserts hugepage into LRU and zone counter
>> is updeted based on the size of page.
>>
>> If this value is the number wrong, memory reclaim will not work well.
>>
>> Note: only head page of THP's huge page is linked into LRU.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  mm/memcontrol.c |   10 ++++++++--
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> Index: mmotm-0107/mm/memcontrol.c
>> ===================================================================
>> --- mmotm-0107.orig/mm/memcontrol.c
>> +++ mmotm-0107/mm/memcontrol.c
>> @@ -815,7 +815,10 @@ void mem_cgroup_del_lru_list(struct page
>>        * removed from global LRU.
>>        */
>>       mz = page_cgroup_zoneinfo(pc);
>> -     MEM_CGROUP_ZSTAT(mz, lru) -= 1;
>> +     if (!PageTransHuge(page))
>> +             MEM_CGROUP_ZSTAT(mz, lru) -= 1;
>> +     else
>> +             MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
>
> compound_order() returns 0 for !PG_head pages, that should do the
> right thing without checking PageTransHuge(), right?

You're right. Then, I can remove this 'if'. Thank you.
Hmm, I'll check other places, again.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
  2011-01-14 10:04 [PATCH 0/4] [BUGFIX] thp vs memcg fix KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2011-01-14 10:10 ` [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP KAMEZAWA Hiroyuki
@ 2011-01-14 10:15 ` KAMEZAWA Hiroyuki
  2011-01-14 12:21   ` Johannes Weiner
  2011-01-17  0:15   ` Daisuke Nishimura
  3 siblings, 2 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14 10:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	Greg Thelen, hannes, aarcange, akpm@linux-foundation.org


Now, when THP is enabled, memcg's rmdir() function is broken
because move_account() for THP page is not supported.

This will cause account leak or -EBUSY issue at rmdir().
This patch fixes the issue by supporting move_account() THP pages.

And account information will be moved to its parent at rmdir().

How to test:
   79  mount -t cgroup none /cgroup/memory/ -o memory
   80  mkdir /cgroup/A/
   81  mkdir /cgroup/memory/A
   82  mkdir /cgroup/memory/A/B
   83  cgexec -g memory:A/B ./malloc 128 &
   84  grep anon /cgroup/memory/A/B/memory.stat
   85  grep rss /cgroup/memory/A/B/memory.stat
   86  echo 1728 > /cgroup/memory/A/tasks
   87  grep rss /cgroup/memory/A/memory.stat
   88  rmdir /cgroup/memory/A/B/
   89  grep rss /cgroup/memory/A/memory.stat

- Create 2 level directory and exec a task calls malloc(big chunk).
- Move a task somewhere (its parent cgroup in above)
- rmdir /A/B
- check memory.stat in /A/B is moved to /A after rmdir. and confirm
  RSS/LRU information includes usages it was charged against /A/B.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2154,6 +2154,10 @@ void mem_cgroup_split_huge_fixup(struct 
 	smp_wmb(); /* see __commit_charge() */
 	SetPageCgroupUsed(tpc);
 	VM_BUG_ON(PageCgroupCache(hpc));
+	/*
+ 	 * Note: if dirty ratio etc..are supported,
+         * other flags may need to be copied.
+         */
 }
 #endif
 
@@ -2175,8 +2179,11 @@ void mem_cgroup_split_huge_fixup(struct 
  */
 
 static void __mem_cgroup_move_account(struct page_cgroup *pc,
-	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
+	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge,
+	int charge_size)
 {
+	int pagenum = charge_size >> PAGE_SHIFT;
+
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
 	VM_BUG_ON(!page_is_cgroup_locked(pc));
@@ -2190,14 +2197,14 @@ static void __mem_cgroup_move_account(st
 		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		preempt_enable();
 	}
-	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
+	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -pagenum);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
-		mem_cgroup_cancel_charge(from, PAGE_SIZE);
+		mem_cgroup_cancel_charge(from, charge_size);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
+	mem_cgroup_charge_statistics(to, PageCgroupCache(pc), pagenum);
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of
@@ -2212,7 +2219,8 @@ static void __mem_cgroup_move_account(st
  * __mem_cgroup_move_account()
  */
 static int mem_cgroup_move_account(struct page_cgroup *pc,
-		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
+		struct mem_cgroup *from, struct mem_cgroup *to,
+		bool uncharge, int charge_size)
 {
 	int ret = -EINVAL;
 	unsigned long flags;
@@ -2220,7 +2228,7 @@ static int mem_cgroup_move_account(struc
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
 		move_lock_page_cgroup(pc, &flags);
-		__mem_cgroup_move_account(pc, from, to, uncharge);
+		__mem_cgroup_move_account(pc, from, to, uncharge, charge_size);
 		move_unlock_page_cgroup(pc, &flags);
 		ret = 0;
 	}
@@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
 	struct cgroup *cg = child->css.cgroup;
 	struct cgroup *pcg = cg->parent;
 	struct mem_cgroup *parent;
+	int charge_size = PAGE_SIZE;
 	int ret;
 
 	/* Is ROOT ? */
@@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
 		goto out;
 	if (isolate_lru_page(page))
 		goto put;
+	/* The page is isolated from LRU and we have no race with splitting */
+	if (PageTransHuge(page))
+		charge_size = PAGE_SIZE << compound_order(page);
 
 	parent = mem_cgroup_from_cont(pcg);
 	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
-				      PAGE_SIZE);
+				      charge_size);
 	if (ret || !parent)
 		goto put_back;
 
-	ret = mem_cgroup_move_account(pc, child, parent, true);
+	ret = mem_cgroup_move_account(pc, child, parent, true, charge_size);
 	if (ret)
-		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
+		mem_cgroup_cancel_charge(parent, charge_size);
 put_back:
 	putback_lru_page(page);
 put:
@@ -4850,7 +4862,7 @@ retry:
 				goto put;
 			pc = lookup_page_cgroup(page);
 			if (!mem_cgroup_move_account(pc,
-						mc.from, mc.to, false)) {
+					mc.from, mc.to, false, PAGE_SIZE)) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
  2011-01-14 10:15 ` [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir " KAMEZAWA Hiroyuki
@ 2011-01-14 12:21   ` Johannes Weiner
  2011-01-14 12:28     ` Hiroyuki Kamezawa
  2011-01-17  0:15   ` Daisuke Nishimura
  1 sibling, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2011-01-14 12:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	Greg Thelen, aarcange, akpm@linux-foundation.org

On Fri, Jan 14, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
> 
> Now, when THP is enabled, memcg's rmdir() function is broken
> because move_account() for THP page is not supported.
> 
> This will cause account leak or -EBUSY issue at rmdir().
> This patch fixes the issue by supporting move_account() THP pages.
> 
> And account information will be moved to its parent at rmdir().
> 
> How to test:
>    79  mount -t cgroup none /cgroup/memory/ -o memory
>    80  mkdir /cgroup/A/
>    81  mkdir /cgroup/memory/A
>    82  mkdir /cgroup/memory/A/B
>    83  cgexec -g memory:A/B ./malloc 128 &
>    84  grep anon /cgroup/memory/A/B/memory.stat
>    85  grep rss /cgroup/memory/A/B/memory.stat
>    86  echo 1728 > /cgroup/memory/A/tasks
>    87  grep rss /cgroup/memory/A/memory.stat
>    88  rmdir /cgroup/memory/A/B/
>    89  grep rss /cgroup/memory/A/memory.stat
> 
> - Create 2 level directory and exec a task calls malloc(big chunk).
> - Move a task somewhere (its parent cgroup in above)
> - rmdir /A/B
> - check memory.stat in /A/B is moved to /A after rmdir. and confirm
>   RSS/LRU information includes usages it was charged against /A/B.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   32 ++++++++++++++++++++++----------
>  1 file changed, 22 insertions(+), 10 deletions(-)
> 
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -2154,6 +2154,10 @@ void mem_cgroup_split_huge_fixup(struct 
>  	smp_wmb(); /* see __commit_charge() */
>  	SetPageCgroupUsed(tpc);
>  	VM_BUG_ON(PageCgroupCache(hpc));
> +	/*
> + 	 * Note: if dirty ratio etc..are supported,
> +         * other flags may need to be copied.
> +         */

That's a good comment, but it should be in the patch that introduces
this function and is a bit unrelated in this one.

>  }
>  #endif
>  
> @@ -2175,8 +2179,11 @@ void mem_cgroup_split_huge_fixup(struct 
>   */
>  
>  static void __mem_cgroup_move_account(struct page_cgroup *pc,
> -	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> +	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge,
> +	int charge_size)
>  {
> +	int pagenum = charge_size >> PAGE_SHIFT;

nr_pages?

> +
>  	VM_BUG_ON(from == to);
>  	VM_BUG_ON(PageLRU(pc->page));
>  	VM_BUG_ON(!page_is_cgroup_locked(pc));
> @@ -2190,14 +2197,14 @@ static void __mem_cgroup_move_account(st
>  		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>  		preempt_enable();
>  	}
> -	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
> +	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -pagenum);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
> -		mem_cgroup_cancel_charge(from, PAGE_SIZE);
> +		mem_cgroup_cancel_charge(from, charge_size);
>  
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
> -	mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
> +	mem_cgroup_charge_statistics(to, PageCgroupCache(pc), pagenum);
>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> @@ -2212,7 +2219,8 @@ static void __mem_cgroup_move_account(st
>   * __mem_cgroup_move_account()
>   */
>  static int mem_cgroup_move_account(struct page_cgroup *pc,
> -		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> +		struct mem_cgroup *from, struct mem_cgroup *to,
> +		bool uncharge, int charge_size)
>  {
>  	int ret = -EINVAL;
>  	unsigned long flags;
> @@ -2220,7 +2228,7 @@ static int mem_cgroup_move_account(struc
>  	lock_page_cgroup(pc);
>  	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
>  		move_lock_page_cgroup(pc, &flags);
> -		__mem_cgroup_move_account(pc, from, to, uncharge);
> +		__mem_cgroup_move_account(pc, from, to, uncharge, charge_size);
>  		move_unlock_page_cgroup(pc, &flags);
>  		ret = 0;
>  	}
> @@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
>  	struct cgroup *cg = child->css.cgroup;
>  	struct cgroup *pcg = cg->parent;
>  	struct mem_cgroup *parent;
> +	int charge_size = PAGE_SIZE;
>  	int ret;
>  
>  	/* Is ROOT ? */
> @@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
>  		goto out;
>  	if (isolate_lru_page(page))
>  		goto put;
> +	/* The page is isolated from LRU and we have no race with splitting */
> +	if (PageTransHuge(page))
> +		charge_size = PAGE_SIZE << compound_order(page);

The same as in the previous patch, compound_order() implicitely
handles order-0 pages and should do the right thing without an extra
check.

The comment is valuable, though!

Nitpicks aside:
Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
  2011-01-14 12:21   ` Johannes Weiner
@ 2011-01-14 12:28     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-14 12:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, Greg Thelen, aarcange,
	akpm@linux-foundation.org

2011/1/14 Johannes Weiner <hannes@cmpxchg.org>:
> On Fri, Jan 14, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
>>
>> Now, when THP is enabled, memcg's rmdir() function is broken
>> because move_account() for THP page is not supported.
>>
>> This will cause account leak or -EBUSY issue at rmdir().
>> This patch fixes the issue by supporting move_account() THP pages.
>>
>> And account information will be moved to its parent at rmdir().
>>
>> How to test:
>>    79  mount -t cgroup none /cgroup/memory/ -o memory
>>    80  mkdir /cgroup/A/
>>    81  mkdir /cgroup/memory/A
>>    82  mkdir /cgroup/memory/A/B
>>    83  cgexec -g memory:A/B ./malloc 128 &
>>    84  grep anon /cgroup/memory/A/B/memory.stat
>>    85  grep rss /cgroup/memory/A/B/memory.stat
>>    86  echo 1728 > /cgroup/memory/A/tasks
>>    87  grep rss /cgroup/memory/A/memory.stat
>>    88  rmdir /cgroup/memory/A/B/
>>    89  grep rss /cgroup/memory/A/memory.stat
>>
>> - Create 2 level directory and exec a task calls malloc(big chunk).
>> - Move a task somewhere (its parent cgroup in above)
>> - rmdir /A/B
>> - check memory.stat in /A/B is moved to /A after rmdir. and confirm
>>   RSS/LRU information includes usages it was charged against /A/B.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  mm/memcontrol.c |   32 ++++++++++++++++++++++----------
>>  1 file changed, 22 insertions(+), 10 deletions(-)
>>
>> Index: mmotm-0107/mm/memcontrol.c
>> ===================================================================
>> --- mmotm-0107.orig/mm/memcontrol.c
>> +++ mmotm-0107/mm/memcontrol.c
>> @@ -2154,6 +2154,10 @@ void mem_cgroup_split_huge_fixup(struct
>>       smp_wmb(); /* see __commit_charge() */
>>       SetPageCgroupUsed(tpc);
>>       VM_BUG_ON(PageCgroupCache(hpc));
>> +     /*
>> +      * Note: if dirty ratio etc..are supported,
>> +         * other flags may need to be copied.
>> +         */
>
> That's a good comment, but it should be in the patch that introduces
> this function and is a bit unrelated in this one.
>
Ok. I'll remove this. This is an alarm for Greg ;)

>>  }
>>  #endif
>>
>> @@ -2175,8 +2179,11 @@ void mem_cgroup_split_huge_fixup(struct
>>   */
>>
>>  static void __mem_cgroup_move_account(struct page_cgroup *pc,
>> -     struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>> +     struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge,
>> +     int charge_size)
>>  {
>> +     int pagenum = charge_size >> PAGE_SHIFT;
>
> nr_pages?
>
Ok. replace pagenum <-> nr_pages.


>> +
>>       VM_BUG_ON(from == to);
>>       VM_BUG_ON(PageLRU(pc->page));
>>       VM_BUG_ON(!page_is_cgroup_locked(pc));
>> @@ -2190,14 +2197,14 @@ static void __mem_cgroup_move_account(st
>>               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>>               preempt_enable();
>>       }
>> -     mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
>> +     mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -pagenum);
>>       if (uncharge)
>>               /* This is not "cancel", but cancel_charge does all we need. */
>> -             mem_cgroup_cancel_charge(from, PAGE_SIZE);
>> +             mem_cgroup_cancel_charge(from, charge_size);
>>
>>       /* caller should have done css_get */
>>       pc->mem_cgroup = to;
>> -     mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
>> +     mem_cgroup_charge_statistics(to, PageCgroupCache(pc), pagenum);
>>       /*
>>        * We charges against "to" which may not have any tasks. Then, "to"
>>        * can be under rmdir(). But in current implementation, caller of
>> @@ -2212,7 +2219,8 @@ static void __mem_cgroup_move_account(st
>>   * __mem_cgroup_move_account()
>>   */
>>  static int mem_cgroup_move_account(struct page_cgroup *pc,
>> -             struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>> +             struct mem_cgroup *from, struct mem_cgroup *to,
>> +             bool uncharge, int charge_size)
>>  {
>>       int ret = -EINVAL;
>>       unsigned long flags;
>> @@ -2220,7 +2228,7 @@ static int mem_cgroup_move_account(struc
>>       lock_page_cgroup(pc);
>>       if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
>>               move_lock_page_cgroup(pc, &flags);
>> -             __mem_cgroup_move_account(pc, from, to, uncharge);
>> +             __mem_cgroup_move_account(pc, from, to, uncharge, charge_size);
>>               move_unlock_page_cgroup(pc, &flags);
>>               ret = 0;
>>       }
>> @@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
>>       struct cgroup *cg = child->css.cgroup;
>>       struct cgroup *pcg = cg->parent;
>>       struct mem_cgroup *parent;
>> +     int charge_size = PAGE_SIZE;
>>       int ret;
>>
>>       /* Is ROOT ? */
>> @@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
>>               goto out;
>>       if (isolate_lru_page(page))
>>               goto put;
>> +     /* The page is isolated from LRU and we have no race with splitting */
>> +     if (PageTransHuge(page))
>> +             charge_size = PAGE_SIZE << compound_order(page);
>
> The same as in the previous patch, compound_order() implicitely
> handles order-0 pages and should do the right thing without an extra
> check.
>
Sure.

> The comment is valuable, though!
>
> Nitpicks aside:
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thank you for quick review!
Updated one will be posted in the next week after some amounts of more tests.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
  2011-01-14 10:15 ` [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir " KAMEZAWA Hiroyuki
  2011-01-14 12:21   ` Johannes Weiner
@ 2011-01-17  0:15   ` Daisuke Nishimura
  2011-01-17  0:25     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 15+ messages in thread
From: Daisuke Nishimura @ 2011-01-17  0:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, Greg Thelen, hannes, aarcange,
	akpm@linux-foundation.org, Daisuke Nishimura

Hi, thank you for your great works!

I've not read this series in detail, but one quick comment for move_parent.

> @@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
>  	struct cgroup *cg = child->css.cgroup;
>  	struct cgroup *pcg = cg->parent;
>  	struct mem_cgroup *parent;
> +	int charge_size = PAGE_SIZE;
>  	int ret;
>  
>  	/* Is ROOT ? */
> @@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
>  		goto out;
>  	if (isolate_lru_page(page))
>  		goto put;
> +	/* The page is isolated from LRU and we have no race with splitting */
> +	if (PageTransHuge(page))
> +		charge_size = PAGE_SIZE << compound_order(page);
>  
>  	parent = mem_cgroup_from_cont(pcg);
>  	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
> -				      PAGE_SIZE);
> +				      charge_size);
>  	if (ret || !parent)
>  		goto put_back;
>  
> -	ret = mem_cgroup_move_account(pc, child, parent, true);
> +	ret = mem_cgroup_move_account(pc, child, parent, true, charge_size);
>  	if (ret)
> -		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
> +		mem_cgroup_cancel_charge(parent, charge_size);
>  put_back:
>  	putback_lru_page(page);
>  put:
I think there is possibility that the page is split after "if (PageTransHuge(page))".

In RHEL6, this part looks like:

   1598         if (PageTransHuge(page))
   1599                 page_size = PAGE_SIZE << compound_order(page);
   1600
   1601         ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
   1602                                       page_size);
   1603         if (ret || !parent)
   1604                 return ret;
   1605
   1606         if (!get_page_unless_zero(page)) {
   1607                 ret = -EBUSY;
   1608                 goto uncharge;
   1609         }
   1610
   1611         ret = isolate_lru_page(page);
   1612
   1613         if (ret)
   1614                 goto cancel;
   1615
   1616         compound_lock_irqsave(page, &flags);
   1617         ret = mem_cgroup_move_account(pc, child, parent, page_size);
   1618         compound_unlock_irqrestore(page, flags);
   1619

In fact, I found a bug of res_counter underflow around here, and I've already send
a patch to RedHat.

===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

In mem_cgroup_move_parent(), the page can be split by other context after we
check PageTransHuge() and before hold the compound_lock of the page later.

This means a race can happen like:

	__split_huge_page_refcount()		mem_cgroup_move_parent()
    ---------------------------------------------------------------------------
						if (PageTransHuge())
						-> true
						-> set "page_size" to huge page
						   size.
						__mem_cgroup_try_charge()
						-> charge "page_size" to the
						   parent.
	compound_lock()
	mem_cgroup_split_hugepage_commit()
	-> commit all the tail pages to the
	   "current"(i.e. child) cgroup.
	   iow, pc->mem_cgroup of tail pages
	   point to the child.
	ClearPageCompound()
	compound_unlock()
						compound_lock()
						mem_cgroup_move_account()
						-> make pc->mem_cgroup of the
						   head page point to the parent.
						-> uncharge "page_size" from
						   the child.
						compound_unlock()

This can causes at least 2 problems.

1. Tail pages are linked to LRU of the child, even though usages(res_counter) of
   them have been already uncharged from the chilid. This causes res_counter
   underflow at removing the child directory.
2. Usage of the parent is increased by the huge page size at moving charge of
   the head page, but usage will be decreased only by the normal page size when
   the head page is uncharged later because it is not PageTransHuge() anymore.
   This means the parent doesn't have enough pages on its LRU to decrease the
   usage to 0 and it cannot be rmdir'ed.

This patch fixes this problem by re-checking PageTransHuge() again under the
compound_lock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

diff -uprN linux-2.6.32.x86_64.org/mm/memcontrol.c linux-2.6.32.x86_64/mm/memcontrol.c
--- linux-2.6.32.x86_64.org/mm/memcontrol.c	2010-07-15 16:44:57.000000000 +0900
+++ linux-2.6.32.x86_64/mm/memcontrol.c	2010-07-15 17:34:12.000000000 +0900
@@ -1608,6 +1608,17 @@ static int mem_cgroup_move_parent(struct
 		goto cancel;
 
 	compound_lock_irqsave(page, &flags);
+	/* re-check under compound_lock because the page might be split */
+	if (unlikely(page_size != PAGE_SIZE && !PageTransHuge(page))) {
+		unsigned long extra = page_size - PAGE_SIZE;
+		/* uncharge extra charges from parent */
+		if (!mem_cgroup_is_root(parent)) {
+			res_counter_uncharge(&parent->res, extra);
+			if (do_swap_account)
+				res_counter_uncharge(&parent->memsw, extra);
+		}
+		page_size = PAGE_SIZE;
+	}
 	ret = mem_cgroup_move_account(pc, child, parent, page_size);
 	compound_unlock_irqrestore(page, flags);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
  2011-01-17  0:15   ` Daisuke Nishimura
@ 2011-01-17  0:25     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-17  0:25 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	balbir@linux.vnet.ibm.com, Greg Thelen, hannes, aarcange,
	akpm@linux-foundation.org

On Mon, 17 Jan 2011 09:15:33 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi, thank you for your great works!
> 
> I've not read this series in detail, but one quick comment for move_parent.
> 
> > @@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
> >  	struct cgroup *cg = child->css.cgroup;
> >  	struct cgroup *pcg = cg->parent;
> >  	struct mem_cgroup *parent;
> > +	int charge_size = PAGE_SIZE;
> >  	int ret;
> >  
> >  	/* Is ROOT ? */
> > @@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
> >  		goto out;
> >  	if (isolate_lru_page(page))
> >  		goto put;
> > +	/* The page is isolated from LRU and we have no race with splitting */
> > +	if (PageTransHuge(page))
> > +		charge_size = PAGE_SIZE << compound_order(page);
> >  
> >  	parent = mem_cgroup_from_cont(pcg);
> >  	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
> > -				      PAGE_SIZE);
> > +				      charge_size);
> >  	if (ret || !parent)
> >  		goto put_back;
> >  
> > -	ret = mem_cgroup_move_account(pc, child, parent, true);
> > +	ret = mem_cgroup_move_account(pc, child, parent, true, charge_size);
> >  	if (ret)
> > -		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
> > +		mem_cgroup_cancel_charge(parent, charge_size);
> >  put_back:
> >  	putback_lru_page(page);
> >  put:
> I think there is possibility that the page is split after "if (PageTransHuge(page))".
> 
> In RHEL6, this part looks like:
> 
>    1598         if (PageTransHuge(page))
>    1599                 page_size = PAGE_SIZE << compound_order(page);
>    1600
>    1601         ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
>    1602                                       page_size);
>    1603         if (ret || !parent)
>    1604                 return ret;
>    1605
>    1606         if (!get_page_unless_zero(page)) {
>    1607                 ret = -EBUSY;
>    1608                 goto uncharge;
>    1609         }
>    1610
>    1611         ret = isolate_lru_page(page);
>    1612
>    1613         if (ret)
>    1614                 goto cancel;
>    1615
>    1616         compound_lock_irqsave(page, &flags);
>    1617         ret = mem_cgroup_move_account(pc, child, parent, page_size);
>    1618         compound_unlock_irqrestore(page, flags);
>    1619
> 
> In fact, I found a bug of res_counter underflow around here, and I've already send
> a patch to RedHat.
> 

Okay, I'll take care of that in the next version.

Thanks,
-Kame


> ===
> From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> In mem_cgroup_move_parent(), the page can be split by other context after we
> check PageTransHuge() and before hold the compound_lock of the page later.
> 
> This means a race can happen like:
> 
> 	__split_huge_page_refcount()		mem_cgroup_move_parent()
>     ---------------------------------------------------------------------------
> 						if (PageTransHuge())
> 						-> true
> 						-> set "page_size" to huge page
> 						   size.
> 						__mem_cgroup_try_charge()
> 						-> charge "page_size" to the
> 						   parent.
> 	compound_lock()
> 	mem_cgroup_split_hugepage_commit()
> 	-> commit all the tail pages to the
> 	   "current"(i.e. child) cgroup.
> 	   iow, pc->mem_cgroup of tail pages
> 	   point to the child.
> 	ClearPageCompound()
> 	compound_unlock()
> 						compound_lock()
> 						mem_cgroup_move_account()
> 						-> make pc->mem_cgroup of the
> 						   head page point to the parent.
> 						-> uncharge "page_size" from
> 						   the child.
> 						compound_unlock()
> 
> This can causes at least 2 problems.
> 
> 1. Tail pages are linked to LRU of the child, even though usages(res_counter) of
>    them have been already uncharged from the chilid. This causes res_counter
>    underflow at removing the child directory.
> 2. Usage of the parent is increased by the huge page size at moving charge of
>    the head page, but usage will be decreased only by the normal page size when
>    the head page is uncharged later because it is not PageTransHuge() anymore.
>    This means the parent doesn't have enough pages on its LRU to decrease the
>    usage to 0 and it cannot be rmdir'ed.
> 
> This patch fixes this problem by re-checking PageTransHuge() again under the
> compound_lock.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> diff -uprN linux-2.6.32.x86_64.org/mm/memcontrol.c linux-2.6.32.x86_64/mm/memcontrol.c
> --- linux-2.6.32.x86_64.org/mm/memcontrol.c	2010-07-15 16:44:57.000000000 +0900
> +++ linux-2.6.32.x86_64/mm/memcontrol.c	2010-07-15 17:34:12.000000000 +0900
> @@ -1608,6 +1608,17 @@ static int mem_cgroup_move_parent(struct
>  		goto cancel;
>  
>  	compound_lock_irqsave(page, &flags);
> +	/* re-check under compound_lock because the page might be split */
> +	if (unlikely(page_size != PAGE_SIZE && !PageTransHuge(page))) {
> +		unsigned long extra = page_size - PAGE_SIZE;
> +		/* uncharge extra charges from parent */
> +		if (!mem_cgroup_is_root(parent)) {
> +			res_counter_uncharge(&parent->res, extra);
> +			if (do_swap_account)
> +				res_counter_uncharge(&parent->memsw, extra);
> +		}
> +		page_size = PAGE_SIZE;
> +	}
>  	ret = mem_cgroup_move_account(pc, child, parent, page_size);
>  	compound_unlock_irqrestore(page, flags);
>  
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-01-17  0:31 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-14 10:04 [PATCH 0/4] [BUGFIX] thp vs memcg fix KAMEZAWA Hiroyuki
2011-01-14 10:06 ` [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues KAMEZAWA Hiroyuki
2011-01-14 11:51   ` Johannes Weiner
2011-01-14 12:07     ` Hiroyuki Kamezawa
2011-01-14 10:09 ` [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages KAMEZAWA Hiroyuki
2011-01-14 12:09   ` Johannes Weiner
2011-01-14 12:23     ` Hiroyuki Kamezawa
2011-01-14 10:10 ` [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP KAMEZAWA Hiroyuki
2011-01-14 12:14   ` Johannes Weiner
2011-01-14 12:24     ` Hiroyuki Kamezawa
2011-01-14 10:15 ` [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir " KAMEZAWA Hiroyuki
2011-01-14 12:21   ` Johannes Weiner
2011-01-14 12:28     ` Hiroyuki Kamezawa
2011-01-17  0:15   ` Daisuke Nishimura
2011-01-17  0:25     ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).