* [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues
2011-01-14 10:04 [PATCH 0/4] [BUGFIX] thp vs memcg fix KAMEZAWA Hiroyuki
@ 2011-01-14 10:06 ` KAMEZAWA Hiroyuki
2011-01-14 11:51 ` Johannes Weiner
2011-01-14 10:09 ` [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages KAMEZAWA Hiroyuki
` (2 subsequent siblings)
3 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14 10:06 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
Greg Thelen, hannes, aarcange, akpm@linux-foundation.org
mem_cgroup_charge_staistics() was designed for charging a page but
now, we have transparent hugepage. To fix problems (in following patch)
it's required to change the function to get the number of pages
as its arguments.
The new function gets following as argument.
- type of page rather than 'pc'
- size of page which is accounted.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -600,23 +600,23 @@ static void mem_cgroup_swap_statistics(s
}
static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
- struct page_cgroup *pc,
- bool charge)
+ bool file,
+ int pages)
{
- int val = (charge) ? 1 : -1;
-
preempt_disable();
- if (PageCgroupCache(pc))
- __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], val);
+ if (file)
+ __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], pages);
else
- __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], val);
+ __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], pages);
- if (charge)
+ /* pagein of a big page is an event. So, ignore page size */
+ if (pages > 0)
__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGIN_COUNT]);
else
__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGOUT_COUNT]);
- __this_cpu_inc(mem->stat->count[MEM_CGROUP_EVENTS]);
+
+ __this_cpu_add(mem->stat->count[MEM_CGROUP_EVENTS], pages);
preempt_enable();
}
@@ -2092,6 +2092,7 @@ static void ____mem_cgroup_commit_charge
struct page_cgroup *pc,
enum charge_type ctype)
{
+ bool file = false;
pc->mem_cgroup = mem;
/*
* We access a page_cgroup asynchronously without lock_page_cgroup().
@@ -2106,6 +2107,7 @@ static void ____mem_cgroup_commit_charge
case MEM_CGROUP_CHARGE_TYPE_SHMEM:
SetPageCgroupCache(pc);
SetPageCgroupUsed(pc);
+ file = true;
break;
case MEM_CGROUP_CHARGE_TYPE_MAPPED:
ClearPageCgroupCache(pc);
@@ -2115,7 +2117,7 @@ static void ____mem_cgroup_commit_charge
break;
}
- mem_cgroup_charge_statistics(mem, pc, true);
+ mem_cgroup_charge_statistics(mem, file, 1);
}
static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
@@ -2186,14 +2188,14 @@ static void __mem_cgroup_move_account(st
__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
preempt_enable();
}
- mem_cgroup_charge_statistics(from, pc, false);
+ mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
if (uncharge)
/* This is not "cancel", but cancel_charge does all we need. */
mem_cgroup_cancel_charge(from, PAGE_SIZE);
/* caller should have done css_get */
pc->mem_cgroup = to;
- mem_cgroup_charge_statistics(to, pc, true);
+ mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
/*
* We charges against "to" which may not have any tasks. Then, "to"
* can be under rmdir(). But in current implementation, caller of
@@ -2551,6 +2553,7 @@ __mem_cgroup_uncharge_common(struct page
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
int page_size = PAGE_SIZE;
+ bool file = false;
if (mem_cgroup_disabled())
return NULL;
@@ -2578,6 +2581,9 @@ __mem_cgroup_uncharge_common(struct page
if (!PageCgroupUsed(pc))
goto unlock_out;
+ if (PageCgroupCache(pc))
+ file = true;
+
switch (ctype) {
case MEM_CGROUP_CHARGE_TYPE_MAPPED:
case MEM_CGROUP_CHARGE_TYPE_DROP:
@@ -2597,7 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
}
for (i = 0; i < count; i++)
- mem_cgroup_charge_statistics(mem, pc + i, false);
+ mem_cgroup_charge_statistics(mem, file, -1);
ClearPageCgroupUsed(pc);
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues
2011-01-14 10:06 ` [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues KAMEZAWA Hiroyuki
@ 2011-01-14 11:51 ` Johannes Weiner
2011-01-14 12:07 ` Hiroyuki Kamezawa
0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2011-01-14 11:51 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
Greg Thelen, aarcange, akpm@linux-foundation.org
On Fri, Jan 14, 2011 at 07:06:44PM +0900, KAMEZAWA Hiroyuki wrote:
> mem_cgroup_charge_staistics() was designed for charging a page but
> now, we have transparent hugepage. To fix problems (in following patch)
> it's required to change the function to get the number of pages
> as its arguments.
>
> The new function gets following as argument.
> - type of page rather than 'pc'
> - size of page which is accounted.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
I agree with the patch in general, below are only a few nitpicks.
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -600,23 +600,23 @@ static void mem_cgroup_swap_statistics(s
> }
>
> static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
> - struct page_cgroup *pc,
> - bool charge)
> + bool file,
> + int pages)
I think 'nr_pages' would be a better name. This makes me think of a
'struct page *[]'.
> {
> - int val = (charge) ? 1 : -1;
> -
> preempt_disable();
>
> - if (PageCgroupCache(pc))
> - __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], val);
> + if (file)
> + __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], pages);
> else
> - __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], val);
> + __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], pages);
>
> - if (charge)
> + /* pagein of a big page is an event. So, ignore page size */
> + if (pages > 0)
> __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGIN_COUNT]);
> else
> __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGOUT_COUNT]);
> - __this_cpu_inc(mem->stat->count[MEM_CGROUP_EVENTS]);
> +
> + __this_cpu_add(mem->stat->count[MEM_CGROUP_EVENTS], pages);
>
> preempt_enable();
> }
> @@ -2092,6 +2092,7 @@ static void ____mem_cgroup_commit_charge
> struct page_cgroup *pc,
> enum charge_type ctype)
> {
> + bool file = false;
> pc->mem_cgroup = mem;
> /*
> * We access a page_cgroup asynchronously without lock_page_cgroup().
> @@ -2106,6 +2107,7 @@ static void ____mem_cgroup_commit_charge
> case MEM_CGROUP_CHARGE_TYPE_SHMEM:
> SetPageCgroupCache(pc);
> SetPageCgroupUsed(pc);
> + file = true;
> break;
> case MEM_CGROUP_CHARGE_TYPE_MAPPED:
> ClearPageCgroupCache(pc);
> @@ -2115,7 +2117,7 @@ static void ____mem_cgroup_commit_charge
> break;
> }
>
> - mem_cgroup_charge_statistics(mem, pc, true);
> + mem_cgroup_charge_statistics(mem, file, 1);
The extra local variable is a bit awkward, since there are already
several sources of this information (ctype and pc->flags).
Could you keep it like the other sites, just pass PageCgroupCache()
here as well?
> @@ -2186,14 +2188,14 @@ static void __mem_cgroup_move_account(st
> __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> preempt_enable();
> }
> - mem_cgroup_charge_statistics(from, pc, false);
> + mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
> if (uncharge)
> /* This is not "cancel", but cancel_charge does all we need. */
> mem_cgroup_cancel_charge(from, PAGE_SIZE);
>
> /* caller should have done css_get */
> pc->mem_cgroup = to;
> - mem_cgroup_charge_statistics(to, pc, true);
> + mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
> /*
> * We charges against "to" which may not have any tasks. Then, "to"
> * can be under rmdir(). But in current implementation, caller of
> @@ -2551,6 +2553,7 @@ __mem_cgroup_uncharge_common(struct page
> struct page_cgroup *pc;
> struct mem_cgroup *mem = NULL;
> int page_size = PAGE_SIZE;
> + bool file = false;
>
> if (mem_cgroup_disabled())
> return NULL;
> @@ -2578,6 +2581,9 @@ __mem_cgroup_uncharge_common(struct page
> if (!PageCgroupUsed(pc))
> goto unlock_out;
>
> + if (PageCgroupCache(pc))
> + file = true;
> +
> switch (ctype) {
> case MEM_CGROUP_CHARGE_TYPE_MAPPED:
> case MEM_CGROUP_CHARGE_TYPE_DROP:
> @@ -2597,7 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
> }
>
> for (i = 0; i < count; i++)
> - mem_cgroup_charge_statistics(mem, pc + i, false);
> + mem_cgroup_charge_statistics(mem, file, -1);
I see you get rid of this loop in the next patch, anyway. Can you
just use PageCgroupCache() instead of the extra variable?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues
2011-01-14 11:51 ` Johannes Weiner
@ 2011-01-14 12:07 ` Hiroyuki Kamezawa
0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-14 12:07 UTC (permalink / raw)
To: Johannes Weiner
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
balbir@linux.vnet.ibm.com, Greg Thelen, aarcange,
akpm@linux-foundation.org
2011/1/14 Johannes Weiner <hannes@cmpxchg.org>:
> On Fri, Jan 14, 2011 at 07:06:44PM +0900, KAMEZAWA Hiroyuki wrote:
>> mem_cgroup_charge_staistics() was designed for charging a page but
>> now, we have transparent hugepage. To fix problems (in following patch)
>> it's required to change the function to get the number of pages
>> as its arguments.
>>
>> The new function gets following as argument.
>> - type of page rather than 'pc'
>> - size of page which is accounted.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> I agree with the patch in general, below are only a few nitpicks.
>
Thanks, I think details should be updated, too.
>> --- mmotm-0107.orig/mm/memcontrol.c
>> +++ mmotm-0107/mm/memcontrol.c
>> @@ -600,23 +600,23 @@ static void mem_cgroup_swap_statistics(s
>> }
>>
>> static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
>> - struct page_cgroup *pc,
>> - bool charge)
>> + bool file,
>> + int pages)
>
> I think 'nr_pages' would be a better name. This makes me think of a
> 'struct page *[]'.
>
ok.
>> {
>> - int val = (charge) ? 1 : -1;
>> -
>> preempt_disable();
>>
>> - if (PageCgroupCache(pc))
>> - __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], val);
>> + if (file)
>> + __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_CACHE], pages);
>> else
>> - __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], val);
>> + __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_RSS], pages);
>>
>> - if (charge)
>> + /* pagein of a big page is an event. So, ignore page size */
>> + if (pages > 0)
>> __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGIN_COUNT]);
>> else
>> __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_PGPGOUT_COUNT]);
>> - __this_cpu_inc(mem->stat->count[MEM_CGROUP_EVENTS]);
>> +
>> + __this_cpu_add(mem->stat->count[MEM_CGROUP_EVENTS], pages);
>>
>> preempt_enable();
>> }
>> @@ -2092,6 +2092,7 @@ static void ____mem_cgroup_commit_charge
>> struct page_cgroup *pc,
>> enum charge_type ctype)
>> {
>> + bool file = false;
>> pc->mem_cgroup = mem;
>> /*
>> * We access a page_cgroup asynchronously without lock_page_cgroup().
>> @@ -2106,6 +2107,7 @@ static void ____mem_cgroup_commit_charge
>> case MEM_CGROUP_CHARGE_TYPE_SHMEM:
>> SetPageCgroupCache(pc);
>> SetPageCgroupUsed(pc);
>> + file = true;
>> break;
>> case MEM_CGROUP_CHARGE_TYPE_MAPPED:
>> ClearPageCgroupCache(pc);
>> @@ -2115,7 +2117,7 @@ static void ____mem_cgroup_commit_charge
>> break;
>> }
>>
>> - mem_cgroup_charge_statistics(mem, pc, true);
>> + mem_cgroup_charge_statistics(mem, file, 1);
>
> The extra local variable is a bit awkward, since there are already
> several sources of this information (ctype and pc->flags).
>
> Could you keep it like the other sites, just pass PageCgroupCache()
> here as well?
>
Ok.
>> @@ -2186,14 +2188,14 @@ static void __mem_cgroup_move_account(st
>> __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> preempt_enable();
>> }
>> - mem_cgroup_charge_statistics(from, pc, false);
>> + mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
>> if (uncharge)
>> /* This is not "cancel", but cancel_charge does all we need. */
>> mem_cgroup_cancel_charge(from, PAGE_SIZE);
>>
>> /* caller should have done css_get */
>> pc->mem_cgroup = to;
>> - mem_cgroup_charge_statistics(to, pc, true);
>> + mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
>> /*
>> * We charges against "to" which may not have any tasks. Then, "to"
>> * can be under rmdir(). But in current implementation, caller of
>> @@ -2551,6 +2553,7 @@ __mem_cgroup_uncharge_common(struct page
>> struct page_cgroup *pc;
>> struct mem_cgroup *mem = NULL;
>> int page_size = PAGE_SIZE;
>> + bool file = false;
>>
>> if (mem_cgroup_disabled())
>> return NULL;
>> @@ -2578,6 +2581,9 @@ __mem_cgroup_uncharge_common(struct page
>> if (!PageCgroupUsed(pc))
>> goto unlock_out;
>>
>> + if (PageCgroupCache(pc))
>> + file = true;
>> +
>> switch (ctype) {
>> case MEM_CGROUP_CHARGE_TYPE_MAPPED:
>> case MEM_CGROUP_CHARGE_TYPE_DROP:
>> @@ -2597,7 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
>> }
>>
>> for (i = 0; i < count; i++)
>> - mem_cgroup_charge_statistics(mem, pc + i, false);
>> + mem_cgroup_charge_statistics(mem, file, -1);
>
> I see you get rid of this loop in the next patch, anyway. Can you
> just use PageCgroupCache() instead of the extra variable?
will do.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages
2011-01-14 10:04 [PATCH 0/4] [BUGFIX] thp vs memcg fix KAMEZAWA Hiroyuki
2011-01-14 10:06 ` [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues KAMEZAWA Hiroyuki
@ 2011-01-14 10:09 ` KAMEZAWA Hiroyuki
2011-01-14 12:09 ` Johannes Weiner
2011-01-14 10:10 ` [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP KAMEZAWA Hiroyuki
2011-01-14 10:15 ` [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir " KAMEZAWA Hiroyuki
3 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14 10:09 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
Greg Thelen, hannes, aarcange, akpm@linux-foundation.org
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Now, under THP:
at charge:
- PageCgroupUsed bit is set to all page_cgroup on a hugepage.
....set to 512 pages.
at uncharge
- PageCgroupUsed bit is unset on the head page.
So, tail pages will remain with "Used" bit.
This patch fixes that Used bit is set only to the head page.
Used bits for tail pages will be set at spliting if necessary.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 4 ++
mm/huge_memory.c | 2 +
mm/memcontrol.c | 80 ++++++++++++++++++++++-----------------------
3 files changed, 46 insertions(+), 40 deletions(-)
Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2084,15 +2084,28 @@ struct mem_cgroup *try_get_mem_cgroup_fr
return mem;
}
-/*
- * commit a charge got by __mem_cgroup_try_charge() and makes page_cgroup to be
- * USED state. If already USED, uncharge and return.
- */
-static void ____mem_cgroup_commit_charge(struct mem_cgroup *mem,
- struct page_cgroup *pc,
- enum charge_type ctype)
+static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
+ struct page_cgroup *pc,
+ enum charge_type ctype,
+ int page_size)
{
- bool file = false;
+ int count = page_size >> PAGE_SHIFT;
+
+ /* try_charge() can return NULL to *memcg, taking care of it. */
+ if (!mem)
+ return;
+
+ lock_page_cgroup(pc);
+ if (unlikely(PageCgroupUsed(pc))) {
+ unlock_page_cgroup(pc);
+ mem_cgroup_cancel_charge(mem, page_size);
+ return;
+ }
+
+ /*
+ * we don't need page_cgroup_lock about tail pages, becase they are not
+ * accessed by any other context at this point.
+ */
pc->mem_cgroup = mem;
/*
* We access a page_cgroup asynchronously without lock_page_cgroup().
@@ -2107,7 +2120,6 @@ static void ____mem_cgroup_commit_charge
case MEM_CGROUP_CHARGE_TYPE_SHMEM:
SetPageCgroupCache(pc);
SetPageCgroupUsed(pc);
- file = true;
break;
case MEM_CGROUP_CHARGE_TYPE_MAPPED:
ClearPageCgroupCache(pc);
@@ -2117,34 +2129,7 @@ static void ____mem_cgroup_commit_charge
break;
}
- mem_cgroup_charge_statistics(mem, file, 1);
-}
-
-static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
- struct page_cgroup *pc,
- enum charge_type ctype,
- int page_size)
-{
- int i;
- int count = page_size >> PAGE_SHIFT;
-
- /* try_charge() can return NULL to *memcg, taking care of it. */
- if (!mem)
- return;
-
- lock_page_cgroup(pc);
- if (unlikely(PageCgroupUsed(pc))) {
- unlock_page_cgroup(pc);
- mem_cgroup_cancel_charge(mem, page_size);
- return;
- }
-
- /*
- * we don't need page_cgroup_lock about tail pages, becase they are not
- * accessed by any other context at this point.
- */
- for (i = 0; i < count; i++)
- ____mem_cgroup_commit_charge(mem, pc + i, ctype);
+ mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), count);
unlock_page_cgroup(pc);
/*
@@ -2154,6 +2139,23 @@ static void __mem_cgroup_commit_charge(s
*/
memcg_check_events(mem, pc->page);
}
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Because tail pages are not mared as "used", set it. We're under
+ * compund_lock and don't need to take care of races.
+ * Statistics are updated properly at charging. We just mark Used bits.
+ */
+void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
+{
+ struct page_cgroup *hpc = lookup_page_cgroup(head);
+ struct page_cgroup *tpc = lookup_page_cgroup(tail);
+
+ tpc->mem_cgroup = hpc->mem_cgroup;
+ smp_wmb(); /* see __commit_charge() */
+ SetPageCgroupUsed(tpc);
+ VM_BUG_ON(PageCgroupCache(hpc));
+}
+#endif
/**
* __mem_cgroup_move_account - move account of the page
@@ -2548,7 +2550,6 @@ direct_uncharge:
static struct mem_cgroup *
__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
{
- int i;
int count;
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
@@ -2602,8 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
break;
}
- for (i = 0; i < count; i++)
- mem_cgroup_charge_statistics(mem, file, -1);
+ mem_cgroup_charge_statistics(mem, file, -count);
ClearPageCgroupUsed(pc);
/*
Index: mmotm-0107/include/linux/memcontrol.h
===================================================================
--- mmotm-0107.orig/include/linux/memcontrol.h
+++ mmotm-0107/include/linux/memcontrol.h
@@ -146,6 +146,10 @@ unsigned long mem_cgroup_soft_limit_recl
gfp_t gfp_mask);
u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
+#endif
+
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
Index: mmotm-0107/mm/huge_memory.c
===================================================================
--- mmotm-0107.orig/mm/huge_memory.c
+++ mmotm-0107/mm/huge_memory.c
@@ -1203,6 +1203,8 @@ static void __split_huge_page_refcount(s
BUG_ON(!PageDirty(page_tail));
BUG_ON(!PageSwapBacked(page_tail));
+ mem_cgroup_split_huge_fixup(page, page_tail);
+
lru_add_page_tail(zone, page, page_tail);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages
2011-01-14 10:09 ` [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages KAMEZAWA Hiroyuki
@ 2011-01-14 12:09 ` Johannes Weiner
2011-01-14 12:23 ` Hiroyuki Kamezawa
0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2011-01-14 12:09 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
Greg Thelen, aarcange, akpm@linux-foundation.org
On Fri, Jan 14, 2011 at 07:09:09PM +0900, KAMEZAWA Hiroyuki wrote:
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -2154,6 +2139,23 @@ static void __mem_cgroup_commit_charge(s
> */
> memcg_check_events(mem, pc->page);
> }
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Because tail pages are not mared as "used", set it. We're under
marked
> + * compund_lock and don't need to take care of races.
> + * Statistics are updated properly at charging. We just mark Used bits.
> + */
> +void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
> +{
> + struct page_cgroup *hpc = lookup_page_cgroup(head);
> + struct page_cgroup *tpc = lookup_page_cgroup(tail);
I have trouble reading the code fluently with those names as they are
just very similar random letter sequences. Could you rename them so
that they're better to discriminate? headpc and tailpc perhaps?
> + tpc->mem_cgroup = hpc->mem_cgroup;
> + smp_wmb(); /* see __commit_charge() */
> + SetPageCgroupUsed(tpc);
> + VM_BUG_ON(PageCgroupCache(hpc));
Right now, this would be a bug due to other circumstances, but this
function does not require the page to be anon to function correctly,
does it? I don't think we should encode a made up dependency here.
> @@ -2602,8 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
> break;
> }
>
> - for (i = 0; i < count; i++)
> - mem_cgroup_charge_statistics(mem, file, -1);
> + mem_cgroup_charge_statistics(mem, file, -count);
Pass PageCgroupCache(pc) instead, ditch the `file' variable?
> ClearPageCgroupUsed(pc);
> /*
> Index: mmotm-0107/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-0107.orig/include/linux/memcontrol.h
> +++ mmotm-0107/include/linux/memcontrol.h
> @@ -146,6 +146,10 @@ unsigned long mem_cgroup_soft_limit_recl
> gfp_t gfp_mask);
> u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
> +#endif
> +
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> Index: mmotm-0107/mm/huge_memory.c
> ===================================================================
> --- mmotm-0107.orig/mm/huge_memory.c
> +++ mmotm-0107/mm/huge_memory.c
> @@ -1203,6 +1203,8 @@ static void __split_huge_page_refcount(s
> BUG_ON(!PageDirty(page_tail));
> BUG_ON(!PageSwapBacked(page_tail));
>
> + mem_cgroup_split_huge_fixup(page, page_tail);
You need to provide a dummy for non-memcg configurations.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages
2011-01-14 12:09 ` Johannes Weiner
@ 2011-01-14 12:23 ` Hiroyuki Kamezawa
0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-14 12:23 UTC (permalink / raw)
To: Johannes Weiner
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
balbir@linux.vnet.ibm.com, Greg Thelen, aarcange,
akpm@linux-foundation.org
Thank you for review.
2011/1/14 Johannes Weiner <hannes@cmpxchg.org>:
> On Fri, Jan 14, 2011 at 07:09:09PM +0900, KAMEZAWA Hiroyuki wrote:
>> --- mmotm-0107.orig/mm/memcontrol.c
>> +++ mmotm-0107/mm/memcontrol.c
>
>> @@ -2154,6 +2139,23 @@ static void __mem_cgroup_commit_charge(s
>> */
>> memcg_check_events(mem, pc->page);
>> }
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +/*
>> + * Because tail pages are not mared as "used", set it. We're under
>
> marked
>
will fix.
>> + * compund_lock and don't need to take care of races.
>> + * Statistics are updated properly at charging. We just mark Used bits.
>> + */
>> +void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
>> +{
>> + struct page_cgroup *hpc = lookup_page_cgroup(head);
>> + struct page_cgroup *tpc = lookup_page_cgroup(tail);
>
> I have trouble reading the code fluently with those names as they are
> just very similar random letter sequences. Could you rename them so
> that they're better to discriminate? headpc and tailpc perhaps?
>
ok. I'll use headpc,tailpc.
>> + tpc->mem_cgroup = hpc->mem_cgroup;
>> + smp_wmb(); /* see __commit_charge() */
>> + SetPageCgroupUsed(tpc);
>> + VM_BUG_ON(PageCgroupCache(hpc));
>
> Right now, this would be a bug due to other circumstances, but this
> function does not require the page to be anon to function correctly,
> does it?
No.
> I don't think we should encode a made up dependency here.
>
Ok, I'll remove BUG_ON.
>> @@ -2602,8 +2603,7 @@ __mem_cgroup_uncharge_common(struct page
>> break;
>> }
>>
>> - for (i = 0; i < count; i++)
>> - mem_cgroup_charge_statistics(mem, file, -1);
>> + mem_cgroup_charge_statistics(mem, file, -count);
>
> Pass PageCgroupCache(pc) instead, ditch the `file' variable?
>
will do.
>> ClearPageCgroupUsed(pc);
>> /*
>> Index: mmotm-0107/include/linux/memcontrol.h
>> ===================================================================
>> --- mmotm-0107.orig/include/linux/memcontrol.h
>> +++ mmotm-0107/include/linux/memcontrol.h
>> @@ -146,6 +146,10 @@ unsigned long mem_cgroup_soft_limit_recl
>> gfp_t gfp_mask);
>> u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
>> +#endif
>> +
>> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>> struct mem_cgroup;
>>
>> Index: mmotm-0107/mm/huge_memory.c
>> ===================================================================
>> --- mmotm-0107.orig/mm/huge_memory.c
>> +++ mmotm-0107/mm/huge_memory.c
>> @@ -1203,6 +1203,8 @@ static void __split_huge_page_refcount(s
>> BUG_ON(!PageDirty(page_tail));
>> BUG_ON(!PageSwapBacked(page_tail));
>>
>> + mem_cgroup_split_huge_fixup(page, page_tail);
>
> You need to provide a dummy for non-memcg configurations.
Ahhh, yes. I'll add one.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP
2011-01-14 10:04 [PATCH 0/4] [BUGFIX] thp vs memcg fix KAMEZAWA Hiroyuki
2011-01-14 10:06 ` [PATCH 1/4] [BUGFIX] enhance charge_statistics function for fixising issues KAMEZAWA Hiroyuki
2011-01-14 10:09 ` [PATCH 2/4] [BUGFIX] dont set USED bit on tail pages KAMEZAWA Hiroyuki
@ 2011-01-14 10:10 ` KAMEZAWA Hiroyuki
2011-01-14 12:14 ` Johannes Weiner
2011-01-14 10:15 ` [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir " KAMEZAWA Hiroyuki
3 siblings, 1 reply; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14 10:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
Greg Thelen, hannes, aarcange, akpm@linux-foundation.org
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
memroy cgroup's LRU stat should take care of size of pages because
Transparent Hugepage inserts hugepage into LRU and zone counter
is updeted based on the size of page.
If this value is the number wrong, memory reclaim will not work well.
Note: only head page of THP's huge page is linked into LRU.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -815,7 +815,10 @@ void mem_cgroup_del_lru_list(struct page
* removed from global LRU.
*/
mz = page_cgroup_zoneinfo(pc);
- MEM_CGROUP_ZSTAT(mz, lru) -= 1;
+ if (!PageTransHuge(page))
+ MEM_CGROUP_ZSTAT(mz, lru) -= 1;
+ else
+ MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
if (mem_cgroup_is_root(pc->mem_cgroup))
return;
VM_BUG_ON(list_empty(&pc->lru));
@@ -866,7 +869,10 @@ void mem_cgroup_add_lru_list(struct page
return;
mz = page_cgroup_zoneinfo(pc);
- MEM_CGROUP_ZSTAT(mz, lru) += 1;
+ if (!PageTransHuge(page))
+ MEM_CGROUP_ZSTAT(mz, lru) += 1;
+ else
+ MEM_CGROUP_ZSTAT(mz, lru) += 1 << compound_order(page);
SetPageCgroupAcctLRU(pc);
if (mem_cgroup_is_root(pc->mem_cgroup))
return;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP
2011-01-14 10:10 ` [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP KAMEZAWA Hiroyuki
@ 2011-01-14 12:14 ` Johannes Weiner
2011-01-14 12:24 ` Hiroyuki Kamezawa
0 siblings, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2011-01-14 12:14 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
Greg Thelen, aarcange, akpm@linux-foundation.org
On Fri, Jan 14, 2011 at 07:10:42PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> memroy cgroup's LRU stat should take care of size of pages because
> Transparent Hugepage inserts hugepage into LRU and zone counter
> is updeted based on the size of page.
>
> If this value is the number wrong, memory reclaim will not work well.
>
> Note: only head page of THP's huge page is linked into LRU.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/memcontrol.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -815,7 +815,10 @@ void mem_cgroup_del_lru_list(struct page
> * removed from global LRU.
> */
> mz = page_cgroup_zoneinfo(pc);
> - MEM_CGROUP_ZSTAT(mz, lru) -= 1;
> + if (!PageTransHuge(page))
> + MEM_CGROUP_ZSTAT(mz, lru) -= 1;
> + else
> + MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
compound_order() returns 0 for !PG_head pages, that should do the
right thing without checking PageTransHuge(), right?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP
2011-01-14 12:14 ` Johannes Weiner
@ 2011-01-14 12:24 ` Hiroyuki Kamezawa
0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-14 12:24 UTC (permalink / raw)
To: Johannes Weiner
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
balbir@linux.vnet.ibm.com, Greg Thelen, aarcange,
akpm@linux-foundation.org
2011/1/14 Johannes Weiner <hannes@cmpxchg.org>:
> On Fri, Jan 14, 2011 at 07:10:42PM +0900, KAMEZAWA Hiroyuki wrote:
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> memroy cgroup's LRU stat should take care of size of pages because
>> Transparent Hugepage inserts hugepage into LRU and zone counter
>> is updeted based on the size of page.
>>
>> If this value is the number wrong, memory reclaim will not work well.
>>
>> Note: only head page of THP's huge page is linked into LRU.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>> mm/memcontrol.c | 10 ++++++++--
>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> Index: mmotm-0107/mm/memcontrol.c
>> ===================================================================
>> --- mmotm-0107.orig/mm/memcontrol.c
>> +++ mmotm-0107/mm/memcontrol.c
>> @@ -815,7 +815,10 @@ void mem_cgroup_del_lru_list(struct page
>> * removed from global LRU.
>> */
>> mz = page_cgroup_zoneinfo(pc);
>> - MEM_CGROUP_ZSTAT(mz, lru) -= 1;
>> + if (!PageTransHuge(page))
>> + MEM_CGROUP_ZSTAT(mz, lru) -= 1;
>> + else
>> + MEM_CGROUP_ZSTAT(mz, lru) -= 1 << compound_order(page);
>
> compound_order() returns 0 for !PG_head pages, that should do the
> right thing without checking PageTransHuge(), right?
You're right. Then, I can remove this 'if'. Thank you.
Hmm, I'll check other places, again.
Regards,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
2011-01-14 10:04 [PATCH 0/4] [BUGFIX] thp vs memcg fix KAMEZAWA Hiroyuki
` (2 preceding siblings ...)
2011-01-14 10:10 ` [PATCH 3/4] [BUGFIX] fix memcgroup LRU stat with THP KAMEZAWA Hiroyuki
@ 2011-01-14 10:15 ` KAMEZAWA Hiroyuki
2011-01-14 12:21 ` Johannes Weiner
2011-01-17 0:15 ` Daisuke Nishimura
3 siblings, 2 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-14 10:15 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
Greg Thelen, hannes, aarcange, akpm@linux-foundation.org
Now, when THP is enabled, memcg's rmdir() function is broken
because move_account() for THP page is not supported.
This will cause account leak or -EBUSY issue at rmdir().
This patch fixes the issue by supporting move_account() THP pages.
And account information will be moved to its parent at rmdir().
How to test:
79 mount -t cgroup none /cgroup/memory/ -o memory
80 mkdir /cgroup/A/
81 mkdir /cgroup/memory/A
82 mkdir /cgroup/memory/A/B
83 cgexec -g memory:A/B ./malloc 128 &
84 grep anon /cgroup/memory/A/B/memory.stat
85 grep rss /cgroup/memory/A/B/memory.stat
86 echo 1728 > /cgroup/memory/A/tasks
87 grep rss /cgroup/memory/A/memory.stat
88 rmdir /cgroup/memory/A/B/
89 grep rss /cgroup/memory/A/memory.stat
- Create 2 level directory and exec a task calls malloc(big chunk).
- Move a task somewhere (its parent cgroup in above)
- rmdir /A/B
- check memory.stat in /A/B is moved to /A after rmdir. and confirm
RSS/LRU information includes usages it was charged against /A/B.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 32 ++++++++++++++++++++++----------
1 file changed, 22 insertions(+), 10 deletions(-)
Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2154,6 +2154,10 @@ void mem_cgroup_split_huge_fixup(struct
smp_wmb(); /* see __commit_charge() */
SetPageCgroupUsed(tpc);
VM_BUG_ON(PageCgroupCache(hpc));
+ /*
+ * Note: if dirty ratio etc..are supported,
+ * other flags may need to be copied.
+ */
}
#endif
@@ -2175,8 +2179,11 @@ void mem_cgroup_split_huge_fixup(struct
*/
static void __mem_cgroup_move_account(struct page_cgroup *pc,
- struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
+ struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge,
+ int charge_size)
{
+ int pagenum = charge_size >> PAGE_SHIFT;
+
VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
VM_BUG_ON(!page_is_cgroup_locked(pc));
@@ -2190,14 +2197,14 @@ static void __mem_cgroup_move_account(st
__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
preempt_enable();
}
- mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
+ mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -pagenum);
if (uncharge)
/* This is not "cancel", but cancel_charge does all we need. */
- mem_cgroup_cancel_charge(from, PAGE_SIZE);
+ mem_cgroup_cancel_charge(from, charge_size);
/* caller should have done css_get */
pc->mem_cgroup = to;
- mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
+ mem_cgroup_charge_statistics(to, PageCgroupCache(pc), pagenum);
/*
* We charges against "to" which may not have any tasks. Then, "to"
* can be under rmdir(). But in current implementation, caller of
@@ -2212,7 +2219,8 @@ static void __mem_cgroup_move_account(st
* __mem_cgroup_move_account()
*/
static int mem_cgroup_move_account(struct page_cgroup *pc,
- struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
+ struct mem_cgroup *from, struct mem_cgroup *to,
+ bool uncharge, int charge_size)
{
int ret = -EINVAL;
unsigned long flags;
@@ -2220,7 +2228,7 @@ static int mem_cgroup_move_account(struc
lock_page_cgroup(pc);
if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
move_lock_page_cgroup(pc, &flags);
- __mem_cgroup_move_account(pc, from, to, uncharge);
+ __mem_cgroup_move_account(pc, from, to, uncharge, charge_size);
move_unlock_page_cgroup(pc, &flags);
ret = 0;
}
@@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
struct cgroup *cg = child->css.cgroup;
struct cgroup *pcg = cg->parent;
struct mem_cgroup *parent;
+ int charge_size = PAGE_SIZE;
int ret;
/* Is ROOT ? */
@@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
goto out;
if (isolate_lru_page(page))
goto put;
+ /* The page is isolated from LRU and we have no race with splitting */
+ if (PageTransHuge(page))
+ charge_size = PAGE_SIZE << compound_order(page);
parent = mem_cgroup_from_cont(pcg);
ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
- PAGE_SIZE);
+ charge_size);
if (ret || !parent)
goto put_back;
- ret = mem_cgroup_move_account(pc, child, parent, true);
+ ret = mem_cgroup_move_account(pc, child, parent, true, charge_size);
if (ret)
- mem_cgroup_cancel_charge(parent, PAGE_SIZE);
+ mem_cgroup_cancel_charge(parent, charge_size);
put_back:
putback_lru_page(page);
put:
@@ -4850,7 +4862,7 @@ retry:
goto put;
pc = lookup_page_cgroup(page);
if (!mem_cgroup_move_account(pc,
- mc.from, mc.to, false)) {
+ mc.from, mc.to, false, PAGE_SIZE)) {
mc.precharge--;
/* we uncharge from mc.from later. */
mc.moved_charge++;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
2011-01-14 10:15 ` [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir " KAMEZAWA Hiroyuki
@ 2011-01-14 12:21 ` Johannes Weiner
2011-01-14 12:28 ` Hiroyuki Kamezawa
2011-01-17 0:15 ` Daisuke Nishimura
1 sibling, 1 reply; 15+ messages in thread
From: Johannes Weiner @ 2011-01-14 12:21 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
Greg Thelen, aarcange, akpm@linux-foundation.org
On Fri, Jan 14, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
>
> Now, when THP is enabled, memcg's rmdir() function is broken
> because move_account() for THP page is not supported.
>
> This will cause account leak or -EBUSY issue at rmdir().
> This patch fixes the issue by supporting move_account() THP pages.
>
> And account information will be moved to its parent at rmdir().
>
> How to test:
> 79 mount -t cgroup none /cgroup/memory/ -o memory
> 80 mkdir /cgroup/A/
> 81 mkdir /cgroup/memory/A
> 82 mkdir /cgroup/memory/A/B
> 83 cgexec -g memory:A/B ./malloc 128 &
> 84 grep anon /cgroup/memory/A/B/memory.stat
> 85 grep rss /cgroup/memory/A/B/memory.stat
> 86 echo 1728 > /cgroup/memory/A/tasks
> 87 grep rss /cgroup/memory/A/memory.stat
> 88 rmdir /cgroup/memory/A/B/
> 89 grep rss /cgroup/memory/A/memory.stat
>
> - Create 2 level directory and exec a task calls malloc(big chunk).
> - Move a task somewhere (its parent cgroup in above)
> - rmdir /A/B
> - check memory.stat in /A/B is moved to /A after rmdir. and confirm
> RSS/LRU information includes usages it was charged against /A/B.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/memcontrol.c | 32 ++++++++++++++++++++++----------
> 1 file changed, 22 insertions(+), 10 deletions(-)
>
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -2154,6 +2154,10 @@ void mem_cgroup_split_huge_fixup(struct
> smp_wmb(); /* see __commit_charge() */
> SetPageCgroupUsed(tpc);
> VM_BUG_ON(PageCgroupCache(hpc));
> + /*
> + * Note: if dirty ratio etc..are supported,
> + * other flags may need to be copied.
> + */
That's a good comment, but it should be in the patch that introduces
this function and is a bit unrelated in this one.
> }
> #endif
>
> @@ -2175,8 +2179,11 @@ void mem_cgroup_split_huge_fixup(struct
> */
>
> static void __mem_cgroup_move_account(struct page_cgroup *pc,
> - struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> + struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge,
> + int charge_size)
> {
> + int pagenum = charge_size >> PAGE_SHIFT;
nr_pages?
> +
> VM_BUG_ON(from == to);
> VM_BUG_ON(PageLRU(pc->page));
> VM_BUG_ON(!page_is_cgroup_locked(pc));
> @@ -2190,14 +2197,14 @@ static void __mem_cgroup_move_account(st
> __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> preempt_enable();
> }
> - mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
> + mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -pagenum);
> if (uncharge)
> /* This is not "cancel", but cancel_charge does all we need. */
> - mem_cgroup_cancel_charge(from, PAGE_SIZE);
> + mem_cgroup_cancel_charge(from, charge_size);
>
> /* caller should have done css_get */
> pc->mem_cgroup = to;
> - mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
> + mem_cgroup_charge_statistics(to, PageCgroupCache(pc), pagenum);
> /*
> * We charges against "to" which may not have any tasks. Then, "to"
> * can be under rmdir(). But in current implementation, caller of
> @@ -2212,7 +2219,8 @@ static void __mem_cgroup_move_account(st
> * __mem_cgroup_move_account()
> */
> static int mem_cgroup_move_account(struct page_cgroup *pc,
> - struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> + struct mem_cgroup *from, struct mem_cgroup *to,
> + bool uncharge, int charge_size)
> {
> int ret = -EINVAL;
> unsigned long flags;
> @@ -2220,7 +2228,7 @@ static int mem_cgroup_move_account(struc
> lock_page_cgroup(pc);
> if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> move_lock_page_cgroup(pc, &flags);
> - __mem_cgroup_move_account(pc, from, to, uncharge);
> + __mem_cgroup_move_account(pc, from, to, uncharge, charge_size);
> move_unlock_page_cgroup(pc, &flags);
> ret = 0;
> }
> @@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
> struct cgroup *cg = child->css.cgroup;
> struct cgroup *pcg = cg->parent;
> struct mem_cgroup *parent;
> + int charge_size = PAGE_SIZE;
> int ret;
>
> /* Is ROOT ? */
> @@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
> goto out;
> if (isolate_lru_page(page))
> goto put;
> + /* The page is isolated from LRU and we have no race with splitting */
> + if (PageTransHuge(page))
> + charge_size = PAGE_SIZE << compound_order(page);
The same as in the previous patch, compound_order() implicitely
handles order-0 pages and should do the right thing without an extra
check.
The comment is valuable, though!
Nitpicks aside:
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
2011-01-14 12:21 ` Johannes Weiner
@ 2011-01-14 12:28 ` Hiroyuki Kamezawa
0 siblings, 0 replies; 15+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-14 12:28 UTC (permalink / raw)
To: Johannes Weiner
Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
balbir@linux.vnet.ibm.com, Greg Thelen, aarcange,
akpm@linux-foundation.org
2011/1/14 Johannes Weiner <hannes@cmpxchg.org>:
> On Fri, Jan 14, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
>>
>> Now, when THP is enabled, memcg's rmdir() function is broken
>> because move_account() for THP page is not supported.
>>
>> This will cause account leak or -EBUSY issue at rmdir().
>> This patch fixes the issue by supporting move_account() THP pages.
>>
>> And account information will be moved to its parent at rmdir().
>>
>> How to test:
>> 79 mount -t cgroup none /cgroup/memory/ -o memory
>> 80 mkdir /cgroup/A/
>> 81 mkdir /cgroup/memory/A
>> 82 mkdir /cgroup/memory/A/B
>> 83 cgexec -g memory:A/B ./malloc 128 &
>> 84 grep anon /cgroup/memory/A/B/memory.stat
>> 85 grep rss /cgroup/memory/A/B/memory.stat
>> 86 echo 1728 > /cgroup/memory/A/tasks
>> 87 grep rss /cgroup/memory/A/memory.stat
>> 88 rmdir /cgroup/memory/A/B/
>> 89 grep rss /cgroup/memory/A/memory.stat
>>
>> - Create 2 level directory and exec a task calls malloc(big chunk).
>> - Move a task somewhere (its parent cgroup in above)
>> - rmdir /A/B
>> - check memory.stat in /A/B is moved to /A after rmdir. and confirm
>> RSS/LRU information includes usages it was charged against /A/B.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>> mm/memcontrol.c | 32 ++++++++++++++++++++++----------
>> 1 file changed, 22 insertions(+), 10 deletions(-)
>>
>> Index: mmotm-0107/mm/memcontrol.c
>> ===================================================================
>> --- mmotm-0107.orig/mm/memcontrol.c
>> +++ mmotm-0107/mm/memcontrol.c
>> @@ -2154,6 +2154,10 @@ void mem_cgroup_split_huge_fixup(struct
>> smp_wmb(); /* see __commit_charge() */
>> SetPageCgroupUsed(tpc);
>> VM_BUG_ON(PageCgroupCache(hpc));
>> + /*
>> + * Note: if dirty ratio etc..are supported,
>> + * other flags may need to be copied.
>> + */
>
> That's a good comment, but it should be in the patch that introduces
> this function and is a bit unrelated in this one.
>
Ok. I'll remove this. This is an alarm for Greg ;)
>> }
>> #endif
>>
>> @@ -2175,8 +2179,11 @@ void mem_cgroup_split_huge_fixup(struct
>> */
>>
>> static void __mem_cgroup_move_account(struct page_cgroup *pc,
>> - struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>> + struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge,
>> + int charge_size)
>> {
>> + int pagenum = charge_size >> PAGE_SHIFT;
>
> nr_pages?
>
Ok. replace pagenum <-> nr_pages.
>> +
>> VM_BUG_ON(from == to);
>> VM_BUG_ON(PageLRU(pc->page));
>> VM_BUG_ON(!page_is_cgroup_locked(pc));
>> @@ -2190,14 +2197,14 @@ static void __mem_cgroup_move_account(st
>> __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> preempt_enable();
>> }
>> - mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -1);
>> + mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -pagenum);
>> if (uncharge)
>> /* This is not "cancel", but cancel_charge does all we need. */
>> - mem_cgroup_cancel_charge(from, PAGE_SIZE);
>> + mem_cgroup_cancel_charge(from, charge_size);
>>
>> /* caller should have done css_get */
>> pc->mem_cgroup = to;
>> - mem_cgroup_charge_statistics(to, PageCgroupCache(pc), 1);
>> + mem_cgroup_charge_statistics(to, PageCgroupCache(pc), pagenum);
>> /*
>> * We charges against "to" which may not have any tasks. Then, "to"
>> * can be under rmdir(). But in current implementation, caller of
>> @@ -2212,7 +2219,8 @@ static void __mem_cgroup_move_account(st
>> * __mem_cgroup_move_account()
>> */
>> static int mem_cgroup_move_account(struct page_cgroup *pc,
>> - struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>> + struct mem_cgroup *from, struct mem_cgroup *to,
>> + bool uncharge, int charge_size)
>> {
>> int ret = -EINVAL;
>> unsigned long flags;
>> @@ -2220,7 +2228,7 @@ static int mem_cgroup_move_account(struc
>> lock_page_cgroup(pc);
>> if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
>> move_lock_page_cgroup(pc, &flags);
>> - __mem_cgroup_move_account(pc, from, to, uncharge);
>> + __mem_cgroup_move_account(pc, from, to, uncharge, charge_size);
>> move_unlock_page_cgroup(pc, &flags);
>> ret = 0;
>> }
>> @@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
>> struct cgroup *cg = child->css.cgroup;
>> struct cgroup *pcg = cg->parent;
>> struct mem_cgroup *parent;
>> + int charge_size = PAGE_SIZE;
>> int ret;
>>
>> /* Is ROOT ? */
>> @@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
>> goto out;
>> if (isolate_lru_page(page))
>> goto put;
>> + /* The page is isolated from LRU and we have no race with splitting */
>> + if (PageTransHuge(page))
>> + charge_size = PAGE_SIZE << compound_order(page);
>
> The same as in the previous patch, compound_order() implicitely
> handles order-0 pages and should do the right thing without an extra
> check.
>
Sure.
> The comment is valuable, though!
>
> Nitpicks aside:
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Thank you for quick review!
Updated one will be posted in the next week after some amounts of more tests.
Regards,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
2011-01-14 10:15 ` [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir " KAMEZAWA Hiroyuki
2011-01-14 12:21 ` Johannes Weiner
@ 2011-01-17 0:15 ` Daisuke Nishimura
2011-01-17 0:25 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 15+ messages in thread
From: Daisuke Nishimura @ 2011-01-17 0:15 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
balbir@linux.vnet.ibm.com, Greg Thelen, hannes, aarcange,
akpm@linux-foundation.org, Daisuke Nishimura
Hi, thank you for your great works!
I've not read this series in detail, but one quick comment for move_parent.
> @@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
> struct cgroup *cg = child->css.cgroup;
> struct cgroup *pcg = cg->parent;
> struct mem_cgroup *parent;
> + int charge_size = PAGE_SIZE;
> int ret;
>
> /* Is ROOT ? */
> @@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
> goto out;
> if (isolate_lru_page(page))
> goto put;
> + /* The page is isolated from LRU and we have no race with splitting */
> + if (PageTransHuge(page))
> + charge_size = PAGE_SIZE << compound_order(page);
>
> parent = mem_cgroup_from_cont(pcg);
> ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
> - PAGE_SIZE);
> + charge_size);
> if (ret || !parent)
> goto put_back;
>
> - ret = mem_cgroup_move_account(pc, child, parent, true);
> + ret = mem_cgroup_move_account(pc, child, parent, true, charge_size);
> if (ret)
> - mem_cgroup_cancel_charge(parent, PAGE_SIZE);
> + mem_cgroup_cancel_charge(parent, charge_size);
> put_back:
> putback_lru_page(page);
> put:
I think there is possibility that the page is split after "if (PageTransHuge(page))".
In RHEL6, this part looks like:
1598 if (PageTransHuge(page))
1599 page_size = PAGE_SIZE << compound_order(page);
1600
1601 ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
1602 page_size);
1603 if (ret || !parent)
1604 return ret;
1605
1606 if (!get_page_unless_zero(page)) {
1607 ret = -EBUSY;
1608 goto uncharge;
1609 }
1610
1611 ret = isolate_lru_page(page);
1612
1613 if (ret)
1614 goto cancel;
1615
1616 compound_lock_irqsave(page, &flags);
1617 ret = mem_cgroup_move_account(pc, child, parent, page_size);
1618 compound_unlock_irqrestore(page, flags);
1619
In fact, I found a bug of res_counter underflow around here, and I've already send
a patch to RedHat.
===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
In mem_cgroup_move_parent(), the page can be split by other context after we
check PageTransHuge() and before hold the compound_lock of the page later.
This means a race can happen like:
__split_huge_page_refcount() mem_cgroup_move_parent()
---------------------------------------------------------------------------
if (PageTransHuge())
-> true
-> set "page_size" to huge page
size.
__mem_cgroup_try_charge()
-> charge "page_size" to the
parent.
compound_lock()
mem_cgroup_split_hugepage_commit()
-> commit all the tail pages to the
"current"(i.e. child) cgroup.
iow, pc->mem_cgroup of tail pages
point to the child.
ClearPageCompound()
compound_unlock()
compound_lock()
mem_cgroup_move_account()
-> make pc->mem_cgroup of the
head page point to the parent.
-> uncharge "page_size" from
the child.
compound_unlock()
This can causes at least 2 problems.
1. Tail pages are linked to LRU of the child, even though usages(res_counter) of
them have been already uncharged from the chilid. This causes res_counter
underflow at removing the child directory.
2. Usage of the parent is increased by the huge page size at moving charge of
the head page, but usage will be decreased only by the normal page size when
the head page is uncharged later because it is not PageTransHuge() anymore.
This means the parent doesn't have enough pages on its LRU to decrease the
usage to 0 and it cannot be rmdir'ed.
This patch fixes this problem by re-checking PageTransHuge() again under the
compound_lock.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
diff -uprN linux-2.6.32.x86_64.org/mm/memcontrol.c linux-2.6.32.x86_64/mm/memcontrol.c
--- linux-2.6.32.x86_64.org/mm/memcontrol.c 2010-07-15 16:44:57.000000000 +0900
+++ linux-2.6.32.x86_64/mm/memcontrol.c 2010-07-15 17:34:12.000000000 +0900
@@ -1608,6 +1608,17 @@ static int mem_cgroup_move_parent(struct
goto cancel;
compound_lock_irqsave(page, &flags);
+ /* re-check under compound_lock because the page might be split */
+ if (unlikely(page_size != PAGE_SIZE && !PageTransHuge(page))) {
+ unsigned long extra = page_size - PAGE_SIZE;
+ /* uncharge extra charges from parent */
+ if (!mem_cgroup_is_root(parent)) {
+ res_counter_uncharge(&parent->res, extra);
+ if (do_swap_account)
+ res_counter_uncharge(&parent->memsw, extra);
+ }
+ page_size = PAGE_SIZE;
+ }
ret = mem_cgroup_move_account(pc, child, parent, page_size);
compound_unlock_irqrestore(page, flags);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 4/4] [BUGFIX] fix account leak at force_empty, rmdir with THP
2011-01-17 0:15 ` Daisuke Nishimura
@ 2011-01-17 0:25 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 15+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-17 0:25 UTC (permalink / raw)
To: Daisuke Nishimura
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
balbir@linux.vnet.ibm.com, Greg Thelen, hannes, aarcange,
akpm@linux-foundation.org
On Mon, 17 Jan 2011 09:15:33 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> Hi, thank you for your great works!
>
> I've not read this series in detail, but one quick comment for move_parent.
>
> > @@ -2245,6 +2253,7 @@ static int mem_cgroup_move_parent(struct
> > struct cgroup *cg = child->css.cgroup;
> > struct cgroup *pcg = cg->parent;
> > struct mem_cgroup *parent;
> > + int charge_size = PAGE_SIZE;
> > int ret;
> >
> > /* Is ROOT ? */
> > @@ -2256,16 +2265,19 @@ static int mem_cgroup_move_parent(struct
> > goto out;
> > if (isolate_lru_page(page))
> > goto put;
> > + /* The page is isolated from LRU and we have no race with splitting */
> > + if (PageTransHuge(page))
> > + charge_size = PAGE_SIZE << compound_order(page);
> >
> > parent = mem_cgroup_from_cont(pcg);
> > ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
> > - PAGE_SIZE);
> > + charge_size);
> > if (ret || !parent)
> > goto put_back;
> >
> > - ret = mem_cgroup_move_account(pc, child, parent, true);
> > + ret = mem_cgroup_move_account(pc, child, parent, true, charge_size);
> > if (ret)
> > - mem_cgroup_cancel_charge(parent, PAGE_SIZE);
> > + mem_cgroup_cancel_charge(parent, charge_size);
> > put_back:
> > putback_lru_page(page);
> > put:
> I think there is possibility that the page is split after "if (PageTransHuge(page))".
>
> In RHEL6, this part looks like:
>
> 1598 if (PageTransHuge(page))
> 1599 page_size = PAGE_SIZE << compound_order(page);
> 1600
> 1601 ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, page,
> 1602 page_size);
> 1603 if (ret || !parent)
> 1604 return ret;
> 1605
> 1606 if (!get_page_unless_zero(page)) {
> 1607 ret = -EBUSY;
> 1608 goto uncharge;
> 1609 }
> 1610
> 1611 ret = isolate_lru_page(page);
> 1612
> 1613 if (ret)
> 1614 goto cancel;
> 1615
> 1616 compound_lock_irqsave(page, &flags);
> 1617 ret = mem_cgroup_move_account(pc, child, parent, page_size);
> 1618 compound_unlock_irqrestore(page, flags);
> 1619
>
> In fact, I found a bug of res_counter underflow around here, and I've already send
> a patch to RedHat.
>
Okay, I'll take care of that in the next version.
Thanks,
-Kame
> ===
> From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>
> In mem_cgroup_move_parent(), the page can be split by other context after we
> check PageTransHuge() and before hold the compound_lock of the page later.
>
> This means a race can happen like:
>
> __split_huge_page_refcount() mem_cgroup_move_parent()
> ---------------------------------------------------------------------------
> if (PageTransHuge())
> -> true
> -> set "page_size" to huge page
> size.
> __mem_cgroup_try_charge()
> -> charge "page_size" to the
> parent.
> compound_lock()
> mem_cgroup_split_hugepage_commit()
> -> commit all the tail pages to the
> "current"(i.e. child) cgroup.
> iow, pc->mem_cgroup of tail pages
> point to the child.
> ClearPageCompound()
> compound_unlock()
> compound_lock()
> mem_cgroup_move_account()
> -> make pc->mem_cgroup of the
> head page point to the parent.
> -> uncharge "page_size" from
> the child.
> compound_unlock()
>
> This can causes at least 2 problems.
>
> 1. Tail pages are linked to LRU of the child, even though usages(res_counter) of
> them have been already uncharged from the chilid. This causes res_counter
> underflow at removing the child directory.
> 2. Usage of the parent is increased by the huge page size at moving charge of
> the head page, but usage will be decreased only by the normal page size when
> the head page is uncharged later because it is not PageTransHuge() anymore.
> This means the parent doesn't have enough pages on its LRU to decrease the
> usage to 0 and it cannot be rmdir'ed.
>
> This patch fixes this problem by re-checking PageTransHuge() again under the
> compound_lock.
>
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>
> diff -uprN linux-2.6.32.x86_64.org/mm/memcontrol.c linux-2.6.32.x86_64/mm/memcontrol.c
> --- linux-2.6.32.x86_64.org/mm/memcontrol.c 2010-07-15 16:44:57.000000000 +0900
> +++ linux-2.6.32.x86_64/mm/memcontrol.c 2010-07-15 17:34:12.000000000 +0900
> @@ -1608,6 +1608,17 @@ static int mem_cgroup_move_parent(struct
> goto cancel;
>
> compound_lock_irqsave(page, &flags);
> + /* re-check under compound_lock because the page might be split */
> + if (unlikely(page_size != PAGE_SIZE && !PageTransHuge(page))) {
> + unsigned long extra = page_size - PAGE_SIZE;
> + /* uncharge extra charges from parent */
> + if (!mem_cgroup_is_root(parent)) {
> + res_counter_uncharge(&parent->res, extra);
> + if (do_swap_account)
> + res_counter_uncharge(&parent->memsw, extra);
> + }
> + page_size = PAGE_SIZE;
> + }
> ret = mem_cgroup_move_account(pc, child, parent, page_size);
> compound_unlock_irqrestore(page, flags);
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 15+ messages in thread