[RFC][PATCH] memcg: avoid THP split in task migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC][PATCH] memcg: avoid THP split in task migration
@ 2012-02-28 21:12 Naoya Horiguchi
  2012-02-29  0:28 ` KAMEZAWA Hiroyuki
  2012-03-01  2:33 ` Andrea Arcangeli
  0 siblings, 2 replies; 8+ messages in thread
From: Naoya Horiguchi @ 2012-02-28 21:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	Daisuke Nishimura, Hillf Danton, linux-kernel, Naoya Horiguchi

Currently we can't do task migration among memory cgroups without THP split,
which means processes heavily using THP experience large overhead in task
migration. This patch introduce the code for moving charge of THP and makes
THP more valuable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <dhillf@gmail.com>
---
 mm/memcontrol.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 70 insertions(+), 6 deletions(-)

diff --git linux-next-20120228.orig/mm/memcontrol.c linux-next-20120228/mm/memcontrol.c
index c83aeb5..e97c041 100644
--- linux-next-20120228.orig/mm/memcontrol.c
+++ linux-next-20120228/mm/memcontrol.c
@@ -5211,6 +5211,42 @@ static int is_target_pte_for_mc(struct vm_area_struct *vma,
 	return ret;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * We don't consider swapping or file mapped pages because THP does not
+ * support them for now.
+ */
+static int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
+		unsigned long addr, pmd_t pmd, union mc_target *target)
+{
+	struct page *page = NULL;
+	struct page_cgroup *pc;
+	int ret = 0;
+
+	if (pmd_present(pmd))
+		page = pmd_page(pmd);
+	if (!page)
+		return 0;
+	VM_BUG_ON(!PageHead(page));
+	get_page(page);
+	pc = lookup_page_cgroup(page);
+	if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+		ret = MC_TARGET_PAGE;
+		if (target)
+			target->page = page;
+	}
+	if (!ret || !target)
+		put_page(page);
+	return ret;
+}
+#else
+static inline int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
+		unsigned long addr, pmd_t pmd, union mc_target *target)
+{
+	return 0;
+}
+#endif
+
 static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 					unsigned long addr, unsigned long end,
 					struct mm_walk *walk)
@@ -5219,7 +5255,13 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	split_huge_page_pmd(walk->mm, pmd);
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		if (is_target_huge_pmd_for_mc(vma, addr, *pmd, NULL))
+			mc.precharge += HPAGE_PMD_NR;
+		spin_unlock(&walk->mm->page_table_lock);
+		cond_resched();
+		return 0;
+	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
@@ -5378,16 +5420,38 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	struct vm_area_struct *vma = walk->private;
 	pte_t *pte;
 	spinlock_t *ptl;
+	int type;
+	union mc_target target;
+	struct page *page;
+	struct page_cgroup *pc;
+
+	if (pmd_trans_huge_lock(pmd, vma) == 1) {
+		if (!mc.precharge)
+			return 0;
+		type = is_target_huge_pmd_for_mc(vma, addr, *pmd, &target);
+		if (type == MC_TARGET_PAGE) {
+			page = target.page;
+			if (!isolate_lru_page(page)) {
+				pc = lookup_page_cgroup(page);
+				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
+							     pc, mc.from, mc.to,
+							     false)) {
+					mc.precharge -= HPAGE_PMD_NR;
+					mc.moved_charge += HPAGE_PMD_NR;
+				}
+				putback_lru_page(page);
+			}
+			put_page(page);
+		}
+		spin_unlock(&walk->mm->page_table_lock);
+		cond_resched();
+		return 0;
+	}
 
-	split_huge_page_pmd(walk->mm, pmd);
 retry:
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
-		union mc_target target;
-		int type;
-		struct page *page;
-		struct page_cgroup *pc;
 		swp_entry_t ent;
 
 		if (!mc.precharge)
-- 
1.7.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] memcg: avoid THP split in task migration
  2012-02-28 21:12 [RFC][PATCH] memcg: avoid THP split in task migration Naoya Horiguchi
@ 2012-02-29  0:28 ` KAMEZAWA Hiroyuki
  2012-02-29  9:50   ` Naoya Horiguchi
  2012-02-29 14:40   ` Hillf Danton
  2012-03-01  2:33 ` Andrea Arcangeli
  1 sibling, 2 replies; 8+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-29  0:28 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Andrea Arcangeli, Daisuke Nishimura,
	Hillf Danton, linux-kernel

On Tue, 28 Feb 2012 16:12:32 -0500
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> Currently we can't do task migration among memory cgroups without THP split,
> which means processes heavily using THP experience large overhead in task
> migration. This patch introduce the code for moving charge of THP and makes
> THP more valuable.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Hillf Danton <dhillf@gmail.com>


Thank you! 

A comment below.

> ---
>  mm/memcontrol.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 files changed, 70 insertions(+), 6 deletions(-)
> 
> diff --git linux-next-20120228.orig/mm/memcontrol.c linux-next-20120228/mm/memcontrol.c
> index c83aeb5..e97c041 100644
> --- linux-next-20120228.orig/mm/memcontrol.c
> +++ linux-next-20120228/mm/memcontrol.c
> @@ -5211,6 +5211,42 @@ static int is_target_pte_for_mc(struct vm_area_struct *vma,
>  	return ret;
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * We don't consider swapping or file mapped pages because THP does not
> + * support them for now.
> + */
> +static int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
> +		unsigned long addr, pmd_t pmd, union mc_target *target)
> +{
> +	struct page *page = NULL;
> +	struct page_cgroup *pc;
> +	int ret = 0;
> +
> +	if (pmd_present(pmd))
> +		page = pmd_page(pmd);
> +	if (!page)
> +		return 0;
> +	VM_BUG_ON(!PageHead(page));
> +	get_page(page);
> +	pc = lookup_page_cgroup(page);
> +	if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
> +		ret = MC_TARGET_PAGE;
> +		if (target)
> +			target->page = page;
> +	}
> +	if (!ret || !target)
> +		put_page(page);
> +	return ret;
> +}
> +#else
> +static inline int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
> +		unsigned long addr, pmd_t pmd, union mc_target *target)
> +{
> +	return 0;
> +}
> +#endif
> +
>  static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>  					unsigned long addr, unsigned long end,
>  					struct mm_walk *walk)
> @@ -5219,7 +5255,13 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> -	split_huge_page_pmd(walk->mm, pmd);
> +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> +		if (is_target_huge_pmd_for_mc(vma, addr, *pmd, NULL))
> +			mc.precharge += HPAGE_PMD_NR;
> +		spin_unlock(&walk->mm->page_table_lock);
> +		cond_resched();
> +		return 0;
> +	}
>  
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; pte++, addr += PAGE_SIZE)
> @@ -5378,16 +5420,38 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  	struct vm_area_struct *vma = walk->private;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	int type;
> +	union mc_target target;
> +	struct page *page;
> +	struct page_cgroup *pc;
> +
> +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> +		if (!mc.precharge)
> +			return 0;
> +		type = is_target_huge_pmd_for_mc(vma, addr, *pmd, &target);
> +		if (type == MC_TARGET_PAGE) {
> +			page = target.page;
> +			if (!isolate_lru_page(page)) {
> +				pc = lookup_page_cgroup(page);

Here is a diffuclut point. Please see mem_cgroup_split_huge_fixup(). It splits
updates memcg's status of splitted pages under lru_lock and compound_lock
but not under mm->page_table_lock.

Looking into split_huge_page()

	split_huge_page()  # take anon_vma lock
		__split_huge_page()
			__split_huge_page_refcount() # take lru_lock, compound_lock.
				mem_cgroup_split_huge_fixup()
			__split_huge_page_map() # take page table lock.

I'm not fully sure but IIUC, pmd_trans_huge_lock() just guarantees a huge page "map"
never goes out. To avoid page splitting itself, compound_lock() is required, I think.

So, the lock here should be

	page = target.page;
	isolate_lru_page(page);
	flags = compound_lock_irqsave(page);


> +				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
> +							     pc, mc.from, mc.to,
> +							     false)) {
> +					mc.precharge -= HPAGE_PMD_NR;
> +					mc.moved_charge += HPAGE_PMD_NR;
> +				}

Here is PageTransHuge() is checked in mem_cgroup_move_account() and if !PageTransHuge(),
the function returns -EBUSY.
I'm not sure but....it's not worth to retry (but add a comment as FIXME later!)

	compound_unlock_irqrestore(page);

I may miss something, please check carefully, again.


Thanks,
-Kame
> +				putback_lru_page(page);
> +			}
> +			put_page(page);
> +		}
> +		spin_unlock(&walk->mm->page_table_lock);
> +		cond_resched();
> +		return 0;
> +	}
>  
> -	split_huge_page_pmd(walk->mm, pmd);
>  retry:
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; addr += PAGE_SIZE) {
>  		pte_t ptent = *(pte++);
> -		union mc_target target;
> -		int type;
> -		struct page *page;
> -		struct page_cgroup *pc;
>  		swp_entry_t ent;
>  
>  		if (!mc.precharge)
> -- 
> 1.7.7.6
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] memcg: avoid THP split in task migration
  2012-02-29  0:28 ` KAMEZAWA Hiroyuki
@ 2012-02-29  9:50   ` Naoya Horiguchi
  2012-02-29 14:40   ` Hillf Danton
  1 sibling, 0 replies; 8+ messages in thread
From: Naoya Horiguchi @ 2012-02-29  9:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Andrea Arcangeli,
	Daisuke Nishimura, Hillf Danton, linux-kernel

Hi,

On Wed, Feb 29, 2012 at 09:28:59AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 28 Feb 2012 16:12:32 -0500
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
...
> > @@ -5378,16 +5420,38 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >  	struct vm_area_struct *vma = walk->private;
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> > +	int type;
> > +	union mc_target target;
> > +	struct page *page;
> > +	struct page_cgroup *pc;
> > +
> > +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> > +		if (!mc.precharge)
> > +			return 0;
> > +		type = is_target_huge_pmd_for_mc(vma, addr, *pmd, &target);
> > +		if (type == MC_TARGET_PAGE) {
> > +			page = target.page;
> > +			if (!isolate_lru_page(page)) {
> > +				pc = lookup_page_cgroup(page);
>
> Here is a diffuclut point. Please see mem_cgroup_split_huge_fixup(). It splits
> updates memcg's status of splitted pages under lru_lock and compound_lock
> but not under mm->page_table_lock.

OK, I rethink locking.
mem_cgroup_move_account() also states that the caller should hold compound_lock(),
so I should follow that.

> Looking into split_huge_page()
>
> 	split_huge_page()  # take anon_vma lock
> 		__split_huge_page()
> 			__split_huge_page_refcount() # take lru_lock, compound_lock.
> 				mem_cgroup_split_huge_fixup()
> 			__split_huge_page_map() # take page table lock.

I'm afraid this callchain is not correct.
Page table lock seems to be taken before we enter the main split work.

    split_huge_page
        take anon_vma lock
        __split_huge_page
            __split_huge_page_splitting
                lock page_table_lock     <--- *1
                page_check_address_pmd
                unlock page_table_lock
            __split_huge_page_refcount
                lock lru_lock
                compound_lock
                mem_cgroup_split_huge_fixup
                compound_unlock
                unlock lru_lock
            __split_huge_page_map
                lock page_table_lock
                ... some work
                unlock page_table_lock
        unlock anon_vma lock

> I'm not fully sure but IIUC, pmd_trans_huge_lock() just guarantees a huge page "map"
> never goes out. To avoid page splitting itself, compound_lock() is required, I think.
>
> So, the lock here should be
>
> 	page = target.page;
> 	isolate_lru_page(page);
> 	flags = compound_lock_irqsave(page);

I think the race between task migration and thp split does not happen
because of 2 reasons:

  - when we enter the if-block, there is no concurrent thp splitting
    (note that pmd_trans_huge_lock() returns 1 only if the thp is not
     under splitting,)

  - if another thread runs into split_huge_page() just after we entered
    this if-block, the thread waits for page table lock to be unlocked
    in __split_huge_page_splitting() (shown *1 above.) At this point,
    the thp has not been split yet.

But I think it's OK to add compound_lock to meet the requisition of
mem_cgroup_move_account().

>
>
> > +				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
> > +							     pc, mc.from, mc.to,
> > +							     false)) {
> > +					mc.precharge -= HPAGE_PMD_NR;
> > +					mc.moved_charge += HPAGE_PMD_NR;
> > +				}
>
> Here is PageTransHuge() is checked in mem_cgroup_move_account() and if !PageTransHuge(),
> the function returns -EBUSY.

If the above explanation is correct, PageTransHuge() should always be
true here, so BUG_ON(!PageTransHuge()) looks suitable for me.

> I'm not sure but....it's not worth to retry (but add a comment as FIXME later!)

I agree.
For regular size pages, retrying means that we run out of mc.precharge
before addr reaches to end.
But mem_cgroup_move_charge_pte_range() runs over a pmd in a single call and
addr reaches to end only one call of mem_cgroup_move_account() for thp.
So it makes no sense to retry.

> 	compound_unlock_irqrestore(page);
>
> I may miss something, please check carefully, again.

OK.

Thanks,
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] memcg: avoid THP split in task migration
  2012-02-29  0:28 ` KAMEZAWA Hiroyuki
  2012-02-29  9:50   ` Naoya Horiguchi
@ 2012-02-29 14:40   ` Hillf Danton
  2012-02-29 19:39     ` Naoya Horiguchi
  1 sibling, 1 reply; 8+ messages in thread
From: Hillf Danton @ 2012-02-29 14:40 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Andrea Arcangeli, Daisuke Nishimura,
	linux-kernel, KAMEZAWA Hiroyuki

On Wed, Feb 29, 2012 at 8:28 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Tue, 28 Feb 2012 16:12:32 -0500
> Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
>
>> Currently we can't do task migration among memory cgroups without THP split,
>> which means processes heavily using THP experience large overhead in task
>> migration. This patch introduce the code for moving charge of THP and makes
>> THP more valuable.
>>
>> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Hillf Danton <dhillf@gmail.com>
>
>
> Thank you!

   ++hd;

>
> A comment below.
>
>> ---
>>  mm/memcontrol.c |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  1 files changed, 70 insertions(+), 6 deletions(-)
>>
>> diff --git linux-next-20120228.orig/mm/memcontrol.c linux-next-20120228/mm/memcontrol.c
>> index c83aeb5..e97c041 100644
>> --- linux-next-20120228.orig/mm/memcontrol.c
>> +++ linux-next-20120228/mm/memcontrol.c
>> @@ -5211,6 +5211,42 @@ static int is_target_pte_for_mc(struct vm_area_struct *vma,
>>       return ret;
>>  }
>>
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +/*
>> + * We don't consider swapping or file mapped pages because THP does not
>> + * support them for now.
>> + */
>> +static int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,

static int is_target_thp_for_mc(struct vm_area_struct *vma,
or
static int is_target_pmd_for_mc(struct vm_area_struct *vma,
sounds better?

>> +             unsigned long addr, pmd_t pmd, union mc_target *target)
>> +{
>> +     struct page *page = NULL;
>> +     struct page_cgroup *pc;
>> +     int ret = 0;
>> +
>> +     if (pmd_present(pmd))
>> +             page = pmd_page(pmd);
>> +     if (!page)
>> +             return 0;
>> +     VM_BUG_ON(!PageHead(page));

With a huge and stable pmd, the above operations on page could be
compacted into one line?

	page = pmd_page(pmd);

>> +     get_page(page);
>> +     pc = lookup_page_cgroup(page);
>> +     if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
>> +             ret = MC_TARGET_PAGE;
>> +             if (target)

After checking target, looks only get_page() needed?

>> +                     target->page = page;
>> +     }
>> +     if (!ret || !target)
>> +             put_page(page);
>> +     return ret;
>> +}
>> +#else
>> +static inline int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
>> +             unsigned long addr, pmd_t pmd, union mc_target *target)
>> +{
>> +     return 0;
>> +}
>> +#endif
>> +
>>  static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>>                                       unsigned long addr, unsigned long end,
>>                                       struct mm_walk *walk)
>> @@ -5219,7 +5255,13 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>>       pte_t *pte;
>>       spinlock_t *ptl;
>>
>> -     split_huge_page_pmd(walk->mm, pmd);
>> +     if (pmd_trans_huge_lock(pmd, vma) == 1) {
>> +             if (is_target_huge_pmd_for_mc(vma, addr, *pmd, NULL))

		if (is_target_huge_pmd_for_mc(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
looks clearer

>> +                     mc.precharge += HPAGE_PMD_NR;

As HPAGE_PMD_NR is directly used, compiler beeps if THP disabled, I guess.

If yes, please cleanup huge_mm.h with s/BUG()/BUILD_BUG()/ and with
both HPAGE_PMD_ORDER and HPAGE_PMD_NR also defined,
to easy others a bit.

>> +             spin_unlock(&walk->mm->page_table_lock);

		spin_unlock(&vma->mm->page_table_lock);
looks clearer
>> +             cond_resched();
>> +             return 0;
>> +     }
>>
>>       pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>>       for (; addr != end; pte++, addr += PAGE_SIZE)
>> @@ -5378,16 +5420,38 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>>       struct vm_area_struct *vma = walk->private;
>>       pte_t *pte;
>>       spinlock_t *ptl;
>> +     int type;
>> +     union mc_target target;
>> +     struct page *page;
>> +     struct page_cgroup *pc;
>> +
>> +     if (pmd_trans_huge_lock(pmd, vma) == 1) {
>> +             if (!mc.precharge)
>> +                     return 0;

Bang, return without page table lock released.

>> +             type = is_target_huge_pmd_for_mc(vma, addr, *pmd, &target);
>> +             if (type == MC_TARGET_PAGE) {
>> +                     page = target.page;
>> +                     if (!isolate_lru_page(page)) {
>> +                             pc = lookup_page_cgroup(page);
>
> Here is a diffuclut point. Please see mem_cgroup_split_huge_fixup(). It splits

Hard and hard point IMO.

> updates memcg's status of splitted pages under lru_lock and compound_lock
> but not under mm->page_table_lock.
>
> Looking into split_huge_page()
>
>        split_huge_page()  # take anon_vma lock
>                __split_huge_page()
>                        __split_huge_page_refcount() # take lru_lock, compound_lock.
>                                mem_cgroup_split_huge_fixup()
>                        __split_huge_page_map() # take page table lock.
>
[copied from Naoya-san's reply]

> I'm afraid this callchain is not correct.

s/correct/complete/

> Page table lock seems to be taken before we enter the main split work.
>
>    split_huge_page
>        take anon_vma lock
>        __split_huge_page
>            __split_huge_page_splitting
>                lock page_table_lock     <--- *1
>                page_check_address_pmd
>                unlock page_table_lock

Yeah, splitters are blocked.
Plus from the *ugly* documented lock function(another
cleanup needed), the embedded mmap_sem also blocks splitters.

That said, could we simply wait and see results of test cases?

-hd

/* mmap_sem must be held on entry */
static inline int pmd_trans_huge_lock(pmd_t *pmd,
				      struct vm_area_struct *vma)
{
	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
	if (pmd_trans_huge(*pmd))
		return __pmd_trans_huge_lock(pmd, vma);
	else
		return 0;
}

>            __split_huge_page_refcount
>                lock lru_lock
>                compound_lock
>                mem_cgroup_split_huge_fixup
>                compound_unlock
>                unlock lru_lock
>            __split_huge_page_map
>                lock page_table_lock
>                ... some work
>                unlock page_table_lock
>        unlock anon_vma lock
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] memcg: avoid THP split in task migration
  2012-02-29 14:40   ` Hillf Danton
@ 2012-02-29 19:39     ` Naoya Horiguchi
  0 siblings, 0 replies; 8+ messages in thread
From: Naoya Horiguchi @ 2012-02-29 19:39 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Andrea Arcangeli,
	Daisuke Nishimura, linux-kernel, KAMEZAWA Hiroyuki

On Wed, Feb 29, 2012 at 10:40:09PM +0800, Hillf Danton wrote:
> On Wed, Feb 29, 2012 at 8:28 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Tue, 28 Feb 2012 16:12:32 -0500
> > Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> >
> >> Currently we can't do task migration among memory cgroups without THP split,
> >> which means processes heavily using THP experience large overhead in task
> >> migration. This patch introduce the code for moving charge of THP and makes
> >> THP more valuable.
> >>
> >> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >> Cc: Hillf Danton <dhillf@gmail.com>
> >
> >
> > Thank you!
>
>    ++hd;

Thank you, too.

> >
> > A comment below.
> >
> >> ---
> >> mm/memcontrol.c |  76 ++++++++++++++++++++++++++++++++++++++++++++++++++----
> >> 1 files changed, 70 insertions(+), 6 deletions(-)
> >>
> >> diff --git linux-next-20120228.orig/mm/memcontrol.c linux-next-20120228/mm/memcontrol.c
> >> index c83aeb5..e97c041 100644
> >> --- linux-next-20120228.orig/mm/memcontrol.c
> >> +++ linux-next-20120228/mm/memcontrol.c
> >> @@ -5211,6 +5211,42 @@ static int is_target_pte_for_mc(struct vm_area_struct *vma,
> >>    return ret;
> >> }
> >>
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +/*
> >> + * We don't consider swapping or file mapped pages because THP does not
> >> + * support them for now.
> >> + */
> >> +static int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
>
> static int is_target_thp_for_mc(struct vm_area_struct *vma,
> or
> static int is_target_pmd_for_mc(struct vm_area_struct *vma,
> sounds better?

OK, I take the former one.
It's better than the original one because it can avoid potential
name conflict when we implement hugetlbfs variant in the future.

> >> +       unsigned long addr, pmd_t pmd, union mc_target *target)
> >> +{
> >> +   struct page *page = NULL;
> >> +   struct page_cgroup *pc;
> >> +   int ret = 0;
> >> +
> >> +   if (pmd_present(pmd))
> >> +       page = pmd_page(pmd);
> >> +   if (!page)
> >> +       return 0;
> >> +   VM_BUG_ON(!PageHead(page));
>
> With a huge and stable pmd, the above operations on page could be
> compacted into one line?
>
> 	page = pmd_page(pmd);

It's possible under the assumption that thp pmd always has present bit
set and points to the head page. I guess this assumption is true at
least for now, but I'm not sure. Is that true without any exception?
I think that if we miss something and this assumption is not the case,
VM_BUG_ON() can be helpful to know what happened in future bugfix.
But anyway, we should add a comment about the assumption if we do this
optimization.

> >> +   get_page(page);
> >> +   pc = lookup_page_cgroup(page);
> >> +   if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
> >> +       ret = MC_TARGET_PAGE;
> >> +       if (target)
>
> After checking target, looks only get_page() needed?

Do you mean something like this (combined with above optimization:)

  static int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
                  unsigned long addr, pmd_t pmd, union mc_target *target)
  {
          struct page *page = NULL;
          struct page_cgroup *pc;
          int ret = 0;

          /*
           * Here we skip pmd_present() check and NULL page check, assuming
           * that thp pmd has always present bit set and points to the head
           * page.
           */
          page = pmd_page(pmd);
          VM_BUG_ON(!page || !PageHead(page));
          pc = lookup_page_cgroup(page);
          if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
                  ret = MC_TARGET_PAGE;
                  if (target) {
                          get_page(page);
                          target->page = page;
                  }
          }
          return ret;
  }

?
I think this is possible because lookup_page_cgroup() does not depend
on refcount.

> >> +           target->page = page;
> >> +   }
> >> +   if (!ret || !target)
> >> +       put_page(page);
> >> +   return ret;
> >> +}
> >> +#else
> >> +static inline int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
> >> +       unsigned long addr, pmd_t pmd, union mc_target *target)
> >> +{
> >> +   return 0;
> >> +}
> >> +#endif
> >> +
> >> static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> >>                    unsigned long addr, unsigned long end,
> >>                    struct mm_walk *walk)
> >> @@ -5219,7 +5255,13 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> >>    pte_t *pte;
> >>    spinlock_t *ptl;
> >>
> >> -   split_huge_page_pmd(walk->mm, pmd);
> >> +   if (pmd_trans_huge_lock(pmd, vma) == 1) {
> >> +       if (is_target_huge_pmd_for_mc(vma, addr, *pmd, NULL))
>
> 		if (is_target_huge_pmd_for_mc(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
> looks clearer

I agree.

> >> +           mc.precharge += HPAGE_PMD_NR;
>
> As HPAGE_PMD_NR is directly used, compiler beeps if THP disabled, I guess.

Yes, HPAGE_PMD_NR need to be defined for !CONFIG_TRANSPARENT_HUGEPAGE.

> If yes, please cleanup huge_mm.h with s/BUG()/BUILD_BUG()/ and with
> both HPAGE_PMD_ORDER and HPAGE_PMD_NR also defined,
> to easy others a bit.

Thanks, I applied it.

> >> +       spin_unlock(&walk->mm->page_table_lock);
>
> 		spin_unlock(&vma->mm->page_table_lock);
> looks clearer

OK.

> >> +       cond_resched();
> >> +       return 0;
> >> +   }
> >>
> >>    pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >>    for (; addr != end; pte++, addr += PAGE_SIZE)
> >> @@ -5378,16 +5420,38 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >>    struct vm_area_struct *vma = walk->private;
> >>    pte_t *pte;
> >>    spinlock_t *ptl;
> >> +   int type;
> >> +   union mc_target target;
> >> +   struct page *page;
> >> +   struct page_cgroup *pc;
> >> +
> >> +   if (pmd_trans_huge_lock(pmd, vma) == 1) {
> >> +       if (!mc.precharge)
> >> +           return 0;
>
> Bang, return without page table lock released.

Sorry, I missed it.

> >> +       type = is_target_huge_pmd_for_mc(vma, addr, *pmd, &target);
> >> +       if (type == MC_TARGET_PAGE) {
> >> +           page = target.page;
> >> +           if (!isolate_lru_page(page)) {
> >> +               pc = lookup_page_cgroup(page);
> >
> > Here is a diffuclut point. Please see mem_cgroup_split_huge_fixup(). It splits
>
> Hard and hard point IMO.

Yes.

> > updates memcg's status of splitted pages under lru_lock and compound_lock
> > but not under mm->page_table_lock.
> >
> > Looking into split_huge_page()
> >
> >    split_huge_page() # take anon_vma lock
> >        __split_huge_page()
> >            __split_huge_page_refcount() # take lru_lock, compound_lock.
> >                mem_cgroup_split_huge_fixup()
> >            __split_huge_page_map() # take page table lock.
> >
> [copied from Naoya-san's reply]
>
> > I'm afraid this callchain is not correct.
>
> s/correct/complete/
>
> > Page table lock seems to be taken before we enter the main split work.
> >
> >  split_huge_page
> >    take anon_vma lock
> >    __split_huge_page
> >      __split_huge_page_splitting
> >        lock page_table_lock   <--- *1
> >        page_check_address_pmd
> >        unlock page_table_lock
>
> Yeah, splitters are blocked.
> Plus from the *ugly* documented lock function(another
> cleanup needed), the embedded mmap_sem also blocks splitters.

Although I didn't check all callers of split_huge_page() thoroughly,
in my quick checking most of callers hold mmap_sem before kicking split.
And if all callers hold mmap_sem, we can expect to avoid the race
by mmap_sem. I'll take some times to find out the dependency on mmap_sem
of all callers finely.

As for the documentation cleanup, I am not sure what it should be.
I'm sorry if I can't keep up with you, but are you suggesting that
we should document not only the locking rule around pmd_trans_huge_lock(),
but also update whole locking rules in mm subsystem like described in
Documentation/vm/locking, or something else?

> That said, could we simply wait and see results of test cases?

OK.

Thank you for your valuable reviews.
Naoya

> -hd
>
> /* mmap_sem must be held on entry */
> static inline int pmd_trans_huge_lock(pmd_t *pmd,
> 				      struct vm_area_struct *vma)
> {
> 	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
> 	if (pmd_trans_huge(*pmd))
> 		return __pmd_trans_huge_lock(pmd, vma);
> 	else
> 		return 0;
> }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] memcg: avoid THP split in task migration
  2012-02-28 21:12 [RFC][PATCH] memcg: avoid THP split in task migration Naoya Horiguchi
  2012-02-29  0:28 ` KAMEZAWA Hiroyuki
@ 2012-03-01  2:33 ` Andrea Arcangeli
  2012-03-01 20:22   ` Naoya Horiguchi
  1 sibling, 1 reply; 8+ messages in thread
From: Andrea Arcangeli @ 2012-03-01  2:33 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, KAMEZAWA Hiroyuki, Daisuke Nishimura,
	Hillf Danton, linux-kernel

Hi Naoya,

On Tue, Feb 28, 2012 at 04:12:32PM -0500, Naoya Horiguchi wrote:
> Currently we can't do task migration among memory cgroups without THP split,
> which means processes heavily using THP experience large overhead in task
> migration. This patch introduce the code for moving charge of THP and makes
> THP more valuable.

Nice.

> diff --git linux-next-20120228.orig/mm/memcontrol.c linux-next-20120228/mm/memcontrol.c
> index c83aeb5..e97c041 100644
> --- linux-next-20120228.orig/mm/memcontrol.c
> +++ linux-next-20120228/mm/memcontrol.c
> @@ -5211,6 +5211,42 @@ static int is_target_pte_for_mc(struct vm_area_struct *vma,
>  	return ret;
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * We don't consider swapping or file mapped pages because THP does not
> + * support them for now.
> + */
> +static int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
> +		unsigned long addr, pmd_t pmd, union mc_target *target)
> +{
> +	struct page *page = NULL;
> +	struct page_cgroup *pc;
> +	int ret = 0;
> +
> +	if (pmd_present(pmd))
> +		page = pmd_page(pmd);
> +	if (!page)
> +		return 0;

It can't be present and null at the same time.

No need to check pmd_present if you already checked pmd_trans_huge. In
fact checking pmd_present is a bug. For a little time the pmd won't be
present if it's set as splitting. (that short clearing of pmd_present
during pmd splitting is to deal with a vendor CPU errata without
having to flush the smp TLB twice)

Following Kame's suggestion is correct, an unconditional pmd_page is
correct here:

    page = pmd_page(pmd);

We might actually decide to change pmd_present to return true if
pmd_trans_splitting is set to avoid the risk of using an erratic
pmd_present on a pmd_trans_huge pmd, but it's not really necessary if
you never check pmd_present when a pmd is (or can be) a
pmd_trans_huge.

The safe check for pmd is only pmd_none, never pmd_present (as in
__pte_alloc/pte_alloc_map/...).

> +	VM_BUG_ON(!PageHead(page));
> +	get_page(page);

Other review mentioned we can do get_page only when it succeeds, but I
think we can drop the whole get_page and simplify it further see the
end.

> @@ -5219,7 +5255,13 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> -	split_huge_page_pmd(walk->mm, pmd);
> +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> +		if (is_target_huge_pmd_for_mc(vma, addr, *pmd, NULL))
> +			mc.precharge += HPAGE_PMD_NR;

Your use of HPAGE_PMD_NR looks fine, that path will be eliminated at
build time if THP is off. This is the nice way to write code that is
already optimal for THP=off without making special cases or #ifdefs.

Other review suggests changing HPAGE_PMD_NR as BUILD_BUG, that sounds
good idea too, but in this (correct) usage of HPAGE_PMD_NR it wouldn't
make a difference because of the whole branch is correctly eliminated
at build time. In short changing it to BUILD_BUG will simply make sure
the whole pmd_trans_huge_lock == 1 branch is eliminated at build
time. It looks good change too but it's orthogonal so I'd leave it for
a separate patch.

> +		spin_unlock(&walk->mm->page_table_lock);

Agree with other review, vma looks cleaner.

> +		cond_resched();
> +		return 0;
> +	}
>  
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; pte++, addr += PAGE_SIZE)
> @@ -5378,16 +5420,38 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  	struct vm_area_struct *vma = walk->private;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	int type;
> +	union mc_target target;
> +	struct page *page;
> +	struct page_cgroup *pc;
> +
> +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> +		if (!mc.precharge)
> +			return 0;

Agree with Hillf.

> +		type = is_target_huge_pmd_for_mc(vma, addr, *pmd, &target);
> +		if (type == MC_TARGET_PAGE) {
> +			page = target.page;
> +			if (!isolate_lru_page(page)) {
> +				pc = lookup_page_cgroup(page);
> +				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
> +							     pc, mc.from, mc.to,
> +							     false)) {
> +					mc.precharge -= HPAGE_PMD_NR;
> +					mc.moved_charge += HPAGE_PMD_NR;
> +				}

Like you mentioned, a race with split_huge_page_refcount (and hence
mem_cgroup_split_huge_fixup) is not possible because of
pmd_trans_huge_lock succeeding.

However the mmap_sem checked by pmd_trans_huge_lock is there just
because we deal with pmds and so pagetables (and we aren't doing a
lockless lookup like gup_fast). But it's not true that a concurrent
split_huge_page would _not_ prevented by the mmap_sem. The swapout
path will still split hugepages under you even if you hold the
mmap_sem (even in write mode).

The mmap_sem (either read or write) only prevents a concurrent
collapsing/creation of hugepages (but that's irrelevant here). It
won't stop split_huge_page.

So - back to our issue - you're safe against split_huge_page not
running here thanks to the pmd_trans_huge_lock.

There's one tricky locking bit here, that is isolate_lru_page, that
takes the lru_lock.

So the lock order is the page_table_lock first and the lru_lock
second, and so there must not be another place that takes the lru_lock
first and the page_table_lock second. In general it's good idea to
exercise locking code with lockdep prove locking enabled just in case.

> +				putback_lru_page(page);
> +			}
> +			put_page(page);

I wonder if you need a get_page at all in is_target_huge_pmd_for_mc if
you drop the above put_page instead. How can this page go away from
under us, if we've been holding the page_table_lock the whole time?
You can probably drop both get_page above and put_page above.

> +		}
> +		spin_unlock(&walk->mm->page_table_lock);
> +		cond_resched();
> +		return 0;
> +	}
>  
> -	split_huge_page_pmd(walk->mm, pmd);
>  retry:
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; addr += PAGE_SIZE) {
>  		pte_t ptent = *(pte++);
> -		union mc_target target;
> -		int type;
> -		struct page *page;
> -		struct page_cgroup *pc;
>  		swp_entry_t ent;
>  
>  		if (!mc.precharge)

I read the other two great reviews done so far in parallel with the
code, and I ended up replying here to the code as I was reading it,
hope it wasn't too confusing.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] memcg: avoid THP split in task migration
  2012-03-01  2:33 ` Andrea Arcangeli
@ 2012-03-01 20:22   ` Naoya Horiguchi
  2012-03-01 22:01     ` Naoya Horiguchi
  0 siblings, 1 reply; 8+ messages in thread
From: Naoya Horiguchi @ 2012-03-01 20:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, KAMEZAWA Hiroyuki,
	Daisuke Nishimura, Hillf Danton, linux-kernel

Hi Andrea,

On Thu, Mar 01, 2012 at 03:33:14AM +0100, Andrea Arcangeli wrote:
> Hi Naoya,
>
> On Tue, Feb 28, 2012 at 04:12:32PM -0500, Naoya Horiguchi wrote:
> > Currently we can't do task migration among memory cgroups without THP split,
> > which means processes heavily using THP experience large overhead in task
> > migration. This patch introduce the code for moving charge of THP and makes
> > THP more valuable.
>
> Nice.
>
> > diff --git linux-next-20120228.orig/mm/memcontrol.c linux-next-20120228/mm/memcontrol.c
> > index c83aeb5..e97c041 100644
> > --- linux-next-20120228.orig/mm/memcontrol.c
> > +++ linux-next-20120228/mm/memcontrol.c
> > @@ -5211,6 +5211,42 @@ static int is_target_pte_for_mc(struct vm_area_struct *vma,
> >  	return ret;
> >  }
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +/*
> > + * We don't consider swapping or file mapped pages because THP does not
> > + * support them for now.
> > + */
> > +static int is_target_huge_pmd_for_mc(struct vm_area_struct *vma,
> > +		unsigned long addr, pmd_t pmd, union mc_target *target)
> > +{
> > +	struct page *page = NULL;
> > +	struct page_cgroup *pc;
> > +	int ret = 0;
> > +
> > +	if (pmd_present(pmd))
> > +		page = pmd_page(pmd);
> > +	if (!page)
> > +		return 0;
>
> It can't be present and null at the same time.
>
> No need to check pmd_present if you already checked pmd_trans_huge. In
> fact checking pmd_present is a bug. For a little time the pmd won't be
> present if it's set as splitting. (that short clearing of pmd_present
> during pmd splitting is to deal with a vendor CPU errata without
> having to flush the smp TLB twice)

I understand. This is a little-known optimization.

> Following Kame's suggestion is correct, an unconditional pmd_page is
> correct here:
>
>     page = pmd_page(pmd);

I agree. I'll take up the suggestion.

> We might actually decide to change pmd_present to return true if
> pmd_trans_splitting is set to avoid the risk of using an erratic
> pmd_present on a pmd_trans_huge pmd, but it's not really necessary if
> you never check pmd_present when a pmd is (or can be) a
> pmd_trans_huge.

OK.

> The safe check for pmd is only pmd_none, never pmd_present (as in
> __pte_alloc/pte_alloc_map/...).
>
> > +	VM_BUG_ON(!PageHead(page));
> > +	get_page(page);
>
> Other review mentioned we can do get_page only when it succeeds, but I
> think we can drop the whole get_page and simplify it further see the
> end.

See below.

> > @@ -5219,7 +5255,13 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> >
> > -	split_huge_page_pmd(walk->mm, pmd);
> > +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> > +		if (is_target_huge_pmd_for_mc(vma, addr, *pmd, NULL))
> > +			mc.precharge += HPAGE_PMD_NR;
>
> Your use of HPAGE_PMD_NR looks fine, that path will be eliminated at
> build time if THP is off. This is the nice way to write code that is
> already optimal for THP=off without making special cases or #ifdefs.
>
> Other review suggests changing HPAGE_PMD_NR as BUILD_BUG, that sounds
> good idea too, but in this (correct) usage of HPAGE_PMD_NR it wouldn't
> make a difference because of the whole branch is correctly eliminated
> at build time. In short changing it to BUILD_BUG will simply make sure
> the whole pmd_trans_huge_lock == 1 branch is eliminated at build
> time. It looks good change too but it's orthogonal so I'd leave it for
> a separate patch.

In my trial, without changing HPAGE_PMD_NR as BUILD_BUG a build did not
pass with !CONFIG_TRANSPARENT_HUGEPAGE as Hillf said.
Evaluating HPAGE_PMD_NR seems to be prior to eliminating whole
pmd_trans_huge_lock == 1 branch, so I think this change is necessary.
And I agree to add this change with a separate patch.

> > +		spin_unlock(&walk->mm->page_table_lock);
>
> Agree with other review, vma looks cleaner.

I fixed it.

> > +		cond_resched();
> > +		return 0;
> > +	}
> >
> >  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >  	for (; addr != end; pte++, addr += PAGE_SIZE)
> > @@ -5378,16 +5420,38 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >  	struct vm_area_struct *vma = walk->private;
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> > +	int type;
> > +	union mc_target target;
> > +	struct page *page;
> > +	struct page_cgroup *pc;
> > +
> > +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> > +		if (!mc.precharge)
> > +			return 0;
>
> Agree with Hillf.

ditto.

> > +		type = is_target_huge_pmd_for_mc(vma, addr, *pmd, &target);
> > +		if (type == MC_TARGET_PAGE) {
> > +			page = target.page;
> > +			if (!isolate_lru_page(page)) {
> > +				pc = lookup_page_cgroup(page);
> > +				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
> > +							     pc, mc.from, mc.to,
> > +							     false)) {
> > +					mc.precharge -= HPAGE_PMD_NR;
> > +					mc.moved_charge += HPAGE_PMD_NR;
> > +				}
>
> Like you mentioned, a race with split_huge_page_refcount (and hence
> mem_cgroup_split_huge_fixup) is not possible because of
> pmd_trans_huge_lock succeeding.
>
> However the mmap_sem checked by pmd_trans_huge_lock is there just
> because we deal with pmds and so pagetables (and we aren't doing a
> lockless lookup like gup_fast). But it's not true that a concurrent
> split_huge_page would _not_ prevented by the mmap_sem. The swapout
> path will still split hugepages under you even if you hold the
> mmap_sem (even in write mode).

You're right. I also confirmed that add_to_swap() is called without
downing mmap_sem. Moreover, migrate_page() and hwpoison_user_mappings()
are also the same examples.

> The mmap_sem (either read or write) only prevents a concurrent
> collapsing/creation of hugepages (but that's irrelevant here). It
> won't stop split_huge_page.
>
> So - back to our issue - you're safe against split_huge_page not
> running here thanks to the pmd_trans_huge_lock.

OK. I will add the comment here about why this race does not happen.

> There's one tricky locking bit here, that is isolate_lru_page, that
> takes the lru_lock.
>
> So the lock order is the page_table_lock first and the lru_lock
> second, and so there must not be another place that takes the lru_lock
> first and the page_table_lock second. In general it's good idea to
> exercise locking code with lockdep prove locking enabled just in case.

This lock ordering is also described at the head of mm/rmap.c,
and should be obeyed.
And I do (and will) enable lockdep for more dependable testing.

> > +				putback_lru_page(page);
> > +			}
> > +			put_page(page);
>
> I wonder if you need a get_page at all in is_target_huge_pmd_for_mc if
> you drop the above put_page instead. How can this page go away from
> under us, if we've been holding the page_table_lock the whole time?
> You can probably drop both get_page above and put_page above.

I wrote this get/put_page() based on existing code for regular size pages.
In my guess, for regular size pages, someone like memory hotplug can call
put_page() without holding page_table_lock, so the original coder may added
this get/put_page() to protect from it. And if it is applicable to thp
(current memory hotplug code does not call split_huge_page() explicitly,
so I think it is,) this get/put_page() makes us more safe.

> > +		}
> > +		spin_unlock(&walk->mm->page_table_lock);
> > +		cond_resched();
> > +		return 0;
> > +	}
> >
> > -	split_huge_page_pmd(walk->mm, pmd);
> >  retry:
> >  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >  	for (; addr != end; addr += PAGE_SIZE) {
> >  		pte_t ptent = *(pte++);
> > -		union mc_target target;
> > -		int type;
> > -		struct page *page;
> > -		struct page_cgroup *pc;
> >  		swp_entry_t ent;
> >
> >  		if (!mc.precharge)
>
> I read the other two great reviews done so far in parallel with the
> code, and I ended up replying here to the code as I was reading it,
> hope it wasn't too confusing.

Thank you very much for suggestive comments.
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC][PATCH] memcg: avoid THP split in task migration
  2012-03-01 20:22   ` Naoya Horiguchi
@ 2012-03-01 22:01     ` Naoya Horiguchi
  0 siblings, 0 replies; 8+ messages in thread
From: Naoya Horiguchi @ 2012-03-01 22:01 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrea Arcangeli, linux-mm, Andrew Morton, KAMEZAWA Hiroyuki,
	Daisuke Nishimura, Hillf Danton, linux-kernel

On Thu, Mar 01, 2012 at 03:22:16PM -0500, Naoya Horiguchi wrote:
> > > @@ -5219,7 +5255,13 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> > >  	pte_t *pte;
> > >  	spinlock_t *ptl;
> > >
> > > -	split_huge_page_pmd(walk->mm, pmd);
> > > +	if (pmd_trans_huge_lock(pmd, vma) == 1) {
> > > +		if (is_target_huge_pmd_for_mc(vma, addr, *pmd, NULL))
> > > +			mc.precharge += HPAGE_PMD_NR;
> >
> > Your use of HPAGE_PMD_NR looks fine, that path will be eliminated at
> > build time if THP is off. This is the nice way to write code that is
> > already optimal for THP=off without making special cases or #ifdefs.
> >
> > Other review suggests changing HPAGE_PMD_NR as BUILD_BUG, that sounds
> > good idea too, but in this (correct) usage of HPAGE_PMD_NR it wouldn't
> > make a difference because of the whole branch is correctly eliminated
> > at build time. In short changing it to BUILD_BUG will simply make sure
> > the whole pmd_trans_huge_lock == 1 branch is eliminated at build
> > time. It looks good change too but it's orthogonal so I'd leave it for
> > a separate patch.
> 
> In my trial, without changing HPAGE_PMD_NR as BUILD_BUG a build did not
> pass with !CONFIG_TRANSPARENT_HUGEPAGE as Hillf said.
> Evaluating HPAGE_PMD_NR seems to be prior to eliminating whole
> pmd_trans_huge_lock == 1 branch, so I think this change is necessary.

I said the wrong thing.
The error I experienced was just "HPAGE_PMD_NR is undefined."
This is not related to changeing BUG() to BUILD_BUG() in already defined
HPAGE_PMD_(SHIFT|MASK|SIZE).
And using BUILD_BUG() to confirm elimination is good for me. 

Sorry for confusion.
Naoya

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-03-01 22:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-28 21:12 [RFC][PATCH] memcg: avoid THP split in task migration Naoya Horiguchi
2012-02-29  0:28 ` KAMEZAWA Hiroyuki
2012-02-29  9:50   ` Naoya Horiguchi
2012-02-29 14:40   ` Hillf Danton
2012-02-29 19:39     ` Naoya Horiguchi
2012-03-01  2:33 ` Andrea Arcangeli
2012-03-01 20:22   ` Naoya Horiguchi
2012-03-01 22:01     ` Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).