Linux-mm Archive on lore.kernel.org

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] fix build error when CONFIG_SWAP is not set
From: Yoichi Yuasa @ 2011-01-24 12:08 UTC (permalink / raw)
  To: Andrew Morton; +Cc: yuasa, linux-mips, linux-mm, linux-kernel

In file included from
linux-2.6/arch/mips/include/asm/tlb.h:21,
                 from mm/pgtable-generic.c:9:
include/asm-generic/tlb.h: In function 'tlb_flush_mmu':
include/asm-generic/tlb.h:76: error: implicit declaration of function
'release_pages'
include/asm-generic/tlb.h: In function 'tlb_remove_page':
include/asm-generic/tlb.h:105: error: implicit declaration of function
'page_cache_release'
make[1]: *** [mm/pgtable-generic.o] Error 1

Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org>
---
 include/linux/swap.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d55932..92c1be6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -8,6 +8,7 @@
 #include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/node.h>
+#include <linux/pagemap.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH 00/21] mm: Preemptibility -v6
From: Peter Zijlstra @ 2011-01-24 12:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Benjamin Herrenschmidt, David Miller, Nick Piggin,
	Martin Schwidefsky, linux-kernel, linux-arch, linux-mm,
	Andrea Arcangeli, Oleg Nesterov, Paul E. McKenney
In-Reply-To: <1295624034.28776.303.camel@laptop>

On Fri, 2011-01-21 at 16:33 +0100, Peter Zijlstra wrote:

> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
>  	 * Synchronize against page_lock_anon_vma() such that
>  	 * we can safely hold the lock without the anon_vma getting
>  	 * freed.
> +	 *
> +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> +	 * page_lock_anon_vma(). This orders:
> +	 *
> +	 * page_lock_anon_vma()		VS	put_anon_vma()
> +	 *   mutex_trylock()			  atomic_dec_and_test()
> +	 *   smp_mb()				  smp_mb()
> +	 *   atomic_read()			  mutex_is_locked()

Bah!, I thought all mutex_trylock() implementations used an atomic op
with return value (which implies a mb), but it looks like (at least*)
PPC doesn't and only provides a LOCK barrier.


* possibly ARM and SH don't either, but I can't read either ASMs well
enough to tell.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
From: Johannes Weiner @ 2011-01-24 11:34 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, akpm@linux-foundation.org
In-Reply-To: <AANLkTi=sg5HpCTdXgEVYS5rCqtoVVho6dxn8giwZ4kmY@mail.gmail.com>

On Mon, Jan 24, 2011 at 08:14:22PM +0900, Hiroyuki Kamezawa wrote:
> 2011/1/24 Johannes Weiner <hannes@cmpxchg.org>:
> > On Mon, Jan 24, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
> >> On Mon, 24 Jan 2011 11:14:02 +0100
> >> Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>
> >> > On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> >> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > >
> >> > > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> >> > >
> >> > > A clean up for mem_cgroup_move_parent().
> >> > >  - remove unnecessary initialization of local variable.
> >> > >  - rename charge_size -> page_size
> >> > >  - remove unnecessary (wrong) comment.
> >> > >
> >> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > > ---
> >> > >  mm/memcontrol.c |   17 +++++++++--------
> >> > >  1 file changed, 9 insertions(+), 8 deletions(-)
> >> > >
> >> > > Index: mmotm-0107/mm/memcontrol.c
> >> > > ===================================================================
> >> > > --- mmotm-0107.orig/mm/memcontrol.c
> >> > > +++ mmotm-0107/mm/memcontrol.c
> >> > > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
> >> > >   struct cgroup *cg = child->css.cgroup;
> >> > >   struct cgroup *pcg = cg->parent;
> >> > >   struct mem_cgroup *parent;
> >> > > - int charge = PAGE_SIZE;
> >> > > + int page_size;
> >> > >   unsigned long flags;
> >> > >   int ret;
> >> > >
> >> > > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
> >> > >           goto out;
> >> > >   if (isolate_lru_page(page))
> >> > >           goto put;
> >> > > - /* The page is isolated from LRU and we have no race with splitting */
> >> > > - charge = PAGE_SIZE << compound_order(page);
> >> > > +
> >> > > + page_size = PAGE_SIZE << compound_order(page);
> >> >
> >> > Okay, so you remove the wrong comment, but that does not make the code
> >> > right.  What protects compound_order from reading garbage because the
> >> > page is currently splitting?
> >> >
> >>
> >> ==
> >> static int mem_cgroup_move_account(struct page_cgroup *pc,
> >>                 struct mem_cgroup *from, struct mem_cgroup *to,
> >>                 bool uncharge, int charge_size)
> >> {
> >>         int ret = -EINVAL;
> >>         unsigned long flags;
> >>
> >>         if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
> >>                 return -EBUSY;
> >> ==
> >>
> >> This is called under compound_lock(). Then, if someone breaks THP,
> >> -EBUSY and retry.
> >
> > This charge_size contains exactly the garbage you just read from an
> > unprotected compound_order().  It could be anything if the page is
> > split concurrently.
> 
> Then, my recent fix to LRU accounting which use compound_order() is racy, too ?

In lru add/delete/move/rotate?  No, that should be safe because we
have the lru lock there and __split_huge_page_refcount() takes the
lock as well.

> I'll replace compound_order() with
>   if (PageTransHuge(page))
>       size = HPAGE_SIZE.
> 
> Does this work ?

Yes, I think this should work.  This gives a sane size for try_charge
and we still catch a split under the compound_lock later in
move_account as you described above.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
From: Hiroyuki Kamezawa @ 2011-01-24 11:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, akpm@linux-foundation.org
In-Reply-To: <20110124104510.GW2232@cmpxchg.org>

2011/1/24 Johannes Weiner <hannes@cmpxchg.org>:
> On Mon, Jan 24, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
>> On Mon, 24 Jan 2011 11:14:02 +0100
>> Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> > On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
>> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > >
>> > > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
>> > >
>> > > A clean up for mem_cgroup_move_parent().
>> > >  - remove unnecessary initialization of local variable.
>> > >  - rename charge_size -> page_size
>> > >  - remove unnecessary (wrong) comment.
>> > >
>> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > > ---
>> > >  mm/memcontrol.c |   17 +++++++++--------
>> > >  1 file changed, 9 insertions(+), 8 deletions(-)
>> > >
>> > > Index: mmotm-0107/mm/memcontrol.c
>> > > ===================================================================
>> > > --- mmotm-0107.orig/mm/memcontrol.c
>> > > +++ mmotm-0107/mm/memcontrol.c
>> > > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
>> > >   struct cgroup *cg = child->css.cgroup;
>> > >   struct cgroup *pcg = cg->parent;
>> > >   struct mem_cgroup *parent;
>> > > - int charge = PAGE_SIZE;
>> > > + int page_size;
>> > >   unsigned long flags;
>> > >   int ret;
>> > >
>> > > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
>> > >           goto out;
>> > >   if (isolate_lru_page(page))
>> > >           goto put;
>> > > - /* The page is isolated from LRU and we have no race with splitting */
>> > > - charge = PAGE_SIZE << compound_order(page);
>> > > +
>> > > + page_size = PAGE_SIZE << compound_order(page);
>> >
>> > Okay, so you remove the wrong comment, but that does not make the code
>> > right.  What protects compound_order from reading garbage because the
>> > page is currently splitting?
>> >
>>
>> ==
>> static int mem_cgroup_move_account(struct page_cgroup *pc,
>>                 struct mem_cgroup *from, struct mem_cgroup *to,
>>                 bool uncharge, int charge_size)
>> {
>>         int ret = -EINVAL;
>>         unsigned long flags;
>>
>>         if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
>>                 return -EBUSY;
>> ==
>>
>> This is called under compound_lock(). Then, if someone breaks THP,
>> -EBUSY and retry.
>
> This charge_size contains exactly the garbage you just read from an
> unprotected compound_order().  It could be anything if the page is
> split concurrently.

Then, my recent fix to LRU accounting which use compound_order() is racy, too ?

I'll replace compound_order() with
  if (PageTransHuge(page))
      size = HPAGE_SIZE.

Does this work ?
If there are no way to aquire size of page without lock, I need to add one.
Any idea?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
From: Johannes Weiner @ 2011-01-24 10:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org
In-Reply-To: <20110124191535.514ef2d9.kamezawa.hiroyu@jp.fujitsu.com>

On Mon, Jan 24, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 24 Jan 2011 11:14:02 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> > > 
> > > A clean up for mem_cgroup_move_parent(). 
> > >  - remove unnecessary initialization of local variable.
> > >  - rename charge_size -> page_size
> > >  - remove unnecessary (wrong) comment.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  mm/memcontrol.c |   17 +++++++++--------
> > >  1 file changed, 9 insertions(+), 8 deletions(-)
> > > 
> > > Index: mmotm-0107/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-0107.orig/mm/memcontrol.c
> > > +++ mmotm-0107/mm/memcontrol.c
> > > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
> > >  	struct cgroup *cg = child->css.cgroup;
> > >  	struct cgroup *pcg = cg->parent;
> > >  	struct mem_cgroup *parent;
> > > -	int charge = PAGE_SIZE;
> > > +	int page_size;
> > >  	unsigned long flags;
> > >  	int ret;
> > >  
> > > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
> > >  		goto out;
> > >  	if (isolate_lru_page(page))
> > >  		goto put;
> > > -	/* The page is isolated from LRU and we have no race with splitting */
> > > -	charge = PAGE_SIZE << compound_order(page);
> > > +
> > > +	page_size = PAGE_SIZE << compound_order(page);
> > 
> > Okay, so you remove the wrong comment, but that does not make the code
> > right.  What protects compound_order from reading garbage because the
> > page is currently splitting?
> > 
> 
> ==
> static int mem_cgroup_move_account(struct page_cgroup *pc,
>                 struct mem_cgroup *from, struct mem_cgroup *to,
>                 bool uncharge, int charge_size)
> {
>         int ret = -EINVAL;
>         unsigned long flags;
> 
>         if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
>                 return -EBUSY;
> ==
> 
> This is called under compound_lock(). Then, if someone breaks THP,
> -EBUSY and retry.

This charge_size contains exactly the garbage you just read from an
unprotected compound_order().  It could be anything if the page is
split concurrently.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
From: KAMEZAWA Hiroyuki @ 2011-01-24 10:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org
In-Reply-To: <20110124101402.GT2232@cmpxchg.org>

On Mon, 24 Jan 2011 11:14:02 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> > 
> > A clean up for mem_cgroup_move_parent(). 
> >  - remove unnecessary initialization of local variable.
> >  - rename charge_size -> page_size
> >  - remove unnecessary (wrong) comment.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   17 +++++++++--------
> >  1 file changed, 9 insertions(+), 8 deletions(-)
> > 
> > Index: mmotm-0107/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0107.orig/mm/memcontrol.c
> > +++ mmotm-0107/mm/memcontrol.c
> > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
> >  	struct cgroup *cg = child->css.cgroup;
> >  	struct cgroup *pcg = cg->parent;
> >  	struct mem_cgroup *parent;
> > -	int charge = PAGE_SIZE;
> > +	int page_size;
> >  	unsigned long flags;
> >  	int ret;
> >  
> > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
> >  		goto out;
> >  	if (isolate_lru_page(page))
> >  		goto put;
> > -	/* The page is isolated from LRU and we have no race with splitting */
> > -	charge = PAGE_SIZE << compound_order(page);
> > +
> > +	page_size = PAGE_SIZE << compound_order(page);
> 
> Okay, so you remove the wrong comment, but that does not make the code
> right.  What protects compound_order from reading garbage because the
> page is currently splitting?
> 

==
static int mem_cgroup_move_account(struct page_cgroup *pc,
                struct mem_cgroup *from, struct mem_cgroup *to,
                bool uncharge, int charge_size)
{
        int ret = -EINVAL;
        unsigned long flags;

        if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
                return -EBUSY;
==

This is called under compound_lock(). Then, if someone breaks THP,
-EBUSY and retry.


Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 7/7] memcg : remove ugly vairable initialization by callers
From: Johannes Weiner @ 2011-01-24 10:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org
In-Reply-To: <20110121155051.0b309b1f.kamezawa.hiroyu@jp.fujitsu.com>

On Fri, Jan 21, 2011 at 03:50:51PM +0900, KAMEZAWA Hiroyuki wrote:
> This is a promised one.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch is for removing initialization in caller of memory cgroup
> function. Some memory cgroup uses following style to bring the result
> of start function to the end function for avoiding races.
> 
>    mem_cgroup_start_A(&(*ptr))
>    /* Something very complicated can happen here. */
>    mem_cgroup_end_A(*ptr)
> 
> In some calls, *ptr should be initialized to NULL be caller. But
> it's ugly. This patch fixes that *ptr is initialized by _start
> function.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Nitpick: I would remove the comments above the *ptr = NULL lines,
there should be no assumptions about the consequences in the caller
(the next patch will change the caller, and then the comments are
nothing but confusing).  It's just a plain initialization of a return
value.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/7] memcg : more fixes and clean up for 2.6.28-rc
From: Johannes Weiner @ 2011-01-24 10:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org
In-Reply-To: <20110121153928.bb0c5e90.kamezawa.hiroyu@jp.fujitsu.com>

On Fri, Jan 21, 2011 at 03:39:28PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This is a fix for
> ca3e021417eed30ec2b64ce88eb0acf64aa9bc29
> 
> mem_cgroup_disabled() should be checked at splitting.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
From: Johannes Weiner @ 2011-01-24 10:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org
In-Reply-To: <20110121153726.54f4a159.kamezawa.hiroyu@jp.fujitsu.com>

On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> 
> A clean up for mem_cgroup_move_parent(). 
>  - remove unnecessary initialization of local variable.
>  - rename charge_size -> page_size
>  - remove unnecessary (wrong) comment.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
>  	struct cgroup *cg = child->css.cgroup;
>  	struct cgroup *pcg = cg->parent;
>  	struct mem_cgroup *parent;
> -	int charge = PAGE_SIZE;
> +	int page_size;
>  	unsigned long flags;
>  	int ret;
>  
> @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
>  		goto out;
>  	if (isolate_lru_page(page))
>  		goto put;
> -	/* The page is isolated from LRU and we have no race with splitting */
> -	charge = PAGE_SIZE << compound_order(page);
> +
> +	page_size = PAGE_SIZE << compound_order(page);

Okay, so you remove the wrong comment, but that does not make the code
right.  What protects compound_order from reading garbage because the
page is currently splitting?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
From: KAMEZAWA Hiroyuki @ 2011-01-24 10:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org
In-Reply-To: <20110124100434.GS2232@cmpxchg.org>

On Mon, 24 Jan 2011 11:04:34 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
  
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					gfp_mask, flags);
> > +					gfp_mask, flags, csize);
> >  	/*
> >  	 * try_to_free_mem_cgroup_pages() might not give us a full
> >  	 * picture of reclaim. Some pages are reclaimed and might be
> > @@ -1852,7 +1853,7 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> > +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
> >  		return CHARGE_RETRY;
> 
> This is the only site that is really involved with THP. 

yes.

> But you need to touch every site because you change mem_cgroup_check_under_limit()
> instead of adding a new function.
> 
Yes.

> I would suggest just adding another function for checking available
> space explicitely and only changing this single call site to use it.
> 
> Just ignore the return value of mem_cgroup_hierarchical_reclaim() and
> check for enough space unconditionally.
> 
> Everybody else is happy with PAGE_SIZE pages.
> 
Hmm. ok, let us changes to be small and see how often hugepage alloc will fail.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
From: Johannes Weiner @ 2011-01-24 10:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org
In-Reply-To: <20110121154141.680c96d9.kamezawa.hiroyu@jp.fujitsu.com>

On Fri, Jan 21, 2011 at 03:41:41PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> but THP does HPAGE_SIZE charge.
> 
> This is one of fixes for supporing THP. This modifies
> mem_cgroup_check_under_limit to take page_size into account.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> TODO: By this reclaim function can get page_size as argument.
> So...there may be something should be improvoed.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/res_counter.h |   11 +++++++++++
>  mm/memcontrol.c             |   27 ++++++++++++++-------------
>  2 files changed, 25 insertions(+), 13 deletions(-)
> 
> Index: mmotm-0107/include/linux/res_counter.h
> ===================================================================
> --- mmotm-0107.orig/include/linux/res_counter.h
> +++ mmotm-0107/include/linux/res_counter.h
> @@ -182,6 +182,17 @@ static inline bool res_counter_check_und
>  	return ret;
>  }
>  
> +static inline s64 res_counter_check_margin(struct res_counter *cnt)
> +{
> +	s64 ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = cnt->limit - cnt->usage;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  {
>  	bool ret;
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -1099,14 +1099,14 @@ unsigned long mem_cgroup_isolate_pages(u
>  #define mem_cgroup_from_res_counter(counter, member)	\
>  	container_of(counter, struct mem_cgroup, member)
>  
> -static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
> +static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem, int page_size)
>  {
>  	if (do_swap_account) {
> -		if (res_counter_check_under_limit(&mem->res) &&
> -			res_counter_check_under_limit(&mem->memsw))
> +		if (res_counter_check_margin(&mem->res) >= page_size &&
> +			res_counter_check_margin(&mem->memsw) >= page_size)
>  			return true;
>  	} else
> -		if (res_counter_check_under_limit(&mem->res))
> +		if (res_counter_check_margin(&mem->res) >= page_size)
>  			return true;
>  	return false;
>  }
> @@ -1367,7 +1367,8 @@ mem_cgroup_select_victim(struct mem_cgro
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  						struct zone *zone,
>  						gfp_t gfp_mask,
> -						unsigned long reclaim_options)
> +						unsigned long reclaim_options,
> +						int page_size)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
> @@ -1434,7 +1435,7 @@ static int mem_cgroup_hierarchical_recla
>  		if (check_soft) {
>  			if (res_counter_check_under_soft_limit(&root_mem->res))
>  				return total;
> -		} else if (mem_cgroup_check_under_limit(root_mem))
> +		} else if (mem_cgroup_check_under_limit(root_mem, page_size))
>  			return 1 + total;
>  	}
>  	return total;
> @@ -1844,7 +1845,7 @@ static int __mem_cgroup_do_charge(struct
>  		return CHARGE_WOULDBLOCK;
>  
>  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					gfp_mask, flags);
> +					gfp_mask, flags, csize);
>  	/*
>  	 * try_to_free_mem_cgroup_pages() might not give us a full
>  	 * picture of reclaim. Some pages are reclaimed and might be
> @@ -1852,7 +1853,7 @@ static int __mem_cgroup_do_charge(struct
>  	 * Check the limit again to see if the reclaim reduced the
>  	 * current usage of the cgroup before giving up
>  	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
>  		return CHARGE_RETRY;

This is the only site that is really involved with THP.  But you need
to touch every site because you change mem_cgroup_check_under_limit()
instead of adding a new function.

I would suggest just adding another function for checking available
space explicitely and only changing this single call site to use it.

Just ignore the return value of mem_cgroup_hierarchical_reclaim() and
check for enough space unconditionally.

Everybody else is happy with PAGE_SIZE pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)
From: Balbir Singh @ 2011-01-24  6:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, akpm, npiggin, kvm, linux-kernel, kosaki.motohiro,
	kamezawa.hiroyu
In-Reply-To: <alpine.DEB.2.00.1101210947270.13881@router.home>

* Christoph Lameter <cl@linux.com> [2011-01-21 09:55:17]:

> On Fri, 21 Jan 2011, Balbir Singh wrote:
> 
> > * Christoph Lameter <cl@linux.com> [2011-01-20 09:00:09]:
> >
> > > On Thu, 20 Jan 2011, Balbir Singh wrote:
> > >
> > > > +	unmapped_page_control
> > > > +			[KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
> > > > +			is enabled. It controls the amount of unmapped memory
> > > > +			that is present in the system. This boot option plus
> > > > +			vm.min_unmapped_ratio (sysctl) provide granular control
> > >
> > > min_unmapped_ratio is there to guarantee that zone reclaim does not
> > > reclaim all unmapped pages.
> > >
> > > What you want here is a max_unmapped_ratio.
> > >
> >
> > I thought about that, the logic for reusing min_unmapped_ratio was to
> > keep a limit beyond which unmapped page cache shrinking should stop.
> 
> Right. That is the role of it. Its a minimum to leave. You want a maximum
> size of the pagte cache.

In this case we want the maximum to be as small as the minimum, but
from a general design perspective maximum does make sense.

> 
> > I think you are suggesting max_unmapped_ratio as the point at which
> > shrinking should begin, right?
> 
> The role of min_unmapped_ratio is to never reclaim more pagecache if we
> reach that ratio even if we have to go off node for an allocation.
> 
> AFAICT What you propose is a maximum size of the page cache. If the number
> of page cache pages goes beyond that then you trim the page cache in
> background reclaim.
> 
> > > > +			reclaim_unmapped_pages(priority, zone, &sc);
> > > > +
> > > >  			if (!zone_watermark_ok_safe(zone, order,
> > >
> > > Hmmmm. Okay that means background reclaim does it. If so then we also want
> > > zone reclaim to be able to work in the background I think.
> >
> > Anything specific you had in mind, works for me in testing, but is
> > there anything specific that stands out in your mind that needs to be
> > done?
> 
> Hmmm. So this would also work in a NUMA configuration, right. Limiting the
> sizes of the page cache would avoid zone reclaim through these limit. Page
> cache size would be limited by the max_unmapped_ratio.
> 
> zone_reclaim only would come into play if other allocations make the
> memory on the node so tight that we would have to evict more page
> cache pages in direct reclaim.
> Then zone_reclaim could go down to shrink the page cache size to
> min_unmapped_ratio.
>

I'll repost with max_unmapped_ration changes

Thanks for the review! 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* too big min_free_kbytes
From: Shaohua Li @ 2011-01-24  3:56 UTC (permalink / raw)
  To: Andrew Morton, aarcange; +Cc: linux-mm, Chen, Tim C

Hi,
With transparent huge page, min_free_kbytes is set too big.
Before:
Node 0, zone    DMA32
  pages free     1812
        min      1424
        low      1780
        high     2136
        scanned  0
        spanned  519168
        present  511496

After:
Node 0, zone    DMA32
  pages free     482708
        min      11178
        low      13972
        high     16767
        scanned  0
        spanned  519168
        present  511496
This caused different performance problems in our test. I wonder why we
set the value so big.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 0/7] memcg : more fixes and clean up for 2.6.28-rc
From: KAMEZAWA Hiroyuki @ 2011-01-24  0:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	nishimura@mxp.nes.nec.co.jp, hannes@cmpxchg.org,
	balbir@linux.vnet.ibm.com, akpm@linux-foundation.org
In-Reply-To: <20110121153431.191134dd.kamezawa.hiroyu@jp.fujitsu.com>

On Fri, 21 Jan 2011 15:34:31 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> This is a set of patches which I'm now testing, and it seems it passed
> small test. So I post this.
> 
> Some are bug fixes and other are clean ups but I think these are for 2.6.38.
> 
> Brief decription
> 
> [1/7] remove buggy comment and use better name for mem_cgroup_move_parent()
>       The fixes for mem_cgroup_move_parent() is already in mainline, this is
>       an add-on.
> 
> [2/7] a bug fix for a new function mem_cgroup_split_huge_fixup(),
>       which was recently merged.
> 
> [3/7] prepare for fixes in [4/7],[5/7]. This is an enhancement of function
>       which is used now.
> 
> [4/7] fix mem_cgroup_charge() for THP. By this, memory cgroup's charge function
>       will handle THP request in sane way.
> 
> [5/7] fix khugepaged scan condition for memcg.
>       This is a fix for hang of processes under small/buzy memory cgroup.
> 
> [6/7] rename vairable names to be page_size, nr_pages, bytes rather than
>       ambiguous names.
> 
> [7/7] some memcg function requires the caller to initialize variable
>       before call. It's ugly and fix it.
> 
> 
> I think patch 1,2,3,4,5 is urgent ones. But I think patch "5" needs some
> good review. But without "5", stress-test on small memory cgroup will not
> run succesfully.
> 

I'll rebase this set to onto http://marc.info/?l=linux-mm&m=129559263207634&w=2
and post v2.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
From: KAMEZAWA Hiroyuki @ 2011-01-24  0:14 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	hannes@cmpxchg.org, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org
In-Reply-To: <20110121174818.28e1cc83.nishimura@mxp.nes.nec.co.jp>

On Fri, 21 Jan 2011 17:48:18 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 21 Jan 2011 15:44:30 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > When THP is used, Hugepage size charge can happen. It's not handled
> > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > to small page allocation when HUGEPAGE allocation seems difficult
> > or busy, but memory cgroup doesn't understand it and continue to
> > try HUGEPAGE charging. And the worst thing is memory cgroup
> > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > 
> > By this, khugepaged etc...can goes into inifinite reclaim loop
> > if tasks in memcg are busy.
> > 
> > After this patch 
> >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> >  - distinguish THP allocaton from Bached allocation. 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   51 +++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 35 insertions(+), 16 deletions(-)
> > 
> > Index: mmotm-0107/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0107.orig/mm/memcontrol.c
> > +++ mmotm-0107/mm/memcontrol.c
> > @@ -1812,24 +1812,25 @@ enum {
> >  	CHARGE_OK,		/* success */
> >  	CHARGE_RETRY,		/* need to retry but retry is not bad */
> >  	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
> > +	CHARGE_NEED_BREAK,	/* big size allocation failure */
> >  	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
> >  	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
> >  };
> >  
> >  static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> > -				int csize, bool oom_check)
> > +			int page_size, bool do_reclaim, bool oom_check)
> 
> I'm sorry, I can't understand why we need 'do_reclaim'. See below.
> 
> >  {
> >  	struct mem_cgroup *mem_over_limit;
> >  	struct res_counter *fail_res;
> >  	unsigned long flags = 0;
> >  	int ret;
> >  
> > -	ret = res_counter_charge(&mem->res, csize, &fail_res);
> > +	ret = res_counter_charge(&mem->res, page_size, &fail_res);
> >  
> >  	if (likely(!ret)) {
> >  		if (!do_swap_account)
> >  			return CHARGE_OK;
> > -		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> > +		ret = res_counter_charge(&mem->memsw, page_size, &fail_res);
> >  		if (likely(!ret))
> >  			return CHARGE_OK;
> >  
> > @@ -1838,14 +1839,14 @@ static int __mem_cgroup_do_charge(struct
> >  	} else
> >  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> >  
> > -	if (csize > PAGE_SIZE) /* change csize and retry */
> > +	if (!do_reclaim)
> >  		return CHARGE_RETRY;
> >  
> 
> From the very beginning, do we need this "CHARGE_RETRY" ?
> 

Reducing charge_size here in automatic and go back to the start of this function ? 
I think returning here is better.


> >  	if (!(gfp_mask & __GFP_WAIT))
> >  		return CHARGE_WOULDBLOCK;
> >  
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					gfp_mask, flags, csize);
> > +					gfp_mask, flags, page_size);
> >  	/*
> >  	 * try_to_free_mem_cgroup_pages() might not give us a full
> >  	 * picture of reclaim. Some pages are reclaimed and might be
> > @@ -1853,19 +1854,28 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
> > +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, page_size))
> >  		return CHARGE_RETRY;
> >  
> >  	/*
> > +	 * When page_size > PAGE_SIZE, THP calls this function and it's
> > +	 * ok to tell 'there are not enough pages for hugepage'. THP will
> > +	 * fallback into PAGE_SIZE allocation. If we do reclaim eagerly,
> > +	 * page splitting will occur and it seems much worse.
> > +	 */
> > +	if (page_size > PAGE_SIZE)
> > +		return CHARGE_NEED_BREAK;
> > +
> > +	/*
> >  	 * At task move, charge accounts can be doubly counted. So, it's
> >  	 * better to wait until the end of task_move if something is going on.
> >  	 */
> >  	if (mem_cgroup_wait_acct_move(mem_over_limit))
> >  		return CHARGE_RETRY;
> > -
> >  	/* If we don't need to call oom-killer at el, return immediately */
> >  	if (!oom_check)
> >  		return CHARGE_NOMEM;
> > +
> >  	/* check OOM */
> >  	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
> >  		return CHARGE_OOM_DIE;
> > @@ -1885,7 +1895,7 @@ static int __mem_cgroup_try_charge(struc
> >  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct mem_cgroup *mem = NULL;
> >  	int ret;
> > -	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> > +	bool use_pcp_cache = (page_size == PAGE_SIZE);
> >  
> >  	/*
> >  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> > @@ -1910,7 +1920,7 @@ again:
> >  		VM_BUG_ON(css_is_removed(&mem->css));
> >  		if (mem_cgroup_is_root(mem))
> >  			goto done;
> > -		if (page_size == PAGE_SIZE && consume_stock(mem))
> > +		if (use_pcp_cache && consume_stock(mem))
> >  			goto done;
> >  		css_get(&mem->css);
> >  	} else {
> > @@ -1933,7 +1943,7 @@ again:
> >  			rcu_read_unlock();
> >  			goto done;
> >  		}
> > -		if (page_size == PAGE_SIZE && consume_stock(mem)) {
> > +		if (use_pcp_cache && consume_stock(mem)) {
> >  			/*
> >  			 * It seems dagerous to access memcg without css_get().
> >  			 * But considering how consume_stok works, it's not
> > @@ -1967,17 +1977,26 @@ again:
> >  			oom_check = true;
> >  			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  		}
> > -
> > -		ret = __mem_cgroup_do_charge(mem, gfp_mask, csize, oom_check);
> > +		if (use_pcp_cache)
> > +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> > +					CHARGE_SIZE, false, oom_check);
> > +		else
> > +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> > +					page_size, true, oom_check);
> >  
> 
> hmm, this confuses me. I think 'use_pcp_cache' will be used to decide
> whether we should do consume_stock() or not, but why we change charge size
> and reclaim behavior depending on it ? I think this code itself is right,
> but using 'use_pcp_cache' confused me.
> 

Is it problem of function name ? 
'do_batched_charge' or some ?

I'd like to use a 'xxxx_size' variable rather than 2 xxxx_size variable.



> 
> >  		switch (ret) {
> >  		case CHARGE_OK:
> >  			break;
> >  		case CHARGE_RETRY: /* not in OOM situation but retry */
> > -			csize = page_size;
> > +			if (use_pcp_cache)/* need to reclaim pages */
> > +				use_pcp_cache = false;
> >  			css_put(&mem->css);
> >  			mem = NULL;
> >  			goto again;
> > +		case CHARGE_NEED_BREAK: /* page_size > PAGE_SIZE */
> > +			css_put(&mem->css);
> > +			/* returning faiulre doesn't mean OOM for hugepages */
> > +			goto nomem;
> 
> I like this change.
> 
> >  		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
> >  			css_put(&mem->css);
> >  			goto nomem;
> > @@ -1994,9 +2013,9 @@ again:
> >  			goto bypass;
> >  		}
> >  	} while (ret != CHARGE_OK);
> > -
> > -	if (csize > page_size)
> > -		refill_stock(mem, csize - page_size);
> > +	/* This flag is cleared when we fail CHAEGE_SIZE charge. */
> > +	if (use_pcp_cache)
> > +		refill_stock(mem, CHARGE_SIZE - page_size);
> 
> Ditto. can't we keep 'csize' and old code here ?
> 

I remove csize. 2 'size' variable is confusing.


Thanks.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] Fix uninitialized variable use in mm/memcontrol.c::mem_cgroup_move_parent()
From: KAMEZAWA Hiroyuki @ 2011-01-24  0:08 UTC (permalink / raw)
  To: Jesper Juhl
  Cc: linux-mm, linux-kernel, Balbir Singh, Daisuke Nishimura,
	Pavel Emelianov, Kirill A. Shutemov
In-Reply-To: <alpine.LNX.2.00.1101222044580.7746@swampdragon.chaosbits.net>

On Sat, 22 Jan 2011 20:51:32 +0100 (CET)
Jesper Juhl <jj@chaosbits.net> wrote:

> In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps to 
> the 'put_back' label
>   	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
>   	if (ret || !parent)
>   		goto put_back;
>  where we'll 
>   	if (charge > PAGE_SIZE)
>   		compound_unlock_irqrestore(page, flags);
> but, we have not assigned anything to 'flags' at this point, nor have we 
> called 'compound_lock_irqsave()' (which is what sets 'flags').
> So, I believe the 'put_back' label should be moved below the call to 
> compound_unlock_irqrestore() as per this patch. 
> 
> Signed-off-by: Jesper Juhl <jj@chaosbits.net>

Thank you.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Andrew, I'll move my new patces onto this. So, please pick this one 1st.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [patch] mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator
From: David Rientjes @ 2011-01-23 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Minchan Kim, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm

0e093d99763e (writeback: do not sleep on the congestion queue if there
are no congested BDIs or if significant congestion is not being
encountered in the current zone) uncovered a livelock in the page
allocator that resulted in tasks infinitely looping trying to find memory
and kswapd running at 100% cpu.

The issue occurs because drain_all_pages() is called immediately
following direct reclaim when no memory is freed and try_to_free_pages()
returns non-zero because all zones in the zonelist do not have their
all_unreclaimable flag set.

When draining the per-cpu pagesets back to the buddy allocator for each
zone, the zone->pages_scanned counter is cleared to avoid erroneously
setting zone->all_unreclaimable later.  The problem is that no pages may
actually be drained and, thus, the unreclaimable logic never fails direct
reclaim so the oom killer may be invoked.

This apparently only manifested after wait_iff_congested() was introduced
and the zone was full of anonymous memory that would not congest the
backing store.  The page allocator would infinitely loop if there were no
other tasks waiting to be scheduled and clear zone->pages_scanned because
of drain_all_pages() as the result of this change before kswapd could
scan enough pages to trigger the reclaim logic.  Additionally, with every
loop of the page allocator and in the reclaim path, kswapd would be
kicked and would end up running at 100% cpu.  In this scenario, current
and kswapd are all running continuously with kswapd incrementing
zone->pages_scanned and current clearing it.

The problem is even more pronounced when current swaps some of its memory
to swap cache and the reclaimable logic then considers all active
anonymous memory in the all_unreclaimable logic, which requires a much
higher zone->pages_scanned value for try_to_free_pages() to return zero
that is never attainable in this scenario.

Before wait_iff_congested(), the page allocator would incur an
unconditional timeout and allow kswapd to elevate zone->pages_scanned to
a level that the oom killer would be called the next time it loops.

The fix is to only attempt to drain pcp pages if there is actually a
quantity to be drained.  The unconditional clearing of
zone->pages_scanned in free_pcppages_bulk() need not be changed since
other callers already ensure that draining will occur.  This patch
ensures that free_pcppages_bulk() will actually free memory before
calling into it from drain_all_pages() so zone->pages_scanned is only
cleared if appropriate.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1088,8 +1088,10 @@ static void drain_pages(unsigned int cpu)
 		pset = per_cpu_ptr(zone->pageset, cpu);

 		pcp = &pset->pcp;
-		free_pcppages_bulk(zone, pcp->count, pcp);
-		pcp->count = 0;
+		if (pcp->count) {
+			free_pcppages_bulk(zone, pcp->count, pcp);
+			pcp->count = 0;
+		}
 		local_irq_restore(flags);
 	}
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [patch v2] mm: fix deferred congestion timeout if preferred zone is not allowed
From: David Rientjes @ 2011-01-23 22:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Minchan Kim, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm
In-Reply-To: <alpine.DEB.2.00.1101172108380.29048@chino.kir.corp.google.com>

Before 0e093d99763e (writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone), preferred_zone was only used for
NUMA statistics, to determine the zoneidx from which to allocate from given
the type requested, and whether to utilize memory compaction.

wait_iff_congested(), though, uses preferred_zone to determine if the
congestion wait should be deferred because its dirty pages are backed by
a congested bdi.  This incorrectly defers the timeout and busy loops in
the page allocator with various cond_resched() calls if preferred_zone is
not allowed in the current context, usually consuming 100% of a cpu.

This patch ensures preferred_zone is an allowed zone in the fastpath
depending on whether current is constrained by its cpuset or nodes in
its mempolicy (when the nodemask passed is non-NULL).  This is correct
since the fastpath allocation always passes ALLOC_CPUSET when trying
to allocate memory.  In the slowpath, this patch resets preferred_zone
to the first zone of the allowed type when the allocation is not
constrained by current's cpuset, i.e. it does not pass ALLOC_CPUSET.

This patch also ensures preferred_zone is from the set of allowed nodes
when called from within direct reclaim since allocations are always
constrained by cpusets in this context (it is blockable).

Both of these uses of cpuset_current_mems_allowed are protected by
get_mems_allowed().

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   12 +++++++++++-
 mm/vmscan.c     |    3 ++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2034,6 +2034,14 @@ restart:
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);

+	/*
+	 * Find the true preferred zone if the allocation is unconstrained by
+	 * cpusets.
+	 */
+	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
+		first_zones_zonelist(zonelist, high_zoneidx, NULL,
+					&preferred_zone);
+
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
@@ -2192,7 +2200,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,

 	get_mems_allowed();
 	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
+	first_zones_zonelist(zonelist, high_zoneidx,
+				nodemask ? : &cpuset_current_mems_allowed,
+				&preferred_zone);
 	if (!preferred_zone) {
 		put_mems_allowed();
 		return NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2083,7 +2083,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 			struct zone *preferred_zone;

 			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
-							NULL, &preferred_zone);
+						&cpuset_current_mems_allowed,
+						&preferred_zone);
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);
 		}
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] ARM: mm: Regarding section when dealing with meminfo
From: Russell King - ARM Linux @ 2011-01-23 18:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: KyongHo Cho, Kukjin Kim, KeyYoung Park, linux-kernel, Ilho Lee,
	linux-mm, linux-samsung-soc, linux-arm-kernel
In-Reply-To: <1295547087.9039.694.camel@nimitz>

On Thu, Jan 20, 2011 at 10:11:27AM -0800, Dave Hansen wrote:
> On Thu, 2011-01-20 at 18:01 +0000, Russell King - ARM Linux wrote:
> > > The x86 version of show_mem() actually manages to do this without any
> > > #ifdefs, and works for a ton of configuration options.  It uses
> > > pfn_valid() to tell whether it can touch a given pfn.
> > 
> > x86 memory layout tends to be very simple as it expects memory to
> > start at the beginning of every region described by a pgdat and extend
> > in one contiguous block.  I wish ARM was that simple.
> 
> x86 memory layouts can be pretty funky and have been that way for a long
> time.  That's why we *have* to handle holes in x86's show_mem().  My
> laptop even has a ~1GB hole in its ZONE_DMA32:

If x86 is soo funky, I suggest you try the x86 version of show_mem()
on an ARM platform with memory holes.  Make sure you try it with
sparsemem as well...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 00/21] mm: Preemptibility -v6
From: Peter Zijlstra @ 2011-01-23 11:03 UTC (permalink / raw)
  To: paulmck
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt, David Miller,
	Nick Piggin, Martin Schwidefsky, linux-kernel, linux-arch,
	linux-mm, Andrea Arcangeli, Oleg Nesterov
In-Reply-To: <20110122210623.GR17752@linux.vnet.ibm.com>

On Sat, 2011-01-22 at 13:06 -0800, Paul E. McKenney wrote:

> OK, so the anon_vma slab cache is SLAB_DESTROY_BY_RCU.  Presumably
> all callers of page_lock_anon_vma() check the identity of the page
> that got locked, since it might be recycled at any time.  But when
> I look at 2.6.37, I only see checks for NULL.  So I am assuming
> that this code is supposed to prevent such recycling.
> 
> I am not sure that I am seeing a consistent snapshot of all of the
> relevant code, in particular, I am guessing that the ->lock and ->mutex
> are the result of changes rather than there really being both a spinlock
> and a mutex in anon_vma. 

Correct, my earlier spinlock -> mutex conversion left is being called
->lock, but Hugh (rightly) pointed out that I should rename it too, so
in the new (as of yet unposted version its called ->mutex).

>  Mainline currently has a lock, FWIW.  But from
> what I do see, I am concerned about the following sequence of events:
> 
> o	CPU 0 starts executing page_lock_anon_vma() as shown at
> 	https://lkml.org/lkml/2010/11/26/213, fetches the pointer
> 	to anon_vma->root->lock, but does not yet invoke
> 	mutex_trylock().
> 
> o	CPU 1 executes __put_anon_vma() above on the same VMA
> 	that CPU 0 is attempting to use.  It sees that the
> 	anon_vma->root->mutex (presumably AKA ->lock) is not held,
> 	so it calls anon_vma_free().
> 
> o	CPU 2 reallocates the anon_vma freed by CPU 1, so that it
> 	now has a non-zero reference count.
> 
> o	CPU 0 continues execution, incorrectly acquiring a reference
> 	to the now-recycled anon_vma.
> 
> Or am I misunderstanding what this code is trying to do?

No that is quite right and possible, its one of the many subtle issues
surrounding the existing page_lock_anon_vma(), we can indeed return a
locked anon_vma that is not in fact related to the page we asked it for,
all calling code SHOULD and afaict does deal with that, mostly by
calling things like vma_address(vma, page) for all vma's obtained from
the anon_vma, to verify the page is indeed (or not) part of the vma.

The race we guard against with all the fancy stuff is the page itself
getting unmapped and us returning an anon_vma for an unmapped page.

And of course, returning a locked but free'd anon_vma, that too isn't
allowed ;-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 00/21] mm: Preemptibility -v6
From: Paul E. McKenney @ 2011-01-22 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Andrew Morton, Benjamin Herrenschmidt, David Miller,
	Nick Piggin, Martin Schwidefsky, linux-kernel, linux-arch,
	linux-mm, Andrea Arcangeli, Oleg Nesterov
In-Reply-To: <1295624034.28776.303.camel@laptop>

On Fri, Jan 21, 2011 at 04:33:54PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-01-20 at 11:57 -0800, Hugh Dickins wrote:
> > > > 21/21 mm-optimize_page_lock_anon_vma_fast-path.patch
> > > >       I certainly see the call for this patch, I want to eliminate those
> > > >       doubled atomics too.  This appears correct to me, and I've not dreamt
> > > >       up an alternative; but I do dislike it, and I suspect you don't like
> > > >       it much either.  I'm ambivalent about it, would love a better patch.
> > > 
> > > Like said, I fully agree with that sentiment, just haven't been able to
> > > come up with anything saner :/ Although I can optimize the
> > > __put_anon_vma() path a bit by doing something like:
> > > 
> > >   if (mutex_is_locked()) { anon_vma_lock(); anon_vma_unlock(); }
> > > 
> > > But I bet that wants a barrier someplace and my head hurts.. 
> > 
> > Without daring to hurt my head very much, yes, I'd say those kind
> > of "optimizations" have a habit of turning out to be racily wrong.
> > 
> > But you put your finger on it: if you hadn't had to add that lock-
> > unlock pair into __put_anon_vma(), I wouldn't have minded the
> > contortions added to page_lock_anon_vma(). 
> 
> I think there's just about enough implied barriers there that the
> 'simple' code just works ;-)
> 
> But given that I'm trying to think with snot for brains thanks to some
> cold, I don't trust myself at all to have gotten this right.
> 
> [ for Oleg and Paul: https://lkml.org/lkml/2010/11/26/213 contains the
> full patch this is against ]
> 
> ---
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -1559,9 +1559,20 @@ void __put_anon_vma(struct anon_vma *ano
>  	 * Synchronize against page_lock_anon_vma() such that
>  	 * we can safely hold the lock without the anon_vma getting
>  	 * freed.
> +	 *
> +	 * Relies on the full mb implied by the atomic_dec_and_test() from
> +	 * put_anon_vma() against the full mb implied by mutex_trylock() from
> +	 * page_lock_anon_vma(). This orders:
> +	 *
> +	 * page_lock_anon_vma()		VS	put_anon_vma()
> +	 *   mutex_trylock()			  atomic_dec_and_test()
> +	 *   smp_mb()				  smp_mb()
> +	 *   atomic_read()			  mutex_is_locked()
>  	 */
> -	anon_vma_lock(anon_vma);
> -	anon_vma_unlock(anon_vma);
> +	if (mutex_is_locked(&anon_vma->root->mutex)) {
> +		anon_vma_lock(anon_vma);
> +		anon_vma_unlock(anon_vma);
> +	}
>  
>  	if (anon_vma->root != anon_vma)
>  		put_anon_vma(anon_vma->root);
> 

OK, so the anon_vma slab cache is SLAB_DESTROY_BY_RCU.  Presumably
all callers of page_lock_anon_vma() check the identity of the page
that got locked, since it might be recycled at any time.  But when
I look at 2.6.37, I only see checks for NULL.  So I am assuming
that this code is supposed to prevent such recycling.

I am not sure that I am seeing a consistent snapshot of all of the
relevant code, in particular, I am guessing that the ->lock and ->mutex
are the result of changes rather than there really being both a spinlock
and a mutex in anon_vma.  Mainline currently has a lock, FWIW.  But from
what I do see, I am concerned about the following sequence of events:

o	CPU 0 starts executing page_lock_anon_vma() as shown at
	https://lkml.org/lkml/2010/11/26/213, fetches the pointer
	to anon_vma->root->lock, but does not yet invoke
	mutex_trylock().

o	CPU 1 executes __put_anon_vma() above on the same VMA
	that CPU 0 is attempting to use.  It sees that the
	anon_vma->root->mutex (presumably AKA ->lock) is not held,
	so it calls anon_vma_free().

o	CPU 2 reallocates the anon_vma freed by CPU 1, so that it
	now has a non-zero reference count.

o	CPU 0 continues execution, incorrectly acquiring a reference
	to the now-recycled anon_vma.

Or am I misunderstanding what this code is trying to do?

							Thanx, Paul

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] Fix uninitialized variable use in mm/memcontrol.c::mem_cgroup_move_parent()
From: Jesper Juhl @ 2011-01-22 19:51 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Balbir Singh, Daisuke Nishimura, KAMEZAWA Hiroyuki,
	Pavel Emelianov, Kirill A. Shutemov

In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps to 
the 'put_back' label
  	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
  	if (ret || !parent)
  		goto put_back;
 where we'll 
  	if (charge > PAGE_SIZE)
  		compound_unlock_irqrestore(page, flags);
but, we have not assigned anything to 'flags' at this point, nor have we 
called 'compound_lock_irqsave()' (which is what sets 'flags').
So, I believe the 'put_back' label should be moved below the call to 
compound_unlock_irqrestore() as per this patch. 

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
---
 memcontrol.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

  compile tested only.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index db76ef7..4fcf47a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2292,9 +2292,10 @@ static int mem_cgroup_move_parent(struct page_cgroup *pc,
 	ret = mem_cgroup_move_account(pc, child, parent, true, charge);
 	if (ret)
 		mem_cgroup_cancel_charge(parent, charge);
-put_back:
+
 	if (charge > PAGE_SIZE)
 		compound_unlock_irqrestore(page, flags);
+put_back:
 	putback_lru_page(page);
 put:
 	put_page(page);


-- 
Jesper Juhl <jj@chaosbits.net>            http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH] mm: prevent concurrent unmap_mapping_range() on the same inode
From: Hugh Dickins @ 2011-01-22  4:46 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Christoph Hellwig, akpm, gurudas.pai, lkml20101129, linux-kernel,
	linux-mm
In-Reply-To: <E1PfvGx-00086O-IA@pomaz-ex.szeredi.hu>

On Thu, 20 Jan 2011, Miklos Szeredi wrote:
> On Thu, 20 Jan 2011, Christoph Hellwig wrote:
> > On Thu, Jan 20, 2011 at 01:30:58PM +0100, Miklos Szeredi wrote:
> > > 
> > > Truncate and hole punching already serialize with i_mutex.  Other
> > > callers of unmap_mapping_range() do not, and it's difficult to get
> > > i_mutex protection for all callers.  In particular ->d_revalidate(),
> > > which calls invalidate_inode_pages2_range() in fuse, may be called
> > > with or without i_mutex.
> > 
> > 
> > Which I think is mostly a fuse problem.  I really hate bloating the
> > generic inode (into which the address_space is embedded) with another
> > mutex for deficits in rather special case filesystems. 
> 
> As Hugh pointed out unmap_mapping_range() has grown a varied set of
> callers, which are difficult to fix up wrt i_mutex.  Fuse was just an
> example.
> 
> I don't like the bloat either, but this is the best I could come up
> with for fixing this problem generally.  If you have a better idea,
> please share it.

If we start from the point that this is mostly a fuse problem (I expect
that a thorough audit will show up a few other filesystems too, but
let's start from this point): you cite ->d_revalidate as a particular
problem, but can we fix up its call sites so that it is always called
either with, or much preferably without, i_mutex held?  Though actually
I couldn't find where ->d_revalidate() is called while holding i_mutex.

Failing that, can fuse down_write i_alloc_sem before calling
invalidate_inode_pages2(_range), to achieve the same exclusion?
The setattr truncation path takes i_alloc_sem as well as i_mutex,
though I'm not certain of its full coverage.

I did already consider holding and dropping i_alloc_sem inside
invalidate_inode_pages2_range(); but direct-io.c very much wants
to take mmap_sem (when get_user_pages_fast goes slow) after taking
i_alloc_sem, whereas fuse_direct_mmap() very much wants to call
invalidate_inode_pages2() while mmap_sem is held.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [BUG]thp: BUG at mm/huge_memory.c:1350
From: Minchan Kim @ 2011-01-22  2:20 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm
In-Reply-To: <20110122021647.GR9506@random.random>

On Sat, Jan 22, 2011 at 11:16 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Sat, Jan 22, 2011 at 02:08:20AM +0100, Andrea Arcangeli wrote:
>> Yeah x86 is not entirely broken, just some .config, and it's not
>
> Like in this case sometime when I say x86 I mean x86_32, for clarity
> x86_64 has never been affected by this, regardless of the .config.
>
>> common code bug (which is the most important thing!). I think it's a
>> bug in set_pmd_at when paravirt is set and PSA is off. If I'm right 4m
>> pages with PSA off should also work when disabling paravirt.
>
> You said PSA and I kept saying it but I think we both meant
> PAE. There's PSE and PAE, PSA is mix ;).

Yes. It was typo. :)

>
>> I'm just trying to reproduce...
>
> Reproduced and fix posted in the other mail with lkml on CC. Hope it
> works!

Will test and report the result.

>
> Thanks,
> Andrea
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [BUG]thp: BUG at mm/huge_memory.c:1350
From: Andrea Arcangeli @ 2011-01-22  2:16 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm
In-Reply-To: <20110122010820.GP9506@random.random>

On Sat, Jan 22, 2011 at 02:08:20AM +0100, Andrea Arcangeli wrote:
> Yeah x86 is not entirely broken, just some .config, and it's not

Like in this case sometime when I say x86 I mean x86_32, for clarity
x86_64 has never been affected by this, regardless of the .config.

> common code bug (which is the most important thing!). I think it's a
> bug in set_pmd_at when paravirt is set and PSA is off. If I'm right 4m
> pages with PSA off should also work when disabling paravirt.

You said PSA and I kept saying it but I think we both meant
PAE. There's PSE and PAE, PSA is mix ;).

> I'm just trying to reproduce...

Reproduced and fix posted in the other mail with lkml on CC. Hope it
works!

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox