FYI: mmap_sem OOM patch

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* FYI: mmap_sem OOM patch
@ 2010-07-07 23:11 Michel Lespinasse
  2010-07-08  0:16 ` KOSAKI Motohiro
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Michel Lespinasse @ 2010-07-07 23:11 UTC (permalink / raw)
  To: linux-mm, KOSAKI Motohiro; +Cc: LKML, Divyesh Shah

As I mentioned this in my MM discussion proposal, I am sending the following
patch for discussion.

This helps some workloads we use, that are normally run with little memory
headroom and get occasionally pushed over the edge to OOM. In some cases,
the OOMing thread can't exit because it needs to acquire the mmap_sem,
while the thread holding mmap_sem can't progress because it needs to
allocate memory.

IIRC we first encountered this against our 2.6.18 kernel; this has been
reproduced with 2.6.33 (sorry, we'll have to clean up that test case a bit
before I can send it); and following patch is against 2.6.34 plus
down_read_unfair changes (which have been discussed here before, not
resending them at this point unless someone asks).

Note that while this patch modifies some performance critical MM functions,
it does make sure to stay out of the fast paths.

Use down_read_unfair() in OOM-killed tasks

When tasks are being targeted by OOM killer, we want to use
down_read_unfair to ensure a fast exit.

- The down_read_unfair() call in fast_get_user_pages() is to cover the
  futex calls in mm_release() (including exit_robust_list(),
  compat_exit_robust_list(), exit_pi_state_list() and sys_futex())

- The down_read_unfair() call in do_page_fault() is to cover the
  put_user() call in mm_release() and exiting the robust futex lists.

This is a rework of a previous change in 2.6.18:

Change the down_read()s in the exit path to be unfair so that we do
not result in a potentially 3-to-4-way deadlock during the ooms.

What happens is we end up with a single thread in the oom loop (T1)
that ends up killing a sibling thread (T2).  That sibling thread will
need to acquire the read side of the mmap_sem in the exit path.  It's
possible however that yet a different thread (T3) is in the middle of
a virtual address space operation (mmap, munmap) and is enqueue to
grab the write side of the mmap_sem behind yet another thread (T4)
that is stuck in the OOM loop (behind T1) with mmap_sem held for read
(like allocating a page for pagecache as part of a fault.

      T1              T2              T3              T4
      .               .               .               .
   oom:               .               .               .
   oomkill            .               .               .
      ^    \          .               .               .
     /|\    ---->  do_exit:           .               .
      |            sleep in           .               .
      |            read(mmap_sem)     .               .
      |                     \         .               .
      |                      ----> mmap               .
      |                            sleep in           .
      |                            write(mmap_sem)    .
      |                                     \         .
      |                                      ----> fault
      |                                            holding read(mmap_sem)
      |                                            oom
      |                                               |
      |                                               /
      \----------------------------------------------/

Signed-off-by: Michel Lespinasse <walken@google.com>

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f627779..4b3a1c7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 			bad_area_nosemaphore(regs, error_code, address);
 			return;
 		}
-		down_read(&mm->mmap_sem);
+		if (test_thread_flag(TIF_MEMDIE))
+			down_read_unfair(&mm->mmap_sem);
+		else
+			down_read(&mm->mmap_sem);
 	} else {
 		/*
 		 * The above down_read_trylock() might have succeeded in
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 738e659..578d1a7 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -357,7 +357,10 @@ slow_irqon:
 		start += nr << PAGE_SHIFT;
 		pages += nr;

-		down_read(&mm->mmap_sem);
+		if (unlikely(test_thread_flag(TIF_MEMDIE)))
+			down_read_unfair(&mm->mmap_sem);
+		else
+			down_read(&mm->mmap_sem);
 		ret = get_user_pages(current, mm, start,
 			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
 		up_read(&mm->mmap_sem);
diff --git a/kernel/exit.c b/kernel/exit.c
index 7f2683a..2318d3a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -664,7 +664,7 @@ static void exit_mm(struct task_struct * tsk)
 	 * will increment ->nr_threads for each thread in the
 	 * group with ->mm != NULL.
 	 */
-	down_read(&mm->mmap_sem);
+	down_read_unfair(&mm->mmap_sem);
 	core_state = mm->core_state;
 	if (core_state) {
 		struct core_thread self;

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-07 23:11 FYI: mmap_sem OOM patch Michel Lespinasse
@ 2010-07-08  0:16 ` KOSAKI Motohiro
  2010-07-08  1:11   ` KOSAKI Motohiro
  2010-07-08  9:02 ` Peter Zijlstra
  2010-07-08 10:30 ` Peter Zijlstra
  2 siblings, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08  0:16 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: kosaki.motohiro, linux-mm, LKML, Divyesh Shah

Hi

> As I mentioned this in my MM discussion proposal, I am sending the following
> patch for discussion.
> 
> This helps some workloads we use, that are normally run with little memory
> headroom and get occasionally pushed over the edge to OOM. In some cases,
> the OOMing thread can't exit because it needs to acquire the mmap_sem,
> while the thread holding mmap_sem can't progress because it needs to
> allocate memory.
> 
> IIRC we first encountered this against our 2.6.18 kernel; this has been
> reproduced with 2.6.33 (sorry, we'll have to clean up that test case a bit
> before I can send it); and following patch is against 2.6.34 plus
> down_read_unfair changes (which have been discussed here before, not
> resending them at this point unless someone asks).
> 
> Note that while this patch modifies some performance critical MM functions,
> it does make sure to stay out of the fast paths.
> 
> 
> Use down_read_unfair() in OOM-killed tasks
> 
> When tasks are being targeted by OOM killer, we want to use
> down_read_unfair to ensure a fast exit.

Yup.
If admins want to kill memory hogging process manually when the system
is under heavy swap thrashing, we will face the same problem, need 
unfairness and fast exit. So, unfair exiting design looks very good.

If you will updated the description, I'm glad :)


> 
> - The down_read_unfair() call in fast_get_user_pages() is to cover the
>   futex calls in mm_release() (including exit_robust_list(),
>   compat_exit_robust_list(), exit_pi_state_list() and sys_futex())
> 
> - The down_read_unfair() call in do_page_fault() is to cover the
>   put_user() call in mm_release() and exiting the robust futex lists.

I have opposite question. Which case do we avoid to use down_read_unfair()?
You already change the fast path, I guess the reason is not for performance.
but I'm not sure exactly reason.


In other word, can we makes following or something like macro and
convert all caller?


static inline void down_read_mmap_sem(void) {
	if (unlikely(test_thread_flag(TIF_MEMDIE)))
		down_read_unfair(&mm->mmap_sem);
	else
		down_read(&mm->mmap_sem);
}




> 
> This is a rework of a previous change in 2.6.18:
> 
> Change the down_read()s in the exit path to be unfair so that we do
> not result in a potentially 3-to-4-way deadlock during the ooms.
> 
> What happens is we end up with a single thread in the oom loop (T1)
> that ends up killing a sibling thread (T2).  That sibling thread will
> need to acquire the read side of the mmap_sem in the exit path.  It's
> possible however that yet a different thread (T3) is in the middle of
> a virtual address space operation (mmap, munmap) and is enqueue to
> grab the write side of the mmap_sem behind yet another thread (T4)
> that is stuck in the OOM loop (behind T1) with mmap_sem held for read
> (like allocating a page for pagecache as part of a fault.
> 
>       T1              T2              T3              T4
>       .               .               .               .
>    oom:               .               .               .
>    oomkill            .               .               .
>       ^    \          .               .               .
>      /|\    ---->  do_exit:           .               .
>       |            sleep in           .               .
>       |            read(mmap_sem)     .               .
>       |                     \         .               .
>       |                      ----> mmap               .
>       |                            sleep in           .
>       |                            write(mmap_sem)    .
>       |                                     \         .
>       |                                      ----> fault
>       |                                            holding read(mmap_sem)
>       |                                            oom
>       |                                               |
>       |                                               /
>       \----------------------------------------------/
> 
> Signed-off-by: Michel Lespinasse <walken@google.com>
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f627779..4b3a1c7 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
>  			bad_area_nosemaphore(regs, error_code, address);
>  			return;
>  		}
> -		down_read(&mm->mmap_sem);
> +		if (test_thread_flag(TIF_MEMDIE))
> +			down_read_unfair(&mm->mmap_sem);
> +		else
> +			down_read(&mm->mmap_sem);
>  	} else {
>  		/*
>  		 * The above down_read_trylock() might have succeeded in
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 738e659..578d1a7 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -357,7 +357,10 @@ slow_irqon:
>  		start += nr << PAGE_SHIFT;
>  		pages += nr;
>  
> -		down_read(&mm->mmap_sem);
> +		if (unlikely(test_thread_flag(TIF_MEMDIE)))
> +			down_read_unfair(&mm->mmap_sem);
> +		else
> +			down_read(&mm->mmap_sem);
>  		ret = get_user_pages(current, mm, start,
>  			(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
>  		up_read(&mm->mmap_sem);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 7f2683a..2318d3a 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -664,7 +664,7 @@ static void exit_mm(struct task_struct * tsk)
>  	 * will increment ->nr_threads for each thread in the
>  	 * group with ->mm != NULL.
>  	 */
> -	down_read(&mm->mmap_sem);
> +	down_read_unfair(&mm->mmap_sem);
>  	core_state = mm->core_state;
>  	if (core_state) {
>  		struct core_thread self;

Agreed. exit_mm() should use down_read_unfair() always.
But, I think it need the explicit comment. please.



btw, MM developers only want to review mm part. can you please get ack
about down_read_unfair() itself from another developers (perhaps, 
David Howells or shomeone else).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08  0:16 ` KOSAKI Motohiro
@ 2010-07-08  1:11   ` KOSAKI Motohiro
  0 siblings, 0 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08  1:11 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Michel Lespinasse, linux-mm, LKML, Divyesh Shah

> Yup.
> If admins want to kill memory hogging process manually when the system
> is under heavy swap thrashing, we will face the same problem, need 
> unfairness and fast exit. So, unfair exiting design looks very good.
> 
> If you will updated the description, I'm glad :)

I have one more topic. can we check fatal_signal_pending() instead TIF_MEMDIE?
As I said, if the process received SIGKILL, do the fork/exec/brk/mmap 
starvations have any problem?

Thanks.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-07 23:11 FYI: mmap_sem OOM patch Michel Lespinasse
  2010-07-08  0:16 ` KOSAKI Motohiro
@ 2010-07-08  9:02 ` Peter Zijlstra
  2010-07-08  9:24   ` KOSAKI Motohiro
  2010-07-08 10:30 ` Peter Zijlstra
  2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08  9:02 UTC (permalink / raw)
  To: Michel Lespinasse; +Cc: linux-mm, KOSAKI Motohiro, LKML, Divyesh Shah

On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f627779..4b3a1c7 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
>  			bad_area_nosemaphore(regs, error_code, address);
>  			return;
>  		}
> -		down_read(&mm->mmap_sem);
> +		if (test_thread_flag(TIF_MEMDIE))
> +			down_read_unfair(&mm->mmap_sem);
> +		else
> +			down_read(&mm->mmap_sem);
>  	} else {
>  		/*
>  		 * The above down_read_trylock() might have succeeded in

I still think adding that _unfair interface is asking for trouble.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08  9:02 ` Peter Zijlstra
@ 2010-07-08  9:24   ` KOSAKI Motohiro
  2010-07-08  9:35     ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08  9:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Michel Lespinasse, linux-mm, LKML, Divyesh Shah

> On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index f627779..4b3a1c7 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> >  			bad_area_nosemaphore(regs, error_code, address);
> >  			return;
> >  		}
> > -		down_read(&mm->mmap_sem);
> > +		if (test_thread_flag(TIF_MEMDIE))
> > +			down_read_unfair(&mm->mmap_sem);
> > +		else
> > +			down_read(&mm->mmap_sem);
> >  	} else {
> >  		/*
> >  		 * The above down_read_trylock() might have succeeded in
> 
> I still think adding that _unfair interface is asking for trouble.

Can you please explain trouble that you worry? Why do we need to keep
thread fairness when OOM case?


btw, I also dislike unfair + /proc combination.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08  9:24   ` KOSAKI Motohiro
@ 2010-07-08  9:35     ` Peter Zijlstra
  2010-07-08  9:51       ` KOSAKI Motohiro
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08  9:35 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Michel Lespinasse, linux-mm, LKML, Divyesh Shah

On Thu, 2010-07-08 at 18:24 +0900, KOSAKI Motohiro wrote:
> > On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> > 
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index f627779..4b3a1c7 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > >  			bad_area_nosemaphore(regs, error_code, address);
> > >  			return;
> > >  		}
> > > -		down_read(&mm->mmap_sem);
> > > +		if (test_thread_flag(TIF_MEMDIE))
> > > +			down_read_unfair(&mm->mmap_sem);
> > > +		else
> > > +			down_read(&mm->mmap_sem);
> > >  	} else {
> > >  		/*
> > >  		 * The above down_read_trylock() might have succeeded in
> > 
> > I still think adding that _unfair interface is asking for trouble.
> 
> Can you please explain trouble that you worry? Why do we need to keep
> thread fairness when OOM case?

Just the whole concept of the unfair thing offends me ;-) I didn't
really look at the particular application in this case.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08  9:35     ` Peter Zijlstra
@ 2010-07-08  9:51       ` KOSAKI Motohiro
  0 siblings, 0 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Michel Lespinasse, linux-mm, LKML, Divyesh Shah

> On Thu, 2010-07-08 at 18:24 +0900, KOSAKI Motohiro wrote:
> > > On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> > > 
> > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > index f627779..4b3a1c7 100644
> > > > --- a/arch/x86/mm/fault.c
> > > > +++ b/arch/x86/mm/fault.c
> > > > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > >  			bad_area_nosemaphore(regs, error_code, address);
> > > >  			return;
> > > >  		}
> > > > -		down_read(&mm->mmap_sem);
> > > > +		if (test_thread_flag(TIF_MEMDIE))
> > > > +			down_read_unfair(&mm->mmap_sem);
> > > > +		else
> > > > +			down_read(&mm->mmap_sem);
> > > >  	} else {
> > > >  		/*
> > > >  		 * The above down_read_trylock() might have succeeded in
> > > 
> > > I still think adding that _unfair interface is asking for trouble.
> > 
> > Can you please explain trouble that you worry? Why do we need to keep
> > thread fairness when OOM case?
> 
> Just the whole concept of the unfair thing offends me ;-) I didn't
> really look at the particular application in this case.

I see. 
Yup, I agree unfair thing concept is a bit ugly. If anyone have 
alternative idea, I agree to choose that thing.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-07 23:11 FYI: mmap_sem OOM patch Michel Lespinasse
  2010-07-08  0:16 ` KOSAKI Motohiro
  2010-07-08  9:02 ` Peter Zijlstra
@ 2010-07-08 10:30 ` Peter Zijlstra
       [not found]   ` <AANLkTimLSnNot2byTWYuIHE8rhGLXbl1zKsQQhmci1Do@mail.gmail.com>
  2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08 10:30 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, KOSAKI Motohiro, LKML, Divyesh Shah, Ingo Molnar

On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> What happens is we end up with a single thread in the oom loop (T1)
> that ends up killing a sibling thread (T2).  That sibling thread will
> need to acquire the read side of the mmap_sem in the exit path.  It's
> possible however that yet a different thread (T3) is in the middle of
> a virtual address space operation (mmap, munmap) and is enqueue to
> grab the write side of the mmap_sem behind yet another thread (T4)
> that is stuck in the OOM loop (behind T1) with mmap_sem held for read
> (like allocating a page for pagecache as part of a fault.
> 
>       T1              T2              T3              T4
>       .               .               .               .
>    oom:               .               .               .
>    oomkill            .               .               .
>       ^    \          .               .               .
>      /|\    ---->  do_exit:           .               .
>       |            sleep in           .               .
>       |            read(mmap_sem)     .               .
>       |                     \         .               .
>       |                      ----> mmap               .
>       |                            sleep in           .
>       |                            write(mmap_sem)    .
>       |                                     \         .
>       |                                      ----> fault
>       |                                            holding read(mmap_sem)
>       |                                            oom
>       |                                               |
>       |                                               /
>       \----------------------------------------------/ 

So what you do is use recursive locking to side-step a deadlock.
Recursive locking is poor taste and leads to very ill defined locking
rules.

One way to fix this is to have T4 wake from the oom queue and return an
allocation failure instead of insisting on going oom itself when T1
decides to take down the task.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
       [not found]   ` <AANLkTimLSnNot2byTWYuIHE8rhGLXbl1zKsQQhmci1Do@mail.gmail.com>
@ 2010-07-08 10:49     ` Peter Zijlstra
  2010-07-08 10:57       ` KOSAKI Motohiro
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08 10:49 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-mm, KOSAKI Motohiro, LKML, Divyesh Shah, Ingo Molnar

On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
> 
> 
>         One way to fix this is to have T4 wake from the oom queue and return an
>         allocation failure instead of insisting on going oom itself when T1
>         decides to take down the task.
> 
> How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4... 

If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
down the whole process by sending around signals of sorts (SIGKILL?), so
if T4 gets a fatal signal while it is waiting to enter the oom thingy,
have it abort and return an allocation failure.

That alloc failure (along with a pending fatal signal) will very likely
lead to the release of its mmap_sem (if not, there's more things to
cure).

At which point the cycle is broken an stuff continues as it was
intended.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08 10:49     ` Peter Zijlstra
@ 2010-07-08 10:57       ` KOSAKI Motohiro
  2010-07-08 11:02         ` Peter Zijlstra
       [not found]         ` <AANLkTimArLPHrxHNEejiXKNYk9To6qsjglbgzyypXP-c@mail.gmail.com>
  0 siblings, 2 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08 10:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Michel Lespinasse, linux-mm, LKML, Divyesh Shah,
	Ingo Molnar

> On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
> > 
> > 
> >         One way to fix this is to have T4 wake from the oom queue and return an
> >         allocation failure instead of insisting on going oom itself when T1
> >         decides to take down the task.
> > 
> > How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4... 
> 
> If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
> down the whole process by sending around signals of sorts (SIGKILL?), so
> if T4 gets a fatal signal while it is waiting to enter the oom thingy,
> have it abort and return an allocation failure.
> 
> That alloc failure (along with a pending fatal signal) will very likely
> lead to the release of its mmap_sem (if not, there's more things to
> cure).
> 
> At which point the cycle is broken an stuff continues as it was
> intended.

Now, I've reread current code. I think mmotm already have this.


T4 call out_of_memory and get TIF_MEMDIE
=========================================================
void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
                int order, nodemask_t *nodemask)
{
(snip)
        /*
         * If current has a pending SIGKILL, then automatically select it.  The
         * goal is to allow it to allocate so that it may quickly exit and free
         * its memory.
         */
        if (fatal_signal_pending(current)) {
                set_thread_flag(TIF_MEMDIE);
                boost_dying_task_prio(current, NULL);
                return;
        }
==================================================================


alloc_pages immediately return if the task have TIF_MEMDIE
==================================================================
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
        struct zonelist *zonelist, enum zone_type high_zoneidx,
        nodemask_t *nodemask, struct zone *preferred_zone,
        int migratetype)
{
(snip)
        /* Avoid allocations with no watermarks from looping endlessly */
        if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
                goto nopage;
==========================================================================


Thought?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08 10:57       ` KOSAKI Motohiro
@ 2010-07-08 11:02         ` Peter Zijlstra
  2010-07-08 11:06           ` KOSAKI Motohiro
       [not found]         ` <AANLkTimArLPHrxHNEejiXKNYk9To6qsjglbgzyypXP-c@mail.gmail.com>
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08 11:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Michel Lespinasse, linux-mm, LKML, Divyesh Shah, Ingo Molnar

On Thu, 2010-07-08 at 19:57 +0900, KOSAKI Motohiro wrote:
> > On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
> > > 
> > > 
> > >         One way to fix this is to have T4 wake from the oom queue and return an
> > >         allocation failure instead of insisting on going oom itself when T1
> > >         decides to take down the task.
> > > 
> > > How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4... 
> > 
> > If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
> > down the whole process by sending around signals of sorts (SIGKILL?), so
> > if T4 gets a fatal signal while it is waiting to enter the oom thingy,
> > have it abort and return an allocation failure.
> > 
> > That alloc failure (along with a pending fatal signal) will very likely
> > lead to the release of its mmap_sem (if not, there's more things to
> > cure).
> > 
> > At which point the cycle is broken an stuff continues as it was
> > intended.
> 
> Now, I've reread current code. I think mmotm already have this.

<snip code>

[ small note on that we really should kill __GFP_NOFAIL, its utter
deadlock potential ]

> Thought?

So either its not working or google never tried that code?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08 11:02         ` Peter Zijlstra
@ 2010-07-08 11:06           ` KOSAKI Motohiro
  2010-07-08 11:23             ` Peter Zijlstra
  2010-07-12 21:47             ` David Rientjes
  0 siblings, 2 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08 11:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Michel Lespinasse, linux-mm, LKML, Divyesh Shah,
	Ingo Molnar

> On Thu, 2010-07-08 at 19:57 +0900, KOSAKI Motohiro wrote:
> > > On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
> > > > 
> > > > 
> > > >         One way to fix this is to have T4 wake from the oom queue and return an
> > > >         allocation failure instead of insisting on going oom itself when T1
> > > >         decides to take down the task.
> > > > 
> > > > How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4... 
> > > 
> > > If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
> > > down the whole process by sending around signals of sorts (SIGKILL?), so
> > > if T4 gets a fatal signal while it is waiting to enter the oom thingy,
> > > have it abort and return an allocation failure.
> > > 
> > > That alloc failure (along with a pending fatal signal) will very likely
> > > lead to the release of its mmap_sem (if not, there's more things to
> > > cure).
> > > 
> > > At which point the cycle is broken an stuff continues as it was
> > > intended.
> > 
> > Now, I've reread current code. I think mmotm already have this.
> 
> <snip code>
> 
> [ small note on that we really should kill __GFP_NOFAIL, its utter
> deadlock potential ]

I disagree. __GFP_NOFAIL mean this allocation failure can makes really
dangerous result. Instead, OOM-Killer should try to kill next process.
I think.

> > Thought?
> 
> So either its not working or google never tried that code?

Michel?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08 11:06           ` KOSAKI Motohiro
@ 2010-07-08 11:23             ` Peter Zijlstra
  2010-07-09  1:31               ` KOSAKI Motohiro
  2010-07-12 21:47             ` David Rientjes
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08 11:23 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Michel Lespinasse, linux-mm, LKML, Divyesh Shah, Ingo Molnar

On Thu, 2010-07-08 at 20:06 +0900, KOSAKI Motohiro wrote:
> > [ small note on that we really should kill __GFP_NOFAIL, its utter
> > deadlock potential ]
> 
> I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> dangerous result. Instead, OOM-Killer should try to kill next process.
> I think. 

Say _what_?! you think NOFAIL is a sane thing? Pretty much everybody has
been agreeing for years that the thing should die.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
       [not found]         ` <AANLkTimArLPHrxHNEejiXKNYk9To6qsjglbgzyypXP-c@mail.gmail.com>
@ 2010-07-08 12:43           ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08 12:43 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: KOSAKI Motohiro, linux-mm, LKML, Divyesh Shah, Ingo Molnar

Could you please educate your mailer to not send HTML garbage?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08 11:23             ` Peter Zijlstra
@ 2010-07-09  1:31               ` KOSAKI Motohiro
  0 siblings, 0 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-09  1:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Michel Lespinasse, linux-mm, LKML, Divyesh Shah,
	Ingo Molnar

> On Thu, 2010-07-08 at 20:06 +0900, KOSAKI Motohiro wrote:
> > > [ small note on that we really should kill __GFP_NOFAIL, its utter
> > > deadlock potential ]
> > 
> > I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> > dangerous result. Instead, OOM-Killer should try to kill next process.
> > I think. 
> 
> Say _what_?! you think NOFAIL is a sane thing? 

insane obviously ;)
but as far as my experience, some embedded system prefer to use NOFAIL.
So, I don't like to make big hammer crash. NOFAIL killing need long year
rather than you expected, I guess.


> Pretty much everybody has
> been agreeing for years that the thing should die.

I'm not against this at all. but until it die, it should works correctly.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-08 11:06           ` KOSAKI Motohiro
  2010-07-08 11:23             ` Peter Zijlstra
@ 2010-07-12 21:47             ` David Rientjes
  2010-07-13  0:14               ` KOSAKI Motohiro
  1 sibling, 1 reply; 18+ messages in thread
From: David Rientjes @ 2010-07-12 21:47 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Michel Lespinasse, linux-mm, LKML, Divyesh Shah,
	Ingo Molnar

On Thu, 8 Jul 2010, KOSAKI Motohiro wrote:

> I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> dangerous result. Instead, OOM-Killer should try to kill next process.
> I think.
> 

That's not what happens, __alloc_pages_high_priority() will loop forever 
for __GFP_NOFAIL, the oom killer is never recalled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-12 21:47             ` David Rientjes
@ 2010-07-13  0:14               ` KOSAKI Motohiro
  2010-07-13 21:08                 ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-13  0:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Peter Zijlstra, Michel Lespinasse, linux-mm,
	LKML, Divyesh Shah, Ingo Molnar

> On Thu, 8 Jul 2010, KOSAKI Motohiro wrote:
> 
> > I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> > dangerous result. Instead, OOM-Killer should try to kill next process.
> > I think.
> > 
> 
> That's not what happens, __alloc_pages_high_priority() will loop forever 
> for __GFP_NOFAIL, the oom killer is never recalled.

Yup, please reread the discusstion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: FYI: mmap_sem OOM patch
  2010-07-13  0:14               ` KOSAKI Motohiro
@ 2010-07-13 21:08                 ` David Rientjes
  0 siblings, 0 replies; 18+ messages in thread
From: David Rientjes @ 2010-07-13 21:08 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Michel Lespinasse, linux-mm, LKML, Divyesh Shah,
	Ingo Molnar

On Tue, 13 Jul 2010, KOSAKI Motohiro wrote:

> > > I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> > > dangerous result. Instead, OOM-Killer should try to kill next process.
> > > I think.
> > > 
> > 
> > That's not what happens, __alloc_pages_high_priority() will loop forever 
> > for __GFP_NOFAIL, the oom killer is never recalled.
> 
> Yup, please reread the discusstion.
> 

There's nothing in the discussion that addresses the fact that 
__alloc_pages_high_priority() loops infinitely for __GFP_NOFAIL without 
again calling the oom killer.  You may be proposing a change to that when 
you said "OOM-Killer should try to kill next process. I think," but I 
can't speculate on that.  If you are, please propose a patch.

The success of whether we can allocate without watermarks is dependent on 
where we set them and what types of exclusions we allow.  This isn't local 
only to the page allocator but rather to the entire kernel since 
GFP_ATOMIC allocations, for example, can deplete the min watermark by 
~60%.  The remaining memory is what we allow access to for 
ALLOC_NO_WATERMARKS: those allocations in the direct reclaim patch and 
those that have been oom killed.

Thus, it's important that oom killed tasks do not completely deplete 
memory reserves, otherwise __GFP_NOFAIL allocations will loop forever 
without killing additional tasks, as you say.  That's sane since oom 
killing additional tasks wouldn't allow them access to any additional 
memory, anyway, so the allocation would still fail (we only call the oom 
killer for those allocations that are retried) and the victim could not 
exit.

With that said, I'm not exactly sure what change you're proposing when you 
say the oom killer should try to kill another process because it won't 
lead to any future memory freeing unless the victim can exit without 
allocating memory.  If that's not possible, you've proliferated a ton of 
innocent kills that were triggered because of one __GFP_NOFAIL attempt.

I agree with Peter that it would be ideal to remove __GFP_NOFAIL, but I 
think we require changes in the retry logic first before that is possible.  
Right now, we insist on retrying all blockable !__GFP_NORETRY allocations 
under PAGE_ALLOC_COSTLY_ORDER indefinitely.  That, in a sense, is already 
__GFP_NOFAIL behavior that is implicit: that's why we typically see 
__GFP_NOFAIL with GFP_NOFS instead.  With GFP_NOFS, we never kill the oom 
killer in the first place, so memory allocation is only more successful on 
a subsequent attempt by either direct reclaim or memory compaction.

There's nothing preventing users from doing

	do {
		page = alloc_page(GFP_KERNEL);
	} while (!page);

if the retry logic were reworked to start failing allocations or we 
removed __GFP_NOFAIL.  Thus, I think __GFP_NOFAIL should be substituted 
with a different gfp flag that acts similar but failable: use compaction, 
reclaim, and the oom killer where possible and in that order if there is 
no success and then retry to a high watermark one final time.  If the 
allocation is still unsuccessful, return NULL.  This allows users to do 
what getblk() does, for instance, by implementing their own memory freeing 
functions.  It also allows them to use all of the page allocator's powers 
(compaction, reclaim, oom killer) without infinitely looping by either 
using an order under PAGE_ALLOC_COSTLY_ORDER or insisting on __GFP_NOFAIL.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-07-13 21:08 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-07 23:11 FYI: mmap_sem OOM patch Michel Lespinasse
2010-07-08  0:16 ` KOSAKI Motohiro
2010-07-08  1:11   ` KOSAKI Motohiro
2010-07-08  9:02 ` Peter Zijlstra
2010-07-08  9:24   ` KOSAKI Motohiro
2010-07-08  9:35     ` Peter Zijlstra
2010-07-08  9:51       ` KOSAKI Motohiro
2010-07-08 10:30 ` Peter Zijlstra
     [not found]   ` <AANLkTimLSnNot2byTWYuIHE8rhGLXbl1zKsQQhmci1Do@mail.gmail.com>
2010-07-08 10:49     ` Peter Zijlstra
2010-07-08 10:57       ` KOSAKI Motohiro
2010-07-08 11:02         ` Peter Zijlstra
2010-07-08 11:06           ` KOSAKI Motohiro
2010-07-08 11:23             ` Peter Zijlstra
2010-07-09  1:31               ` KOSAKI Motohiro
2010-07-12 21:47             ` David Rientjes
2010-07-13  0:14               ` KOSAKI Motohiro
2010-07-13 21:08                 ` David Rientjes
     [not found]         ` <AANLkTimArLPHrxHNEejiXKNYk9To6qsjglbgzyypXP-c@mail.gmail.com>
2010-07-08 12:43           ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).