* FYI: mmap_sem OOM patch
@ 2010-07-07 23:11 Michel Lespinasse
2010-07-08 0:16 ` KOSAKI Motohiro
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Michel Lespinasse @ 2010-07-07 23:11 UTC (permalink / raw)
To: linux-mm, KOSAKI Motohiro; +Cc: LKML, Divyesh Shah
As I mentioned this in my MM discussion proposal, I am sending the following
patch for discussion.
This helps some workloads we use, that are normally run with little memory
headroom and get occasionally pushed over the edge to OOM. In some cases,
the OOMing thread can't exit because it needs to acquire the mmap_sem,
while the thread holding mmap_sem can't progress because it needs to
allocate memory.
IIRC we first encountered this against our 2.6.18 kernel; this has been
reproduced with 2.6.33 (sorry, we'll have to clean up that test case a bit
before I can send it); and following patch is against 2.6.34 plus
down_read_unfair changes (which have been discussed here before, not
resending them at this point unless someone asks).
Note that while this patch modifies some performance critical MM functions,
it does make sure to stay out of the fast paths.
Use down_read_unfair() in OOM-killed tasks
When tasks are being targeted by OOM killer, we want to use
down_read_unfair to ensure a fast exit.
- The down_read_unfair() call in fast_get_user_pages() is to cover the
futex calls in mm_release() (including exit_robust_list(),
compat_exit_robust_list(), exit_pi_state_list() and sys_futex())
- The down_read_unfair() call in do_page_fault() is to cover the
put_user() call in mm_release() and exiting the robust futex lists.
This is a rework of a previous change in 2.6.18:
Change the down_read()s in the exit path to be unfair so that we do
not result in a potentially 3-to-4-way deadlock during the ooms.
What happens is we end up with a single thread in the oom loop (T1)
that ends up killing a sibling thread (T2). That sibling thread will
need to acquire the read side of the mmap_sem in the exit path. It's
possible however that yet a different thread (T3) is in the middle of
a virtual address space operation (mmap, munmap) and is enqueue to
grab the write side of the mmap_sem behind yet another thread (T4)
that is stuck in the OOM loop (behind T1) with mmap_sem held for read
(like allocating a page for pagecache as part of a fault.
T1 T2 T3 T4
. . . .
oom: . . .
oomkill . . .
^ \ . . .
/|\ ----> do_exit: . .
| sleep in . .
| read(mmap_sem) . .
| \ . .
| ----> mmap .
| sleep in .
| write(mmap_sem) .
| \ .
| ----> fault
| holding read(mmap_sem)
| oom
| |
| /
\----------------------------------------------/
Signed-off-by: Michel Lespinasse <walken@google.com>
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f627779..4b3a1c7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
bad_area_nosemaphore(regs, error_code, address);
return;
}
- down_read(&mm->mmap_sem);
+ if (test_thread_flag(TIF_MEMDIE))
+ down_read_unfair(&mm->mmap_sem);
+ else
+ down_read(&mm->mmap_sem);
} else {
/*
* The above down_read_trylock() might have succeeded in
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 738e659..578d1a7 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -357,7 +357,10 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;
- down_read(&mm->mmap_sem);
+ if (unlikely(test_thread_flag(TIF_MEMDIE)))
+ down_read_unfair(&mm->mmap_sem);
+ else
+ down_read(&mm->mmap_sem);
ret = get_user_pages(current, mm, start,
(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
up_read(&mm->mmap_sem);
diff --git a/kernel/exit.c b/kernel/exit.c
index 7f2683a..2318d3a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -664,7 +664,7 @@ static void exit_mm(struct task_struct * tsk)
* will increment ->nr_threads for each thread in the
* group with ->mm != NULL.
*/
- down_read(&mm->mmap_sem);
+ down_read_unfair(&mm->mmap_sem);
core_state = mm->core_state;
if (core_state) {
struct core_thread self;
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: FYI: mmap_sem OOM patch
2010-07-07 23:11 FYI: mmap_sem OOM patch Michel Lespinasse
@ 2010-07-08 0:16 ` KOSAKI Motohiro
2010-07-08 1:11 ` KOSAKI Motohiro
2010-07-08 9:02 ` Peter Zijlstra
2010-07-08 10:30 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08 0:16 UTC (permalink / raw)
To: Michel Lespinasse; +Cc: kosaki.motohiro, linux-mm, LKML, Divyesh Shah
Hi
> As I mentioned this in my MM discussion proposal, I am sending the following
> patch for discussion.
>
> This helps some workloads we use, that are normally run with little memory
> headroom and get occasionally pushed over the edge to OOM. In some cases,
> the OOMing thread can't exit because it needs to acquire the mmap_sem,
> while the thread holding mmap_sem can't progress because it needs to
> allocate memory.
>
> IIRC we first encountered this against our 2.6.18 kernel; this has been
> reproduced with 2.6.33 (sorry, we'll have to clean up that test case a bit
> before I can send it); and following patch is against 2.6.34 plus
> down_read_unfair changes (which have been discussed here before, not
> resending them at this point unless someone asks).
>
> Note that while this patch modifies some performance critical MM functions,
> it does make sure to stay out of the fast paths.
>
>
> Use down_read_unfair() in OOM-killed tasks
>
> When tasks are being targeted by OOM killer, we want to use
> down_read_unfair to ensure a fast exit.
Yup.
If admins want to kill memory hogging process manually when the system
is under heavy swap thrashing, we will face the same problem, need
unfairness and fast exit. So, unfair exiting design looks very good.
If you will updated the description, I'm glad :)
>
> - The down_read_unfair() call in fast_get_user_pages() is to cover the
> futex calls in mm_release() (including exit_robust_list(),
> compat_exit_robust_list(), exit_pi_state_list() and sys_futex())
>
> - The down_read_unfair() call in do_page_fault() is to cover the
> put_user() call in mm_release() and exiting the robust futex lists.
I have opposite question. Which case do we avoid to use down_read_unfair()?
You already change the fast path, I guess the reason is not for performance.
but I'm not sure exactly reason.
In other word, can we makes following or something like macro and
convert all caller?
static inline void down_read_mmap_sem(void) {
if (unlikely(test_thread_flag(TIF_MEMDIE)))
down_read_unfair(&mm->mmap_sem);
else
down_read(&mm->mmap_sem);
}
>
> This is a rework of a previous change in 2.6.18:
>
> Change the down_read()s in the exit path to be unfair so that we do
> not result in a potentially 3-to-4-way deadlock during the ooms.
>
> What happens is we end up with a single thread in the oom loop (T1)
> that ends up killing a sibling thread (T2). That sibling thread will
> need to acquire the read side of the mmap_sem in the exit path. It's
> possible however that yet a different thread (T3) is in the middle of
> a virtual address space operation (mmap, munmap) and is enqueue to
> grab the write side of the mmap_sem behind yet another thread (T4)
> that is stuck in the OOM loop (behind T1) with mmap_sem held for read
> (like allocating a page for pagecache as part of a fault.
>
> T1 T2 T3 T4
> . . . .
> oom: . . .
> oomkill . . .
> ^ \ . . .
> /|\ ----> do_exit: . .
> | sleep in . .
> | read(mmap_sem) . .
> | \ . .
> | ----> mmap .
> | sleep in .
> | write(mmap_sem) .
> | \ .
> | ----> fault
> | holding read(mmap_sem)
> | oom
> | |
> | /
> \----------------------------------------------/
>
> Signed-off-by: Michel Lespinasse <walken@google.com>
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f627779..4b3a1c7 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> bad_area_nosemaphore(regs, error_code, address);
> return;
> }
> - down_read(&mm->mmap_sem);
> + if (test_thread_flag(TIF_MEMDIE))
> + down_read_unfair(&mm->mmap_sem);
> + else
> + down_read(&mm->mmap_sem);
> } else {
> /*
> * The above down_read_trylock() might have succeeded in
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 738e659..578d1a7 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -357,7 +357,10 @@ slow_irqon:
> start += nr << PAGE_SHIFT;
> pages += nr;
>
> - down_read(&mm->mmap_sem);
> + if (unlikely(test_thread_flag(TIF_MEMDIE)))
> + down_read_unfair(&mm->mmap_sem);
> + else
> + down_read(&mm->mmap_sem);
> ret = get_user_pages(current, mm, start,
> (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> up_read(&mm->mmap_sem);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 7f2683a..2318d3a 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -664,7 +664,7 @@ static void exit_mm(struct task_struct * tsk)
> * will increment ->nr_threads for each thread in the
> * group with ->mm != NULL.
> */
> - down_read(&mm->mmap_sem);
> + down_read_unfair(&mm->mmap_sem);
> core_state = mm->core_state;
> if (core_state) {
> struct core_thread self;
Agreed. exit_mm() should use down_read_unfair() always.
But, I think it need the explicit comment. please.
btw, MM developers only want to review mm part. can you please get ack
about down_read_unfair() itself from another developers (perhaps,
David Howells or shomeone else).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: FYI: mmap_sem OOM patch
2010-07-08 0:16 ` KOSAKI Motohiro
@ 2010-07-08 1:11 ` KOSAKI Motohiro
0 siblings, 0 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08 1:11 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: Michel Lespinasse, linux-mm, LKML, Divyesh Shah
> Yup.
> If admins want to kill memory hogging process manually when the system
> is under heavy swap thrashing, we will face the same problem, need
> unfairness and fast exit. So, unfair exiting design looks very good.
>
> If you will updated the description, I'm glad :)
I have one more topic. can we check fatal_signal_pending() instead TIF_MEMDIE?
As I said, if the process received SIGKILL, do the fork/exec/brk/mmap
starvations have any problem?
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: FYI: mmap_sem OOM patch
2010-07-07 23:11 FYI: mmap_sem OOM patch Michel Lespinasse
2010-07-08 0:16 ` KOSAKI Motohiro
@ 2010-07-08 9:02 ` Peter Zijlstra
2010-07-08 9:24 ` KOSAKI Motohiro
2010-07-08 10:30 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08 9:02 UTC (permalink / raw)
To: Michel Lespinasse; +Cc: linux-mm, KOSAKI Motohiro, LKML, Divyesh Shah
On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f627779..4b3a1c7 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> bad_area_nosemaphore(regs, error_code, address);
> return;
> }
> - down_read(&mm->mmap_sem);
> + if (test_thread_flag(TIF_MEMDIE))
> + down_read_unfair(&mm->mmap_sem);
> + else
> + down_read(&mm->mmap_sem);
> } else {
> /*
> * The above down_read_trylock() might have succeeded in
I still think adding that _unfair interface is asking for trouble.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: FYI: mmap_sem OOM patch
2010-07-08 9:02 ` Peter Zijlstra
@ 2010-07-08 9:24 ` KOSAKI Motohiro
2010-07-08 9:35 ` Peter Zijlstra
0 siblings, 1 reply; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08 9:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: kosaki.motohiro, Michel Lespinasse, linux-mm, LKML, Divyesh Shah
> On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
>
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index f627779..4b3a1c7 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > bad_area_nosemaphore(regs, error_code, address);
> > return;
> > }
> > - down_read(&mm->mmap_sem);
> > + if (test_thread_flag(TIF_MEMDIE))
> > + down_read_unfair(&mm->mmap_sem);
> > + else
> > + down_read(&mm->mmap_sem);
> > } else {
> > /*
> > * The above down_read_trylock() might have succeeded in
>
> I still think adding that _unfair interface is asking for trouble.
Can you please explain trouble that you worry? Why do we need to keep
thread fairness when OOM case?
btw, I also dislike unfair + /proc combination.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: FYI: mmap_sem OOM patch
2010-07-08 9:24 ` KOSAKI Motohiro
@ 2010-07-08 9:35 ` Peter Zijlstra
2010-07-08 9:51 ` KOSAKI Motohiro
0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08 9:35 UTC (permalink / raw)
To: KOSAKI Motohiro; +Cc: Michel Lespinasse, linux-mm, LKML, Divyesh Shah
On Thu, 2010-07-08 at 18:24 +0900, KOSAKI Motohiro wrote:
> > On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> >
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index f627779..4b3a1c7 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > bad_area_nosemaphore(regs, error_code, address);
> > > return;
> > > }
> > > - down_read(&mm->mmap_sem);
> > > + if (test_thread_flag(TIF_MEMDIE))
> > > + down_read_unfair(&mm->mmap_sem);
> > > + else
> > > + down_read(&mm->mmap_sem);
> > > } else {
> > > /*
> > > * The above down_read_trylock() might have succeeded in
> >
> > I still think adding that _unfair interface is asking for trouble.
>
> Can you please explain trouble that you worry? Why do we need to keep
> thread fairness when OOM case?
Just the whole concept of the unfair thing offends me ;-) I didn't
really look at the particular application in this case.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: FYI: mmap_sem OOM patch
2010-07-08 9:35 ` Peter Zijlstra
@ 2010-07-08 9:51 ` KOSAKI Motohiro
0 siblings, 0 replies; 18+ messages in thread
From: KOSAKI Motohiro @ 2010-07-08 9:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: kosaki.motohiro, Michel Lespinasse, linux-mm, LKML, Divyesh Shah
> On Thu, 2010-07-08 at 18:24 +0900, KOSAKI Motohiro wrote:
> > > On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> > >
> > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > index f627779..4b3a1c7 100644
> > > > --- a/arch/x86/mm/fault.c
> > > > +++ b/arch/x86/mm/fault.c
> > > > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > > bad_area_nosemaphore(regs, error_code, address);
> > > > return;
> > > > }
> > > > - down_read(&mm->mmap_sem);
> > > > + if (test_thread_flag(TIF_MEMDIE))
> > > > + down_read_unfair(&mm->mmap_sem);
> > > > + else
> > > > + down_read(&mm->mmap_sem);
> > > > } else {
> > > > /*
> > > > * The above down_read_trylock() might have succeeded in
> > >
> > > I still think adding that _unfair interface is asking for trouble.
> >
> > Can you please explain trouble that you worry? Why do we need to keep
> > thread fairness when OOM case?
>
> Just the whole concept of the unfair thing offends me ;-) I didn't
> really look at the particular application in this case.
I see.
Yup, I agree unfair thing concept is a bit ugly. If anyone have
alternative idea, I agree to choose that thing.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: FYI: mmap_sem OOM patch
2010-07-07 23:11 FYI: mmap_sem OOM patch Michel Lespinasse
2010-07-08 0:16 ` KOSAKI Motohiro
2010-07-08 9:02 ` Peter Zijlstra
@ 2010-07-08 10:30 ` Peter Zijlstra
[not found] ` <AANLkTimLSnNot2byTWYuIHE8rhGLXbl1zKsQQhmci1Do@mail.gmail.com>
2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2010-07-08 10:30 UTC (permalink / raw)
To: Michel Lespinasse
Cc: linux-mm, KOSAKI Motohiro, LKML, Divyesh Shah, Ingo Molnar
On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> What happens is we end up with a single thread in the oom loop (T1)
> that ends up killing a sibling thread (T2). That sibling thread will
> need to acquire the read side of the mmap_sem in the exit path. It's
> possible however that yet a different thread (T3) is in the middle of
> a virtual address space operation (mmap, munmap) and is enqueue to
> grab the write side of the mmap_sem behind yet another thread (T4)
> that is stuck in the OOM loop (behind T1) with mmap_sem held for read
> (like allocating a page for pagecache as part of a fault.
>
> T1 T2 T3 T4
> . . . .
> oom: . . .
> oomkill . . .
> ^ \ . . .
> /|\ ----> do_exit: . .
> | sleep in . .
> | read(mmap_sem) . .
> | \ . .
> | ----> mmap .
> | sleep in .
> | write(mmap_sem) .
> | \ .
> | ----> fault
> | holding read(mmap_sem)
> | oom
> | |
> | /
> \----------------------------------------------/
So what you do is use recursive locking to side-step a deadlock.
Recursive locking is poor taste and leads to very ill defined locking
rules.
One way to fix this is to have T4 wake from the oom queue and return an
allocation failure instead of insisting on going oom itself when T1
decides to take down the task.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2010-07-13 21:08 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-07 23:11 FYI: mmap_sem OOM patch Michel Lespinasse
2010-07-08 0:16 ` KOSAKI Motohiro
2010-07-08 1:11 ` KOSAKI Motohiro
2010-07-08 9:02 ` Peter Zijlstra
2010-07-08 9:24 ` KOSAKI Motohiro
2010-07-08 9:35 ` Peter Zijlstra
2010-07-08 9:51 ` KOSAKI Motohiro
2010-07-08 10:30 ` Peter Zijlstra
[not found] ` <AANLkTimLSnNot2byTWYuIHE8rhGLXbl1zKsQQhmci1Do@mail.gmail.com>
2010-07-08 10:49 ` Peter Zijlstra
2010-07-08 10:57 ` KOSAKI Motohiro
2010-07-08 11:02 ` Peter Zijlstra
2010-07-08 11:06 ` KOSAKI Motohiro
2010-07-08 11:23 ` Peter Zijlstra
2010-07-09 1:31 ` KOSAKI Motohiro
2010-07-12 21:47 ` David Rientjes
2010-07-13 0:14 ` KOSAKI Motohiro
2010-07-13 21:08 ` David Rientjes
[not found] ` <AANLkTimArLPHrxHNEejiXKNYk9To6qsjglbgzyypXP-c@mail.gmail.com>
2010-07-08 12:43 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).