[patch] oom: prevent unnecessary oom kills or kernel panics

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch] oom: prevent unnecessary oom kills or kernel panics
@ 2011-03-01 19:09 David Rientjes
  2011-03-03  1:20 ` KOSAKI Motohiro
  0 siblings, 1 reply; 69+ messages in thread
From: David Rientjes @ 2011-03-01 19:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Oleg Nesterov, Hugh Dickins,
	linux-mm

This patch revents unnecessary oom kills or kernel panics by reverting
two commits:

	495789a5 (oom: make oom_score to per-process value)
	cef1d352 (oom: multi threaded process coredump don't make deadlock)

First, 495789a5 (oom: make oom_score to per-process value) ignores the
fact that all threads in a thread group do not necessarily exit at the
same time.

It is imperative that select_bad_process() detect threads that are in the
exit path, specifically those with PF_EXITING set, to prevent needlessly
killing additional tasks.  If a process is oom killed and the thread
group leader exits, select_bad_process() cannot detect the other threads
that are PF_EXITING by iterating over only processes.  Thus, it currently
chooses another task unnecessarily for oom kill or panics the machine
when nothing else is eligible.

By iterating over threads instead, it is possible to detect threads that
are exiting and nominate them for oom kill so they get access to memory
reserves.

Second, cef1d352 (oom: multi threaded process coredump don't make
deadlock) erroneously avoids making the oom killer a no-op when an
eligible thread other than current isfound to be exiting.  We want to
detect this situation so that we may allow that exiting thread time to
exit and free its memory; if it is able to exit on its own, that should
free memory so current is no loner oom.  If it is not able to exit on its
own, the oom killer will nominate it for oom kill which, in this case,
only means it will get access to memory reserves.

Without this change, it is easy for the oom killer to unnecessarily
target tasks when all threads of a victim don't exit before the thread
group leader or, in the worst case, panic the machine.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -292,11 +292,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		unsigned long totalpages, struct mem_cgroup *mem,
 		const nodemask_t *nodemask)
 {
-	struct task_struct *p;
+	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
 	*ppoints = 0;

-	for_each_process(p) {
+	do_each_thread(g, p) {
 		unsigned int points;

 		if (oom_unkillable_task(p, mem, nodemask))
@@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
+		if ((p->flags & PF_EXITING) && p->mm) {
 			if (p != current)
 				return ERR_PTR(-1UL);

@@ -337,7 +337,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 			chosen = p;
 			*ppoints = points;
 		}
-	}
+	} while_each_thread(g, p);

 	return chosen;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-01 19:09 [patch] oom: prevent unnecessary oom kills or kernel panics David Rientjes
@ 2011-03-03  1:20 ` KOSAKI Motohiro
  2011-03-03 19:53   ` David Rientjes
  2011-03-08 13:42   ` Oleg Nesterov
  0 siblings, 2 replies; 69+ messages in thread
From: KOSAKI Motohiro @ 2011-03-03  1:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Oleg Nesterov,
	Hugh Dickins, linux-mm

Hi

> This patch revents unnecessary oom kills or kernel panics by reverting
> two commits:
> 
> 	495789a5 (oom: make oom_score to per-process value)
> 	cef1d352 (oom: multi threaded process coredump don't make deadlock)
> 
> First, 495789a5 (oom: make oom_score to per-process value) ignores the
> fact that all threads in a thread group do not necessarily exit at the
> same time.
> 
> It is imperative that select_bad_process() detect threads that are in the
> exit path, specifically those with PF_EXITING set, to prevent needlessly
> killing additional tasks.  

to prevent? No, it is not a reason of PF_EXITING exist.


> If a process is oom killed and the thread
> group leader exits, select_bad_process() cannot detect the other threads
> that are PF_EXITING by iterating over only processes.  Thus, it currently
> chooses another task unnecessarily for oom kill or panics the machine
> when nothing else is eligible.
> 
> By iterating over threads instead, it is possible to detect threads that
> are exiting and nominate them for oom kill so they get access to memory
> reserves.

In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore
PF_EXITING is not a sign of memory freeing in nearly future. If other
CPUs don't try to free memory, prevent oom and waiting makes deadlock.

Thus, I suggest to remove PF_EXITING check completely.

> 
> Second, cef1d352 (oom: multi threaded process coredump don't make
> deadlock) erroneously avoids making the oom killer a no-op when an
> eligible thread other than current isfound to be exiting.  We want to
> detect this situation so that we may allow that exiting thread time to
> exit and free its memory; if it is able to exit on its own, that should
> free memory so current is no loner oom.  If it is not able to exit on its
> own, the oom killer will nominate it for oom kill which, in this case,
> only means it will get access to memory reserves.
> 
> Without this change, it is easy for the oom killer to unnecessarily
> target tasks when all threads of a victim don't exit before the thread
> group leader or, in the worst case, panic the machine.
> 

You missed deadlock is more worse than panic. And again, task overkill
is a part of OOM killer design. it is necessary to avoid deadlock. If
you want to change this spec, you need to remove deadlock change at first.


> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/oom_kill.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -292,11 +292,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  		unsigned long totalpages, struct mem_cgroup *mem,
>  		const nodemask_t *nodemask)
>  {
> -	struct task_struct *p;
> +	struct task_struct *g, *p;
>  	struct task_struct *chosen = NULL;
>  	*ppoints = 0;
>  
> -	for_each_process(p) {
> +	do_each_thread(g, p) {
>  		unsigned int points;
>  
>  		if (oom_unkillable_task(p, mem, nodemask))
> @@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  		 * the process of exiting and releasing its resources.
>  		 * Otherwise we could get an easy OOM deadlock.
>  		 */
> -		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
> +		if ((p->flags & PF_EXITING) && p->mm) {
>
>  			if (p != current)
>  				return ERR_PTR(-1UL);
>  
> @@ -337,7 +337,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  			chosen = p;
>  			*ppoints = points;
>  		}
> -	}
> +	} while_each_thread(g, p);
>  
>  	return chosen;
>  }



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-03  1:20 ` KOSAKI Motohiro
@ 2011-03-03 19:53   ` David Rientjes
  2011-03-06 11:14     ` KOSAKI Motohiro
  2011-03-08 13:42   ` Oleg Nesterov
  1 sibling, 1 reply; 69+ messages in thread
From: David Rientjes @ 2011-03-03 19:53 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Oleg Nesterov, Hugh Dickins,
	linux-mm

On Thu, 3 Mar 2011, KOSAKI Motohiro wrote:

> > This patch revents unnecessary oom kills or kernel panics by reverting
> > two commits:
> > 
> > 	495789a5 (oom: make oom_score to per-process value)
> > 	cef1d352 (oom: multi threaded process coredump don't make deadlock)
> > 
> > First, 495789a5 (oom: make oom_score to per-process value) ignores the
> > fact that all threads in a thread group do not necessarily exit at the
> > same time.
> > 
> > It is imperative that select_bad_process() detect threads that are in the
> > exit path, specifically those with PF_EXITING set, to prevent needlessly
> > killing additional tasks.  
> 
> to prevent? No, it is not a reason of PF_EXITING exist.
> 

It is not the sole reason PF_EXITING exists in the kernel, no.  It was 
used in select_bad_process() to ensure we don't needlessly kill another 
task if an eligible one is already in the exit path.  We want to ensure 
that the oom killer only kills a process getting work done when nothing 
has the potential to free memory in the short term.  It's not a guarantee 
that the PF_EXITING task will free memory, but it has the potential to be 
the last thread pinning the ->mm.

> > By iterating over threads instead, it is possible to detect threads that
> > are exiting and nominate them for oom kill so they get access to memory
> > reserves.
> 
> In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore
> PF_EXITING is not a sign of memory freeing in nearly future. If other
> CPUs don't try to free memory, prevent oom and waiting makes deadlock.
> 

It's not a deadlock if a thread is PF_EXITING and isn't stalled by, for 
instance, failed memory allocations.  That's why this patch restores the 
behavior back to what it was previous to cef1d352: if an eligible thread 
is PF_EXITING and is not current, then wait for it; otherwise, if it is 
current, give it access to memory reserves so it can allow the allocation 
to succeed.

> > Second, cef1d352 (oom: multi threaded process coredump don't make
> > deadlock) erroneously avoids making the oom killer a no-op when an
> > eligible thread other than current isfound to be exiting.  We want to
> > detect this situation so that we may allow that exiting thread time to
> > exit and free its memory; if it is able to exit on its own, that should
> > free memory so current is no loner oom.  If it is not able to exit on its
> > own, the oom killer will nominate it for oom kill which, in this case,
> > only means it will get access to memory reserves.
> > 
> > Without this change, it is easy for the oom killer to unnecessarily
> > target tasks when all threads of a victim don't exit before the thread
> > group leader or, in the worst case, panic the machine.
> > 
> 
> You missed deadlock is more worse than panic. And again, task overkill
> is a part of OOM killer design. it is necessary to avoid deadlock. If
> you want to change this spec, you need to remove deadlock change at first.
> 

There is no deadlock being introduced by this patch; if you have an 
example of one, then please show it.  The problem is not just overkill but 
rather panicking the machine when no other eligible processes exist.  We 
have seen this in production quite a few times and we'd like to see this 
patch merged to avoid our machines panicking because the oom killer, by 
your patch, isn't considering threads that are eligible in the exit path 
once their parent has been killed and has exited itself yet memory freeing 
isn't possible yet because the threads still pin the ->mm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-03 19:53   ` David Rientjes
@ 2011-03-06 11:14     ` KOSAKI Motohiro
  2011-03-06 22:06       ` David Rientjes
  0 siblings, 1 reply; 69+ messages in thread
From: KOSAKI Motohiro @ 2011-03-06 11:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Oleg Nesterov,
	Hugh Dickins, linux-mm

> On Thu, 3 Mar 2011, KOSAKI Motohiro wrote:
> 
> > > This patch revents unnecessary oom kills or kernel panics by reverting
> > > two commits:
> > > 
> > > 	495789a5 (oom: make oom_score to per-process value)
> > > 	cef1d352 (oom: multi threaded process coredump don't make deadlock)
> > > 
> > > First, 495789a5 (oom: make oom_score to per-process value) ignores the
> > > fact that all threads in a thread group do not necessarily exit at the
> > > same time.
> > > 
> > > It is imperative that select_bad_process() detect threads that are in the
> > > exit path, specifically those with PF_EXITING set, to prevent needlessly
> > > killing additional tasks.  
> > 
> > to prevent? No, it is not a reason of PF_EXITING exist.
> > 
> 
> It is not the sole reason PF_EXITING exists in the kernel, no.  It was 
> used in select_bad_process() to ensure we don't needlessly kill another 
> task if an eligible one is already in the exit path.  We want to ensure 
> that the oom killer only kills a process getting work done when nothing 
> has the potential to free memory in the short term.  It's not a guarantee 
> that the PF_EXITING task will free memory, but it has the potential to be 
> the last thread pinning the ->mm.
> 
> > > By iterating over threads instead, it is possible to detect threads that
> > > are exiting and nominate them for oom kill so they get access to memory
> > > reserves.
> > 
> > In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore
> > PF_EXITING is not a sign of memory freeing in nearly future. If other
> > CPUs don't try to free memory, prevent oom and waiting makes deadlock.
> > 
> 
> It's not a deadlock if a thread is PF_EXITING and isn't stalled by, for 
> instance, failed memory allocations.  That's why this patch restores the 
> behavior back to what it was previous to cef1d352: if an eligible thread 
> is PF_EXITING and is not current, then wait for it; otherwise, if it is 
> current, give it access to memory reserves so it can allow the allocation 
> to succeed.
> 
> > > Second, cef1d352 (oom: multi threaded process coredump don't make
> > > deadlock) erroneously avoids making the oom killer a no-op when an
> > > eligible thread other than current isfound to be exiting.  We want to
> > > detect this situation so that we may allow that exiting thread time to
> > > exit and free its memory; if it is able to exit on its own, that should
> > > free memory so current is no loner oom.  If it is not able to exit on its
> > > own, the oom killer will nominate it for oom kill which, in this case,
> > > only means it will get access to memory reserves.
> > > 
> > > Without this change, it is easy for the oom killer to unnecessarily
> > > target tasks when all threads of a victim don't exit before the thread
> > > group leader or, in the worst case, panic the machine.
> > > 
> > 
> > You missed deadlock is more worse than panic. And again, task overkill
> > is a part of OOM killer design. it is necessary to avoid deadlock. If
> > you want to change this spec, you need to remove deadlock change at first.
> > 
> 
> There is no deadlock being introduced by this patch; if you have an 
> example of one, then please show it.  The problem is not just overkill but 
> rather panicking the machine when no other eligible processes exist.  We 
> have seen this in production quite a few times and we'd like to see this 
> patch merged to avoid our machines panicking because the oom killer, by 
> your patch, isn't considering threads that are eligible in the exit path 
> once their parent has been killed and has exited itself yet memory freeing 
> isn't possible yet because the threads still pin the ->mm.

No. While you don't understand current code, I'll not taking yours.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-06 11:14     ` KOSAKI Motohiro
@ 2011-03-06 22:06       ` David Rientjes
  2011-03-08  0:24         ` KOSAKI Motohiro
  2011-03-08  2:01         ` KOSAKI Motohiro
  0 siblings, 2 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-06 22:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Oleg Nesterov, Hugh Dickins,
	linux-mm

On Sun, 6 Mar 2011, KOSAKI Motohiro wrote:

> > There is no deadlock being introduced by this patch; if you have an 
> > example of one, then please show it.  The problem is not just overkill but 
> > rather panicking the machine when no other eligible processes exist.  We 
> > have seen this in production quite a few times and we'd like to see this 
> > patch merged to avoid our machines panicking because the oom killer, by 
> > your patch, isn't considering threads that are eligible in the exit path 
> > once their parent has been killed and has exited itself yet memory freeing 
> > isn't possible yet because the threads still pin the ->mm.
> 
> No. While you don't understand current code, I'll not taking yours.
> 

I take this as you declining to show your example of a deadlock introduced 
by this patch as requested.  There is no such deadlock.  The patch is 
reintroducing the behavior of the oom killer that existed for years before 
you broke it and caused many of ours machines to panic as a result.

Thanks for your review.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-06 22:06       ` David Rientjes
@ 2011-03-08  0:24         ` KOSAKI Motohiro
  2011-03-08  2:01         ` KOSAKI Motohiro
  1 sibling, 0 replies; 69+ messages in thread
From: KOSAKI Motohiro @ 2011-03-08  0:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Oleg Nesterov,
	Hugh Dickins, linux-mm

> On Sun, 6 Mar 2011, KOSAKI Motohiro wrote:
> 
> > > There is no deadlock being introduced by this patch; if you have an 
> > > example of one, then please show it.  The problem is not just overkill but 
> > > rather panicking the machine when no other eligible processes exist.  We 
> > > have seen this in production quite a few times and we'd like to see this 
> > > patch merged to avoid our machines panicking because the oom killer, by 
> > > your patch, isn't considering threads that are eligible in the exit path 
> > > once their parent has been killed and has exited itself yet memory freeing 
> > > isn't possible yet because the threads still pin the ->mm.
> > 
> > No. While you don't understand current code, I'll not taking yours.
> > 
> 
> I take this as you declining to show your example of a deadlock introduced 
> by this patch as requested.  There is no such deadlock.  The patch is 
> reintroducing the behavior of the oom killer that existed for years before 
> you broke it and caused many of ours machines to panic as a result.
> 
> Thanks for your review.

Do I need to talk the same again. I don't take your buggy patch. That's all.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-06 22:06       ` David Rientjes
  2011-03-08  0:24         ` KOSAKI Motohiro
@ 2011-03-08  2:01         ` KOSAKI Motohiro
  1 sibling, 0 replies; 69+ messages in thread
From: KOSAKI Motohiro @ 2011-03-08  2:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Oleg Nesterov,
	Hugh Dickins, linux-mm

> On Sun, 6 Mar 2011, KOSAKI Motohiro wrote:
> 
> > > There is no deadlock being introduced by this patch; if you have an 
> > > example of one, then please show it.  The problem is not just overkill but 
> > > rather panicking the machine when no other eligible processes exist.  We 
> > > have seen this in production quite a few times and we'd like to see this 
> > > patch merged to avoid our machines panicking because the oom killer, by 
> > > your patch, isn't considering threads that are eligible in the exit path 
> > > once their parent has been killed and has exited itself yet memory freeing 
> > > isn't possible yet because the threads still pin the ->mm.
> > 
> > No. While you don't understand current code, I'll not taking yours.
> > 
> 
> I take this as you declining to show your example of a deadlock introduced 
> by this patch as requested.  There is no such deadlock.  The patch is 
> reintroducing the behavior of the oom killer that existed for years before 
> you broke it and caused many of ours machines to panic as a result.
> 
> Thanks for your review.

How do you proof no deadlock? No, you can't. Don't pray to work as you hope.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-03  1:20 ` KOSAKI Motohiro
  2011-03-03 19:53   ` David Rientjes
@ 2011-03-08 13:42   ` Oleg Nesterov
  2011-03-08 23:57     ` David Rientjes
  1 sibling, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-08 13:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On 03/03, KOSAKI Motohiro wrote:
>
> > By iterating over threads instead, it is possible to detect threads that
> > are exiting and nominate them for oom kill so they get access to memory
> > reserves.
>
> In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore
> PF_EXITING is not a sign of memory freeing in nearly future. If other
> CPUs don't try to free memory, prevent oom and waiting makes deadlock.

I agree. I don't understand this patch.

And. Instead of moving to for_each_mm() this patch moves the logic back,
to for_each_thread().

> Thus, I suggest to remove PF_EXITING check completely.

Again, this seems better to me but I do not really understand oom
killer's heuristic. Perhaps this check helps with some workloads.

I tried to avoid this discussion because I have nothing new to add,
and the previous discussion was painful. But since this patch was
merged into -mm,

> > @@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> >  		 * the process of exiting and releasing its resources.
> >  		 * Otherwise we could get an easy OOM deadlock.
> >  		 */
> > -		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
> > +		if ((p->flags & PF_EXITING) && p->mm) {

The previous check was not perfect, we know this.

But with this patch applied, the simple program below disables oom-killer
completely. select_bad_process() can never succeed.

I think this patch should dropped. And another one,

	oom-skip-zombies-when-iterating-tasklist.patch

should be dropped as well. Add Andrey.

Oleg.

#include <unistd.h>
#include <signal.h>
#include <pthread.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <assert.h>
#include <stdio.h>

void *tfunc(void* arg)
{
	pause();
}

int main(void)
{
	int pid = fork();

	if (!pid) {
		pthread_t thread;
		pthread_create(&thread, NULL, tfunc, NULL);
		pthread_create(&thread, NULL, tfunc, NULL);
		ptrace(PTRACE_TRACEME, 0,0,0);
		kill(getpid(), SIGSTOP);
		pthread_kill(thread, SIGQUIT);
		pause();
		return 0;
	}

	assert(wait(NULL) == pid);
	assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEEXIT) == 0);
	assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
	wait(NULL);

	pause();

	return 0;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-08 13:42   ` Oleg Nesterov
@ 2011-03-08 23:57     ` David Rientjes
  2011-03-09 10:36       ` KOSAKI Motohiro
                         ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-08 23:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On Tue, 8 Mar 2011, Oleg Nesterov wrote:

> > > By iterating over threads instead, it is possible to detect threads that
> > > are exiting and nominate them for oom kill so they get access to memory
> > > reserves.
> >
> > In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore
> > PF_EXITING is not a sign of memory freeing in nearly future. If other
> > CPUs don't try to free memory, prevent oom and waiting makes deadlock.
> 
> I agree. I don't understand this patch.
> 

Using for_each_process() does not consider threads that have failed to 
exit after the oom killed parent and, thus, we select another innocent 
task to kill when we're really just waiting for those threads to exit (and 
perhaps they need memory reserves in the exit path) or, in the worst case, 
panic if there is nothing else eligible.

The end result is that without this patch, we sometimes unnecessarily 
panic (and "sometimes" is defined as "many machines" for us) when nothing 
else is eligible for kill within an oom cpuset yet doing a 
do_each_thread() over that cpuset shows threads of previously oom killed 
parent that have yet to exit.

> > > @@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> > >  		 * the process of exiting and releasing its resources.
> > >  		 * Otherwise we could get an easy OOM deadlock.
> > >  		 */
> > > -		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
> > > +		if ((p->flags & PF_EXITING) && p->mm) {
> 
> The previous check was not perfect, we know this.
> 
> But with this patch applied, the simple program below disables oom-killer
> completely. select_bad_process() can never succeed.
> 

The program illustrates a problem that shouldn't be fixed in 
select_bad_process() but rather in oom_kill_process() when choosing an 
eligible child of the selected task to kill in place of its parent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-08 23:57     ` David Rientjes
@ 2011-03-09 10:36       ` KOSAKI Motohiro
  2011-03-09 11:06       ` Oleg Nesterov
  2011-03-09 23:19       ` Andrew Morton
  2 siblings, 0 replies; 69+ messages in thread
From: KOSAKI Motohiro @ 2011-03-09 10:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	Hugh Dickins, linux-mm, Andrey Vagin

> On Tue, 8 Mar 2011, Oleg Nesterov wrote:
> 
> > > > By iterating over threads instead, it is possible to detect threads that
> > > > are exiting and nominate them for oom kill so they get access to memory
> > > > reserves.
> > >
> > > In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore
> > > PF_EXITING is not a sign of memory freeing in nearly future. If other
> > > CPUs don't try to free memory, prevent oom and waiting makes deadlock.
> > 
> > I agree. I don't understand this patch.
> > 
> 
> Using for_each_process() does not consider threads that have failed to 
> exit after the oom killed parent and, thus, we select another innocent 
> task to kill when we're really just waiting for those threads to exit (and 
> perhaps they need memory reserves in the exit path) or, in the worst case, 
> panic if there is nothing else eligible.
> 
> The end result is that without this patch, we sometimes unnecessarily 
> panic (and "sometimes" is defined as "many machines" for us) when nothing 
> else is eligible for kill within an oom cpuset yet doing a 
> do_each_thread() over that cpuset shows threads of previously oom killed 
> parent that have yet to exit.

Only your workload careness don't make any excuse to break other worlds.
Google workload specific patch should be maintained in their. We only want 
to discuss to make world happy.

Guys, Why don't run the test program. Oleg did spnet some times for you.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-08 23:57     ` David Rientjes
  2011-03-09 10:36       ` KOSAKI Motohiro
@ 2011-03-09 11:06       ` Oleg Nesterov
  2011-03-09 20:32         ` David Rientjes
  2011-03-09 23:19       ` Andrew Morton
  2 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-09 11:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On 03/08, David Rientjes wrote:
>
> On Tue, 8 Mar 2011, Oleg Nesterov wrote:
>
> > > > By iterating over threads instead, it is possible to detect threads that
> > > > are exiting and nominate them for oom kill so they get access to memory
> > > > reserves.
> > >
> > > In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore
> > > PF_EXITING is not a sign of memory freeing in nearly future. If other
> > > CPUs don't try to free memory, prevent oom and waiting makes deadlock.
> >
> > I agree. I don't understand this patch.
> >
>
> Using for_each_process() does not consider threads that have failed to
> exit after the oom killed parent and, thus, we select another innocent
> task to kill when we're really just waiting for those threads to exit

How so? select_bad_process() checks TIF_MEMDIE and returns ERR_PTR()
if it is set.

And, exactly because we use for_each_process() we do not need to check
other threads. The main thread can't disappear until they all exit.

Imho TIF_MEMDIE is not perfect and should be replaced by MMF_, but this
is another story. Hmm... and in any case, currently TIF_MEMDIE is not
always used correctly, afaics.

> The end result is that without this patch, we sometimes unnecessarily
> panic (and "sometimes" is defined as "many machines" for us) when nothing
> else is eligible for kill within an oom cpuset yet doing a
> do_each_thread() over that cpuset shows threads of previously oom killed
> parent that have yet to exit.
>
> > > > @@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> > > >  		 * the process of exiting and releasing its resources.
> > > >  		 * Otherwise we could get an easy OOM deadlock.
> > > >  		 */
> > > > -		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
> > > > +		if ((p->flags & PF_EXITING) && p->mm) {
> >
> > The previous check was not perfect, we know this.
> >
> > But with this patch applied, the simple program below disables oom-killer
> > completely. select_bad_process() can never succeed.
> >
>
> The program illustrates a problem that shouldn't be fixed in
> select_bad_process() but rather in oom_kill_process() when choosing an
> eligible child of the selected task to kill in place of its parent.

Can't understand. oom_kill_process() is never called exactly because
select_bad_process() is fooled.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-09 11:06       ` Oleg Nesterov
@ 2011-03-09 20:32         ` David Rientjes
  2011-03-10 12:05           ` Oleg Nesterov
  0 siblings, 1 reply; 69+ messages in thread
From: David Rientjes @ 2011-03-09 20:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On Wed, 9 Mar 2011, Oleg Nesterov wrote:

> > Using for_each_process() does not consider threads that have failed to
> > exit after the oom killed parent and, thus, we select another innocent
> > task to kill when we're really just waiting for those threads to exit
> 
> How so? select_bad_process() checks TIF_MEMDIE and returns ERR_PTR()
> if it is set.
> 

TIF_MEMDIE is quite obviously a per-thread flag that only gets set for the 
oom killed task.  That leader may exit and leave behind several other 
threads that cannot be detected in a subsequent oom killer call using 
for_each_process().  We instead want to identify those threads and target 
them as well so that they may die and free memory.

> And, exactly because we use for_each_process() we do not need to check
> other threads. The main thread can't disappear until they all exit.
> 

That's obviously false, otherwise we wouldn't have lots of panics because 
there are no other eligible processes found using for_each_process() yet 
there are eligible threads using do_each_thread().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-09 20:32         ` David Rientjes
@ 2011-03-10 12:05           ` Oleg Nesterov
  2011-03-10 15:40             ` [PATCH 0/1] Was: " Oleg Nesterov
  2011-03-13  1:06             ` [patch] oom: prevent unnecessary oom kills or kernel panics David Rientjes
  0 siblings, 2 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-10 12:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On 03/09, David Rientjes wrote:
>
> On Wed, 9 Mar 2011, Oleg Nesterov wrote:
>
> > > Using for_each_process() does not consider threads that have failed to
> > > exit after the oom killed parent and, thus, we select another innocent
> > > task to kill when we're really just waiting for those threads to exit
> >
> > How so? select_bad_process() checks TIF_MEMDIE and returns ERR_PTR()
> > if it is set.
> >
>
> TIF_MEMDIE is quite obviously a per-thread flag

Yes, and this is why I think it should be replaced.

> That leader may exit and leave behind several other
> threads

No, it can't.

More precisely, it can, and it can even exit _before_ this process starts
to use a lot of memory, then later this process can be oom-killed.

But, until all threads disappear, the leader can't go away and
for_each_process() must see it.

IOW. If for_each_process() doesn't see the leader, there are no threads
from its thread group. If it does see, the process is not dead yet. It
may be exiting, and the current check tries to detect this case but it
is not perfect. And btw "if (p != current)" code is wrong.

I told this many times. This was one of the reasons why initial ->mm != NULL
in select_bad_process() were completely wrong. Now we are going to re-introduce
them, although Andrey's patch is not that wrong (but should be dropped anyway
imho).

However,

> > And, exactly because we use for_each_process() we do not need to check
> > other threads. The main thread can't disappear until they all exit.
> >
>
> That's obviously false,

I still can't convince you ;)

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 0/1] Was: oom: prevent unnecessary oom kills or kernel panics
  2011-03-10 12:05           ` Oleg Nesterov
@ 2011-03-10 15:40             ` Oleg Nesterov
  2011-03-10 15:41               ` [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
  2011-03-10 16:36               ` [PATCH 0/1] select_bad_process: improve the PF_EXITING check Oleg Nesterov
  2011-03-13  1:06             ` [patch] oom: prevent unnecessary oom kills or kernel panics David Rientjes
  1 sibling, 2 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-10 15:40 UTC (permalink / raw)
  To: David Rientjes, KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin

On 03/10, Oleg Nesterov wrote:
>
> On 03/09, David Rientjes wrote:
> >
> > On Wed, 9 Mar 2011, Oleg Nesterov wrote:
> >
> > > > Using for_each_process() does not consider threads that have failed to
> > > > exit after the oom killed parent and, thus, we select another innocent
> > > > task to kill when we're really just waiting for those threads to exit
> > >
> > > How so? select_bad_process() checks TIF_MEMDIE and returns ERR_PTR()
> > > if it is set.
> > >
> >
> > TIF_MEMDIE is quite obviously a per-thread flag

> Yes, and this is why I think it should be replaced.

But this is not simple. I'd suggest this patch as the first step.

It is not tested. I do not pretend I really understand oom-killer in
all details (although oom_kill.c itself is quite trivial).

The patch assumes that

	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
	oom-skip-zombies-when-iterating-tasklist.patch

are dropped.

I think, this patch __might__ address the problems described in the
changelog, but of course I am not sure.

I am relying on your and Kosaki's review, if you disagree with this
change I won't argue.

But the current usage of TIF_MEMDIE can't be right.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE
  2011-03-10 15:40             ` [PATCH 0/1] Was: " Oleg Nesterov
@ 2011-03-10 15:41               ` Oleg Nesterov
  2011-03-13  1:08                 ` David Rientjes
  2011-03-10 16:36               ` [PATCH 0/1] select_bad_process: improve the PF_EXITING check Oleg Nesterov
  1 sibling, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-10 15:41 UTC (permalink / raw)
  To: David Rientjes, KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin

oom_kill_task() kills the whole thread group, but only one task gets
TIF_MEMDIE. This can't be right, every test_thread_flag(TIF_MEMDIE)
check is per-thread.

I think it should be replaced by MMF_, but as a first step let's change
oom_kill_task() to mark every sub-thread as TIF_MEMDIE. And change the
"Kill all processes sharing p->mm" code the same way.

This also fixes another problem. sysctl_oom_kill_allocating_task case
does oom_kill_process(current). If current is not the main thread, then
select_bad_process() won't see the TIF_MEMDIE task.

Note:

	- oom_kill_task()->for_each_process() is wrong. It can't detect
	  all processes sharing p->mm. The fix is simple

	- This patch doesn't change other callers of set_*(TIF_MEMDIE).
	  This needs a separate discussion, but oom_kill_process() is
	  simply wrong.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |   17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

--- 38/mm/oom_kill.c~oom_kill_spread_memdie	2011-03-08 14:45:49.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-10 16:08:51.000000000 +0100
@@ -401,6 +401,17 @@ static void dump_header(struct task_stru
 		dump_tasks(mem, nodemask);
 }
 
+static void do_oom_kill(struct task_struct *p)
+{
+	struct task_struct *t;
+
+	do {
+		set_tsk_thread_flag(t, TIF_MEMDIE);
+	} while_each_thread(p, t);
+
+	force_sig(SIGKILL, p);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 {
@@ -436,12 +447,10 @@ static int oom_kill_task(struct task_str
 			pr_err("Kill process %d (%s) sharing same memory\n",
 				task_pid_nr(q), q->comm);
 			task_unlock(q);
-			force_sig(SIGKILL, q);
+			do_oom_kill(q);
 		}
 
-	set_tsk_thread_flag(p, TIF_MEMDIE);
-	force_sig(SIGKILL, p);
-
+	do_oom_kill(p);
 	/*
 	 * We give our sacrificial lamb high priority and access to
 	 * all the memory it needs. That way it should be able to

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE
  2011-03-10 15:41               ` [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
@ 2011-03-13  1:08                 ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-13  1:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On Thu, 10 Mar 2011, Oleg Nesterov wrote:

> --- 38/mm/oom_kill.c~oom_kill_spread_memdie	2011-03-08 14:45:49.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-10 16:08:51.000000000 +0100
> @@ -401,6 +401,17 @@ static void dump_header(struct task_stru
>  		dump_tasks(mem, nodemask);
>  }
>  
> +static void do_oom_kill(struct task_struct *p)
> +{
> +	struct task_struct *t;
> +
> +	do {
> +		set_tsk_thread_flag(t, TIF_MEMDIE);
> +	} while_each_thread(p, t);
> +
> +	force_sig(SIGKILL, p);
> +}
> +
>  #define K(x) ((x) << (PAGE_SHIFT-10))
>  static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
>  {
> @@ -436,12 +447,10 @@ static int oom_kill_task(struct task_str
>  			pr_err("Kill process %d (%s) sharing same memory\n",
>  				task_pid_nr(q), q->comm);
>  			task_unlock(q);
> -			force_sig(SIGKILL, q);
> +			do_oom_kill(q);
>  		}
>  
> -	set_tsk_thread_flag(p, TIF_MEMDIE);
> -	force_sig(SIGKILL, p);
> -
> +	do_oom_kill(p);
>  	/*
>  	 * We give our sacrificial lamb high priority and access to
>  	 * all the memory it needs. That way it should be able to
> 
> 

This isn't appropriate: we specifically try to limit TIF_MEMDIE to only 
threads that need it (those that need it in the exit path), otherwise we 
risk completely depleting all memory, which would result in an oom 
deadlock.  TIF_MEMDIE isn't supposed to be used as a flag to detect oom 
killed task despite its use in select_bad_process() -- if a thread needs 
access to memory reserves after receiving a SIGKILL then out_of_memory() 
provides that appropriately.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 0/1] select_bad_process: improve the PF_EXITING check
  2011-03-10 15:40             ` [PATCH 0/1] Was: " Oleg Nesterov
  2011-03-10 15:41               ` [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
@ 2011-03-10 16:36               ` Oleg Nesterov
  2011-03-10 16:37                 ` [PATCH 1/1] " Oleg Nesterov
  2011-03-10 16:40                 ` [PATCH 0/1] " Oleg Nesterov
  1 sibling, 2 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-10 16:36 UTC (permalink / raw)
  To: David Rientjes, KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin

On 03/10, Oleg Nesterov wrote:
>
> It is not tested. I do not pretend I really understand oom-killer in
> all details (although oom_kill.c itself is quite trivial).
>
> The patch assumes that
>
> 	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> 	oom-skip-zombies-when-iterating-tasklist.patch
>
> are dropped.
>
> I think, this patch __might__ address the problems described in the
> changelog, but of course I am not sure.
>
> I am relying on your and Kosaki's review, if you disagree with this
> change I won't argue.

And another uncompiled/untested/needs_review patch which might help.

Nobody ever argued, the current PF_EXITING check is not very good.
I do not know how much it is useful, but we can at least improve it.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/1] select_bad_process: improve the PF_EXITING check
  2011-03-10 16:36               ` [PATCH 0/1] select_bad_process: improve the PF_EXITING check Oleg Nesterov
@ 2011-03-10 16:37                 ` Oleg Nesterov
  2011-03-10 16:40                 ` [PATCH 0/1] " Oleg Nesterov
  1 sibling, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-10 16:37 UTC (permalink / raw)
  To: David Rientjes, KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin

The current PF_EXITING check in select_bad_process() is very limited,
it only works if the task is single-threaded.

Add the new helper which tries to handle the mt case. It is not exactly
clear what should we actually check in this case. This patch checks
PF_EXITING, but perhaps we can take signal_group_exit() into account.
In this case select_bad_process() could detect the exiting process even
before every thread calls exit().

Note: "if (p != current)" check is obviously wrong in mt case too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~detect_exiting	2011-03-10 16:08:51.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-10 17:21:40.000000000 +0100
@@ -282,6 +282,21 @@ static enum oom_constraint constrained_a
 }
 #endif
 
+static bool mm_is_exiting(struct task_struct *p)
+{
+	struct task_struct *t;
+	bool has_mm = false;
+
+	do {
+		if (!(p->flags & PF_EXITING))
+			return false;
+		if (t->mm)
+			has_mm = true;
+	} while_each_thread(p, t);
+
+	return has_mm;
+}
+
 /*
  * Simple selection loop. We chose the process with the highest
  * number of 'points'. We expect the caller will lock the tasklist.
@@ -324,7 +339,7 @@ static struct task_struct *select_bad_pr
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
+		if (mm_is_exiting(p)) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/1] select_bad_process: improve the PF_EXITING check
  2011-03-10 16:36               ` [PATCH 0/1] select_bad_process: improve the PF_EXITING check Oleg Nesterov
  2011-03-10 16:37                 ` [PATCH 1/1] " Oleg Nesterov
@ 2011-03-10 16:40                 ` Oleg Nesterov
  2011-03-10 17:18                   ` [PATCH v2 " Oleg Nesterov
  1 sibling, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-10 16:40 UTC (permalink / raw)
  To: David Rientjes, KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin

On 03/10, Oleg Nesterov wrote:
>
> On 03/10, Oleg Nesterov wrote:
> >
> > It is not tested. I do not pretend I really understand oom-killer in
> > all details (although oom_kill.c itself is quite trivial).
> >
> > The patch assumes that
> >
> > 	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> > 	oom-skip-zombies-when-iterating-tasklist.patch
> >
> > are dropped.
> >
> > I think, this patch __might__ address the problems described in the
> > changelog, but of course I am not sure.
> >
> > I am relying on your and Kosaki's review, if you disagree with this
> > change I won't argue.
>
> And another uncompiled/untested/needs_review patch which might help.
              ^^^^^^^^^^

I meant, compile-tested only ;)

> Nobody ever argued, the current PF_EXITING check is not very good.
> I do not know how much it is useful, but we can at least improve it.
>
> Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 0/1] select_bad_process: improve the PF_EXITING check
  2011-03-10 16:40                 ` [PATCH 0/1] " Oleg Nesterov
@ 2011-03-10 17:18                   ` Oleg Nesterov
  2011-03-10 17:19                     ` [PATCH v2 1/1] " Oleg Nesterov
  0 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-10 17:18 UTC (permalink / raw)
  To: David Rientjes, KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin

On 03/10, Oleg Nesterov wrote:
>
> On 03/10, Oleg Nesterov wrote:
> >
> > On 03/10, Oleg Nesterov wrote:
> > >
> > > It is not tested. I do not pretend I really understand oom-killer in
> > > all details (although oom_kill.c itself is quite trivial).
> > >
> > > The patch assumes that
> > >
> > > 	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> > > 	oom-skip-zombies-when-iterating-tasklist.patch
> > >
> > > are dropped.
> > >
> > > I think, this patch __might__ address the problems described in the
> > > changelog, but of course I am not sure.
> > >
> > > I am relying on your and Kosaki's review, if you disagree with this
> > > change I won't argue.
> >
> > And another uncompiled/untested/needs_review patch which might help.
>               ^^^^^^^^^^
>
> I meant, compile-tested only ;)
>
> > Nobody ever argued, the current PF_EXITING check is not very good.
> > I do not know how much it is useful, but we can at least improve it.

Argh. Sorry for noise, I am sending v2. Changes:

	- fix the typo in mm_is_exiting(), s/p/t/

	- update the changelog

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check
  2011-03-10 17:18                   ` [PATCH v2 " Oleg Nesterov
@ 2011-03-10 17:19                     ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-10 17:19 UTC (permalink / raw)
  To: David Rientjes, KOSAKI Motohiro
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin

The current PF_EXITING check in select_bad_process() is very limited,
it only works if the task is single-threaded.

Add the new helper which tries to handle the mt case. It is not exactly
clear what should we actually check in this case. This patch checks
PF_EXITING, but perhaps we can take signal_group_exit() into account.
In this case select_bad_process() could detect the exiting process even
before every thread calls exit().

Note:

	- "if (p != current)" check is obviously wrong in mt case too.

	- with or without this change, we should probably check
	  mm->core_state == NULL. We shouldn't assume that is is going
	  to exit "soon" otherwise. But this needs other changes anyway,
	  and in the common case when we do not share ->mm with another
	  process the false positive is not possible.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~detect_exiting	2011-03-10 16:08:51.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-10 18:12:35.000000000 +0100
@@ -282,6 +282,21 @@ static enum oom_constraint constrained_a
 }
 #endif
 
+static bool mm_is_exiting(struct task_struct *p)
+{
+	struct task_struct *t;
+	bool has_mm = false;
+
+	do {
+		if (!(t->flags & PF_EXITING))
+			return false;
+		if (t->mm)
+			has_mm = true;
+	} while_each_thread(p, t);
+
+	return has_mm;
+}
+
 /*
  * Simple selection loop. We chose the process with the highest
  * number of 'points'. We expect the caller will lock the tasklist.
@@ -324,7 +339,7 @@ static struct task_struct *select_bad_pr
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
+		if (mm_is_exiting(p)) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-10 12:05           ` Oleg Nesterov
  2011-03-10 15:40             ` [PATCH 0/1] Was: " Oleg Nesterov
@ 2011-03-13  1:06             ` David Rientjes
  1 sibling, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-13  1:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On Thu, 10 Mar 2011, Oleg Nesterov wrote:

> > That leader may exit and leave behind several other
> > threads
> 
> No, it can't.
> 
> More precisely, it can, and it can even exit _before_ this process starts
> to use a lot of memory, then later this process can be oom-killed.
> 
> But, until all threads disappear, the leader can't go away and
> for_each_process() must see it.
> 

for_each_process() sees the parent, but it is filtered because we no 
longer consider threads without an ->mm.  We only want to pass threads 
with valid ->mm pointers to oom_badness(), otherwise it ignores the thread 
anyway.  Please note that Andrey's patch to filter !p->mm is nothing new, 
it's more of a cleanup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-08 23:57     ` David Rientjes
  2011-03-09 10:36       ` KOSAKI Motohiro
  2011-03-09 11:06       ` Oleg Nesterov
@ 2011-03-09 23:19       ` Andrew Morton
  2011-03-11 19:45         ` David Rientjes
  2 siblings, 1 reply; 69+ messages in thread
From: Andrew Morton @ 2011-03-09 23:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Oleg Nesterov, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On Tue, 8 Mar 2011 15:57:36 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> > > > @@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> > > >  		 * the process of exiting and releasing its resources.
> > > >  		 * Otherwise we could get an easy OOM deadlock.
> > > >  		 */
> > > > -		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
> > > > +		if ((p->flags & PF_EXITING) && p->mm) {
> > 
> > The previous check was not perfect, we know this.
> > 
> > But with this patch applied, the simple program below disables oom-killer
> > completely. select_bad_process() can never succeed.
> > 
> 
> The program illustrates a problem that shouldn't be fixed in 
> select_bad_process() but rather in oom_kill_process() when choosing an 
> eligible child of the selected task to kill in place of its parent.

If Oleg's test program cause a hang with
oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't
cause a hang without
oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a
big problem for
oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-09 23:19       ` Andrew Morton
@ 2011-03-11 19:45         ` David Rientjes
  2011-03-12 12:34           ` Oleg Nesterov
  0 siblings, 1 reply; 69+ messages in thread
From: David Rientjes @ 2011-03-11 19:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On Wed, 9 Mar 2011, Andrew Morton wrote:

> If Oleg's test program cause a hang with
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't
> cause a hang without
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a
> big problem for
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no?
> 

It's a problem, but not because of 
oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch.  If we don't 
have this patch, then we have a trivial panic when an oom kill occurs in a 
cpuset with no other eligible processes, the oom killed thread group 
leader exits but its other threads do not and they trigger oom kills 
themselves.  for_each_process() does not iterate over these threads and so 
it finds no eligible threads to kill and then panics (and we have many 
examples of that happening in production).  I'll look at Oleg's test case 
and see what can be done to fix that condition, but the answer isn't to 
ignore eligible threads that can be killed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-11 19:45         ` David Rientjes
@ 2011-03-12 12:34           ` Oleg Nesterov
  2011-03-12 13:43             ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
  2011-03-13  1:11             ` [patch] oom: prevent unnecessary oom kills or kernel panics David Rientjes
  0 siblings, 2 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-12 12:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On 03/11, David Rientjes wrote:
>
> On Wed, 9 Mar 2011, Andrew Morton wrote:
>
> > If Oleg's test program cause a hang with
> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't
> > cause a hang without
> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a
> > big problem for
> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no?
> >
>
> It's a problem, but not because of
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch.

It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm
thread is going to free the memory.

> If we don't
> have this patch, then we have a trivial panic when an oom kill occurs in a
> cpuset with no other eligible processes, the oom killed thread group
> leader exits

It is not clear what "leader exits" actually mean. OK, perhaps you mean
its ->mm == NULL.

> but its other threads do not and they trigger oom kills
> themselves.  for_each_process() does not iterate over these threads and so
> it finds no eligible threads to kill and then panics

Could you explain what do you mean? No need to kill these threads, they
are already killed, we should wait until they all exit.

> I'll look at Oleg's test case
> and see what can be done to fix that condition, but the answer isn't to
> ignore eligible threads that can be killed.

Once again, they are already killed. Or I do not understand what you meant.

Could you please explain the problem in more details?


Also. Could you please look at the patches I sent?

	[PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE
	[PATCH v2 1/1] select_bad_process: improve the PF_EXITING check

Note also the note about "p == current" check. it should be fixed too.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes
  2011-03-12 12:34           ` Oleg Nesterov
@ 2011-03-12 13:43             ` Oleg Nesterov
  2011-03-12 13:44               ` [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
                                 ` (4 more replies)
  2011-03-13  1:11             ` [patch] oom: prevent unnecessary oom kills or kernel panics David Rientjes
  1 sibling, 5 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-12 13:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin, David Rientjes

On 03/12, Oleg Nesterov wrote:
>
> On 03/11, David Rientjes wrote:
> >
> > On Wed, 9 Mar 2011, Andrew Morton wrote:
> >
> > > If Oleg's test program cause a hang with
> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't
> > > cause a hang without
> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a
> > > big problem for
> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no?
> > >
> >
> > It's a problem, but not because of
> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch.
>
> It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm
> thread is going to free the memory.
>
> > If we don't
> > have this patch, then we have a trivial panic when an oom kill occurs in a
> > cpuset with no other eligible processes, the oom killed thread group
> > leader exits
>
> It is not clear what "leader exits" actually mean. OK, perhaps you mean
> its ->mm == NULL.
>
> > but its other threads do not and they trigger oom kills
> > themselves.  for_each_process() does not iterate over these threads and so
> > it finds no eligible threads to kill and then panics
>
> Could you explain what do you mean? No need to kill these threads, they
> are already killed, we should wait until they all exit.
>
> > I'll look at Oleg's test case
> > and see what can be done to fix that condition, but the answer isn't to
> > ignore eligible threads that can be killed.
>
> Once again, they are already killed. Or I do not understand what you meant.
>
> Could you please explain the problem in more details?
>
>
> Also. Could you please look at the patches I sent?
>
> 	[PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE
> 	[PATCH v2 1/1] select_bad_process: improve the PF_EXITING check

Cough. And both were not right, while_each_thread(p, t) needs the properly
initialized "t". At least I warned they were not tested ;)

> Note also the note about "p == current" check. it should be fixed too.

I am resending the fixes above plus the new one.

David, Kosaki, what do you think?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE
  2011-03-12 13:43             ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
@ 2011-03-12 13:44               ` Oleg Nesterov
  2011-03-13  1:14                 ` David Rientjes
  2011-03-12 13:44               ` [PATCH 2/3] oom: select_bad_process: improve the PF_EXITING check Oleg Nesterov
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-12 13:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin, David Rientjes

oom_kill_task() kills the whole thread group, but only one task gets
TIF_MEMDIE. This can't be right, every test_thread_flag(TIF_MEMDIE)
check is per-thread.

I think it should be replaced by MMF_, but as a first step let's change
oom_kill_task() to mark every sub-thread as TIF_MEMDIE. And change the
"Kill all processes sharing p->mm" code the same way.

This also fixes another problem. sysctl_oom_kill_allocating_task case
does oom_kill_process(current). If current is not the main thread, then
select_bad_process() won't see the TIF_MEMDIE task.

Note:

	- oom_kill_task()->for_each_process() is wrong. It can't detect
	  all processes sharing p->mm. The fix is simple

	- This patch doesn't change other callers of set_*(TIF_MEMDIE).
	  This needs a separate discussion, but oom_kill_process() is
	  simply wrong.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

--- 38/mm/oom_kill.c~oom_kill_spread_memdie	2011-03-12 14:19:36.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-12 14:20:42.000000000 +0100
@@ -401,6 +401,18 @@ static void dump_header(struct task_stru
 		dump_tasks(mem, nodemask);
 }
 
+static void do_oom_kill(struct task_struct *p)
+{
+	struct task_struct *t;
+
+	t = p;
+	do {
+		set_tsk_thread_flag(t, TIF_MEMDIE);
+	} while_each_thread(p, t);
+
+	force_sig(SIGKILL, p);
+}
+
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
 {
@@ -436,12 +448,10 @@ static int oom_kill_task(struct task_str
 			pr_err("Kill process %d (%s) sharing same memory\n",
 				task_pid_nr(q), q->comm);
 			task_unlock(q);
-			force_sig(SIGKILL, q);
+			do_oom_kill(q);
 		}
 
-	set_tsk_thread_flag(p, TIF_MEMDIE);
-	force_sig(SIGKILL, p);
-
+	do_oom_kill(p);
 	/*
 	 * We give our sacrificial lamb high priority and access to
 	 * all the memory it needs. That way it should be able to

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE
  2011-03-12 13:44               ` [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
@ 2011-03-13  1:14                 ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-13  1:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On Sat, 12 Mar 2011, Oleg Nesterov wrote:

> --- 38/mm/oom_kill.c~oom_kill_spread_memdie	2011-03-12 14:19:36.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-12 14:20:42.000000000 +0100
> @@ -401,6 +401,18 @@ static void dump_header(struct task_stru
>  		dump_tasks(mem, nodemask);
>  }
>  
> +static void do_oom_kill(struct task_struct *p)
> +{
> +	struct task_struct *t;
> +
> +	t = p;
> +	do {
> +		set_tsk_thread_flag(t, TIF_MEMDIE);
> +	} while_each_thread(p, t);
> +
> +	force_sig(SIGKILL, p);
> +}
> +
>  #define K(x) ((x) << (PAGE_SHIFT-10))
>  static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
>  {
> @@ -436,12 +448,10 @@ static int oom_kill_task(struct task_str
>  			pr_err("Kill process %d (%s) sharing same memory\n",
>  				task_pid_nr(q), q->comm);
>  			task_unlock(q);
> -			force_sig(SIGKILL, q);
> +			do_oom_kill(q);
>  		}
>  
> -	set_tsk_thread_flag(p, TIF_MEMDIE);
> -	force_sig(SIGKILL, p);
> -
> +	do_oom_kill(p);
>  	/*
>  	 * We give our sacrificial lamb high priority and access to
>  	 * all the memory it needs. That way it should be able to

As mentioned in the first posting of this patch, this isn't appropraite: 
we don't want to set TIF_MEMDIE to all threads unless they can't reclaim 
memory in the page allocator and are still in an oom condition.  The point 
is to try to limit TIF_MEMDIE to only those threads that are guaranteed to 
already be in the exit path or handling SIGKILLs, otherwise we risk 
depleting memory reserves entirely and there's no sanity checking here 
that would suggest that can't happen.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 2/3] oom: select_bad_process: improve the PF_EXITING check
  2011-03-12 13:43             ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
  2011-03-12 13:44               ` [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
@ 2011-03-12 13:44               ` Oleg Nesterov
  2011-03-12 13:44               ` [PATCH 3/3] oom: select_bad_process: use same_thread_group() Oleg Nesterov
                                 ` (2 subsequent siblings)
  4 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-12 13:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin, David Rientjes

The current PF_EXITING check in select_bad_process() is very limited,
it only works if the task is single-threaded.

Add the new helper which tries to handle the mt case. It is not exactly
clear what should we actually check in this case. This patch checks
PF_EXITING, but perhaps we can take signal_group_exit() into account.
In this case select_bad_process() could detect the exiting process even
before every thread calls exit().

Note:

	- "if (p != current)" check is obviously wrong in mt case too.

	- with or without this change, we should probably check
	  mm->core_state == NULL. We shouldn't assume that is is going
	  to exit "soon" otherwise. But this needs other changes anyway,
	  and in the common case when we do not share ->mm with another
	  process the false positive is not possible.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~detect_exiting	2011-03-12 14:20:42.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-12 14:22:19.000000000 +0100
@@ -282,6 +282,22 @@ static enum oom_constraint constrained_a
 }
 #endif
 
+static bool mm_is_exiting(struct task_struct *p)
+{
+	struct task_struct *t;
+	bool has_mm = false;
+
+	t = p;
+	do {
+		if (!(t->flags & PF_EXITING))
+			return false;
+		if (t->mm)
+			has_mm = true;
+	} while_each_thread(p, t);
+
+	return has_mm;
+}
+
 /*
  * Simple selection loop. We chose the process with the highest
  * number of 'points'. We expect the caller will lock the tasklist.
@@ -324,7 +340,7 @@ static struct task_struct *select_bad_pr
 		 * the process of exiting and releasing its resources.
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
-		if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
+		if (mm_is_exiting(p)) {
 			if (p != current)
 				return ERR_PTR(-1UL);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/3] oom: select_bad_process: use same_thread_group()
  2011-03-12 13:43             ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
  2011-03-12 13:44               ` [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
  2011-03-12 13:44               ` [PATCH 2/3] oom: select_bad_process: improve the PF_EXITING check Oleg Nesterov
@ 2011-03-12 13:44               ` Oleg Nesterov
  2011-03-12 19:40               ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Hugh Dickins
  2011-03-13 11:36               ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes KOSAKI Motohiro
  4 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-12 13:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin, David Rientjes

select_bad_process() checks if the exiting process is current. This
condition can never be true if the caller is not the main thread.
Use same_thread_group() instead.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 38/mm/oom_kill.c~fix_ck_current	2011-03-12 14:22:19.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-12 14:27:14.000000000 +0100
@@ -341,7 +341,7 @@ static struct task_struct *select_bad_pr
 		 * Otherwise we could get an easy OOM deadlock.
 		 */
 		if (mm_is_exiting(p)) {
-			if (p != current)
+			if (!same_thread_group(p, current))
 				return ERR_PTR(-1UL);
 
 			chosen = p;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes
  2011-03-12 13:43             ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
                                 ` (2 preceding siblings ...)
  2011-03-12 13:44               ` [PATCH 3/3] oom: select_bad_process: use same_thread_group() Oleg Nesterov
@ 2011-03-12 19:40               ` Hugh Dickins
  2011-03-13  8:53                 ` KOSAKI Motohiro
  2011-03-13 21:27                 ` Oleg Nesterov
  2011-03-13 11:36               ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes KOSAKI Motohiro
  4 siblings, 2 replies; 69+ messages in thread
From: Hugh Dickins @ 2011-03-12 19:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm,
	Andrey Vagin, David Rientjes

Hi Oleg,

On Sat, Mar 12, 2011 at 5:43 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 03/12, Oleg Nesterov wrote:
>>
>> On 03/11, David Rientjes wrote:
>> >
>> > On Wed, 9 Mar 2011, Andrew Morton wrote:
>> >
>> > > If Oleg's test program cause a hang with
>> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't
>> > > cause a hang without
>> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a
>> > > big problem for
>> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no?
>> > >
>> >
>> > It's a problem, but not because of
>> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch.
>>
>> It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm
>> thread is going to free the memory.
>>
>> > If we don't
>> > have this patch, then we have a trivial panic when an oom kill occurs in a
>> > cpuset with no other eligible processes, the oom killed thread group
>> > leader exits
>>
>> It is not clear what "leader exits" actually mean. OK, perhaps you mean
>> its ->mm == NULL.
>>
>> > but its other threads do not and they trigger oom kills
>> > themselves.  for_each_process() does not iterate over these threads and so
>> > it finds no eligible threads to kill and then panics
>>
>> Could you explain what do you mean? No need to kill these threads, they
>> are already killed, we should wait until they all exit.
>>
>> > I'll look at Oleg's test case
>> > and see what can be done to fix that condition, but the answer isn't to
>> > ignore eligible threads that can be killed.
>>
>> Once again, they are already killed. Or I do not understand what you meant.
>>
>> Could you please explain the problem in more details?
>>
>>
>> Also. Could you please look at the patches I sent?
>>
>>       [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE
>>       [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check
>
> Cough. And both were not right, while_each_thread(p, t) needs the properly
> initialized "t". At least I warned they were not tested ;)
>
>> Note also the note about "p == current" check. it should be fixed too.
>
> I am resending the fixes above plus the new one.

I've spent much of the week building up to join in, but the more I
look around, the more I find to say or investigate, and therefore
never quite get to write the mail.  Let this be a placeholder, that I
probably disagree (in an amicable way!) with all of you, and maybe
I'll finally manage to collect my thoughts into mail later today.

I guess my main point will be that TIF_MEMDIE serves a number of
slightly different, even conflicting, purposes; and one of those
purposes, which present company seems to ignore repeatedly, is to
serialize access to final reserves of memory - as a comment by Nick in
select_bad_process() makes clear. (This is distinct from the
serialization to avoid OOM-killing rampage.)

We _might_ choose to abandon that, but if so, it should be a decision,
not an oversight.  So I cannot blindly agree with just setting
TIF_MEMDIE on more and more tasks, even if they share the same mm.  I
wonder if use of your find_lock_task_mm() in   select_bad_process()
might bring together my wish to continue serialization, David's wish
to avoid stupid panics, and your wish to avoid deadlocks.

Though any serialization has a risk of deadlock: we probably need to
weigh up how realistic different cases are.   Which brings me neatly
to your little pthread_create, ptrace proggy... I dare say you and
Kosaki and David know exactly what it's doing and why it's a problem,
but even after repeated skims of the ptrace manpage,  I'll admit to
not having a clue, nor the inclination to run and then debug it to
find out the answer.  Please, Oleg, would you mind very much
explaining it to me? I don't even know if the double pthread_create is
a vital part of the scheme, or just a typo.  I see it doesn't even
allocate any special memory, so I assume it leaves a PF_EXITING around
forever, but I couldn't quite see how (with PF_EXITING being set after
the tracehook_report_exit).  And I wonder if a similar case can be
constructed to deadlock the for_each_process version of
select_bad_process().

I worry more about someone holding a reference to the mm via /proc (I
see memory allocations after getting the mm).

Thanks; until later,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes
  2011-03-12 19:40               ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Hugh Dickins
@ 2011-03-13  8:53                 ` KOSAKI Motohiro
  2011-03-13 21:27                 ` Oleg Nesterov
  1 sibling, 0 replies; 69+ messages in thread
From: KOSAKI Motohiro @ 2011-03-13  8:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: kosaki.motohiro, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
	linux-mm, Andrey Vagin, David Rientjes

> I've spent much of the week building up to join in, but the more I
> look around, the more I find to say or investigate, and therefore
> never quite get to write the mail.  Let this be a placeholder, that I
> probably disagree (in an amicable way!) with all of you, and maybe
> I'll finally manage to collect my thoughts into mail later today.
> 
> I guess my main point will be that TIF_MEMDIE serves a number of
> slightly different, even conflicting, purposes; and one of those
> purposes, which present company seems to ignore repeatedly, is to
> serialize access to final reserves of memory - as a comment by Nick in
> select_bad_process() makes clear. (This is distinct from the
> serialization to avoid OOM-killing rampage.)
>
> We _might_ choose to abandon that, but if so, it should be a decision,
> not an oversight.  So I cannot blindly agree with just setting
> TIF_MEMDIE on more and more tasks, even if they share the same mm.  I
> wonder if use of your find_lock_task_mm() in   select_bad_process()
> might bring together my wish to continue serialization, David's wish
> to avoid stupid panics, and your wish to avoid deadlocks.
> 
> Though any serialization has a risk of deadlock: we probably need to
> weigh up how realistic different cases are.   Which brings me neatly
> to your little pthread_create, ptrace proggy... I dare say you and
> Kosaki and David know exactly what it's doing and why it's a problem,
> but even after repeated skims of the ptrace manpage,  I'll admit to
> not having a clue, nor the inclination to run and then debug it to
> find out the answer.  Please, Oleg, would you mind very much
> explaining it to me? I don't even know if the double pthread_create is
> a vital part of the scheme, or just a typo.  I see it doesn't even
> allocate any special memory, so I assume it leaves a PF_EXITING around
> forever, but I couldn't quite see how (with PF_EXITING being set after
> the tracehook_report_exit).  And I wonder if a similar case can be
> constructed to deadlock the for_each_process version of
> select_bad_process().
> 
> I worry more about someone holding a reference to the mm via /proc (I
> see memory allocations after getting the mm).
> 
> Thanks; until later,
> Hugh

Hi Hugh,

Andrew Vagin showed us actual deadlock case in "[PATCH] mm: check 
zone->all_unreclaimable in all_unreclaimable()" thread. and I'm
now digging it. Unfortunatelly the cause is not single. then, I
couln't explained how exact event was occur and what should we do.

However, I can say, at least, TIF_MEMDIE is one of root cause of the
andrey's deadlock. I could observed TIF_MEMDIE process never die and
all other process continue to wait. then, the system become hang up.

I'm still continue to investigate and I wish I find minimum negative 
impact solution, but I'm not wonder I finally decide to abandon both 
TIF_MEMDIE and boost_dying_task_prio. The hang-up can be reproduced 
too easily.

Unfortunatelly, my country is under slightly rare natural disaster
and I don't think I'm going to concentrate a debug activity awhile.
I'm sorry for the lazy activity.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes
  2011-03-12 19:40               ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Hugh Dickins
  2011-03-13  8:53                 ` KOSAKI Motohiro
@ 2011-03-13 21:27                 ` Oleg Nesterov
  2011-03-14 19:04                     ` Oleg Nesterov
  1 sibling, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-13 21:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm,
	Andrey Vagin, David Rientjes

Hi Hugh,

On 03/12, Hugh Dickins wrote:
>
> On Sat, Mar 12, 2011 at 5:43 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >>
> >> Also. Could you please look at the patches I sent?
> >>
> >>       [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE
> >>       [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check
> >
> > Cough. And both were not right, while_each_thread(p, t) needs the properly
> > initialized "t". At least I warned they were not tested ;)
> >
> >> Note also the note about "p == current" check. it should be fixed too.
> >
> > I am resending the fixes above plus the new one.
>
> I've spent much of the week building up to join in, but the more I
> look around, the more I find to say or investigate, and therefore
> never quite get to write the mail.  Let this be a placeholder, that I
> probably disagree (in an amicable way!) with all of you, and maybe
> I'll finally manage to collect my thoughts into mail later today.

Thanks for looking into this!

Before I say anything else, I'd like to repeat that I am not going
to argue with the nack. The only reason I sent these patches is that
I can not understand David's patch at all. I mean, I can not even
understand the problems it should address. Yet I think the patch is
not right.

So I just tried to guess why this change helps, and suggest the
alternatives for review/testing.

> I guess my main point will be that TIF_MEMDIE serves a number of
> slightly different, even conflicting, purposes;

Yes. And let me repeat, I do not pretend understand it. However,
I bet the usage of TIF_MEMDIE is wrong.

> and one of those
> purposes, which present company seems to ignore repeatedly, is to
> serialize access to final reserves of memory

I don't quite understand "serialize" above. I hope you didn't mean
that only one thread can have TIF_MEMDIE to avoid the races...

But yes, I understand that "a lot" of TIF_MEMDIE thread can abuse
__alloc_pages_high_priority/etc. At least, I hope I understand ;)

- as a comment by Nick in
> select_bad_process() makes clear.

Oh. This comment is not clear at all.

		 * This task already has access to memory reserves and is
		 * being killed.

which task? it is quite possible that this task is already dead/released.
Or it is just exiting without memory allocations. This is group leader.
All we know is that it has task_struct, no more.

		 * Don't allow any other task access to the
		 * memory reserve.

Which other tasks? What about the sub-threads of the killed process?
It is quite possible that another thread triggered OOM and needs
TIF_MEMDIE to access the memory reserves.

And even this is not consistent. sysctl_oom_kill_allocating_task
does oom_kill_process(current) which may be non-leader.

oom_kill_process()->set_tsk_thread_flag(p, TIF_MEMDIE) is simply wrong.
The _trivial_ exploit (distinct from this one) can kill the system. And
worse! I showed this exploit many times (most probably off-list but all
were cc'ed). This was already fixed (iirc), and know I see we have it
again. This because almost every PF_EXITING check in oom_kill.c is wrong.
I'll return to this tomorrow.

> We _might_ choose to abandon that, but if so, it should be a decision,
> not an oversight.  So I cannot blindly agree with just setting
> TIF_MEMDIE on more and more tasks,

OK, lets not do this.

> wonder if use of your find_lock_task_mm() in   select_bad_process()
> might bring together my wish to continue serialization, David's wish
> to avoid stupid panics, and your wish to avoid deadlocks.

Hmm. Could you explain?

> but even after repeated skims of the ptrace manpage,  I'll admit to
> not having a clue, nor the inclination to run and then debug it to
> find out the answer.

Ahh, sorry. I didn't explain what this test-case does... Because we
discussed this many times before, iirc.

> I don't even know if the double pthread_create is
> a vital part of the scheme

it is,

> so I assume it leaves a PF_EXITING around
> forever,

More precisely, it creates the PF_EXITING thread with ->mm != NULL.
It never goes away (unless you kill the tracer, of course).

> but I couldn't quite see how (with PF_EXITING being set after
> the tracehook_report_exit).

Yes. But note that it creates 2 threads, and the second one hangs
in exit_mm() before clearing ->mm while the 3rd thread can waits
for the 1st one which should enter exit_mm() and participate.

But this is not important, you can ignore the actual details.

> And I wonder if a similar case can be
> constructed to deadlock the for_each_process version of
> select_bad_process().

Unfortunately yes. This is documented in the changelog of 2/2.
We should fix the problems with the coredumps. Hmm, we already
had some patches, but they were forgotten/ignored. But at least
we shouldn't move back and assume that "PF_EXITING && mm" means
"we will have more memory soon".

I'll return tomorrow. At first glance, the new patch from David
has the same problem, but I am not sure.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 0/3 for 2.6.38] oom: fixes
  2011-03-13 21:27                 ` Oleg Nesterov
@ 2011-03-14 19:04                     ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

On 03/13, Oleg Nesterov wrote:
>
> The _trivial_ exploit (distinct from this one) can kill the system.
> ...
> I'll return to this tomorrow.

I am going to give up, I do not understand the changes in -mm ;)

But I think we have other problems which should be fixed first.
Note that each fix has the "this is _not_ enough" comment. I was
going to _try_ to do something more complete later, but since we
are going to add more hacks^W subtle changes... lets fix the obvious
bugs first.

Add Linus. Hopefully the changes are simple enough. But once again,
we need more.

Oleg.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 0/3 for 2.6.38] oom: fixes
@ 2011-03-14 19:04                     ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

On 03/13, Oleg Nesterov wrote:
>
> The _trivial_ exploit (distinct from this one) can kill the system.
> ...
> I'll return to this tomorrow.

I am going to give up, I do not understand the changes in -mm ;)

But I think we have other problems which should be fixed first.
Note that each fix has the "this is _not_ enough" comment. I was
going to _try_ to do something more complete later, but since we
are going to add more hacks^W subtle changes... lets fix the obvious
bugs first.

Add Linus. Hopefully the changes are simple enough. But once again,
we need more.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:04                     ` Oleg Nesterov
@ 2011-03-14 19:04                       ` Oleg Nesterov
  -1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
This is very wrong by many reasons. In particular, this thread can
be the dead group leader. Check p->mm != NULL.

Note: this is _not_ enough. Just a minimal fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 38/mm/oom_kill.c~1_kill_fix_pf_exiting	2011-03-14 17:53:05.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 18:51:49.000000000 +0100
@@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING && p->mm) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		boost_dying_task_prio(p, mem);
 		return 0;


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 19:04                       ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
This is very wrong by many reasons. In particular, this thread can
be the dead group leader. Check p->mm != NULL.

Note: this is _not_ enough. Just a minimal fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 38/mm/oom_kill.c~1_kill_fix_pf_exiting	2011-03-14 17:53:05.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 18:51:49.000000000 +0100
@@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING && p->mm) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		boost_dying_task_prio(p, mem);
 		return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:04                       ` Oleg Nesterov
@ 2011-03-14 19:35                         ` Linus Torvalds
  -1 siblings, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2011-03-14 19:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.

Explain more, please. Maybe I'm missing some context because I wasn't
cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
and exit_mm() is called almost immediately afterwards which will set
p->mm to NULL.

So afaik, this will basically just remove the whole point of the code
entirely - so why not remove it then?

The combination of testing PF_EXITING and p->mm just doesn't seem to
make any sense.

                      Linus

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 19:35                         ` Linus Torvalds
  0 siblings, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2011-03-14 19:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.

Explain more, please. Maybe I'm missing some context because I wasn't
cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
and exit_mm() is called almost immediately afterwards which will set
p->mm to NULL.

So afaik, this will basically just remove the whole point of the code
entirely - so why not remove it then?

The combination of testing PF_EXITING and p->mm just doesn't seem to
make any sense.

                      Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:35                         ` Linus Torvalds
@ 2011-03-14 20:31                           ` Oleg Nesterov
  -1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, Linus Torvalds wrote:
>
> On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
>
> Explain more, please. Maybe I'm missing some context because I wasn't
> cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
> and exit_mm() is called almost immediately afterwards which will set
> p->mm to NULL.
>
> So afaik, this will basically just remove the whole point of the code
> entirely - so why not remove it then?

I am afraid I am going to lie... But iirc I tried to remove this code
before. Can't find the previous discussion, probably I am wrong.

Anyway. I never understood why do we have this special case.

> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.

To me, it doesn't make too much sense even if we do not check ->mm.

But. I _think_ the intent was to wait until this "exiting" process
does exit_mm() and frees the memory. This is like the
"the process of releasing memory " code in select_bad_process(). Once
again, this is only my speculation.

In any case, this patch doesn't pretend to be the right fix.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 20:31                           ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, Linus Torvalds wrote:
>
> On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
>
> Explain more, please. Maybe I'm missing some context because I wasn't
> cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
> and exit_mm() is called almost immediately afterwards which will set
> p->mm to NULL.
>
> So afaik, this will basically just remove the whole point of the code
> entirely - so why not remove it then?

I am afraid I am going to lie... But iirc I tried to remove this code
before. Can't find the previous discussion, probably I am wrong.

Anyway. I never understood why do we have this special case.

> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.

To me, it doesn't make too much sense even if we do not check ->mm.

But. I _think_ the intent was to wait until this "exiting" process
does exit_mm() and frees the memory. This is like the
"the process of releasing memory " code in select_bad_process(). Once
again, this is only my speculation.

In any case, this patch doesn't pretend to be the right fix.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:35                         ` Linus Torvalds
@ 2011-03-14 20:32                           ` David Rientjes
  -1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Linus Torvalds wrote:

> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.
> 

Right, it doesn't (and I recently removed testing the combination from 
select_bad_process() in -mm).  The check for PF_EXITING in the oom killer 
is purely to avoid needlessly killing tasks when something is already 
exiting and will (hopefully) be freeing its memory soon.  If an eligible 
thread is found to be PF_EXITING, the oom killer will defer indefinitely 
unless it happens to be current.  If it happens to be current, then it is 
automatically selected so it gets access to the needed memory reserves.

We do need to ensure that behavior doesn't preempt any task from being 
killed if there's an eligible thread other than current that never 
actually detaches its ->mm (oom-skip-zombies-when-iterating-tasklist.patch 
in -mm filters all threads without an ->mm).  That can happen if 
mm->mmap_sem never gets released by a thread and that's why an earlier 
change that is already in your tree automatically gives current access to 
memory reserves immediately upon calling the oom killer if it has a 
pending SIGKILL.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 20:32                           ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Linus Torvalds wrote:

> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.
> 

Right, it doesn't (and I recently removed testing the combination from 
select_bad_process() in -mm).  The check for PF_EXITING in the oom killer 
is purely to avoid needlessly killing tasks when something is already 
exiting and will (hopefully) be freeing its memory soon.  If an eligible 
thread is found to be PF_EXITING, the oom killer will defer indefinitely 
unless it happens to be current.  If it happens to be current, then it is 
automatically selected so it gets access to the needed memory reserves.

We do need to ensure that behavior doesn't preempt any task from being 
killed if there's an eligible thread other than current that never 
actually detaches its ->mm (oom-skip-zombies-when-iterating-tasklist.patch 
in -mm filters all threads without an ->mm).  That can happen if 
mm->mmap_sem never gets released by a thread and that's why an earlier 
change that is already in your tree automatically gives current access to 
memory reserves immediately upon calling the oom killer if it has a 
pending SIGKILL.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 20:32                           ` David Rientjes
@ 2011-03-15 19:12                             ` Oleg Nesterov
  -1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Linus Torvalds wrote:
>
> > The combination of testing PF_EXITING and p->mm just doesn't seem to
> > make any sense.
> >
>
> Right, it doesn't (and I recently removed testing the combination from
> select_bad_process() in -mm).  The check for PF_EXITING in the oom killer
> is purely to avoid needlessly killing tasks when something is already
> exiting

Maybe 0/3 wasn't clear enough. This patches does not try to fix things,
it only tries to close the hole in 2.6.38. But it was already released
today.

> and will (hopefully) be freeing its memory soon.

This is not clear to me.

When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
and the changelog says:

	__oom_kill_task() is called to elevate the task's timeslice and give it
	access to memory reserves so that it may quickly exit.

	This privilege is unnecessary, however, if the task has already detached
	its mm.

Now you are saing this is pointless.

OK. I already said I do not understand this special case. Perhaps I'll ask
the questions later.

> If an eligible
> thread is found to be PF_EXITING,

The problem is, we can't trust per-thread PF_EXITING checks. But I guess
we will discuss this more anyway.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 19:12                             ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Linus Torvalds wrote:
>
> > The combination of testing PF_EXITING and p->mm just doesn't seem to
> > make any sense.
> >
>
> Right, it doesn't (and I recently removed testing the combination from
> select_bad_process() in -mm).  The check for PF_EXITING in the oom killer
> is purely to avoid needlessly killing tasks when something is already
> exiting

Maybe 0/3 wasn't clear enough. This patches does not try to fix things,
it only tries to close the hole in 2.6.38. But it was already released
today.

> and will (hopefully) be freeing its memory soon.

This is not clear to me.

When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
and the changelog says:

	__oom_kill_task() is called to elevate the task's timeslice and give it
	access to memory reserves so that it may quickly exit.

	This privilege is unnecessary, however, if the task has already detached
	its mm.

Now you are saing this is pointless.

OK. I already said I do not understand this special case. Perhaps I'll ask
the questions later.

> If an eligible
> thread is found to be PF_EXITING,

The problem is, we can't trust per-thread PF_EXITING checks. But I guess
we will discuss this more anyway.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-15 19:12                             ` Oleg Nesterov
@ 2011-03-15 19:51                               ` David Rientjes
  -1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-15 19:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Tue, 15 Mar 2011, Oleg Nesterov wrote:

> When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
> and the changelog says:
> 
> 	__oom_kill_task() is called to elevate the task's timeslice and give it
> 	access to memory reserves so that it may quickly exit.
> 
> 	This privilege is unnecessary, however, if the task has already detached
> 	its mm.
> 
> Now you are saing this is pointless.
> 

If you have the commit id, do a "git blame 8123681022", because I see a:

5081dde3 (Nick Piggin        2006-09-25 23:31:32 -0700 222)             if (!p->mm)
5081dde3 (Nick Piggin        2006-09-25 23:31:32 -0700 223)                     continue;

in select_bad_process() and it's also iterating over every thread:

a49335cc (Paul Jackson       2005-09-06 15:18:09 -0700 215)     do_each_thread(g, p) {

It's pointless since oom-skip-zombies-when-iterating-tasklist.patch in 
-mm reintroduced the filter for !p->mm in select_bad_process() which was 
still there when 81236810 was merged; it's a small optimization, though, 
to avoid races where the mm becomes detached between the process' 
selection in select_bad_process() and its kill in oom_kill_process().

> The problem is, we can't trust per-thread PF_EXITING checks. But I guess
> we will discuss this more anyway.
> 

My approach, as you saw with 
oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch in 
-mm is to add exceptions to the oom killer when we can't trust that 
PF_EXITING will soon be exiting.  I think that's a much more long-term 
maintainable solution instead of inferring the status of a thread based on 
external circumstances (such as number of threads in the thread group) 
that could easily change out from under us and once again break the oom 
killer.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 19:51                               ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-15 19:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Tue, 15 Mar 2011, Oleg Nesterov wrote:

> When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
> and the changelog says:
> 
> 	__oom_kill_task() is called to elevate the task's timeslice and give it
> 	access to memory reserves so that it may quickly exit.
> 
> 	This privilege is unnecessary, however, if the task has already detached
> 	its mm.
> 
> Now you are saing this is pointless.
> 

If you have the commit id, do a "git blame 8123681022", because I see a:

5081dde3 (Nick Piggin        2006-09-25 23:31:32 -0700 222)             if (!p->mm)
5081dde3 (Nick Piggin        2006-09-25 23:31:32 -0700 223)                     continue;

in select_bad_process() and it's also iterating over every thread:

a49335cc (Paul Jackson       2005-09-06 15:18:09 -0700 215)     do_each_thread(g, p) {

It's pointless since oom-skip-zombies-when-iterating-tasklist.patch in 
-mm reintroduced the filter for !p->mm in select_bad_process() which was 
still there when 81236810 was merged; it's a small optimization, though, 
to avoid races where the mm becomes detached between the process' 
selection in select_bad_process() and its kill in oom_kill_process().

> The problem is, we can't trust per-thread PF_EXITING checks. But I guess
> we will discuss this more anyway.
> 

My approach, as you saw with 
oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch in 
-mm is to add exceptions to the oom killer when we can't trust that 
PF_EXITING will soon be exiting.  I think that's a much more long-term 
maintainable solution instead of inferring the status of a thread based on 
external circumstances (such as number of threads in the thread group) 
that could easily change out from under us and once again break the oom 
killer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:04                       ` Oleg Nesterov
@ 2011-03-14 20:22                         ` David Rientjes
  -1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.
> 

This is true only for the oom_kill_allocating_task sysctl where it is 
required in all cases to kill current; current won't be triggering the oom 
killer if it's dead.

oom_kill_process() is called with the thread selected by 
select_bad_process() and that function will not return any thread if any 
eligible task is found to be PF_EXITING and is not current, or any 
eligible task is found to have TIF_MEMDIE.

In other words, for this conditional to be true in oom_kill_process(), 
then p must be current and so it cannot be the dead group leader as 
specified in your changelog unless PF_EXITING gets set between 
select_bad_process() and the oom_kill_process() call: we don't care about 
that since it's in the exit path and we therefore want to give it access 
to memory reserves to quickly exit anyway and the check for PF_EXITING in 
select_bad_process() prevents any infinite loop of that task getting 
constantly reselected if it's dead.

> Note: this is _not_ enough. Just a minimal fix.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~1_kill_fix_pf_exiting	2011-03-14 17:53:05.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 18:51:49.000000000 +0100
> @@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (p->flags & PF_EXITING && p->mm) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		boost_dying_task_prio(p, mem);
>  		return 0;
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 20:22                         ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.
> 

This is true only for the oom_kill_allocating_task sysctl where it is 
required in all cases to kill current; current won't be triggering the oom 
killer if it's dead.

oom_kill_process() is called with the thread selected by 
select_bad_process() and that function will not return any thread if any 
eligible task is found to be PF_EXITING and is not current, or any 
eligible task is found to have TIF_MEMDIE.

In other words, for this conditional to be true in oom_kill_process(), 
then p must be current and so it cannot be the dead group leader as 
specified in your changelog unless PF_EXITING gets set between 
select_bad_process() and the oom_kill_process() call: we don't care about 
that since it's in the exit path and we therefore want to give it access 
to memory reserves to quickly exit anyway and the check for PF_EXITING in 
select_bad_process() prevents any infinite loop of that task getting 
constantly reselected if it's dead.

> Note: this is _not_ enough. Just a minimal fix.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~1_kill_fix_pf_exiting	2011-03-14 17:53:05.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 18:51:49.000000000 +0100
> @@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (p->flags & PF_EXITING && p->mm) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		boost_dying_task_prio(p, mem);
>  		return 0;
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 20:22                         ` David Rientjes
@ 2011-03-15 18:53                           ` Oleg Nesterov
  -1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 18:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
> >
>
> This is true only for the oom_kill_allocating_task sysctl where it is
> required in all cases to kill current; current won't be triggering the oom
> killer if it's dead.
>
> oom_kill_process() is called with the thread selected by
> select_bad_process() and that function will not return any thread if any
> eligible task is found to be PF_EXITING and is not current, or any
> eligible task is found to have TIF_MEMDIE.
>
> In other words, for this conditional to be true in oom_kill_process(),
> then p must be current and so it cannot be the dead group leader as
> specified in your changelog unless PF_EXITING gets set between
> select_bad_process() and the oom_kill_process() call: we don't care about
> that since it's in the exit path and we therefore want to give it access
> to memory reserves to quickly exit anyway and the check for PF_EXITING in
> select_bad_process() prevents any infinite loop of that task getting
> constantly reselected if it's dead.

Confused. I sent the test-case. OK, may be you meant the code in -mm,
but I meant the current code.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 18:53                           ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 18:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
> >
>
> This is true only for the oom_kill_allocating_task sysctl where it is
> required in all cases to kill current; current won't be triggering the oom
> killer if it's dead.
>
> oom_kill_process() is called with the thread selected by
> select_bad_process() and that function will not return any thread if any
> eligible task is found to be PF_EXITING and is not current, or any
> eligible task is found to have TIF_MEMDIE.
>
> In other words, for this conditional to be true in oom_kill_process(),
> then p must be current and so it cannot be the dead group leader as
> specified in your changelog unless PF_EXITING gets set between
> select_bad_process() and the oom_kill_process() call: we don't care about
> that since it's in the exit path and we therefore want to give it access
> to memory reserves to quickly exit anyway and the check for PF_EXITING in
> select_bad_process() prevents any infinite loop of that task getting
> constantly reselected if it's dead.

Confused. I sent the test-case. OK, may be you meant the code in -mm,
but I meant the current code.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-15 18:53                           ` Oleg Nesterov
@ 2011-03-15 19:54                             ` David Rientjes
  -1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-15 19:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Tue, 15 Mar 2011, Oleg Nesterov wrote:

> Confused. I sent the test-case. OK, may be you meant the code in -mm,
> but I meant the current code.
> 

This entire discussion, and your involvement in it, originated from these 
two patches being merged into -mm:

	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
	oom-skip-zombies-when-iterating-tasklist.patch

So naturally I'm going to challenge your testcases with the latest -mm.  
If you wanted to suggest pushing these to 2.6.38 earlier, I don't think 
anyone would have disputed that -- I certainly wouldn't have since the 
first fixes a quite obvious panic that we've faced on a lot of our 
machines.  It's not that big of a deal, though, since Andrew has targeted 
them for -stable and they're on schedule to be pushed to 2.6.38.x

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 19:54                             ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-15 19:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Tue, 15 Mar 2011, Oleg Nesterov wrote:

> Confused. I sent the test-case. OK, may be you meant the code in -mm,
> but I meant the current code.
> 

This entire discussion, and your involvement in it, originated from these 
two patches being merged into -mm:

	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
	oom-skip-zombies-when-iterating-tasklist.patch

So naturally I'm going to challenge your testcases with the latest -mm.  
If you wanted to suggest pushing these to 2.6.38 earlier, I don't think 
anyone would have disputed that -- I certainly wouldn't have since the 
first fixes a quite obvious panic that we've faced on a lot of our 
machines.  It's not that big of a deal, though, since Andrew has targeted 
them for -stable and they're on schedule to be pushed to 2.6.38.x

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-15 19:54                             ` David Rientjes
@ 2011-03-15 21:16                               ` Oleg Nesterov
  -1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 21:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/15, David Rientjes wrote:
>
> On Tue, 15 Mar 2011, Oleg Nesterov wrote:
>
> > Confused. I sent the test-case. OK, may be you meant the code in -mm,
> > but I meant the current code.
> >
>
> This entire discussion, and your involvement in it, originated from these
> two patches being merged into -mm:
>
> 	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> 	oom-skip-zombies-when-iterating-tasklist.patch

Yes. This motivated me to look at the current code, and I was unpleasantly
surprised.

> So naturally I'm going to challenge your testcases with the latest -mm.

Sure. And once again, I didn't expect the 2nd problem was fixed, I forgot
about the second patch.

> If you wanted to suggest pushing these to 2.6.38 earlier,

Yes. It was too late for 2.6.38. I thought we have more time before
release.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 21:16                               ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 21:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/15, David Rientjes wrote:
>
> On Tue, 15 Mar 2011, Oleg Nesterov wrote:
>
> > Confused. I sent the test-case. OK, may be you meant the code in -mm,
> > but I meant the current code.
> >
>
> This entire discussion, and your involvement in it, originated from these
> two patches being merged into -mm:
>
> 	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> 	oom-skip-zombies-when-iterating-tasklist.patch

Yes. This motivated me to look at the current code, and I was unpleasantly
surprised.

> So naturally I'm going to challenge your testcases with the latest -mm.

Sure. And once again, I didn't expect the 2nd problem was fixed, I forgot
about the second patch.

> If you wanted to suggest pushing these to 2.6.38 earlier,

Yes. It was too late for 2.6.38. I thought we have more time before
release.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
  2011-03-14 19:04                     ` Oleg Nesterov
@ 2011-03-14 19:05                       ` Oleg Nesterov
  -1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

select_bad_process() assumes that a TIF_MEMDIE process should go away.
But it can only go away it its parent does wait(). Change this check to
ignore the TIF_MEMDIE zombies.

Note: this is _not_ enough. Just a minimal fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~2_tif_memdie_zombie	2011-03-14 18:51:49.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 18:52:39.000000000 +0100
@@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
 		 * blocked waiting for another task which itself is waiting
 		 * for memory. Is there a better alternative?
 		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
+		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
+		    !p->exit_state && thread_group_empty(p))
 			return ERR_PTR(-1UL);
 
 		/*


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
@ 2011-03-14 19:05                       ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

select_bad_process() assumes that a TIF_MEMDIE process should go away.
But it can only go away it its parent does wait(). Change this check to
ignore the TIF_MEMDIE zombies.

Note: this is _not_ enough. Just a minimal fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~2_tif_memdie_zombie	2011-03-14 18:51:49.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 18:52:39.000000000 +0100
@@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
 		 * blocked waiting for another task which itself is waiting
 		 * for memory. Is there a better alternative?
 		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
+		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
+		    !p->exit_state && thread_group_empty(p))
 			return ERR_PTR(-1UL);
 
 		/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
  2011-03-14 19:05                       ` Oleg Nesterov
@ 2011-03-14 20:50                         ` David Rientjes
  -1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> select_bad_process() assumes that a TIF_MEMDIE process should go away.
> But it can only go away it its parent does wait(). Change this check to
> ignore the TIF_MEMDIE zombies.
> 

The equivalent of this change would be to set TIF_MEMDIE for all threads 
in a thread group when choosing a process to kill; as we've already 
discussed in your first series of patches, that has the risk of fully 
depleting memory reserves and causing the kernel the deadlock.  We want to 
limit TIF_MEMDIE to an oom killed task or to current when it is responding 
to a SIGKILL or already in the exit path because we know it's exiting and 
without memory reserves it may never exit.

This patch is even more concerning, however, because select_bad_process() 
isn't even guaranteed to select a thread from the same thread group this 
time.

> Note: this is _not_ enough. Just a minimal fix.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~2_tif_memdie_zombie	2011-03-14 18:51:49.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 18:52:39.000000000 +0100
> @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
>  		 * blocked waiting for another task which itself is waiting
>  		 * for memory. Is there a better alternative?
>  		 */
> -		if (test_tsk_thread_flag(p, TIF_MEMDIE))
> +		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> +		    !p->exit_state && thread_group_empty(p))
>  			return ERR_PTR(-1UL);
>  
>  		/*
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
@ 2011-03-14 20:50                         ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> select_bad_process() assumes that a TIF_MEMDIE process should go away.
> But it can only go away it its parent does wait(). Change this check to
> ignore the TIF_MEMDIE zombies.
> 

The equivalent of this change would be to set TIF_MEMDIE for all threads 
in a thread group when choosing a process to kill; as we've already 
discussed in your first series of patches, that has the risk of fully 
depleting memory reserves and causing the kernel the deadlock.  We want to 
limit TIF_MEMDIE to an oom killed task or to current when it is responding 
to a SIGKILL or already in the exit path because we know it's exiting and 
without memory reserves it may never exit.

This patch is even more concerning, however, because select_bad_process() 
isn't even guaranteed to select a thread from the same thread group this 
time.

> Note: this is _not_ enough. Just a minimal fix.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~2_tif_memdie_zombie	2011-03-14 18:51:49.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 18:52:39.000000000 +0100
> @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
>  		 * blocked waiting for another task which itself is waiting
>  		 * for memory. Is there a better alternative?
>  		 */
> -		if (test_tsk_thread_flag(p, TIF_MEMDIE))
> +		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> +		    !p->exit_state && thread_group_empty(p))
>  			return ERR_PTR(-1UL);
>  
>  		/*
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
  2011-03-14 19:04                     ` Oleg Nesterov
@ 2011-03-14 19:05                       ` Oleg Nesterov
  -1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.

Also, "children has a different mm" doesn't match the reality, we
should check child->mm != t->mm. This check is not exactly correct
if t->mm == NULL but this doesn't really matter, oom_kill_task()
will kill them anyway.

Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~3_fix_kill_chld	2011-03-14 18:52:39.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 19:36:01.000000000 +0100
@@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
 			    struct mem_cgroup *mem, nodemask_t *nodemask,
 			    const char *message)
 {
-	struct task_struct *victim = p;
+	struct task_struct *victim;
 	struct task_struct *child;
-	struct task_struct *t = p;
-	unsigned int victim_points = 0;
+	struct task_struct *t;
+	unsigned int victim_points;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);
@@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
+	victim_points = oom_badness(p, mem, nodemask, totalpages);
+	victim = p;
+	t = p;
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
 
+			if (child->mm == t->mm)
+				continue;
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
@ 2011-03-14 19:05                       ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.

Also, "children has a different mm" doesn't match the reality, we
should check child->mm != t->mm. This check is not exactly correct
if t->mm == NULL but this doesn't really matter, oom_kill_task()
will kill them anyway.

Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~3_fix_kill_chld	2011-03-14 18:52:39.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 19:36:01.000000000 +0100
@@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
 			    struct mem_cgroup *mem, nodemask_t *nodemask,
 			    const char *message)
 {
-	struct task_struct *victim = p;
+	struct task_struct *victim;
 	struct task_struct *child;
-	struct task_struct *t = p;
-	unsigned int victim_points = 0;
+	struct task_struct *t;
+	unsigned int victim_points;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);
@@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
+	victim_points = oom_badness(p, mem, nodemask, totalpages);
+	victim = p;
+	t = p;
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
 
+			if (child->mm == t->mm)
+				continue;
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
  2011-03-14 19:05                       ` Oleg Nesterov
@ 2011-03-14 20:41                         ` David Rientjes
  -1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> oom_kill_process() starts with victim_points == 0. This means that
> (most likely) any child has more points and can be killed erroneously.
> 
> Also, "children has a different mm" doesn't match the reality, we
> should check child->mm != t->mm. This check is not exactly correct
> if t->mm == NULL but this doesn't really matter, oom_kill_task()
> will kill them anyway.
> 
> Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> too.
> 

There're two issues you're addressing in this patch.  It only kills a 
child in place of its selected parent when:

 - the child has a higher badness score, and

 - it has a different ->mm.

In the former case, NACK, we always want to sacrifice children regardless 
of their badness score (as long as it is non-zero) if it has a separate 
->mm in place of its parent, otherwise webservers will be killed instead 
of one of their children serving a client, sshd could be killed instead of 
bash, etc.  The behavior of the oom killer has always been to try to kill 
a child with its own ->mm first to avoid losing a large amount of work 
being done or unnecessarily killing a job scheduler, for example, when 
sacrificing a child would be satisfactory.  It'll kill additional tasks, 
and perhaps even the parent later if it has no more children, if the oom 
condition persists.

In the latter case, I agree, we should be testing if the child has a 
different ->mm before sacrificing it for its parent as the comment 
indicates it will.  I proposed that exact change in "oom: avoid deferring 
oom killer if exiting task is being traced" posted to -mm a couple days 
ago.

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~3_fix_kill_chld	2011-03-14 18:52:39.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 19:36:01.000000000 +0100
> @@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
>  			    struct mem_cgroup *mem, nodemask_t *nodemask,
>  			    const char *message)
>  {
> -	struct task_struct *victim = p;
> +	struct task_struct *victim;
>  	struct task_struct *child;
> -	struct task_struct *t = p;
> -	unsigned int victim_points = 0;
> +	struct task_struct *t;
> +	unsigned int victim_points;
>  
>  	if (printk_ratelimit())
>  		dump_header(p, gfp_mask, order, mem, nodemask);
> @@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
>  	 * parent.  This attempts to lose the minimal amount of work done while
>  	 * still freeing memory.
>  	 */
> +	victim_points = oom_badness(p, mem, nodemask, totalpages);
> +	victim = p;
> +	t = p;
>  	do {
>  		list_for_each_entry(child, &t->children, sibling) {
>  			unsigned int child_points;
>  
> +			if (child->mm == t->mm)
> +				continue;
>  			/*
>  			 * oom_badness() returns 0 if the thread is unkillable
>  			 */
> 
> 

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
@ 2011-03-14 20:41                         ` David Rientjes
  0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> oom_kill_process() starts with victim_points == 0. This means that
> (most likely) any child has more points and can be killed erroneously.
> 
> Also, "children has a different mm" doesn't match the reality, we
> should check child->mm != t->mm. This check is not exactly correct
> if t->mm == NULL but this doesn't really matter, oom_kill_task()
> will kill them anyway.
> 
> Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> too.
> 

There're two issues you're addressing in this patch.  It only kills a 
child in place of its selected parent when:

 - the child has a higher badness score, and

 - it has a different ->mm.

In the former case, NACK, we always want to sacrifice children regardless 
of their badness score (as long as it is non-zero) if it has a separate 
->mm in place of its parent, otherwise webservers will be killed instead 
of one of their children serving a client, sshd could be killed instead of 
bash, etc.  The behavior of the oom killer has always been to try to kill 
a child with its own ->mm first to avoid losing a large amount of work 
being done or unnecessarily killing a job scheduler, for example, when 
sacrificing a child would be satisfactory.  It'll kill additional tasks, 
and perhaps even the parent later if it has no more children, if the oom 
condition persists.

In the latter case, I agree, we should be testing if the child has a 
different ->mm before sacrificing it for its parent as the comment 
indicates it will.  I proposed that exact change in "oom: avoid deferring 
oom killer if exiting task is being traced" posted to -mm a couple days 
ago.

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~3_fix_kill_chld	2011-03-14 18:52:39.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 19:36:01.000000000 +0100
> @@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
>  			    struct mem_cgroup *mem, nodemask_t *nodemask,
>  			    const char *message)
>  {
> -	struct task_struct *victim = p;
> +	struct task_struct *victim;
>  	struct task_struct *child;
> -	struct task_struct *t = p;
> -	unsigned int victim_points = 0;
> +	struct task_struct *t;
> +	unsigned int victim_points;
>  
>  	if (printk_ratelimit())
>  		dump_header(p, gfp_mask, order, mem, nodemask);
> @@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
>  	 * parent.  This attempts to lose the minimal amount of work done while
>  	 * still freeing memory.
>  	 */
> +	victim_points = oom_badness(p, mem, nodemask, totalpages);
> +	victim = p;
> +	t = p;
>  	do {
>  		list_for_each_entry(child, &t->children, sibling) {
>  			unsigned int child_points;
>  
> +			if (child->mm == t->mm)
> +				continue;
>  			/*
>  			 * oom_badness() returns 0 if the thread is unkillable
>  			 */
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
  2011-03-14 20:41                         ` David Rientjes
@ 2011-03-15 19:21                           ` Oleg Nesterov
  -1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() starts with victim_points == 0. This means that
> > (most likely) any child has more points and can be killed erroneously.
> >
> > Also, "children has a different mm" doesn't match the reality, we
> > should check child->mm != t->mm. This check is not exactly correct
> > if t->mm == NULL but this doesn't really matter, oom_kill_task()
> > will kill them anyway.
> >
> > Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> > too.
> >
>
> There're two issues you're addressing in this patch.  It only kills a 
> child in place of its selected parent when:
>
>  - the child has a higher badness score, and
>
>  - it has a different ->mm.
>
> In the former case, NACK, we always want to sacrifice children regardless
> of their badness score (as long as it is non-zero) if it has a separate
> ->mm in place of its parent,

Ah. So this was intentional?

OK. I was hypnotized by the security implications, and this looked so
"obviously wrong" to me.

But, of course I can't judge when it comes to oom's heuristic, and you
certainly know better.

So, thanks for correcting me.


Just a question... what about oom_kill_allocating_task? Probably
Documentation/sysctl/vm.txt should be updated.

Oleg.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
@ 2011-03-15 19:21                           ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() starts with victim_points == 0. This means that
> > (most likely) any child has more points and can be killed erroneously.
> >
> > Also, "children has a different mm" doesn't match the reality, we
> > should check child->mm != t->mm. This check is not exactly correct
> > if t->mm == NULL but this doesn't really matter, oom_kill_task()
> > will kill them anyway.
> >
> > Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> > too.
> >
>
> There're two issues you're addressing in this patch.  It only kills a 
> child in place of its selected parent when:
>
>  - the child has a higher badness score, and
>
>  - it has a different ->mm.
>
> In the former case, NACK, we always want to sacrifice children regardless
> of their badness score (as long as it is non-zero) if it has a separate
> ->mm in place of its parent,

Ah. So this was intentional?

OK. I was hypnotized by the security implications, and this looked so
"obviously wrong" to me.

But, of course I can't judge when it comes to oom's heuristic, and you
certainly know better.

So, thanks for correcting me.


Just a question... what about oom_kill_allocating_task? Probably
Documentation/sysctl/vm.txt should be updated.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes
  2011-03-12 13:43             ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
                                 ` (3 preceding siblings ...)
  2011-03-12 19:40               ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Hugh Dickins
@ 2011-03-13 11:36               ` KOSAKI Motohiro
  4 siblings, 0 replies; 69+ messages in thread
From: KOSAKI Motohiro @ 2011-03-13 11:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: kosaki.motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin, David Rientjes

> > Note also the note about "p == current" check. it should be fixed too.
> 
> I am resending the fixes above plus the new one.
> 
> David, Kosaki, what do you think?

Oleg, could you please give me some testing time? Now TIF_MEMDIE has
really unclear and nasty meanings. mainly 1) to prevent multiple oom-killer
(see select_bad_process) and 2) to allow to use zone's last resort reserved
memory (see gfp_to_alloc_flags).  The latter has really problematic even if
apply or not apply your patch.

If no apply your patch, multi thread application have a lot of risk
to fail to exit when oom killed. because sub threads can't use reserved
memory and may get stucked exiting path.

if apply your patch, multi thread application also have a lot of risk
to fail to exit when oom killed. because if all thread used a little
reserved memory but it is not enough successful exit. It become deadlock.

The optimal way is, take a bonus one thread and successfull thread exiting
pass a bonus to sibling thread. But, who want to put performance overhead
into thread exiting path for only really really rare oom events? this is
the problem. therefore, I can't put ack until I've finished some test.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch] oom: prevent unnecessary oom kills or kernel panics
  2011-03-12 12:34           ` Oleg Nesterov
  2011-03-12 13:43             ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
@ 2011-03-13  1:11             ` David Rientjes
  2011-03-13  1:15               ` [patch -mm] oom: avoid deferring oom killer if exiting task is being traced David Rientjes
  1 sibling, 1 reply; 69+ messages in thread
From: David Rientjes @ 2011-03-13  1:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On Sat, 12 Mar 2011, Oleg Nesterov wrote:

> > It's a problem, but not because of
> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch.
> 
> It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm
> thread is going to free the memory.
> 

We can add a check to see if a PF_EXITING thread will stall in the exit 
path as in your testcase, we do not need to filter threads that are still 
running that results in panics if nothing else is eligible in cpusets.

> > but its other threads do not and they trigger oom kills
> > themselves.  for_each_process() does not iterate over these threads and so
> > it finds no eligible threads to kill and then panics
> 
> Could you explain what do you mean? No need to kill these threads, they
> are already killed, we should wait until they all exit.
> 

Yes, and the check for PF_EXITING is intended to do exactly that (and 
give the thread access to memory reserves if it is trying to allocate 
memory itself).  The problem with your testcase is that the thread will 
indefinitely stall, so the appropriate fix is to detect that possibility 
and avoid the deferral if its possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [patch -mm] oom: avoid deferring oom killer if exiting task is being traced
  2011-03-13  1:11             ` [patch] oom: prevent unnecessary oom kills or kernel panics David Rientjes
@ 2011-03-13  1:15               ` David Rientjes
  2011-03-14 17:40                 ` Oleg Nesterov
  0 siblings, 1 reply; 69+ messages in thread
From: David Rientjes @ 2011-03-13  1:15 UTC (permalink / raw)
  To: Andrew Morton, Oleg Nesterov
  Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
	Andrey Vagin

The oom killer naturally defers killing anything if it finds an eligible
task that is already exiting and has yet to detach its ->mm.  This avoids
unnecessarily killing tasks when one is already in the exit path and may
free enough memory that the oom killer is no longer needed.  This is
detected by PF_EXITING since threads that have already detached its ->mm
are no longer considered at all.

The problem with always deferring when a thread is PF_EXITING, however,
is that it may never actually exit when being traced, specifically if
another task is tracing it with PTRACE_O_TRACEEXIT.  The oom killer does
not want to defer in this case since there is no guarantee that thread
will ever exit without intervention.

This patch will now only defer the oom killer when a thread is PF_EXITING
and no ptracer has stopped its progress in the exit path.  It also
ensures that a child is sacrificed for the chosen parent only if it has
a different ->mm as the comment implies: this ensures that the thread
group leader is always targeted appropriately.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/oom_kill.c |   40 +++++++++++++++++++++++++---------------
 1 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -31,6 +31,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mempolicy.h>
 #include <linux/security.h>
+#include <linux/ptrace.h>
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
@@ -316,22 +317,29 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
 		if (test_tsk_thread_flag(p, TIF_MEMDIE))
 			return ERR_PTR(-1UL);
 
-		/*
-		 * This is in the process of releasing memory so wait for it
-		 * to finish before killing some other task by mistake.
-		 *
-		 * However, if p is the current task, we allow the 'kill' to
-		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
-		 * which will allow it to gain access to memory reserves in
-		 * the process of exiting and releasing its resources.
-		 * Otherwise we could get an easy OOM deadlock.
-		 */
 		if (p->flags & PF_EXITING) {
-			if (p != current)
-				return ERR_PTR(-1UL);
-
-			chosen = p;
-			*ppoints = 1000;
+			/*
+			 * If p is the current task and is in the process of
+			 * releasing memory, we allow the "kill" to set
+			 * TIF_MEMDIE, which will allow it to gain access to
+			 * memory reserves.  Otherwise, it may stall forever.
+			 *
+			 * The loop isn't broken here, however, in case other
+			 * threads are found to have already been oom killed.
+			 */
+			if (p == current) {
+				chosen = p;
+				*ppoints = 1000;
+			} else {
+				/*
+				 * If this task is not being ptraced on exit,
+				 * then wait for it to finish before killing
+				 * some other task unnecessarily.
+				 */
+				if (!(task_ptrace(p->group_leader) &
+							PT_TRACE_EXIT))
+					return ERR_PTR(-1UL);
+			}
 		}
 
 		points = oom_badness(p, mem, nodemask, totalpages);
@@ -493,6 +501,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
 
+			if (child->mm == p->mm)
+				continue;
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [patch -mm] oom: avoid deferring oom killer if exiting task is being traced
  2011-03-13  1:15               ` [patch -mm] oom: avoid deferring oom killer if exiting task is being traced David Rientjes
@ 2011-03-14 17:40                 ` Oleg Nesterov
  0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 17:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins,
	linux-mm, Andrey Vagin

On 03/12, David Rientjes wrote:
>
> The oom killer naturally defers killing anything if it finds an eligible
> task that is already exiting and has yet to detach its ->mm.  This avoids
> unnecessarily killing tasks when one is already in the exit path and may
> free enough memory that the oom killer is no longer needed.  This is
> detected by PF_EXITING since threads that have already detached its ->mm
> are no longer considered at all.

So, this is on top of oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch

I still can't understand which problems that patch tries to solve. Could
you explain why this patch helps in details?

Speaking of "already exiting and has yet to detach its ->mm", did you look
at "[PATCH 2/3] oom: select_bad_process: improve the PF_EXITING check" ?

> The problem with always deferring when a thread is PF_EXITING, however,
> is that it may never actually exit when being traced, specifically if
> another task is tracing it with PTRACE_O_TRACEEXIT.  The oom killer does
> not want to defer in this case since there is no guarantee that thread
> will ever exit without intervention.

IOW, you are trying to fight with the test-case I sent,

> +			} else {
> +				/*
> +				 * If this task is not being ptraced on exit,
> +				 * then wait for it to finish before killing
> +				 * some other task unnecessarily.
> +				 */
> +				if (!(task_ptrace(p->group_leader) &
> +							PT_TRACE_EXIT))
> +					return ERR_PTR(-1UL);

No, this can't help afaics. It is trivial to change the exploit and
get the same result.

Perhaps I missed something, I didn't read this patch carefully. Will
try to do later.

However. could you please answer my question above?


Also. We have the serious and easily exploitable bugs, I think we
should fix them first. I'll report more details later today.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2011-03-15 21:26 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-01 19:09 [patch] oom: prevent unnecessary oom kills or kernel panics David Rientjes
2011-03-03  1:20 ` KOSAKI Motohiro
2011-03-03 19:53   ` David Rientjes
2011-03-06 11:14     ` KOSAKI Motohiro
2011-03-06 22:06       ` David Rientjes
2011-03-08  0:24         ` KOSAKI Motohiro
2011-03-08  2:01         ` KOSAKI Motohiro
2011-03-08 13:42   ` Oleg Nesterov
2011-03-08 23:57     ` David Rientjes
2011-03-09 10:36       ` KOSAKI Motohiro
2011-03-09 11:06       ` Oleg Nesterov
2011-03-09 20:32         ` David Rientjes
2011-03-10 12:05           ` Oleg Nesterov
2011-03-10 15:40             ` [PATCH 0/1] Was: " Oleg Nesterov
2011-03-10 15:41               ` [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
2011-03-13  1:08                 ` David Rientjes
2011-03-10 16:36               ` [PATCH 0/1] select_bad_process: improve the PF_EXITING check Oleg Nesterov
2011-03-10 16:37                 ` [PATCH 1/1] " Oleg Nesterov
2011-03-10 16:40                 ` [PATCH 0/1] " Oleg Nesterov
2011-03-10 17:18                   ` [PATCH v2 " Oleg Nesterov
2011-03-10 17:19                     ` [PATCH v2 1/1] " Oleg Nesterov
2011-03-13  1:06             ` [patch] oom: prevent unnecessary oom kills or kernel panics David Rientjes
2011-03-09 23:19       ` Andrew Morton
2011-03-11 19:45         ` David Rientjes
2011-03-12 12:34           ` Oleg Nesterov
2011-03-12 13:43             ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
2011-03-12 13:44               ` [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
2011-03-13  1:14                 ` David Rientjes
2011-03-12 13:44               ` [PATCH 2/3] oom: select_bad_process: improve the PF_EXITING check Oleg Nesterov
2011-03-12 13:44               ` [PATCH 3/3] oom: select_bad_process: use same_thread_group() Oleg Nesterov
2011-03-12 19:40               ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Hugh Dickins
2011-03-13  8:53                 ` KOSAKI Motohiro
2011-03-13 21:27                 ` Oleg Nesterov
2011-03-14 19:04                   ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
2011-03-14 19:04                     ` Oleg Nesterov
2011-03-14 19:04                     ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
2011-03-14 19:04                       ` Oleg Nesterov
2011-03-14 19:35                       ` Linus Torvalds
2011-03-14 19:35                         ` Linus Torvalds
2011-03-14 20:31                         ` Oleg Nesterov
2011-03-14 20:31                           ` Oleg Nesterov
2011-03-14 20:32                         ` David Rientjes
2011-03-14 20:32                           ` David Rientjes
2011-03-15 19:12                           ` Oleg Nesterov
2011-03-15 19:12                             ` Oleg Nesterov
2011-03-15 19:51                             ` David Rientjes
2011-03-15 19:51                               ` David Rientjes
2011-03-14 20:22                       ` David Rientjes
2011-03-14 20:22                         ` David Rientjes
2011-03-15 18:53                         ` Oleg Nesterov
2011-03-15 18:53                           ` Oleg Nesterov
2011-03-15 19:54                           ` David Rientjes
2011-03-15 19:54                             ` David Rientjes
2011-03-15 21:16                             ` Oleg Nesterov
2011-03-15 21:16                               ` Oleg Nesterov
2011-03-14 19:05                     ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
2011-03-14 19:05                       ` Oleg Nesterov
2011-03-14 20:50                       ` David Rientjes
2011-03-14 20:50                         ` David Rientjes
2011-03-14 19:05                     ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
2011-03-14 19:05                       ` Oleg Nesterov
2011-03-14 20:41                       ` David Rientjes
2011-03-14 20:41                         ` David Rientjes
2011-03-15 19:21                         ` Oleg Nesterov
2011-03-15 19:21                           ` Oleg Nesterov
2011-03-13 11:36               ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes KOSAKI Motohiro
2011-03-13  1:11             ` [patch] oom: prevent unnecessary oom kills or kernel panics David Rientjes
2011-03-13  1:15               ` [patch -mm] oom: avoid deferring oom killer if exiting task is being traced David Rientjes
2011-03-14 17:40                 ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.