[PATCH 0/3 for 2.6.38] oom: fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3 for 2.6.38] oom: fixes
       [not found]                 ` <20110313212726.GA24530@redhat.com>
@ 2011-03-14 19:04                   ` Oleg Nesterov
  2011-03-14 19:04                     ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
                                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

On 03/13, Oleg Nesterov wrote:
>
> The _trivial_ exploit (distinct from this one) can kill the system.
> ...
> I'll return to this tomorrow.

I am going to give up, I do not understand the changes in -mm ;)

But I think we have other problems which should be fixed first.
Note that each fix has the "this is _not_ enough" comment. I was
going to _try_ to do something more complete later, but since we
are going to add more hacks^W subtle changes... lets fix the obvious
bugs first.

Add Linus. Hopefully the changes are simple enough. But once again,
we need more.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:04                   ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
@ 2011-03-14 19:04                     ` Oleg Nesterov
  2011-03-14 19:35                       ` Linus Torvalds
  2011-03-14 20:22                       ` David Rientjes
  2011-03-14 19:05                     ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
  2011-03-14 19:05                     ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
  2 siblings, 2 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
This is very wrong by many reasons. In particular, this thread can
be the dead group leader. Check p->mm != NULL.

Note: this is _not_ enough. Just a minimal fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 38/mm/oom_kill.c~1_kill_fix_pf_exiting	2011-03-14 17:53:05.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 18:51:49.000000000 +0100
@@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
 	 * If the task is already exiting, don't alarm the sysadmin or kill
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
-	if (p->flags & PF_EXITING) {
+	if (p->flags & PF_EXITING && p->mm) {
 		set_tsk_thread_flag(p, TIF_MEMDIE);
 		boost_dying_task_prio(p, mem);
 		return 0;


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:04                     ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
@ 2011-03-14 19:35                       ` Linus Torvalds
  2011-03-14 20:31                         ` Oleg Nesterov
  2011-03-14 20:32                         ` David Rientjes
  2011-03-14 20:22                       ` David Rientjes
  1 sibling, 2 replies; 17+ messages in thread
From: Linus Torvalds @ 2011-03-14 19:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.

Explain more, please. Maybe I'm missing some context because I wasn't
cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
and exit_mm() is called almost immediately afterwards which will set
p->mm to NULL.

So afaik, this will basically just remove the whole point of the code
entirely - so why not remove it then?

The combination of testing PF_EXITING and p->mm just doesn't seem to
make any sense.

                      Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:35                       ` Linus Torvalds
@ 2011-03-14 20:31                         ` Oleg Nesterov
  2011-03-14 20:32                         ` David Rientjes
  1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-14 20:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
	Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, Linus Torvalds wrote:
>
> On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
>
> Explain more, please. Maybe I'm missing some context because I wasn't
> cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
> and exit_mm() is called almost immediately afterwards which will set
> p->mm to NULL.
>
> So afaik, this will basically just remove the whole point of the code
> entirely - so why not remove it then?

I am afraid I am going to lie... But iirc I tried to remove this code
before. Can't find the previous discussion, probably I am wrong.

Anyway. I never understood why do we have this special case.

> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.

To me, it doesn't make too much sense even if we do not check ->mm.

But. I _think_ the intent was to wait until this "exiting" process
does exit_mm() and frees the memory. This is like the
"the process of releasing memory " code in select_bad_process(). Once
again, this is only my speculation.

In any case, this patch doesn't pretend to be the right fix.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:35                       ` Linus Torvalds
  2011-03-14 20:31                         ` Oleg Nesterov
@ 2011-03-14 20:32                         ` David Rientjes
  2011-03-15 19:12                           ` Oleg Nesterov
  1 sibling, 1 reply; 17+ messages in thread
From: David Rientjes @ 2011-03-14 20:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Linus Torvalds wrote:

> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.
> 

Right, it doesn't (and I recently removed testing the combination from 
select_bad_process() in -mm).  The check for PF_EXITING in the oom killer 
is purely to avoid needlessly killing tasks when something is already 
exiting and will (hopefully) be freeing its memory soon.  If an eligible 
thread is found to be PF_EXITING, the oom killer will defer indefinitely 
unless it happens to be current.  If it happens to be current, then it is 
automatically selected so it gets access to the needed memory reserves.

We do need to ensure that behavior doesn't preempt any task from being 
killed if there's an eligible thread other than current that never 
actually detaches its ->mm (oom-skip-zombies-when-iterating-tasklist.patch 
in -mm filters all threads without an ->mm).  That can happen if 
mm->mmap_sem never gets released by a thread and that's why an earlier 
change that is already in your tree automatically gives current access to 
memory reserves immediately upon calling the oom killer if it has a 
pending SIGKILL.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 20:32                         ` David Rientjes
@ 2011-03-15 19:12                           ` Oleg Nesterov
  2011-03-15 19:51                             ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Linus Torvalds wrote:
>
> > The combination of testing PF_EXITING and p->mm just doesn't seem to
> > make any sense.
> >
>
> Right, it doesn't (and I recently removed testing the combination from
> select_bad_process() in -mm).  The check for PF_EXITING in the oom killer
> is purely to avoid needlessly killing tasks when something is already
> exiting

Maybe 0/3 wasn't clear enough. This patches does not try to fix things,
it only tries to close the hole in 2.6.38. But it was already released
today.

> and will (hopefully) be freeing its memory soon.

This is not clear to me.

When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
and the changelog says:

	__oom_kill_task() is called to elevate the task's timeslice and give it
	access to memory reserves so that it may quickly exit.

	This privilege is unnecessary, however, if the task has already detached
	its mm.

Now you are saing this is pointless.

OK. I already said I do not understand this special case. Perhaps I'll ask
the questions later.

> If an eligible
> thread is found to be PF_EXITING,

The problem is, we can't trust per-thread PF_EXITING checks. But I guess
we will discuss this more anyway.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-15 19:12                           ` Oleg Nesterov
@ 2011-03-15 19:51                             ` David Rientjes
  0 siblings, 0 replies; 17+ messages in thread
From: David Rientjes @ 2011-03-15 19:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Tue, 15 Mar 2011, Oleg Nesterov wrote:

> When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
> and the changelog says:
> 
> 	__oom_kill_task() is called to elevate the task's timeslice and give it
> 	access to memory reserves so that it may quickly exit.
> 
> 	This privilege is unnecessary, however, if the task has already detached
> 	its mm.
> 
> Now you are saing this is pointless.
> 

If you have the commit id, do a "git blame 8123681022", because I see a:

5081dde3 (Nick Piggin        2006-09-25 23:31:32 -0700 222)             if (!p->mm)
5081dde3 (Nick Piggin        2006-09-25 23:31:32 -0700 223)                     continue;

in select_bad_process() and it's also iterating over every thread:

a49335cc (Paul Jackson       2005-09-06 15:18:09 -0700 215)     do_each_thread(g, p) {

It's pointless since oom-skip-zombies-when-iterating-tasklist.patch in 
-mm reintroduced the filter for !p->mm in select_bad_process() which was 
still there when 81236810 was merged; it's a small optimization, though, 
to avoid races where the mm becomes detached between the process' 
selection in select_bad_process() and its kill in oom_kill_process().

> The problem is, we can't trust per-thread PF_EXITING checks. But I guess
> we will discuss this more anyway.
> 

My approach, as you saw with 
oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch in 
-mm is to add exceptions to the oom killer when we can't trust that 
PF_EXITING will soon be exiting.  I think that's a much more long-term 
maintainable solution instead of inferring the status of a thread based on 
external circumstances (such as number of threads in the thread group) 
that could easily change out from under us and once again break the oom 
killer.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 19:04                     ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
  2011-03-14 19:35                       ` Linus Torvalds
@ 2011-03-14 20:22                       ` David Rientjes
  2011-03-15 18:53                         ` Oleg Nesterov
  1 sibling, 1 reply; 17+ messages in thread
From: David Rientjes @ 2011-03-14 20:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.
> 

This is true only for the oom_kill_allocating_task sysctl where it is 
required in all cases to kill current; current won't be triggering the oom 
killer if it's dead.

oom_kill_process() is called with the thread selected by 
select_bad_process() and that function will not return any thread if any 
eligible task is found to be PF_EXITING and is not current, or any 
eligible task is found to have TIF_MEMDIE.

In other words, for this conditional to be true in oom_kill_process(), 
then p must be current and so it cannot be the dead group leader as 
specified in your changelog unless PF_EXITING gets set between 
select_bad_process() and the oom_kill_process() call: we don't care about 
that since it's in the exit path and we therefore want to give it access 
to memory reserves to quickly exit anyway and the check for PF_EXITING in 
select_bad_process() prevents any infinite loop of that task getting 
constantly reselected if it's dead.

> Note: this is _not_ enough. Just a minimal fix.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~1_kill_fix_pf_exiting	2011-03-14 17:53:05.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 18:51:49.000000000 +0100
> @@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
> -	if (p->flags & PF_EXITING) {
> +	if (p->flags & PF_EXITING && p->mm) {
>  		set_tsk_thread_flag(p, TIF_MEMDIE);
>  		boost_dying_task_prio(p, mem);
>  		return 0;
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-14 20:22                       ` David Rientjes
@ 2011-03-15 18:53                         ` Oleg Nesterov
  2011-03-15 19:54                           ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-15 18:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
> >
>
> This is true only for the oom_kill_allocating_task sysctl where it is
> required in all cases to kill current; current won't be triggering the oom
> killer if it's dead.
>
> oom_kill_process() is called with the thread selected by
> select_bad_process() and that function will not return any thread if any
> eligible task is found to be PF_EXITING and is not current, or any
> eligible task is found to have TIF_MEMDIE.
>
> In other words, for this conditional to be true in oom_kill_process(),
> then p must be current and so it cannot be the dead group leader as
> specified in your changelog unless PF_EXITING gets set between
> select_bad_process() and the oom_kill_process() call: we don't care about
> that since it's in the exit path and we therefore want to give it access
> to memory reserves to quickly exit anyway and the check for PF_EXITING in
> select_bad_process() prevents any infinite loop of that task getting
> constantly reselected if it's dead.

Confused. I sent the test-case. OK, may be you meant the code in -mm,
but I meant the current code.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-15 18:53                         ` Oleg Nesterov
@ 2011-03-15 19:54                           ` David Rientjes
  2011-03-15 21:16                             ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2011-03-15 19:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Tue, 15 Mar 2011, Oleg Nesterov wrote:

> Confused. I sent the test-case. OK, may be you meant the code in -mm,
> but I meant the current code.
> 

This entire discussion, and your involvement in it, originated from these 
two patches being merged into -mm:

	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
	oom-skip-zombies-when-iterating-tasklist.patch

So naturally I'm going to challenge your testcases with the latest -mm.  
If you wanted to suggest pushing these to 2.6.38 earlier, I don't think 
anyone would have disputed that -- I certainly wouldn't have since the 
first fixes a quite obvious panic that we've faced on a lot of our 
machines.  It's not that big of a deal, though, since Andrew has targeted 
them for -stable and they're on schedule to be pushed to 2.6.38.x

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
  2011-03-15 19:54                           ` David Rientjes
@ 2011-03-15 21:16                             ` Oleg Nesterov
  0 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-15 21:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/15, David Rientjes wrote:
>
> On Tue, 15 Mar 2011, Oleg Nesterov wrote:
>
> > Confused. I sent the test-case. OK, may be you meant the code in -mm,
> > but I meant the current code.
> >
>
> This entire discussion, and your involvement in it, originated from these
> two patches being merged into -mm:
>
> 	oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> 	oom-skip-zombies-when-iterating-tasklist.patch

Yes. This motivated me to look at the current code, and I was unpleasantly
surprised.

> So naturally I'm going to challenge your testcases with the latest -mm.

Sure. And once again, I didn't expect the 2nd problem was fixed, I forgot
about the second patch.

> If you wanted to suggest pushing these to 2.6.38 earlier,

Yes. It was too late for 2.6.38. I thought we have more time before
release.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
  2011-03-14 19:04                   ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
  2011-03-14 19:04                     ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
@ 2011-03-14 19:05                     ` Oleg Nesterov
  2011-03-14 20:50                       ` David Rientjes
  2011-03-14 19:05                     ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
  2 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

select_bad_process() assumes that a TIF_MEMDIE process should go away.
But it can only go away it its parent does wait(). Change this check to
ignore the TIF_MEMDIE zombies.

Note: this is _not_ enough. Just a minimal fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~2_tif_memdie_zombie	2011-03-14 18:51:49.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 18:52:39.000000000 +0100
@@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
 		 * blocked waiting for another task which itself is waiting
 		 * for memory. Is there a better alternative?
 		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
+		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
+		    !p->exit_state && thread_group_empty(p))
 			return ERR_PTR(-1UL);
 
 		/*


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
  2011-03-14 19:05                     ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
@ 2011-03-14 20:50                       ` David Rientjes
  0 siblings, 0 replies; 17+ messages in thread
From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> select_bad_process() assumes that a TIF_MEMDIE process should go away.
> But it can only go away it its parent does wait(). Change this check to
> ignore the TIF_MEMDIE zombies.
> 

The equivalent of this change would be to set TIF_MEMDIE for all threads 
in a thread group when choosing a process to kill; as we've already 
discussed in your first series of patches, that has the risk of fully 
depleting memory reserves and causing the kernel the deadlock.  We want to 
limit TIF_MEMDIE to an oom killed task or to current when it is responding 
to a SIGKILL or already in the exit path because we know it's exiting and 
without memory reserves it may never exit.

This patch is even more concerning, however, because select_bad_process() 
isn't even guaranteed to select a thread from the same thread group this 
time.

> Note: this is _not_ enough. Just a minimal fix.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~2_tif_memdie_zombie	2011-03-14 18:51:49.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 18:52:39.000000000 +0100
> @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
>  		 * blocked waiting for another task which itself is waiting
>  		 * for memory. Is there a better alternative?
>  		 */
> -		if (test_tsk_thread_flag(p, TIF_MEMDIE))
> +		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> +		    !p->exit_state && thread_group_empty(p))
>  			return ERR_PTR(-1UL);
>  
>  		/*
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
  2011-03-14 19:04                   ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
  2011-03-14 19:04                     ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
  2011-03-14 19:05                     ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
@ 2011-03-14 19:05                     ` Oleg Nesterov
  2011-03-14 20:41                       ` David Rientjes
  2 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.

Also, "children has a different mm" doesn't match the reality, we
should check child->mm != t->mm. This check is not exactly correct
if t->mm == NULL but this doesn't really matter, oom_kill_task()
will kill them anyway.

Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~3_fix_kill_chld	2011-03-14 18:52:39.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 19:36:01.000000000 +0100
@@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
 			    struct mem_cgroup *mem, nodemask_t *nodemask,
 			    const char *message)
 {
-	struct task_struct *victim = p;
+	struct task_struct *victim;
 	struct task_struct *child;
-	struct task_struct *t = p;
-	unsigned int victim_points = 0;
+	struct task_struct *t;
+	unsigned int victim_points;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem, nodemask);
@@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
 	 * parent.  This attempts to lose the minimal amount of work done while
 	 * still freeing memory.
 	 */
+	victim_points = oom_badness(p, mem, nodemask, totalpages);
+	victim = p;
+	t = p;
 	do {
 		list_for_each_entry(child, &t->children, sibling) {
 			unsigned int child_points;
 
+			if (child->mm == t->mm)
+				continue;
 			/*
 			 * oom_badness() returns 0 if the thread is unkillable
 			 */


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
  2011-03-14 19:05                     ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
@ 2011-03-14 20:41                       ` David Rientjes
  2011-03-15 19:21                         ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2011-03-14 20:41 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> oom_kill_process() starts with victim_points == 0. This means that
> (most likely) any child has more points and can be killed erroneously.
> 
> Also, "children has a different mm" doesn't match the reality, we
> should check child->mm != t->mm. This check is not exactly correct
> if t->mm == NULL but this doesn't really matter, oom_kill_task()
> will kill them anyway.
> 
> Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> too.
> 

There're two issues you're addressing in this patch.  It only kills a 
child in place of its selected parent when:

 - the child has a higher badness score, and

 - it has a different ->mm.

In the former case, NACK, we always want to sacrifice children regardless 
of their badness score (as long as it is non-zero) if it has a separate 
->mm in place of its parent, otherwise webservers will be killed instead 
of one of their children serving a client, sshd could be killed instead of 
bash, etc.  The behavior of the oom killer has always been to try to kill 
a child with its own ->mm first to avoid losing a large amount of work 
being done or unnecessarily killing a job scheduler, for example, when 
sacrificing a child would be satisfactory.  It'll kill additional tasks, 
and perhaps even the parent later if it has no more children, if the oom 
condition persists.

In the latter case, I agree, we should be testing if the child has a 
different ->mm before sacrificing it for its parent as the comment 
indicates it will.  I proposed that exact change in "oom: avoid deferring 
oom killer if exiting task is being traced" posted to -mm a couple days 
ago.

> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~3_fix_kill_chld	2011-03-14 18:52:39.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 19:36:01.000000000 +0100
> @@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
>  			    struct mem_cgroup *mem, nodemask_t *nodemask,
>  			    const char *message)
>  {
> -	struct task_struct *victim = p;
> +	struct task_struct *victim;
>  	struct task_struct *child;
> -	struct task_struct *t = p;
> -	unsigned int victim_points = 0;
> +	struct task_struct *t;
> +	unsigned int victim_points;
>  
>  	if (printk_ratelimit())
>  		dump_header(p, gfp_mask, order, mem, nodemask);
> @@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
>  	 * parent.  This attempts to lose the minimal amount of work done while
>  	 * still freeing memory.
>  	 */
> +	victim_points = oom_badness(p, mem, nodemask, totalpages);
> +	victim = p;
> +	t = p;
>  	do {
>  		list_for_each_entry(child, &t->children, sibling) {
>  			unsigned int child_points;
>  
> +			if (child->mm == t->mm)
> +				continue;
>  			/*
>  			 * oom_badness() returns 0 if the thread is unkillable
>  			 */
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
  2011-03-14 20:41                       ` David Rientjes
@ 2011-03-15 19:21                         ` Oleg Nesterov
  0 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() starts with victim_points == 0. This means that
> > (most likely) any child has more points and can be killed erroneously.
> >
> > Also, "children has a different mm" doesn't match the reality, we
> > should check child->mm != t->mm. This check is not exactly correct
> > if t->mm == NULL but this doesn't really matter, oom_kill_task()
> > will kill them anyway.
> >
> > Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> > too.
> >
>
> There're two issues you're addressing in this patch.  It only kills a 
> child in place of its selected parent when:
>
>  - the child has a higher badness score, and
>
>  - it has a different ->mm.
>
> In the former case, NACK, we always want to sacrifice children regardless
> of their badness score (as long as it is non-zero) if it has a separate
> ->mm in place of its parent,

Ah. So this was intentional?

OK. I was hypnotized by the security implications, and this looked so
"obviously wrong" to me.

But, of course I can't judge when it comes to oom's heuristic, and you
certainly know better.

So, thanks for correcting me.


Just a question... what about oom_kill_allocating_task? Probably
Documentation/sysctl/vm.txt should be updated.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

[parent not found: <AANLkTikOdG7iTKDKq5mCYhcVz-rgZ_F2Ja78oBCOCQ91@mail.gmail.com>]

[parent not found: <alpine.DEB.2.00.1103141512310.4425@chino.kir.corp.google.com>]

[parent not found: <20110315194737.GE21640@redhat.com>]

[parent not found: <alpine.DEB.2.00.1103151259380.558@chino.kir.corp.google.com>]

[parent not found: <20110315212754.GB28117@redhat.com>]

[parent not found: <alpine.DEB.2.00.1103151530200.5099@chino.kir.corp.google.com>]

[parent not found: <20110316155310.GA9797@redhat.com>]

[parent not found: <alpine.DEB.2.00.1103161220110.9710@chino.kir.corp.google.com>]

[parent not found: <20110316202131.GA20790@redhat.com>]

[parent not found: <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com>]

* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
       [not found]                 ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com>
@ 2011-03-18 18:32                   ` Oleg Nesterov
  0 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2011-03-18 18:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-kernel

Sorry for delay...

Remove security, this has nothing to do with the released code.
But please see the question at the end...

On 03/16, David Rientjes wrote:
>
> On Wed, 16 Mar 2011, Oleg Nesterov wrote:
>
> > > > 	do {
> > > > 		list_for_each_entry(child, &t->children, sibling) {
> > > > 			unsigned int child_points;
> > > >
> > > > 			/*
> > > > 			 * oom_badness() returns 0 if the thread is unkillable
> > > > 			 */
> > > > 			child_points = oom_badness(child, mem, nodemask,
> > > > 								totalpages);
> > > >
> > > > child->mm can be NULL.
> > > >
> > >
> > > So child_points would be 0 here.
> >
> > Why? oom_badness() checks the whole group. group_leader can exit and
> > pass exit_mm(). But it still the leader and "represents" the whole
> > group even if it exits as thread.
> >
>
> If there are still child threads that have valid mm's, then they are
> eligible for oom kill and all threads sharing that mm will be killed once
> passed to oom_kill_task().  That may be the same as the selected task, p,
> passed to oom_kill_process() but all threads that share the mm would have
> to be killed anyway to free memory.

Not sure I understand... Yes, oom_kill_task() kills all processes that
share the same ->mm. (but to remind, "q->mm == mm" is not right for the
same reason, q->mm can be NULL). But the code above should filter out
the tasks with the same ->mm. It can't.

OK, this is really minor. CLONE_VM processes with the dead leader, this
is really exotics.

> > > > 	if (p->flags & PF_EXITING) {
> > > > 		set_tsk_thread_flag(p, TIF_MEMDIE);
> > > > 		boost_dying_task_prio(p, mem);
> > > > 		return 0;
> > > > 	}
> > > >
> > > > in oom_kill_process() whith -mm patches?
> > > >
> > > > We know that this thread (not process) was chosen by select_bad_process()
> > > > and p->mm != NULL. As Linus rightly pointed, this means this code can only
> > > > work in the small window between exit_signals() and exit_mm().
> > > >
> > > > So, what is the point?
> > > >
> > >
> > > Because there's no need to SIGKILL the task or emit anything to the kernel
> > > log.  We don't want anybody thinking that the oom killer killed it when it
> > > was already exiting on its own.
> >
> > OK. But this case is very unlikely. And I am still trying to understand
> > why this special case is important. But I can't.
> >
>
> It's actually not unlikely at all if mm->mmap_sem is held.

Do you mean OOM from with down_write(mmap_sem) ? OK, in this case we can
see a lot of PF_EXITING && mm threads. But this means they are likely
sleeping in exit_mm()->down_read(), how the code above can help?

> > 	> The combination of testing PF_EXITING and p->mm just doesn't seem to
> > 	> make any sense.
> > 	>
> >
> > 	Right, it doesn't (and I recently removed testing the combination from
> > 	select_bad_process() in -mm).
> >
> > How so? This is what we have now, no?
> >
>
> It's not required functionally for the oom killer,

OK, thanks.

> If any other threads can't actually exit yet,
> then they will automatically be selected when they invoke the oom killer
> (we automatically select current if it is PF_EXITING and the oom killer
> iterates over all threads in -mm) so we don't need to be concerned about
> them stalling at this point.

Again, it is unlikely that another thread triggers oom between exit_signals()
and exit_mm().

And what "other threads" actually mean? If you mean that we already killed
this process (iow, oom_kill_task() sent SIGKILL to any sub-thread in this
group) then yes, this thread probably needs TIF_MEMDIE.

But. In this case current won't call select_bad_process() at all. We have
the fatal_signal_pending() check at the top of out_of_memory(), and this
is the "special" case in oom_kill.c I can understand. I hope ;)

Btw. fatal_signal_pending() is not really good... it can be false negative.
signal_group_exit() looks better.

> In the quote above, Linus was referring to testing PF_EXITING and p->mm in
> oom_kill_process().  It doesn't make any sense if we have already filtered
> p->mm in select_bad_process()

No, I don't think this was the point.

This was discussed assuming the current code, select_bad_process() doesn't
filter !mm threads, and it is not per-thread.

> and we don't want to needlessly kill any
> children because p has executed exit_mm() between its selection and its
> kill: it's on the exit path and will probably be freeing memory soon.

OK, this is reasonable. And this is what I can understand. But this
case looks unlikely, and I am not sure it is right, please see below.

> While this code inspection is interesting, what would probably be more
> interesting is if you have any test cases that are problematic on the
> latest -mm tree

I sent one. it wasn't tested, but should be problematic. Doesn't really
matter, we can fix this.

I am just trying to understand the new "per-thread" direction. I can't.

OK. For example. Two threads T1 and T2. This process uses a lot of memory.

	1. T2 does, say, do_brk() and triggers OOM

	2. T2 calls out_of_memory->select_bad_process() and starts the
	   main do_each_thread() loop.

	   It finds T1, then T2. oom_badness() returns the same value,
	   so select_bad_process() returns T1.

	4. T1 exits, calls exit_mm() and sleeps on down_read().

	5. T2 calls oom_kill_process(), sees PF_EXITING, does
	   set_tsk_thread_flag(T1, TIF_MEMDIE) and returns.

Now. out_of_memory() will be called again, but select_bad_process()
is fooled. It will see T1 before T2 and return ERR_PTR() because of
T1 has TIF_MEMDIE.

And T2 can't access the memory reserves because it lacks TIF_MEMDIE.

No?

Oleg.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-03-18 18:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <alpine.DEB.2.00.1103011108400.28110@chino.kir.corp.google.com>
     [not found] ` <20110303100030.B936.A69D9226@jp.fujitsu.com>
     [not found]   ` <20110308134233.GA26884@redhat.com>
     [not found]     ` <alpine.DEB.2.00.1103081549530.27910@chino.kir.corp.google.com>
     [not found]       ` <20110309151946.dea51cde.akpm@linux-foundation.org>
     [not found]         ` <alpine.DEB.2.00.1103111142260.30699@chino.kir.corp.google.com>
     [not found]           ` <20110312123413.GA18351@redhat.com>
     [not found]             ` <20110312134341.GA27275@redhat.com>
     [not found]               ` <AANLkTinHGSb2_jfkwx=Wjv96phzPCjBROfCTFCKi4Wey@mail.gmail.com>
     [not found]                 ` <20110313212726.GA24530@redhat.com>
2011-03-14 19:04                   ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
2011-03-14 19:04                     ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
2011-03-14 19:35                       ` Linus Torvalds
2011-03-14 20:31                         ` Oleg Nesterov
2011-03-14 20:32                         ` David Rientjes
2011-03-15 19:12                           ` Oleg Nesterov
2011-03-15 19:51                             ` David Rientjes
2011-03-14 20:22                       ` David Rientjes
2011-03-15 18:53                         ` Oleg Nesterov
2011-03-15 19:54                           ` David Rientjes
2011-03-15 21:16                             ` Oleg Nesterov
2011-03-14 19:05                     ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
2011-03-14 20:50                       ` David Rientjes
2011-03-14 19:05                     ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
2011-03-14 20:41                       ` David Rientjes
2011-03-15 19:21                         ` Oleg Nesterov
     [not found] <AANLkTikOdG7iTKDKq5mCYhcVz-rgZ_F2Ja78oBCOCQ91@mail.gmail.com>
     [not found] ` <alpine.DEB.2.00.1103141512310.4425@chino.kir.corp.google.com>
     [not found]   ` <20110315194737.GE21640@redhat.com>
     [not found]     ` <alpine.DEB.2.00.1103151259380.558@chino.kir.corp.google.com>
     [not found]       ` <20110315212754.GB28117@redhat.com>
     [not found]         ` <alpine.DEB.2.00.1103151530200.5099@chino.kir.corp.google.com>
     [not found]           ` <20110316155310.GA9797@redhat.com>
     [not found]             ` <alpine.DEB.2.00.1103161220110.9710@chino.kir.corp.google.com>
     [not found]               ` <20110316202131.GA20790@redhat.com>
     [not found]                 ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com>
2011-03-18 18:32                   ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox