public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
  2011-03-14 19:04                 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
@ 2011-03-14 19:05                   ` Oleg Nesterov
  2011-03-14 20:50                     ` David Rientjes
  0 siblings, 1 reply; 3+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
  To: Hugh Dickins, Linus Torvalds
  Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
	David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel

select_bad_process() assumes that a TIF_MEMDIE process should go away.
But it can only go away it its parent does wait(). Change this check to
ignore the TIF_MEMDIE zombies.

Note: this is _not_ enough. Just a minimal fix.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 mm/oom_kill.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 38/mm/oom_kill.c~2_tif_memdie_zombie	2011-03-14 18:51:49.000000000 +0100
+++ 38/mm/oom_kill.c	2011-03-14 18:52:39.000000000 +0100
@@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
 		 * blocked waiting for another task which itself is waiting
 		 * for memory. Is there a better alternative?
 		 */
-		if (test_tsk_thread_flag(p, TIF_MEMDIE))
+		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
+		    !p->exit_state && thread_group_empty(p))
 			return ERR_PTR(-1UL);
 
 		/*


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
  2011-03-14 19:05                   ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
@ 2011-03-14 20:50                     ` David Rientjes
  0 siblings, 0 replies; 3+ messages in thread
From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
	linux-kernel

On Mon, 14 Mar 2011, Oleg Nesterov wrote:

> select_bad_process() assumes that a TIF_MEMDIE process should go away.
> But it can only go away it its parent does wait(). Change this check to
> ignore the TIF_MEMDIE zombies.
> 

The equivalent of this change would be to set TIF_MEMDIE for all threads 
in a thread group when choosing a process to kill; as we've already 
discussed in your first series of patches, that has the risk of fully 
depleting memory reserves and causing the kernel the deadlock.  We want to 
limit TIF_MEMDIE to an oom killed task or to current when it is responding 
to a SIGKILL or already in the exit path because we know it's exiting and 
without memory reserves it may never exit.

This patch is even more concerning, however, because select_bad_process() 
isn't even guaranteed to select a thread from the same thread group this 
time.

> Note: this is _not_ enough. Just a minimal fix.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
> 
>  mm/oom_kill.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 38/mm/oom_kill.c~2_tif_memdie_zombie	2011-03-14 18:51:49.000000000 +0100
> +++ 38/mm/oom_kill.c	2011-03-14 18:52:39.000000000 +0100
> @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
>  		 * blocked waiting for another task which itself is waiting
>  		 * for memory. Is there a better alternative?
>  		 */
> -		if (test_tsk_thread_flag(p, TIF_MEMDIE))
> +		if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> +		    !p->exit_state && thread_group_empty(p))
>  			return ERR_PTR(-1UL);
>  
>  		/*
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
       [not found]                 ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com>
@ 2011-03-18 18:32                   ` Oleg Nesterov
  0 siblings, 0 replies; 3+ messages in thread
From: Oleg Nesterov @ 2011-03-18 18:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-kernel

Sorry for delay...

Remove security, this has nothing to do with the released code.
But please see the question at the end...

On 03/16, David Rientjes wrote:
>
> On Wed, 16 Mar 2011, Oleg Nesterov wrote:
>
> > > > 	do {
> > > > 		list_for_each_entry(child, &t->children, sibling) {
> > > > 			unsigned int child_points;
> > > >
> > > > 			/*
> > > > 			 * oom_badness() returns 0 if the thread is unkillable
> > > > 			 */
> > > > 			child_points = oom_badness(child, mem, nodemask,
> > > > 								totalpages);
> > > >
> > > > child->mm can be NULL.
> > > >
> > >
> > > So child_points would be 0 here.
> >
> > Why? oom_badness() checks the whole group. group_leader can exit and
> > pass exit_mm(). But it still the leader and "represents" the whole
> > group even if it exits as thread.
> >
>
> If there are still child threads that have valid mm's, then they are
> eligible for oom kill and all threads sharing that mm will be killed once
> passed to oom_kill_task().  That may be the same as the selected task, p,
> passed to oom_kill_process() but all threads that share the mm would have
> to be killed anyway to free memory.

Not sure I understand... Yes, oom_kill_task() kills all processes that
share the same ->mm. (but to remind, "q->mm == mm" is not right for the
same reason, q->mm can be NULL). But the code above should filter out
the tasks with the same ->mm. It can't.

OK, this is really minor. CLONE_VM processes with the dead leader, this
is really exotics.

> > > > 	if (p->flags & PF_EXITING) {
> > > > 		set_tsk_thread_flag(p, TIF_MEMDIE);
> > > > 		boost_dying_task_prio(p, mem);
> > > > 		return 0;
> > > > 	}
> > > >
> > > > in oom_kill_process() whith -mm patches?
> > > >
> > > > We know that this thread (not process) was chosen by select_bad_process()
> > > > and p->mm != NULL. As Linus rightly pointed, this means this code can only
> > > > work in the small window between exit_signals() and exit_mm().
> > > >
> > > > So, what is the point?
> > > >
> > >
> > > Because there's no need to SIGKILL the task or emit anything to the kernel
> > > log.  We don't want anybody thinking that the oom killer killed it when it
> > > was already exiting on its own.
> >
> > OK. But this case is very unlikely. And I am still trying to understand
> > why this special case is important. But I can't.
> >
>
> It's actually not unlikely at all if mm->mmap_sem is held.

Do you mean OOM from with down_write(mmap_sem) ? OK, in this case we can
see a lot of PF_EXITING && mm threads. But this means they are likely
sleeping in exit_mm()->down_read(), how the code above can help?

> > 	> The combination of testing PF_EXITING and p->mm just doesn't seem to
> > 	> make any sense.
> > 	>
> >
> > 	Right, it doesn't (and I recently removed testing the combination from
> > 	select_bad_process() in -mm).
> >
> > How so? This is what we have now, no?
> >
>
> It's not required functionally for the oom killer,

OK, thanks.

> If any other threads can't actually exit yet,
> then they will automatically be selected when they invoke the oom killer
> (we automatically select current if it is PF_EXITING and the oom killer
> iterates over all threads in -mm) so we don't need to be concerned about
> them stalling at this point.

Again, it is unlikely that another thread triggers oom between exit_signals()
and exit_mm().

And what "other threads" actually mean? If you mean that we already killed
this process (iow, oom_kill_task() sent SIGKILL to any sub-thread in this
group) then yes, this thread probably needs TIF_MEMDIE.

But. In this case current won't call select_bad_process() at all. We have
the fatal_signal_pending() check at the top of out_of_memory(), and this
is the "special" case in oom_kill.c I can understand. I hope ;)

Btw. fatal_signal_pending() is not really good... it can be false negative.
signal_group_exit() looks better.

> In the quote above, Linus was referring to testing PF_EXITING and p->mm in
> oom_kill_process().  It doesn't make any sense if we have already filtered
> p->mm in select_bad_process()

No, I don't think this was the point.

This was discussed assuming the current code, select_bad_process() doesn't
filter !mm threads, and it is not per-thread.

> and we don't want to needlessly kill any
> children because p has executed exit_mm() between its selection and its
> kill: it's on the exit path and will probably be freeing memory soon.

OK, this is reasonable. And this is what I can understand. But this
case looks unlikely, and I am not sure it is right, please see below.

> While this code inspection is interesting, what would probably be more
> interesting is if you have any test cases that are problematic on the
> latest -mm tree

I sent one. it wasn't tested, but should be problematic. Doesn't really
matter, we can fix this.

I am just trying to understand the new "per-thread" direction. I can't.

OK. For example. Two threads T1 and T2. This process uses a lot of memory.

	1. T2 does, say, do_brk() and triggers OOM

	2. T2 calls out_of_memory->select_bad_process() and starts the
	   main do_each_thread() loop.

	   It finds T1, then T2. oom_badness() returns the same value,
	   so select_bad_process() returns T1.

	4. T1 exits, calls exit_mm() and sleeps on down_read().

	5. T2 calls oom_kill_process(), sees PF_EXITING, does
	   set_tsk_thread_flag(T1, TIF_MEMDIE) and returns.

Now. out_of_memory() will be called again, but select_bad_process()
is fooled. It will see T1 before T2 and return ERR_PTR() because of
T1 has TIF_MEMDIE.

And T2 can't access the memory reserves because it lacks TIF_MEMDIE.

No?

Oleg.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-03-18 18:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <AANLkTikOdG7iTKDKq5mCYhcVz-rgZ_F2Ja78oBCOCQ91@mail.gmail.com>
     [not found] ` <alpine.DEB.2.00.1103141512310.4425@chino.kir.corp.google.com>
     [not found]   ` <20110315194737.GE21640@redhat.com>
     [not found]     ` <alpine.DEB.2.00.1103151259380.558@chino.kir.corp.google.com>
     [not found]       ` <20110315212754.GB28117@redhat.com>
     [not found]         ` <alpine.DEB.2.00.1103151530200.5099@chino.kir.corp.google.com>
     [not found]           ` <20110316155310.GA9797@redhat.com>
     [not found]             ` <alpine.DEB.2.00.1103161220110.9710@chino.kir.corp.google.com>
     [not found]               ` <20110316202131.GA20790@redhat.com>
     [not found]                 ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com>
2011-03-18 18:32                   ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
     [not found] <20110303100030.B936.A69D9226@jp.fujitsu.com>
     [not found] ` <20110308134233.GA26884@redhat.com>
     [not found]   ` <alpine.DEB.2.00.1103081549530.27910@chino.kir.corp.google.com>
     [not found]     ` <20110309151946.dea51cde.akpm@linux-foundation.org>
     [not found]       ` <alpine.DEB.2.00.1103111142260.30699@chino.kir.corp.google.com>
     [not found]         ` <20110312123413.GA18351@redhat.com>
     [not found]           ` <20110312134341.GA27275@redhat.com>
     [not found]             ` <AANLkTinHGSb2_jfkwx=Wjv96phzPCjBROfCTFCKi4Wey@mail.gmail.com>
     [not found]               ` <20110313212726.GA24530@redhat.com>
2011-03-14 19:04                 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
2011-03-14 19:05                   ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
2011-03-14 20:50                     ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox