* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies [not found] ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com> @ 2011-03-18 18:32 ` Oleg Nesterov 0 siblings, 0 replies; 3+ messages in thread From: Oleg Nesterov @ 2011-03-18 18:32 UTC (permalink / raw) To: David Rientjes Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-kernel Sorry for delay... Remove security, this has nothing to do with the released code. But please see the question at the end... On 03/16, David Rientjes wrote: > > On Wed, 16 Mar 2011, Oleg Nesterov wrote: > > > > > do { > > > > list_for_each_entry(child, &t->children, sibling) { > > > > unsigned int child_points; > > > > > > > > /* > > > > * oom_badness() returns 0 if the thread is unkillable > > > > */ > > > > child_points = oom_badness(child, mem, nodemask, > > > > totalpages); > > > > > > > > child->mm can be NULL. > > > > > > > > > > So child_points would be 0 here. > > > > Why? oom_badness() checks the whole group. group_leader can exit and > > pass exit_mm(). But it still the leader and "represents" the whole > > group even if it exits as thread. > > > > If there are still child threads that have valid mm's, then they are > eligible for oom kill and all threads sharing that mm will be killed once > passed to oom_kill_task(). That may be the same as the selected task, p, > passed to oom_kill_process() but all threads that share the mm would have > to be killed anyway to free memory. Not sure I understand... Yes, oom_kill_task() kills all processes that share the same ->mm. (but to remind, "q->mm == mm" is not right for the same reason, q->mm can be NULL). But the code above should filter out the tasks with the same ->mm. It can't. OK, this is really minor. CLONE_VM processes with the dead leader, this is really exotics. > > > > if (p->flags & PF_EXITING) { > > > > set_tsk_thread_flag(p, TIF_MEMDIE); > > > > boost_dying_task_prio(p, mem); > > > > return 0; > > > > } > > > > > > > > in oom_kill_process() whith -mm patches? > > > > > > > > We know that this thread (not process) was chosen by select_bad_process() > > > > and p->mm != NULL. As Linus rightly pointed, this means this code can only > > > > work in the small window between exit_signals() and exit_mm(). > > > > > > > > So, what is the point? > > > > > > > > > > Because there's no need to SIGKILL the task or emit anything to the kernel > > > log. We don't want anybody thinking that the oom killer killed it when it > > > was already exiting on its own. > > > > OK. But this case is very unlikely. And I am still trying to understand > > why this special case is important. But I can't. > > > > It's actually not unlikely at all if mm->mmap_sem is held. Do you mean OOM from with down_write(mmap_sem) ? OK, in this case we can see a lot of PF_EXITING && mm threads. But this means they are likely sleeping in exit_mm()->down_read(), how the code above can help? > > > The combination of testing PF_EXITING and p->mm just doesn't seem to > > > make any sense. > > > > > > > Right, it doesn't (and I recently removed testing the combination from > > select_bad_process() in -mm). > > > > How so? This is what we have now, no? > > > > It's not required functionally for the oom killer, OK, thanks. > If any other threads can't actually exit yet, > then they will automatically be selected when they invoke the oom killer > (we automatically select current if it is PF_EXITING and the oom killer > iterates over all threads in -mm) so we don't need to be concerned about > them stalling at this point. Again, it is unlikely that another thread triggers oom between exit_signals() and exit_mm(). And what "other threads" actually mean? If you mean that we already killed this process (iow, oom_kill_task() sent SIGKILL to any sub-thread in this group) then yes, this thread probably needs TIF_MEMDIE. But. In this case current won't call select_bad_process() at all. We have the fatal_signal_pending() check at the top of out_of_memory(), and this is the "special" case in oom_kill.c I can understand. I hope ;) Btw. fatal_signal_pending() is not really good... it can be false negative. signal_group_exit() looks better. > In the quote above, Linus was referring to testing PF_EXITING and p->mm in > oom_kill_process(). It doesn't make any sense if we have already filtered > p->mm in select_bad_process() No, I don't think this was the point. This was discussed assuming the current code, select_bad_process() doesn't filter !mm threads, and it is not per-thread. > and we don't want to needlessly kill any > children because p has executed exit_mm() between its selection and its > kill: it's on the exit path and will probably be freeing memory soon. OK, this is reasonable. And this is what I can understand. But this case looks unlikely, and I am not sure it is right, please see below. > While this code inspection is interesting, what would probably be more > interesting is if you have any test cases that are problematic on the > latest -mm tree I sent one. it wasn't tested, but should be problematic. Doesn't really matter, we can fix this. I am just trying to understand the new "per-thread" direction. I can't. OK. For example. Two threads T1 and T2. This process uses a lot of memory. 1. T2 does, say, do_brk() and triggers OOM 2. T2 calls out_of_memory->select_bad_process() and starts the main do_each_thread() loop. It finds T1, then T2. oom_badness() returns the same value, so select_bad_process() returns T1. 4. T1 exits, calls exit_mm() and sleeps on down_read(). 5. T2 calls oom_kill_process(), sees PF_EXITING, does set_tsk_thread_flag(T1, TIF_MEMDIE) and returns. Now. out_of_memory() will be called again, but select_bad_process() is fooled. It will see T1 before T2 and return ERR_PTR() because of T1 has TIF_MEMDIE. And T2 can't access the memory reserves because it lacks TIF_MEMDIE. No? Oleg. ^ permalink raw reply [flat|nested] 3+ messages in thread
[parent not found: <20110303100030.B936.A69D9226@jp.fujitsu.com>]
[parent not found: <20110308134233.GA26884@redhat.com>]
[parent not found: <alpine.DEB.2.00.1103081549530.27910@chino.kir.corp.google.com>]
[parent not found: <20110309151946.dea51cde.akpm@linux-foundation.org>]
[parent not found: <alpine.DEB.2.00.1103111142260.30699@chino.kir.corp.google.com>]
[parent not found: <20110312123413.GA18351@redhat.com>]
[parent not found: <20110312134341.GA27275@redhat.com>]
[parent not found: <AANLkTinHGSb2_jfkwx=Wjv96phzPCjBROfCTFCKi4Wey@mail.gmail.com>]
[parent not found: <20110313212726.GA24530@redhat.com>]
* [PATCH 0/3 for 2.6.38] oom: fixes [not found] ` <20110313212726.GA24530@redhat.com> @ 2011-03-14 19:04 ` Oleg Nesterov 2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov 0 siblings, 1 reply; 3+ messages in thread From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw) To: Hugh Dickins, Linus Torvalds Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel On 03/13, Oleg Nesterov wrote: > > The _trivial_ exploit (distinct from this one) can kill the system. > ... > I'll return to this tomorrow. I am going to give up, I do not understand the changes in -mm ;) But I think we have other problems which should be fixed first. Note that each fix has the "this is _not_ enough" comment. I was going to _try_ to do something more complete later, but since we are going to add more hacks^W subtle changes... lets fix the obvious bugs first. Add Linus. Hopefully the changes are simple enough. But once again, we need more. Oleg. ^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies 2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov @ 2011-03-14 19:05 ` Oleg Nesterov 2011-03-14 20:50 ` David Rientjes 0 siblings, 1 reply; 3+ messages in thread From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw) To: Hugh Dickins, Linus Torvalds Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel select_bad_process() assumes that a TIF_MEMDIE process should go away. But it can only go away it its parent does wait(). Change this check to ignore the TIF_MEMDIE zombies. Note: this is _not_ enough. Just a minimal fix. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/oom_kill.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100 +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100 @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr * blocked waiting for another task which itself is waiting * for memory. Is there a better alternative? */ - if (test_tsk_thread_flag(p, TIF_MEMDIE)) + if (test_tsk_thread_flag(p, TIF_MEMDIE) && + !p->exit_state && thread_group_empty(p)) return ERR_PTR(-1UL); /* ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies 2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov @ 2011-03-14 20:50 ` David Rientjes 0 siblings, 0 replies; 3+ messages in thread From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On Mon, 14 Mar 2011, Oleg Nesterov wrote: > select_bad_process() assumes that a TIF_MEMDIE process should go away. > But it can only go away it its parent does wait(). Change this check to > ignore the TIF_MEMDIE zombies. > The equivalent of this change would be to set TIF_MEMDIE for all threads in a thread group when choosing a process to kill; as we've already discussed in your first series of patches, that has the risk of fully depleting memory reserves and causing the kernel the deadlock. We want to limit TIF_MEMDIE to an oom killed task or to current when it is responding to a SIGKILL or already in the exit path because we know it's exiting and without memory reserves it may never exit. This patch is even more concerning, however, because select_bad_process() isn't even guaranteed to select a thread from the same thread group this time. > Note: this is _not_ enough. Just a minimal fix. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > > mm/oom_kill.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100 > +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100 > @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr > * blocked waiting for another task which itself is waiting > * for memory. Is there a better alternative? > */ > - if (test_tsk_thread_flag(p, TIF_MEMDIE)) > + if (test_tsk_thread_flag(p, TIF_MEMDIE) && > + !p->exit_state && thread_group_empty(p)) > return ERR_PTR(-1UL); > > /* > > ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2011-03-18 18:41 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <AANLkTikOdG7iTKDKq5mCYhcVz-rgZ_F2Ja78oBCOCQ91@mail.gmail.com>
[not found] ` <alpine.DEB.2.00.1103141512310.4425@chino.kir.corp.google.com>
[not found] ` <20110315194737.GE21640@redhat.com>
[not found] ` <alpine.DEB.2.00.1103151259380.558@chino.kir.corp.google.com>
[not found] ` <20110315212754.GB28117@redhat.com>
[not found] ` <alpine.DEB.2.00.1103151530200.5099@chino.kir.corp.google.com>
[not found] ` <20110316155310.GA9797@redhat.com>
[not found] ` <alpine.DEB.2.00.1103161220110.9710@chino.kir.corp.google.com>
[not found] ` <20110316202131.GA20790@redhat.com>
[not found] ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com>
2011-03-18 18:32 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
[not found] <20110303100030.B936.A69D9226@jp.fujitsu.com>
[not found] ` <20110308134233.GA26884@redhat.com>
[not found] ` <alpine.DEB.2.00.1103081549530.27910@chino.kir.corp.google.com>
[not found] ` <20110309151946.dea51cde.akpm@linux-foundation.org>
[not found] ` <alpine.DEB.2.00.1103111142260.30699@chino.kir.corp.google.com>
[not found] ` <20110312123413.GA18351@redhat.com>
[not found] ` <20110312134341.GA27275@redhat.com>
[not found] ` <AANLkTinHGSb2_jfkwx=Wjv96phzPCjBROfCTFCKi4Wey@mail.gmail.com>
[not found] ` <20110313212726.GA24530@redhat.com>
2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
2011-03-14 20:50 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox