* [PATCH 0/3 for 2.6.38] oom: fixes [not found] ` <20110313212726.GA24530@redhat.com> @ 2011-03-14 19:04 ` Oleg Nesterov 2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw) To: Hugh Dickins, Linus Torvalds Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel On 03/13, Oleg Nesterov wrote: > > The _trivial_ exploit (distinct from this one) can kill the system. > ... > I'll return to this tomorrow. I am going to give up, I do not understand the changes in -mm ;) But I think we have other problems which should be fixed first. Note that each fix has the "this is _not_ enough" comment. I was going to _try_ to do something more complete later, but since we are going to add more hacks^W subtle changes... lets fix the obvious bugs first. Add Linus. Hopefully the changes are simple enough. But once again, we need more. Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov @ 2011-03-14 19:04 ` Oleg Nesterov 2011-03-14 19:35 ` Linus Torvalds 2011-03-14 20:22 ` David Rientjes 2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov 2011-03-14 19:05 ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov 2 siblings, 2 replies; 17+ messages in thread From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw) To: Hugh Dickins, Linus Torvalds Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING. This is very wrong by many reasons. In particular, this thread can be the dead group leader. Check p->mm != NULL. Note: this is _not_ enough. Just a minimal fix. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/oom_kill.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- 38/mm/oom_kill.c~1_kill_fix_pf_exiting 2011-03-14 17:53:05.000000000 +0100 +++ 38/mm/oom_kill.c 2011-03-14 18:51:49.000000000 +0100 @@ -470,7 +470,7 @@ static int oom_kill_process(struct task_ * If the task is already exiting, don't alarm the sysadmin or kill * its children or threads, just set TIF_MEMDIE so it can die quickly */ - if (p->flags & PF_EXITING) { + if (p->flags & PF_EXITING && p->mm) { set_tsk_thread_flag(p, TIF_MEMDIE); boost_dying_task_prio(p, mem); return 0; ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov @ 2011-03-14 19:35 ` Linus Torvalds 2011-03-14 20:31 ` Oleg Nesterov 2011-03-14 20:32 ` David Rientjes 2011-03-14 20:22 ` David Rientjes 1 sibling, 2 replies; 17+ messages in thread From: Linus Torvalds @ 2011-03-14 19:35 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote: > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING. > This is very wrong by many reasons. In particular, this thread can > be the dead group leader. Check p->mm != NULL. Explain more, please. Maybe I'm missing some context because I wasn't cc'd on the original thread, but PF_EXITING gets set by exit_signal(), and exit_mm() is called almost immediately afterwards which will set p->mm to NULL. So afaik, this will basically just remove the whole point of the code entirely - so why not remove it then? The combination of testing PF_EXITING and p->mm just doesn't seem to make any sense. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-14 19:35 ` Linus Torvalds @ 2011-03-14 20:31 ` Oleg Nesterov 2011-03-14 20:32 ` David Rientjes 1 sibling, 0 replies; 17+ messages in thread From: Oleg Nesterov @ 2011-03-14 20:31 UTC (permalink / raw) To: Linus Torvalds Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel On 03/14, Linus Torvalds wrote: > > On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote: > > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING. > > This is very wrong by many reasons. In particular, this thread can > > be the dead group leader. Check p->mm != NULL. > > Explain more, please. Maybe I'm missing some context because I wasn't > cc'd on the original thread, but PF_EXITING gets set by exit_signal(), > and exit_mm() is called almost immediately afterwards which will set > p->mm to NULL. > > So afaik, this will basically just remove the whole point of the code > entirely - so why not remove it then? I am afraid I am going to lie... But iirc I tried to remove this code before. Can't find the previous discussion, probably I am wrong. Anyway. I never understood why do we have this special case. > The combination of testing PF_EXITING and p->mm just doesn't seem to > make any sense. To me, it doesn't make too much sense even if we do not check ->mm. But. I _think_ the intent was to wait until this "exiting" process does exit_mm() and frees the memory. This is like the "the process of releasing memory " code in select_bad_process(). Once again, this is only my speculation. In any case, this patch doesn't pretend to be the right fix. Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-14 19:35 ` Linus Torvalds 2011-03-14 20:31 ` Oleg Nesterov @ 2011-03-14 20:32 ` David Rientjes 2011-03-15 19:12 ` Oleg Nesterov 1 sibling, 1 reply; 17+ messages in thread From: David Rientjes @ 2011-03-14 20:32 UTC (permalink / raw) To: Linus Torvalds Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On Mon, 14 Mar 2011, Linus Torvalds wrote: > The combination of testing PF_EXITING and p->mm just doesn't seem to > make any sense. > Right, it doesn't (and I recently removed testing the combination from select_bad_process() in -mm). The check for PF_EXITING in the oom killer is purely to avoid needlessly killing tasks when something is already exiting and will (hopefully) be freeing its memory soon. If an eligible thread is found to be PF_EXITING, the oom killer will defer indefinitely unless it happens to be current. If it happens to be current, then it is automatically selected so it gets access to the needed memory reserves. We do need to ensure that behavior doesn't preempt any task from being killed if there's an eligible thread other than current that never actually detaches its ->mm (oom-skip-zombies-when-iterating-tasklist.patch in -mm filters all threads without an ->mm). That can happen if mm->mmap_sem never gets released by a thread and that's why an earlier change that is already in your tree automatically gives current access to memory reserves immediately upon calling the oom killer if it has a pending SIGKILL. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-14 20:32 ` David Rientjes @ 2011-03-15 19:12 ` Oleg Nesterov 2011-03-15 19:51 ` David Rientjes 0 siblings, 1 reply; 17+ messages in thread From: Oleg Nesterov @ 2011-03-15 19:12 UTC (permalink / raw) To: David Rientjes Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On 03/14, David Rientjes wrote: > > On Mon, 14 Mar 2011, Linus Torvalds wrote: > > > The combination of testing PF_EXITING and p->mm just doesn't seem to > > make any sense. > > > > Right, it doesn't (and I recently removed testing the combination from > select_bad_process() in -mm). The check for PF_EXITING in the oom killer > is purely to avoid needlessly killing tasks when something is already > exiting Maybe 0/3 wasn't clear enough. This patches does not try to fix things, it only tries to close the hole in 2.6.38. But it was already released today. > and will (hopefully) be freeing its memory soon. This is not clear to me. When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61 and the changelog says: __oom_kill_task() is called to elevate the task's timeslice and give it access to memory reserves so that it may quickly exit. This privilege is unnecessary, however, if the task has already detached its mm. Now you are saing this is pointless. OK. I already said I do not understand this special case. Perhaps I'll ask the questions later. > If an eligible > thread is found to be PF_EXITING, The problem is, we can't trust per-thread PF_EXITING checks. But I guess we will discuss this more anyway. Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-15 19:12 ` Oleg Nesterov @ 2011-03-15 19:51 ` David Rientjes 0 siblings, 0 replies; 17+ messages in thread From: David Rientjes @ 2011-03-15 19:51 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On Tue, 15 Mar 2011, Oleg Nesterov wrote: > When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61 > and the changelog says: > > __oom_kill_task() is called to elevate the task's timeslice and give it > access to memory reserves so that it may quickly exit. > > This privilege is unnecessary, however, if the task has already detached > its mm. > > Now you are saing this is pointless. > If you have the commit id, do a "git blame 8123681022", because I see a: 5081dde3 (Nick Piggin 2006-09-25 23:31:32 -0700 222) if (!p->mm) 5081dde3 (Nick Piggin 2006-09-25 23:31:32 -0700 223) continue; in select_bad_process() and it's also iterating over every thread: a49335cc (Paul Jackson 2005-09-06 15:18:09 -0700 215) do_each_thread(g, p) { It's pointless since oom-skip-zombies-when-iterating-tasklist.patch in -mm reintroduced the filter for !p->mm in select_bad_process() which was still there when 81236810 was merged; it's a small optimization, though, to avoid races where the mm becomes detached between the process' selection in select_bad_process() and its kill in oom_kill_process(). > The problem is, we can't trust per-thread PF_EXITING checks. But I guess > we will discuss this more anyway. > My approach, as you saw with oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch in -mm is to add exceptions to the oom killer when we can't trust that PF_EXITING will soon be exiting. I think that's a much more long-term maintainable solution instead of inferring the status of a thread based on external circumstances (such as number of threads in the thread group) that could easily change out from under us and once again break the oom killer. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov 2011-03-14 19:35 ` Linus Torvalds @ 2011-03-14 20:22 ` David Rientjes 2011-03-15 18:53 ` Oleg Nesterov 1 sibling, 1 reply; 17+ messages in thread From: David Rientjes @ 2011-03-14 20:22 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On Mon, 14 Mar 2011, Oleg Nesterov wrote: > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING. > This is very wrong by many reasons. In particular, this thread can > be the dead group leader. Check p->mm != NULL. > This is true only for the oom_kill_allocating_task sysctl where it is required in all cases to kill current; current won't be triggering the oom killer if it's dead. oom_kill_process() is called with the thread selected by select_bad_process() and that function will not return any thread if any eligible task is found to be PF_EXITING and is not current, or any eligible task is found to have TIF_MEMDIE. In other words, for this conditional to be true in oom_kill_process(), then p must be current and so it cannot be the dead group leader as specified in your changelog unless PF_EXITING gets set between select_bad_process() and the oom_kill_process() call: we don't care about that since it's in the exit path and we therefore want to give it access to memory reserves to quickly exit anyway and the check for PF_EXITING in select_bad_process() prevents any infinite loop of that task getting constantly reselected if it's dead. > Note: this is _not_ enough. Just a minimal fix. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > > mm/oom_kill.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- 38/mm/oom_kill.c~1_kill_fix_pf_exiting 2011-03-14 17:53:05.000000000 +0100 > +++ 38/mm/oom_kill.c 2011-03-14 18:51:49.000000000 +0100 > @@ -470,7 +470,7 @@ static int oom_kill_process(struct task_ > * If the task is already exiting, don't alarm the sysadmin or kill > * its children or threads, just set TIF_MEMDIE so it can die quickly > */ > - if (p->flags & PF_EXITING) { > + if (p->flags & PF_EXITING && p->mm) { > set_tsk_thread_flag(p, TIF_MEMDIE); > boost_dying_task_prio(p, mem); > return 0; > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-14 20:22 ` David Rientjes @ 2011-03-15 18:53 ` Oleg Nesterov 2011-03-15 19:54 ` David Rientjes 0 siblings, 1 reply; 17+ messages in thread From: Oleg Nesterov @ 2011-03-15 18:53 UTC (permalink / raw) To: David Rientjes Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On 03/14, David Rientjes wrote: > > On Mon, 14 Mar 2011, Oleg Nesterov wrote: > > > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING. > > This is very wrong by many reasons. In particular, this thread can > > be the dead group leader. Check p->mm != NULL. > > > > This is true only for the oom_kill_allocating_task sysctl where it is > required in all cases to kill current; current won't be triggering the oom > killer if it's dead. > > oom_kill_process() is called with the thread selected by > select_bad_process() and that function will not return any thread if any > eligible task is found to be PF_EXITING and is not current, or any > eligible task is found to have TIF_MEMDIE. > > In other words, for this conditional to be true in oom_kill_process(), > then p must be current and so it cannot be the dead group leader as > specified in your changelog unless PF_EXITING gets set between > select_bad_process() and the oom_kill_process() call: we don't care about > that since it's in the exit path and we therefore want to give it access > to memory reserves to quickly exit anyway and the check for PF_EXITING in > select_bad_process() prevents any infinite loop of that task getting > constantly reselected if it's dead. Confused. I sent the test-case. OK, may be you meant the code in -mm, but I meant the current code. Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-15 18:53 ` Oleg Nesterov @ 2011-03-15 19:54 ` David Rientjes 2011-03-15 21:16 ` Oleg Nesterov 0 siblings, 1 reply; 17+ messages in thread From: David Rientjes @ 2011-03-15 19:54 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On Tue, 15 Mar 2011, Oleg Nesterov wrote: > Confused. I sent the test-case. OK, may be you meant the code in -mm, > but I meant the current code. > This entire discussion, and your involvement in it, originated from these two patches being merged into -mm: oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch oom-skip-zombies-when-iterating-tasklist.patch So naturally I'm going to challenge your testcases with the latest -mm. If you wanted to suggest pushing these to 2.6.38 earlier, I don't think anyone would have disputed that -- I certainly wouldn't have since the first fixes a quite obvious panic that we've faced on a lot of our machines. It's not that big of a deal, though, since Andrew has targeted them for -stable and they're on schedule to be pushed to 2.6.38.x ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm 2011-03-15 19:54 ` David Rientjes @ 2011-03-15 21:16 ` Oleg Nesterov 0 siblings, 0 replies; 17+ messages in thread From: Oleg Nesterov @ 2011-03-15 21:16 UTC (permalink / raw) To: David Rientjes Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On 03/15, David Rientjes wrote: > > On Tue, 15 Mar 2011, Oleg Nesterov wrote: > > > Confused. I sent the test-case. OK, may be you meant the code in -mm, > > but I meant the current code. > > > > This entire discussion, and your involvement in it, originated from these > two patches being merged into -mm: > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch > oom-skip-zombies-when-iterating-tasklist.patch Yes. This motivated me to look at the current code, and I was unpleasantly surprised. > So naturally I'm going to challenge your testcases with the latest -mm. Sure. And once again, I didn't expect the 2nd problem was fixed, I forgot about the second patch. > If you wanted to suggest pushing these to 2.6.38 earlier, Yes. It was too late for 2.6.38. I thought we have more time before release. Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies 2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov 2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov @ 2011-03-14 19:05 ` Oleg Nesterov 2011-03-14 20:50 ` David Rientjes 2011-03-14 19:05 ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov 2 siblings, 1 reply; 17+ messages in thread From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw) To: Hugh Dickins, Linus Torvalds Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel select_bad_process() assumes that a TIF_MEMDIE process should go away. But it can only go away it its parent does wait(). Change this check to ignore the TIF_MEMDIE zombies. Note: this is _not_ enough. Just a minimal fix. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/oom_kill.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100 +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100 @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr * blocked waiting for another task which itself is waiting * for memory. Is there a better alternative? */ - if (test_tsk_thread_flag(p, TIF_MEMDIE)) + if (test_tsk_thread_flag(p, TIF_MEMDIE) && + !p->exit_state && thread_group_empty(p)) return ERR_PTR(-1UL); /* ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies 2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov @ 2011-03-14 20:50 ` David Rientjes 0 siblings, 0 replies; 17+ messages in thread From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On Mon, 14 Mar 2011, Oleg Nesterov wrote: > select_bad_process() assumes that a TIF_MEMDIE process should go away. > But it can only go away it its parent does wait(). Change this check to > ignore the TIF_MEMDIE zombies. > The equivalent of this change would be to set TIF_MEMDIE for all threads in a thread group when choosing a process to kill; as we've already discussed in your first series of patches, that has the risk of fully depleting memory reserves and causing the kernel the deadlock. We want to limit TIF_MEMDIE to an oom killed task or to current when it is responding to a SIGKILL or already in the exit path because we know it's exiting and without memory reserves it may never exit. This patch is even more concerning, however, because select_bad_process() isn't even guaranteed to select a thread from the same thread group this time. > Note: this is _not_ enough. Just a minimal fix. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > > mm/oom_kill.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100 > +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100 > @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr > * blocked waiting for another task which itself is waiting > * for memory. Is there a better alternative? > */ > - if (test_tsk_thread_flag(p, TIF_MEMDIE)) > + if (test_tsk_thread_flag(p, TIF_MEMDIE) && > + !p->exit_state && thread_group_empty(p)) > return ERR_PTR(-1UL); > > /* > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic 2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov 2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov 2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov @ 2011-03-14 19:05 ` Oleg Nesterov 2011-03-14 20:41 ` David Rientjes 2 siblings, 1 reply; 17+ messages in thread From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw) To: Hugh Dickins, Linus Torvalds Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel oom_kill_process() starts with victim_points == 0. This means that (most likely) any child has more points and can be killed erroneously. Also, "children has a different mm" doesn't match the reality, we should check child->mm != t->mm. This check is not exactly correct if t->mm == NULL but this doesn't really matter, oom_kill_task() will kill them anyway. Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong too. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/oom_kill.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- 38/mm/oom_kill.c~3_fix_kill_chld 2011-03-14 18:52:39.000000000 +0100 +++ 38/mm/oom_kill.c 2011-03-14 19:36:01.000000000 +0100 @@ -459,10 +459,10 @@ static int oom_kill_process(struct task_ struct mem_cgroup *mem, nodemask_t *nodemask, const char *message) { - struct task_struct *victim = p; + struct task_struct *victim; struct task_struct *child; - struct task_struct *t = p; - unsigned int victim_points = 0; + struct task_struct *t; + unsigned int victim_points; if (printk_ratelimit()) dump_header(p, gfp_mask, order, mem, nodemask); @@ -488,10 +488,15 @@ static int oom_kill_process(struct task_ * parent. This attempts to lose the minimal amount of work done while * still freeing memory. */ + victim_points = oom_badness(p, mem, nodemask, totalpages); + victim = p; + t = p; do { list_for_each_entry(child, &t->children, sibling) { unsigned int child_points; + if (child->mm == t->mm) + continue; /* * oom_badness() returns 0 if the thread is unkillable */ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic 2011-03-14 19:05 ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov @ 2011-03-14 20:41 ` David Rientjes 2011-03-15 19:21 ` Oleg Nesterov 0 siblings, 1 reply; 17+ messages in thread From: David Rientjes @ 2011-03-14 20:41 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On Mon, 14 Mar 2011, Oleg Nesterov wrote: > oom_kill_process() starts with victim_points == 0. This means that > (most likely) any child has more points and can be killed erroneously. > > Also, "children has a different mm" doesn't match the reality, we > should check child->mm != t->mm. This check is not exactly correct > if t->mm == NULL but this doesn't really matter, oom_kill_task() > will kill them anyway. > > Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong > too. > There're two issues you're addressing in this patch. It only kills a child in place of its selected parent when: - the child has a higher badness score, and - it has a different ->mm. In the former case, NACK, we always want to sacrifice children regardless of their badness score (as long as it is non-zero) if it has a separate ->mm in place of its parent, otherwise webservers will be killed instead of one of their children serving a client, sshd could be killed instead of bash, etc. The behavior of the oom killer has always been to try to kill a child with its own ->mm first to avoid losing a large amount of work being done or unnecessarily killing a job scheduler, for example, when sacrificing a child would be satisfactory. It'll kill additional tasks, and perhaps even the parent later if it has no more children, if the oom condition persists. In the latter case, I agree, we should be testing if the child has a different ->mm before sacrificing it for its parent as the comment indicates it will. I proposed that exact change in "oom: avoid deferring oom killer if exiting task is being traced" posted to -mm a couple days ago. > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > > mm/oom_kill.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > --- 38/mm/oom_kill.c~3_fix_kill_chld 2011-03-14 18:52:39.000000000 +0100 > +++ 38/mm/oom_kill.c 2011-03-14 19:36:01.000000000 +0100 > @@ -459,10 +459,10 @@ static int oom_kill_process(struct task_ > struct mem_cgroup *mem, nodemask_t *nodemask, > const char *message) > { > - struct task_struct *victim = p; > + struct task_struct *victim; > struct task_struct *child; > - struct task_struct *t = p; > - unsigned int victim_points = 0; > + struct task_struct *t; > + unsigned int victim_points; > > if (printk_ratelimit()) > dump_header(p, gfp_mask, order, mem, nodemask); > @@ -488,10 +488,15 @@ static int oom_kill_process(struct task_ > * parent. This attempts to lose the minimal amount of work done while > * still freeing memory. > */ > + victim_points = oom_badness(p, mem, nodemask, totalpages); > + victim = p; > + t = p; > do { > list_for_each_entry(child, &t->children, sibling) { > unsigned int child_points; > > + if (child->mm == t->mm) > + continue; > /* > * oom_badness() returns 0 if the thread is unkillable > */ > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic 2011-03-14 20:41 ` David Rientjes @ 2011-03-15 19:21 ` Oleg Nesterov 0 siblings, 0 replies; 17+ messages in thread From: Oleg Nesterov @ 2011-03-15 19:21 UTC (permalink / raw) To: David Rientjes Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On 03/14, David Rientjes wrote: > > On Mon, 14 Mar 2011, Oleg Nesterov wrote: > > > oom_kill_process() starts with victim_points == 0. This means that > > (most likely) any child has more points and can be killed erroneously. > > > > Also, "children has a different mm" doesn't match the reality, we > > should check child->mm != t->mm. This check is not exactly correct > > if t->mm == NULL but this doesn't really matter, oom_kill_task() > > will kill them anyway. > > > > Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong > > too. > > > > There're two issues you're addressing in this patch. It only kills a > child in place of its selected parent when: > > - the child has a higher badness score, and > > - it has a different ->mm. > > In the former case, NACK, we always want to sacrifice children regardless > of their badness score (as long as it is non-zero) if it has a separate > ->mm in place of its parent, Ah. So this was intentional? OK. I was hypnotized by the security implications, and this looked so "obviously wrong" to me. But, of course I can't judge when it comes to oom's heuristic, and you certainly know better. So, thanks for correcting me. Just a question... what about oom_kill_allocating_task? Probably Documentation/sysctl/vm.txt should be updated. Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <AANLkTikOdG7iTKDKq5mCYhcVz-rgZ_F2Ja78oBCOCQ91@mail.gmail.com>]
[parent not found: <alpine.DEB.2.00.1103141512310.4425@chino.kir.corp.google.com>]
[parent not found: <20110315194737.GE21640@redhat.com>]
[parent not found: <alpine.DEB.2.00.1103151259380.558@chino.kir.corp.google.com>]
[parent not found: <20110315212754.GB28117@redhat.com>]
[parent not found: <alpine.DEB.2.00.1103151530200.5099@chino.kir.corp.google.com>]
[parent not found: <20110316155310.GA9797@redhat.com>]
[parent not found: <alpine.DEB.2.00.1103161220110.9710@chino.kir.corp.google.com>]
[parent not found: <20110316202131.GA20790@redhat.com>]
[parent not found: <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com>]
* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies [not found] ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com> @ 2011-03-18 18:32 ` Oleg Nesterov 0 siblings, 0 replies; 17+ messages in thread From: Oleg Nesterov @ 2011-03-18 18:32 UTC (permalink / raw) To: David Rientjes Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-kernel Sorry for delay... Remove security, this has nothing to do with the released code. But please see the question at the end... On 03/16, David Rientjes wrote: > > On Wed, 16 Mar 2011, Oleg Nesterov wrote: > > > > > do { > > > > list_for_each_entry(child, &t->children, sibling) { > > > > unsigned int child_points; > > > > > > > > /* > > > > * oom_badness() returns 0 if the thread is unkillable > > > > */ > > > > child_points = oom_badness(child, mem, nodemask, > > > > totalpages); > > > > > > > > child->mm can be NULL. > > > > > > > > > > So child_points would be 0 here. > > > > Why? oom_badness() checks the whole group. group_leader can exit and > > pass exit_mm(). But it still the leader and "represents" the whole > > group even if it exits as thread. > > > > If there are still child threads that have valid mm's, then they are > eligible for oom kill and all threads sharing that mm will be killed once > passed to oom_kill_task(). That may be the same as the selected task, p, > passed to oom_kill_process() but all threads that share the mm would have > to be killed anyway to free memory. Not sure I understand... Yes, oom_kill_task() kills all processes that share the same ->mm. (but to remind, "q->mm == mm" is not right for the same reason, q->mm can be NULL). But the code above should filter out the tasks with the same ->mm. It can't. OK, this is really minor. CLONE_VM processes with the dead leader, this is really exotics. > > > > if (p->flags & PF_EXITING) { > > > > set_tsk_thread_flag(p, TIF_MEMDIE); > > > > boost_dying_task_prio(p, mem); > > > > return 0; > > > > } > > > > > > > > in oom_kill_process() whith -mm patches? > > > > > > > > We know that this thread (not process) was chosen by select_bad_process() > > > > and p->mm != NULL. As Linus rightly pointed, this means this code can only > > > > work in the small window between exit_signals() and exit_mm(). > > > > > > > > So, what is the point? > > > > > > > > > > Because there's no need to SIGKILL the task or emit anything to the kernel > > > log. We don't want anybody thinking that the oom killer killed it when it > > > was already exiting on its own. > > > > OK. But this case is very unlikely. And I am still trying to understand > > why this special case is important. But I can't. > > > > It's actually not unlikely at all if mm->mmap_sem is held. Do you mean OOM from with down_write(mmap_sem) ? OK, in this case we can see a lot of PF_EXITING && mm threads. But this means they are likely sleeping in exit_mm()->down_read(), how the code above can help? > > > The combination of testing PF_EXITING and p->mm just doesn't seem to > > > make any sense. > > > > > > > Right, it doesn't (and I recently removed testing the combination from > > select_bad_process() in -mm). > > > > How so? This is what we have now, no? > > > > It's not required functionally for the oom killer, OK, thanks. > If any other threads can't actually exit yet, > then they will automatically be selected when they invoke the oom killer > (we automatically select current if it is PF_EXITING and the oom killer > iterates over all threads in -mm) so we don't need to be concerned about > them stalling at this point. Again, it is unlikely that another thread triggers oom between exit_signals() and exit_mm(). And what "other threads" actually mean? If you mean that we already killed this process (iow, oom_kill_task() sent SIGKILL to any sub-thread in this group) then yes, this thread probably needs TIF_MEMDIE. But. In this case current won't call select_bad_process() at all. We have the fatal_signal_pending() check at the top of out_of_memory(), and this is the "special" case in oom_kill.c I can understand. I hope ;) Btw. fatal_signal_pending() is not really good... it can be false negative. signal_group_exit() looks better. > In the quote above, Linus was referring to testing PF_EXITING and p->mm in > oom_kill_process(). It doesn't make any sense if we have already filtered > p->mm in select_bad_process() No, I don't think this was the point. This was discussed assuming the current code, select_bad_process() doesn't filter !mm threads, and it is not per-thread. > and we don't want to needlessly kill any > children because p has executed exit_mm() between its selection and its > kill: it's on the exit path and will probably be freeing memory soon. OK, this is reasonable. And this is what I can understand. But this case looks unlikely, and I am not sure it is right, please see below. > While this code inspection is interesting, what would probably be more > interesting is if you have any test cases that are problematic on the > latest -mm tree I sent one. it wasn't tested, but should be problematic. Doesn't really matter, we can fix this. I am just trying to understand the new "per-thread" direction. I can't. OK. For example. Two threads T1 and T2. This process uses a lot of memory. 1. T2 does, say, do_brk() and triggers OOM 2. T2 calls out_of_memory->select_bad_process() and starts the main do_each_thread() loop. It finds T1, then T2. oom_badness() returns the same value, so select_bad_process() returns T1. 4. T1 exits, calls exit_mm() and sleeps on down_read(). 5. T2 calls oom_kill_process(), sees PF_EXITING, does set_tsk_thread_flag(T1, TIF_MEMDIE) and returns. Now. out_of_memory() will be called again, but select_bad_process() is fooled. It will see T1 before T2 and return ERR_PTR() because of T1 has TIF_MEMDIE. And T2 can't access the memory reserves because it lacks TIF_MEMDIE. No? Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-03-18 18:41 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <alpine.DEB.2.00.1103011108400.28110@chino.kir.corp.google.com>
[not found] ` <20110303100030.B936.A69D9226@jp.fujitsu.com>
[not found] ` <20110308134233.GA26884@redhat.com>
[not found] ` <alpine.DEB.2.00.1103081549530.27910@chino.kir.corp.google.com>
[not found] ` <20110309151946.dea51cde.akpm@linux-foundation.org>
[not found] ` <alpine.DEB.2.00.1103111142260.30699@chino.kir.corp.google.com>
[not found] ` <20110312123413.GA18351@redhat.com>
[not found] ` <20110312134341.GA27275@redhat.com>
[not found] ` <AANLkTinHGSb2_jfkwx=Wjv96phzPCjBROfCTFCKi4Wey@mail.gmail.com>
[not found] ` <20110313212726.GA24530@redhat.com>
2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
2011-03-14 19:35 ` Linus Torvalds
2011-03-14 20:31 ` Oleg Nesterov
2011-03-14 20:32 ` David Rientjes
2011-03-15 19:12 ` Oleg Nesterov
2011-03-15 19:51 ` David Rientjes
2011-03-14 20:22 ` David Rientjes
2011-03-15 18:53 ` Oleg Nesterov
2011-03-15 19:54 ` David Rientjes
2011-03-15 21:16 ` Oleg Nesterov
2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
2011-03-14 20:50 ` David Rientjes
2011-03-14 19:05 ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
2011-03-14 20:41 ` David Rientjes
2011-03-15 19:21 ` Oleg Nesterov
[not found] <AANLkTikOdG7iTKDKq5mCYhcVz-rgZ_F2Ja78oBCOCQ91@mail.gmail.com>
[not found] ` <alpine.DEB.2.00.1103141512310.4425@chino.kir.corp.google.com>
[not found] ` <20110315194737.GE21640@redhat.com>
[not found] ` <alpine.DEB.2.00.1103151259380.558@chino.kir.corp.google.com>
[not found] ` <20110315212754.GB28117@redhat.com>
[not found] ` <alpine.DEB.2.00.1103151530200.5099@chino.kir.corp.google.com>
[not found] ` <20110316155310.GA9797@redhat.com>
[not found] ` <alpine.DEB.2.00.1103161220110.9710@chino.kir.corp.google.com>
[not found] ` <20110316202131.GA20790@redhat.com>
[not found] ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com>
2011-03-18 18:32 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox