* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies [not found] ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com> @ 2011-03-18 18:32 ` Oleg Nesterov 0 siblings, 0 replies; 5+ messages in thread From: Oleg Nesterov @ 2011-03-18 18:32 UTC (permalink / raw) To: David Rientjes Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-kernel Sorry for delay... Remove security, this has nothing to do with the released code. But please see the question at the end... On 03/16, David Rientjes wrote: > > On Wed, 16 Mar 2011, Oleg Nesterov wrote: > > > > > do { > > > > list_for_each_entry(child, &t->children, sibling) { > > > > unsigned int child_points; > > > > > > > > /* > > > > * oom_badness() returns 0 if the thread is unkillable > > > > */ > > > > child_points = oom_badness(child, mem, nodemask, > > > > totalpages); > > > > > > > > child->mm can be NULL. > > > > > > > > > > So child_points would be 0 here. > > > > Why? oom_badness() checks the whole group. group_leader can exit and > > pass exit_mm(). But it still the leader and "represents" the whole > > group even if it exits as thread. > > > > If there are still child threads that have valid mm's, then they are > eligible for oom kill and all threads sharing that mm will be killed once > passed to oom_kill_task(). That may be the same as the selected task, p, > passed to oom_kill_process() but all threads that share the mm would have > to be killed anyway to free memory. Not sure I understand... Yes, oom_kill_task() kills all processes that share the same ->mm. (but to remind, "q->mm == mm" is not right for the same reason, q->mm can be NULL). But the code above should filter out the tasks with the same ->mm. It can't. OK, this is really minor. CLONE_VM processes with the dead leader, this is really exotics. > > > > if (p->flags & PF_EXITING) { > > > > set_tsk_thread_flag(p, TIF_MEMDIE); > > > > boost_dying_task_prio(p, mem); > > > > return 0; > > > > } > > > > > > > > in oom_kill_process() whith -mm patches? > > > > > > > > We know that this thread (not process) was chosen by select_bad_process() > > > > and p->mm != NULL. As Linus rightly pointed, this means this code can only > > > > work in the small window between exit_signals() and exit_mm(). > > > > > > > > So, what is the point? > > > > > > > > > > Because there's no need to SIGKILL the task or emit anything to the kernel > > > log. We don't want anybody thinking that the oom killer killed it when it > > > was already exiting on its own. > > > > OK. But this case is very unlikely. And I am still trying to understand > > why this special case is important. But I can't. > > > > It's actually not unlikely at all if mm->mmap_sem is held. Do you mean OOM from with down_write(mmap_sem) ? OK, in this case we can see a lot of PF_EXITING && mm threads. But this means they are likely sleeping in exit_mm()->down_read(), how the code above can help? > > > The combination of testing PF_EXITING and p->mm just doesn't seem to > > > make any sense. > > > > > > > Right, it doesn't (and I recently removed testing the combination from > > select_bad_process() in -mm). > > > > How so? This is what we have now, no? > > > > It's not required functionally for the oom killer, OK, thanks. > If any other threads can't actually exit yet, > then they will automatically be selected when they invoke the oom killer > (we automatically select current if it is PF_EXITING and the oom killer > iterates over all threads in -mm) so we don't need to be concerned about > them stalling at this point. Again, it is unlikely that another thread triggers oom between exit_signals() and exit_mm(). And what "other threads" actually mean? If you mean that we already killed this process (iow, oom_kill_task() sent SIGKILL to any sub-thread in this group) then yes, this thread probably needs TIF_MEMDIE. But. In this case current won't call select_bad_process() at all. We have the fatal_signal_pending() check at the top of out_of_memory(), and this is the "special" case in oom_kill.c I can understand. I hope ;) Btw. fatal_signal_pending() is not really good... it can be false negative. signal_group_exit() looks better. > In the quote above, Linus was referring to testing PF_EXITING and p->mm in > oom_kill_process(). It doesn't make any sense if we have already filtered > p->mm in select_bad_process() No, I don't think this was the point. This was discussed assuming the current code, select_bad_process() doesn't filter !mm threads, and it is not per-thread. > and we don't want to needlessly kill any > children because p has executed exit_mm() between its selection and its > kill: it's on the exit path and will probably be freeing memory soon. OK, this is reasonable. And this is what I can understand. But this case looks unlikely, and I am not sure it is right, please see below. > While this code inspection is interesting, what would probably be more > interesting is if you have any test cases that are problematic on the > latest -mm tree I sent one. it wasn't tested, but should be problematic. Doesn't really matter, we can fix this. I am just trying to understand the new "per-thread" direction. I can't. OK. For example. Two threads T1 and T2. This process uses a lot of memory. 1. T2 does, say, do_brk() and triggers OOM 2. T2 calls out_of_memory->select_bad_process() and starts the main do_each_thread() loop. It finds T1, then T2. oom_badness() returns the same value, so select_bad_process() returns T1. 4. T1 exits, calls exit_mm() and sleeps on down_read(). 5. T2 calls oom_kill_process(), sees PF_EXITING, does set_tsk_thread_flag(T1, TIF_MEMDIE) and returns. Now. out_of_memory() will be called again, but select_bad_process() is fooled. It will see T1 before T2 and return ERR_PTR() because of T1 has TIF_MEMDIE. And T2 can't access the memory reserves because it lacks TIF_MEMDIE. No? Oleg. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] oom: prevent unnecessary oom kills or kernel panics @ 2011-03-03 1:20 KOSAKI Motohiro 2011-03-08 13:42 ` Oleg Nesterov 0 siblings, 1 reply; 5+ messages in thread From: KOSAKI Motohiro @ 2011-03-03 1:20 UTC (permalink / raw) To: David Rientjes Cc: kosaki.motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Oleg Nesterov, Hugh Dickins, linux-mm Hi > This patch revents unnecessary oom kills or kernel panics by reverting > two commits: > > 495789a5 (oom: make oom_score to per-process value) > cef1d352 (oom: multi threaded process coredump don't make deadlock) > > First, 495789a5 (oom: make oom_score to per-process value) ignores the > fact that all threads in a thread group do not necessarily exit at the > same time. > > It is imperative that select_bad_process() detect threads that are in the > exit path, specifically those with PF_EXITING set, to prevent needlessly > killing additional tasks. to prevent? No, it is not a reason of PF_EXITING exist. > If a process is oom killed and the thread > group leader exits, select_bad_process() cannot detect the other threads > that are PF_EXITING by iterating over only processes. Thus, it currently > chooses another task unnecessarily for oom kill or panics the machine > when nothing else is eligible. > > By iterating over threads instead, it is possible to detect threads that > are exiting and nominate them for oom kill so they get access to memory > reserves. In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore PF_EXITING is not a sign of memory freeing in nearly future. If other CPUs don't try to free memory, prevent oom and waiting makes deadlock. Thus, I suggest to remove PF_EXITING check completely. > > Second, cef1d352 (oom: multi threaded process coredump don't make > deadlock) erroneously avoids making the oom killer a no-op when an > eligible thread other than current isfound to be exiting. We want to > detect this situation so that we may allow that exiting thread time to > exit and free its memory; if it is able to exit on its own, that should > free memory so current is no loner oom. If it is not able to exit on its > own, the oom killer will nominate it for oom kill which, in this case, > only means it will get access to memory reserves. > > Without this change, it is easy for the oom killer to unnecessarily > target tasks when all threads of a victim don't exit before the thread > group leader or, in the worst case, panic the machine. > You missed deadlock is more worse than panic. And again, task overkill is a part of OOM killer design. it is necessary to avoid deadlock. If you want to change this spec, you need to remove deadlock change at first. > Signed-off-by: David Rientjes <rientjes@google.com> > --- > mm/oom_kill.c | 8 ++++---- > 1 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -292,11 +292,11 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > unsigned long totalpages, struct mem_cgroup *mem, > const nodemask_t *nodemask) > { > - struct task_struct *p; > + struct task_struct *g, *p; > struct task_struct *chosen = NULL; > *ppoints = 0; > > - for_each_process(p) { > + do_each_thread(g, p) { > unsigned int points; > > if (oom_unkillable_task(p, mem, nodemask)) > @@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > * the process of exiting and releasing its resources. > * Otherwise we could get an easy OOM deadlock. > */ > - if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) { > + if ((p->flags & PF_EXITING) && p->mm) { > > if (p != current) > return ERR_PTR(-1UL); > > @@ -337,7 +337,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > chosen = p; > *ppoints = points; > } > - } > + } while_each_thread(g, p); > > return chosen; > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] oom: prevent unnecessary oom kills or kernel panics 2011-03-03 1:20 [patch] oom: prevent unnecessary oom kills or kernel panics KOSAKI Motohiro @ 2011-03-08 13:42 ` Oleg Nesterov 2011-03-08 23:57 ` David Rientjes 0 siblings, 1 reply; 5+ messages in thread From: Oleg Nesterov @ 2011-03-08 13:42 UTC (permalink / raw) To: KOSAKI Motohiro Cc: David Rientjes, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, Andrey Vagin On 03/03, KOSAKI Motohiro wrote: > > > By iterating over threads instead, it is possible to detect threads that > > are exiting and nominate them for oom kill so they get access to memory > > reserves. > > In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore > PF_EXITING is not a sign of memory freeing in nearly future. If other > CPUs don't try to free memory, prevent oom and waiting makes deadlock. I agree. I don't understand this patch. And. Instead of moving to for_each_mm() this patch moves the logic back, to for_each_thread(). > Thus, I suggest to remove PF_EXITING check completely. Again, this seems better to me but I do not really understand oom killer's heuristic. Perhaps this check helps with some workloads. I tried to avoid this discussion because I have nothing new to add, and the previous discussion was painful. But since this patch was merged into -mm, > > @@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > > * the process of exiting and releasing its resources. > > * Otherwise we could get an easy OOM deadlock. > > */ > > - if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) { > > + if ((p->flags & PF_EXITING) && p->mm) { The previous check was not perfect, we know this. But with this patch applied, the simple program below disables oom-killer completely. select_bad_process() can never succeed. I think this patch should dropped. And another one, oom-skip-zombies-when-iterating-tasklist.patch should be dropped as well. Add Andrey. Oleg. #include <unistd.h> #include <signal.h> #include <pthread.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <assert.h> #include <stdio.h> void *tfunc(void* arg) { pause(); } int main(void) { int pid = fork(); if (!pid) { pthread_t thread; pthread_create(&thread, NULL, tfunc, NULL); pthread_create(&thread, NULL, tfunc, NULL); ptrace(PTRACE_TRACEME, 0,0,0); kill(getpid(), SIGSTOP); pthread_kill(thread, SIGQUIT); pause(); return 0; } assert(wait(NULL) == pid); assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEEXIT) == 0); assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0); wait(NULL); pause(); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] oom: prevent unnecessary oom kills or kernel panics 2011-03-08 13:42 ` Oleg Nesterov @ 2011-03-08 23:57 ` David Rientjes 2011-03-09 23:19 ` Andrew Morton 0 siblings, 1 reply; 5+ messages in thread From: David Rientjes @ 2011-03-08 23:57 UTC (permalink / raw) To: Oleg Nesterov Cc: KOSAKI Motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, Andrey Vagin On Tue, 8 Mar 2011, Oleg Nesterov wrote: > > > By iterating over threads instead, it is possible to detect threads that > > > are exiting and nominate them for oom kill so they get access to memory > > > reserves. > > > > In fact, PF_EXITING is a sing of *THREAD* exiting, not process. Therefore > > PF_EXITING is not a sign of memory freeing in nearly future. If other > > CPUs don't try to free memory, prevent oom and waiting makes deadlock. > > I agree. I don't understand this patch. > Using for_each_process() does not consider threads that have failed to exit after the oom killed parent and, thus, we select another innocent task to kill when we're really just waiting for those threads to exit (and perhaps they need memory reserves in the exit path) or, in the worst case, panic if there is nothing else eligible. The end result is that without this patch, we sometimes unnecessarily panic (and "sometimes" is defined as "many machines" for us) when nothing else is eligible for kill within an oom cpuset yet doing a do_each_thread() over that cpuset shows threads of previously oom killed parent that have yet to exit. > > > @@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > > > * the process of exiting and releasing its resources. > > > * Otherwise we could get an easy OOM deadlock. > > > */ > > > - if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) { > > > + if ((p->flags & PF_EXITING) && p->mm) { > > The previous check was not perfect, we know this. > > But with this patch applied, the simple program below disables oom-killer > completely. select_bad_process() can never succeed. > The program illustrates a problem that shouldn't be fixed in select_bad_process() but rather in oom_kill_process() when choosing an eligible child of the selected task to kill in place of its parent. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] oom: prevent unnecessary oom kills or kernel panics 2011-03-08 23:57 ` David Rientjes @ 2011-03-09 23:19 ` Andrew Morton 2011-03-11 19:45 ` David Rientjes 0 siblings, 1 reply; 5+ messages in thread From: Andrew Morton @ 2011-03-09 23:19 UTC (permalink / raw) To: David Rientjes Cc: Oleg Nesterov, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, Andrey Vagin On Tue, 8 Mar 2011 15:57:36 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > > > > @@ -324,7 +324,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, > > > > * the process of exiting and releasing its resources. > > > > * Otherwise we could get an easy OOM deadlock. > > > > */ > > > > - if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) { > > > > + if ((p->flags & PF_EXITING) && p->mm) { > > > > The previous check was not perfect, we know this. > > > > But with this patch applied, the simple program below disables oom-killer > > completely. select_bad_process() can never succeed. > > > > The program illustrates a problem that shouldn't be fixed in > select_bad_process() but rather in oom_kill_process() when choosing an > eligible child of the selected task to kill in place of its parent. If Oleg's test program cause a hang with oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't cause a hang without oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a big problem for oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] oom: prevent unnecessary oom kills or kernel panics 2011-03-09 23:19 ` Andrew Morton @ 2011-03-11 19:45 ` David Rientjes 2011-03-12 12:34 ` Oleg Nesterov 0 siblings, 1 reply; 5+ messages in thread From: David Rientjes @ 2011-03-11 19:45 UTC (permalink / raw) To: Andrew Morton Cc: Oleg Nesterov, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, Andrey Vagin On Wed, 9 Mar 2011, Andrew Morton wrote: > If Oleg's test program cause a hang with > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't > cause a hang without > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a > big problem for > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no? > It's a problem, but not because of oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch. If we don't have this patch, then we have a trivial panic when an oom kill occurs in a cpuset with no other eligible processes, the oom killed thread group leader exits but its other threads do not and they trigger oom kills themselves. for_each_process() does not iterate over these threads and so it finds no eligible threads to kill and then panics (and we have many examples of that happening in production). I'll look at Oleg's test case and see what can be done to fix that condition, but the answer isn't to ignore eligible threads that can be killed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [patch] oom: prevent unnecessary oom kills or kernel panics 2011-03-11 19:45 ` David Rientjes @ 2011-03-12 12:34 ` Oleg Nesterov 2011-03-12 13:43 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov 0 siblings, 1 reply; 5+ messages in thread From: Oleg Nesterov @ 2011-03-12 12:34 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, Andrey Vagin On 03/11, David Rientjes wrote: > > On Wed, 9 Mar 2011, Andrew Morton wrote: > > > If Oleg's test program cause a hang with > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't > > cause a hang without > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a > > big problem for > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no? > > > > It's a problem, but not because of > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch. It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm thread is going to free the memory. > If we don't > have this patch, then we have a trivial panic when an oom kill occurs in a > cpuset with no other eligible processes, the oom killed thread group > leader exits It is not clear what "leader exits" actually mean. OK, perhaps you mean its ->mm == NULL. > but its other threads do not and they trigger oom kills > themselves. for_each_process() does not iterate over these threads and so > it finds no eligible threads to kill and then panics Could you explain what do you mean? No need to kill these threads, they are already killed, we should wait until they all exit. > I'll look at Oleg's test case > and see what can be done to fix that condition, but the answer isn't to > ignore eligible threads that can be killed. Once again, they are already killed. Or I do not understand what you meant. Could you please explain the problem in more details? Also. Could you please look at the patches I sent? [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check Note also the note about "p == current" check. it should be fixed too. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes 2011-03-12 12:34 ` Oleg Nesterov @ 2011-03-12 13:43 ` Oleg Nesterov 2011-03-12 19:40 ` Hugh Dickins 0 siblings, 1 reply; 5+ messages in thread From: Oleg Nesterov @ 2011-03-12 13:43 UTC (permalink / raw) To: Andrew Morton Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, Andrey Vagin, David Rientjes On 03/12, Oleg Nesterov wrote: > > On 03/11, David Rientjes wrote: > > > > On Wed, 9 Mar 2011, Andrew Morton wrote: > > > > > If Oleg's test program cause a hang with > > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't > > > cause a hang without > > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a > > > big problem for > > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no? > > > > > > > It's a problem, but not because of > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch. > > It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm > thread is going to free the memory. > > > If we don't > > have this patch, then we have a trivial panic when an oom kill occurs in a > > cpuset with no other eligible processes, the oom killed thread group > > leader exits > > It is not clear what "leader exits" actually mean. OK, perhaps you mean > its ->mm == NULL. > > > but its other threads do not and they trigger oom kills > > themselves. for_each_process() does not iterate over these threads and so > > it finds no eligible threads to kill and then panics > > Could you explain what do you mean? No need to kill these threads, they > are already killed, we should wait until they all exit. > > > I'll look at Oleg's test case > > and see what can be done to fix that condition, but the answer isn't to > > ignore eligible threads that can be killed. > > Once again, they are already killed. Or I do not understand what you meant. > > Could you please explain the problem in more details? > > > Also. Could you please look at the patches I sent? > > [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE > [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check Cough. And both were not right, while_each_thread(p, t) needs the properly initialized "t". At least I warned they were not tested ;) > Note also the note about "p == current" check. it should be fixed too. I am resending the fixes above plus the new one. David, Kosaki, what do you think? Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes 2011-03-12 13:43 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov @ 2011-03-12 19:40 ` Hugh Dickins 2011-03-13 21:27 ` Oleg Nesterov 0 siblings, 1 reply; 5+ messages in thread From: Hugh Dickins @ 2011-03-12 19:40 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm, Andrey Vagin, David Rientjes Hi Oleg, On Sat, Mar 12, 2011 at 5:43 AM, Oleg Nesterov <oleg@redhat.com> wrote: > On 03/12, Oleg Nesterov wrote: >> >> On 03/11, David Rientjes wrote: >> > >> > On Wed, 9 Mar 2011, Andrew Morton wrote: >> > >> > > If Oleg's test program cause a hang with >> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't >> > > cause a hang without >> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a >> > > big problem for >> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no? >> > > >> > >> > It's a problem, but not because of >> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch. >> >> It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm >> thread is going to free the memory. >> >> > If we don't >> > have this patch, then we have a trivial panic when an oom kill occurs in a >> > cpuset with no other eligible processes, the oom killed thread group >> > leader exits >> >> It is not clear what "leader exits" actually mean. OK, perhaps you mean >> its ->mm == NULL. >> >> > but its other threads do not and they trigger oom kills >> > themselves. for_each_process() does not iterate over these threads and so >> > it finds no eligible threads to kill and then panics >> >> Could you explain what do you mean? No need to kill these threads, they >> are already killed, we should wait until they all exit. >> >> > I'll look at Oleg's test case >> > and see what can be done to fix that condition, but the answer isn't to >> > ignore eligible threads that can be killed. >> >> Once again, they are already killed. Or I do not understand what you meant. >> >> Could you please explain the problem in more details? >> >> >> Also. Could you please look at the patches I sent? >> >> [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE >> [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check > > Cough. And both were not right, while_each_thread(p, t) needs the properly > initialized "t". At least I warned they were not tested ;) > >> Note also the note about "p == current" check. it should be fixed too. > > I am resending the fixes above plus the new one. I've spent much of the week building up to join in, but the more I look around, the more I find to say or investigate, and therefore never quite get to write the mail. Let this be a placeholder, that I probably disagree (in an amicable way!) with all of you, and maybe I'll finally manage to collect my thoughts into mail later today. I guess my main point will be that TIF_MEMDIE serves a number of slightly different, even conflicting, purposes; and one of those purposes, which present company seems to ignore repeatedly, is to serialize access to final reserves of memory - as a comment by Nick in select_bad_process() makes clear. (This is distinct from the serialization to avoid OOM-killing rampage.) We _might_ choose to abandon that, but if so, it should be a decision, not an oversight. So I cannot blindly agree with just setting TIF_MEMDIE on more and more tasks, even if they share the same mm. I wonder if use of your find_lock_task_mm() in select_bad_process() might bring together my wish to continue serialization, David's wish to avoid stupid panics, and your wish to avoid deadlocks. Though any serialization has a risk of deadlock: we probably need to weigh up how realistic different cases are. Which brings me neatly to your little pthread_create, ptrace proggy... I dare say you and Kosaki and David know exactly what it's doing and why it's a problem, but even after repeated skims of the ptrace manpage, I'll admit to not having a clue, nor the inclination to run and then debug it to find out the answer. Please, Oleg, would you mind very much explaining it to me? I don't even know if the double pthread_create is a vital part of the scheme, or just a typo. I see it doesn't even allocate any special memory, so I assume it leaves a PF_EXITING around forever, but I couldn't quite see how (with PF_EXITING being set after the tracehook_report_exit). And I wonder if a similar case can be constructed to deadlock the for_each_process version of select_bad_process(). I worry more about someone holding a reference to the mm via /proc (I see memory allocations after getting the mm). Thanks; until later, Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes 2011-03-12 19:40 ` Hugh Dickins @ 2011-03-13 21:27 ` Oleg Nesterov 2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov 0 siblings, 1 reply; 5+ messages in thread From: Oleg Nesterov @ 2011-03-13 21:27 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm, Andrey Vagin, David Rientjes Hi Hugh, On 03/12, Hugh Dickins wrote: > > On Sat, Mar 12, 2011 at 5:43 AM, Oleg Nesterov <oleg@redhat.com> wrote: > >> > >> Also. Could you please look at the patches I sent? > >> > >> [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE > >> [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check > > > > Cough. And both were not right, while_each_thread(p, t) needs the properly > > initialized "t". At least I warned they were not tested ;) > > > >> Note also the note about "p == current" check. it should be fixed too. > > > > I am resending the fixes above plus the new one. > > I've spent much of the week building up to join in, but the more I > look around, the more I find to say or investigate, and therefore > never quite get to write the mail. Let this be a placeholder, that I > probably disagree (in an amicable way!) with all of you, and maybe > I'll finally manage to collect my thoughts into mail later today. Thanks for looking into this! Before I say anything else, I'd like to repeat that I am not going to argue with the nack. The only reason I sent these patches is that I can not understand David's patch at all. I mean, I can not even understand the problems it should address. Yet I think the patch is not right. So I just tried to guess why this change helps, and suggest the alternatives for review/testing. > I guess my main point will be that TIF_MEMDIE serves a number of > slightly different, even conflicting, purposes; Yes. And let me repeat, I do not pretend understand it. However, I bet the usage of TIF_MEMDIE is wrong. > and one of those > purposes, which present company seems to ignore repeatedly, is to > serialize access to final reserves of memory I don't quite understand "serialize" above. I hope you didn't mean that only one thread can have TIF_MEMDIE to avoid the races... But yes, I understand that "a lot" of TIF_MEMDIE thread can abuse __alloc_pages_high_priority/etc. At least, I hope I understand ;) - as a comment by Nick in > select_bad_process() makes clear. Oh. This comment is not clear at all. * This task already has access to memory reserves and is * being killed. which task? it is quite possible that this task is already dead/released. Or it is just exiting without memory allocations. This is group leader. All we know is that it has task_struct, no more. * Don't allow any other task access to the * memory reserve. Which other tasks? What about the sub-threads of the killed process? It is quite possible that another thread triggered OOM and needs TIF_MEMDIE to access the memory reserves. And even this is not consistent. sysctl_oom_kill_allocating_task does oom_kill_process(current) which may be non-leader. oom_kill_process()->set_tsk_thread_flag(p, TIF_MEMDIE) is simply wrong. The _trivial_ exploit (distinct from this one) can kill the system. And worse! I showed this exploit many times (most probably off-list but all were cc'ed). This was already fixed (iirc), and know I see we have it again. This because almost every PF_EXITING check in oom_kill.c is wrong. I'll return to this tomorrow. > We _might_ choose to abandon that, but if so, it should be a decision, > not an oversight. So I cannot blindly agree with just setting > TIF_MEMDIE on more and more tasks, OK, lets not do this. > wonder if use of your find_lock_task_mm() in select_bad_process() > might bring together my wish to continue serialization, David's wish > to avoid stupid panics, and your wish to avoid deadlocks. Hmm. Could you explain? > but even after repeated skims of the ptrace manpage, I'll admit to > not having a clue, nor the inclination to run and then debug it to > find out the answer. Ahh, sorry. I didn't explain what this test-case does... Because we discussed this many times before, iirc. > I don't even know if the double pthread_create is > a vital part of the scheme it is, > so I assume it leaves a PF_EXITING around > forever, More precisely, it creates the PF_EXITING thread with ->mm != NULL. It never goes away (unless you kill the tracer, of course). > but I couldn't quite see how (with PF_EXITING being set after > the tracehook_report_exit). Yes. But note that it creates 2 threads, and the second one hangs in exit_mm() before clearing ->mm while the 3rd thread can waits for the 1st one which should enter exit_mm() and participate. But this is not important, you can ignore the actual details. > And I wonder if a similar case can be > constructed to deadlock the for_each_process version of > select_bad_process(). Unfortunately yes. This is documented in the changelog of 2/2. We should fix the problems with the coredumps. Hmm, we already had some patches, but they were forgotten/ignored. But at least we shouldn't move back and assume that "PF_EXITING && mm" means "we will have more memory soon". I'll return tomorrow. At first glance, the new patch from David has the same problem, but I am not sure. Oleg. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 0/3 for 2.6.38] oom: fixes 2011-03-13 21:27 ` Oleg Nesterov @ 2011-03-14 19:04 ` Oleg Nesterov 2011-03-14 19:05 ` Oleg Nesterov 0 siblings, 1 reply; 5+ messages in thread From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw) To: Hugh Dickins, Linus Torvalds Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel On 03/13, Oleg Nesterov wrote: > > The _trivial_ exploit (distinct from this one) can kill the system. > ... > I'll return to this tomorrow. I am going to give up, I do not understand the changes in -mm ;) But I think we have other problems which should be fixed first. Note that each fix has the "this is _not_ enough" comment. I was going to _try_ to do something more complete later, but since we are going to add more hacks^W subtle changes... lets fix the obvious bugs first. Add Linus. Hopefully the changes are simple enough. But once again, we need more. Oleg. ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies 2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov @ 2011-03-14 19:05 ` Oleg Nesterov 0 siblings, 0 replies; 5+ messages in thread From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw) To: Hugh Dickins, Linus Torvalds Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel select_bad_process() assumes that a TIF_MEMDIE process should go away. But it can only go away it its parent does wait(). Change this check to ignore the TIF_MEMDIE zombies. Note: this is _not_ enough. Just a minimal fix. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/oom_kill.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100 +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100 @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr * blocked waiting for another task which itself is waiting * for memory. Is there a better alternative? */ - if (test_tsk_thread_flag(p, TIF_MEMDIE)) + if (test_tsk_thread_flag(p, TIF_MEMDIE) && + !p->exit_state && thread_group_empty(p)) return ERR_PTR(-1UL); /* ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies @ 2011-03-14 19:05 ` Oleg Nesterov 0 siblings, 0 replies; 5+ messages in thread From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw) To: Hugh Dickins, Linus Torvalds Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel select_bad_process() assumes that a TIF_MEMDIE process should go away. But it can only go away it its parent does wait(). Change this check to ignore the TIF_MEMDIE zombies. Note: this is _not_ enough. Just a minimal fix. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- mm/oom_kill.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100 +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100 @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr * blocked waiting for another task which itself is waiting * for memory. Is there a better alternative? */ - if (test_tsk_thread_flag(p, TIF_MEMDIE)) + if (test_tsk_thread_flag(p, TIF_MEMDIE) && + !p->exit_state && thread_group_empty(p)) return ERR_PTR(-1UL); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies 2011-03-14 19:05 ` Oleg Nesterov @ 2011-03-14 20:50 ` David Rientjes -1 siblings, 0 replies; 5+ messages in thread From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On Mon, 14 Mar 2011, Oleg Nesterov wrote: > select_bad_process() assumes that a TIF_MEMDIE process should go away. > But it can only go away it its parent does wait(). Change this check to > ignore the TIF_MEMDIE zombies. > The equivalent of this change would be to set TIF_MEMDIE for all threads in a thread group when choosing a process to kill; as we've already discussed in your first series of patches, that has the risk of fully depleting memory reserves and causing the kernel the deadlock. We want to limit TIF_MEMDIE to an oom killed task or to current when it is responding to a SIGKILL or already in the exit path because we know it's exiting and without memory reserves it may never exit. This patch is even more concerning, however, because select_bad_process() isn't even guaranteed to select a thread from the same thread group this time. > Note: this is _not_ enough. Just a minimal fix. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > > mm/oom_kill.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100 > +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100 > @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr > * blocked waiting for another task which itself is waiting > * for memory. Is there a better alternative? > */ > - if (test_tsk_thread_flag(p, TIF_MEMDIE)) > + if (test_tsk_thread_flag(p, TIF_MEMDIE) && > + !p->exit_state && thread_group_empty(p)) > return ERR_PTR(-1UL); > > /* > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies @ 2011-03-14 20:50 ` David Rientjes 0 siblings, 0 replies; 5+ messages in thread From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw) To: Oleg Nesterov Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm, linux-kernel On Mon, 14 Mar 2011, Oleg Nesterov wrote: > select_bad_process() assumes that a TIF_MEMDIE process should go away. > But it can only go away it its parent does wait(). Change this check to > ignore the TIF_MEMDIE zombies. > The equivalent of this change would be to set TIF_MEMDIE for all threads in a thread group when choosing a process to kill; as we've already discussed in your first series of patches, that has the risk of fully depleting memory reserves and causing the kernel the deadlock. We want to limit TIF_MEMDIE to an oom killed task or to current when it is responding to a SIGKILL or already in the exit path because we know it's exiting and without memory reserves it may never exit. This patch is even more concerning, however, because select_bad_process() isn't even guaranteed to select a thread from the same thread group this time. > Note: this is _not_ enough. Just a minimal fix. > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > > mm/oom_kill.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100 > +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100 > @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr > * blocked waiting for another task which itself is waiting > * for memory. Is there a better alternative? > */ > - if (test_tsk_thread_flag(p, TIF_MEMDIE)) > + if (test_tsk_thread_flag(p, TIF_MEMDIE) && > + !p->exit_state && thread_group_empty(p)) > return ERR_PTR(-1UL); > > /* > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-03-18 18:41 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <AANLkTikOdG7iTKDKq5mCYhcVz-rgZ_F2Ja78oBCOCQ91@mail.gmail.com>
[not found] ` <alpine.DEB.2.00.1103141512310.4425@chino.kir.corp.google.com>
[not found] ` <20110315194737.GE21640@redhat.com>
[not found] ` <alpine.DEB.2.00.1103151259380.558@chino.kir.corp.google.com>
[not found] ` <20110315212754.GB28117@redhat.com>
[not found] ` <alpine.DEB.2.00.1103151530200.5099@chino.kir.corp.google.com>
[not found] ` <20110316155310.GA9797@redhat.com>
[not found] ` <alpine.DEB.2.00.1103161220110.9710@chino.kir.corp.google.com>
[not found] ` <20110316202131.GA20790@redhat.com>
[not found] ` <alpine.DEB.2.00.1103161332490.11002@chino.kir.corp.google.com>
2011-03-18 18:32 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
2011-03-03 1:20 [patch] oom: prevent unnecessary oom kills or kernel panics KOSAKI Motohiro
2011-03-08 13:42 ` Oleg Nesterov
2011-03-08 23:57 ` David Rientjes
2011-03-09 23:19 ` Andrew Morton
2011-03-11 19:45 ` David Rientjes
2011-03-12 12:34 ` Oleg Nesterov
2011-03-12 13:43 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
2011-03-12 19:40 ` Hugh Dickins
2011-03-13 21:27 ` Oleg Nesterov
2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
2011-03-14 19:05 ` Oleg Nesterov
2011-03-14 20:50 ` David Rientjes
2011-03-14 20:50 ` David Rientjes
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.