* [PATCH 0/3 for 2.6.38] oom: fixes
[not found] ` <20110313212726.GA24530@redhat.com>
@ 2011-03-14 19:04 ` Oleg Nesterov
2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
On 03/13, Oleg Nesterov wrote:
>
> The _trivial_ exploit (distinct from this one) can kill the system.
> ...
> I'll return to this tomorrow.
I am going to give up, I do not understand the changes in -mm ;)
But I think we have other problems which should be fixed first.
Note that each fix has the "this is _not_ enough" comment. I was
going to _try_ to do something more complete later, but since we
are going to add more hacks^W subtle changes... lets fix the obvious
bugs first.
Add Linus. Hopefully the changes are simple enough. But once again,
we need more.
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
@ 2011-03-14 19:04 ` Oleg Nesterov
2011-03-14 19:35 ` Linus Torvalds
2011-03-14 20:22 ` David Rientjes
2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
2011-03-14 19:05 ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
2 siblings, 2 replies; 16+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
This is very wrong by many reasons. In particular, this thread can
be the dead group leader. Check p->mm != NULL.
Note: this is _not_ enough. Just a minimal fix.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- 38/mm/oom_kill.c~1_kill_fix_pf_exiting 2011-03-14 17:53:05.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-14 18:51:49.000000000 +0100
@@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
* If the task is already exiting, don't alarm the sysadmin or kill
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
- if (p->flags & PF_EXITING) {
+ if (p->flags & PF_EXITING && p->mm) {
set_tsk_thread_flag(p, TIF_MEMDIE);
boost_dying_task_prio(p, mem);
return 0;
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
@ 2011-03-14 19:05 ` Oleg Nesterov
2011-03-14 20:50 ` David Rientjes
2011-03-14 19:05 ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
2 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
select_bad_process() assumes that a TIF_MEMDIE process should go away.
But it can only go away it its parent does wait(). Change this check to
ignore the TIF_MEMDIE zombies.
Note: this is _not_ enough. Just a minimal fix.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100
@@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
* blocked waiting for another task which itself is waiting
* for memory. Is there a better alternative?
*/
- if (test_tsk_thread_flag(p, TIF_MEMDIE))
+ if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
+ !p->exit_state && thread_group_empty(p))
return ERR_PTR(-1UL);
/*
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
@ 2011-03-14 19:05 ` Oleg Nesterov
2011-03-14 20:41 ` David Rientjes
2 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.
Also, "children has a different mm" doesn't match the reality, we
should check child->mm != t->mm. This check is not exactly correct
if t->mm == NULL but this doesn't really matter, oom_kill_task()
will kill them anyway.
Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- 38/mm/oom_kill.c~3_fix_kill_chld 2011-03-14 18:52:39.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-14 19:36:01.000000000 +0100
@@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
- struct task_struct *victim = p;
+ struct task_struct *victim;
struct task_struct *child;
- struct task_struct *t = p;
- unsigned int victim_points = 0;
+ struct task_struct *t;
+ unsigned int victim_points;
if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);
@@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
* parent. This attempts to lose the minimal amount of work done while
* still freeing memory.
*/
+ victim_points = oom_badness(p, mem, nodemask, totalpages);
+ victim = p;
+ t = p;
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
+ if (child->mm == t->mm)
+ continue;
/*
* oom_badness() returns 0 if the thread is unkillable
*/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
@ 2011-03-14 19:35 ` Linus Torvalds
2011-03-14 20:31 ` Oleg Nesterov
2011-03-14 20:32 ` David Rientjes
2011-03-14 20:22 ` David Rientjes
1 sibling, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2011-03-14 19:35 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.
Explain more, please. Maybe I'm missing some context because I wasn't
cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
and exit_mm() is called almost immediately afterwards which will set
p->mm to NULL.
So afaik, this will basically just remove the whole point of the code
entirely - so why not remove it then?
The combination of testing PF_EXITING and p->mm just doesn't seem to
make any sense.
Linus
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
2011-03-14 19:35 ` Linus Torvalds
@ 2011-03-14 20:22 ` David Rientjes
2011-03-15 18:53 ` Oleg Nesterov
1 sibling, 1 reply; 16+ messages in thread
From: David Rientjes @ 2011-03-14 20:22 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Oleg Nesterov wrote:
> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.
>
This is true only for the oom_kill_allocating_task sysctl where it is
required in all cases to kill current; current won't be triggering the oom
killer if it's dead.
oom_kill_process() is called with the thread selected by
select_bad_process() and that function will not return any thread if any
eligible task is found to be PF_EXITING and is not current, or any
eligible task is found to have TIF_MEMDIE.
In other words, for this conditional to be true in oom_kill_process(),
then p must be current and so it cannot be the dead group leader as
specified in your changelog unless PF_EXITING gets set between
select_bad_process() and the oom_kill_process() call: we don't care about
that since it's in the exit path and we therefore want to give it access
to memory reserves to quickly exit anyway and the check for PF_EXITING in
select_bad_process() prevents any infinite loop of that task getting
constantly reselected if it's dead.
> Note: this is _not_ enough. Just a minimal fix.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~1_kill_fix_pf_exiting 2011-03-14 17:53:05.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 18:51:49.000000000 +0100
> @@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
> * If the task is already exiting, don't alarm the sysadmin or kill
> * its children or threads, just set TIF_MEMDIE so it can die quickly
> */
> - if (p->flags & PF_EXITING) {
> + if (p->flags & PF_EXITING && p->mm) {
> set_tsk_thread_flag(p, TIF_MEMDIE);
> boost_dying_task_prio(p, mem);
> return 0;
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:35 ` Linus Torvalds
@ 2011-03-14 20:31 ` Oleg Nesterov
2011-03-14 20:32 ` David Rientjes
1 sibling, 0 replies; 16+ messages in thread
From: Oleg Nesterov @ 2011-03-14 20:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, Linus Torvalds wrote:
>
> On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
>
> Explain more, please. Maybe I'm missing some context because I wasn't
> cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
> and exit_mm() is called almost immediately afterwards which will set
> p->mm to NULL.
>
> So afaik, this will basically just remove the whole point of the code
> entirely - so why not remove it then?
I am afraid I am going to lie... But iirc I tried to remove this code
before. Can't find the previous discussion, probably I am wrong.
Anyway. I never understood why do we have this special case.
> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.
To me, it doesn't make too much sense even if we do not check ->mm.
But. I _think_ the intent was to wait until this "exiting" process
does exit_mm() and frees the memory. This is like the
"the process of releasing memory " code in select_bad_process(). Once
again, this is only my speculation.
In any case, this patch doesn't pretend to be the right fix.
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:35 ` Linus Torvalds
2011-03-14 20:31 ` Oleg Nesterov
@ 2011-03-14 20:32 ` David Rientjes
2011-03-15 19:12 ` Oleg Nesterov
1 sibling, 1 reply; 16+ messages in thread
From: David Rientjes @ 2011-03-14 20:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Linus Torvalds wrote:
> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.
>
Right, it doesn't (and I recently removed testing the combination from
select_bad_process() in -mm). The check for PF_EXITING in the oom killer
is purely to avoid needlessly killing tasks when something is already
exiting and will (hopefully) be freeing its memory soon. If an eligible
thread is found to be PF_EXITING, the oom killer will defer indefinitely
unless it happens to be current. If it happens to be current, then it is
automatically selected so it gets access to the needed memory reserves.
We do need to ensure that behavior doesn't preempt any task from being
killed if there's an eligible thread other than current that never
actually detaches its ->mm (oom-skip-zombies-when-iterating-tasklist.patch
in -mm filters all threads without an ->mm). That can happen if
mm->mmap_sem never gets released by a thread and that's why an earlier
change that is already in your tree automatically gives current access to
memory reserves immediately upon calling the oom killer if it has a
pending SIGKILL.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
2011-03-14 19:05 ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
@ 2011-03-14 20:41 ` David Rientjes
2011-03-15 19:21 ` Oleg Nesterov
0 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2011-03-14 20:41 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Oleg Nesterov wrote:
> oom_kill_process() starts with victim_points == 0. This means that
> (most likely) any child has more points and can be killed erroneously.
>
> Also, "children has a different mm" doesn't match the reality, we
> should check child->mm != t->mm. This check is not exactly correct
> if t->mm == NULL but this doesn't really matter, oom_kill_task()
> will kill them anyway.
>
> Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> too.
>
There're two issues you're addressing in this patch. It only kills a
child in place of its selected parent when:
- the child has a higher badness score, and
- it has a different ->mm.
In the former case, NACK, we always want to sacrifice children regardless
of their badness score (as long as it is non-zero) if it has a separate
->mm in place of its parent, otherwise webservers will be killed instead
of one of their children serving a client, sshd could be killed instead of
bash, etc. The behavior of the oom killer has always been to try to kill
a child with its own ->mm first to avoid losing a large amount of work
being done or unnecessarily killing a job scheduler, for example, when
sacrificing a child would be satisfactory. It'll kill additional tasks,
and perhaps even the parent later if it has no more children, if the oom
condition persists.
In the latter case, I agree, we should be testing if the child has a
different ->mm before sacrificing it for its parent as the comment
indicates it will. I proposed that exact change in "oom: avoid deferring
oom killer if exiting task is being traced" posted to -mm a couple days
ago.
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~3_fix_kill_chld 2011-03-14 18:52:39.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 19:36:01.000000000 +0100
> @@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
> struct mem_cgroup *mem, nodemask_t *nodemask,
> const char *message)
> {
> - struct task_struct *victim = p;
> + struct task_struct *victim;
> struct task_struct *child;
> - struct task_struct *t = p;
> - unsigned int victim_points = 0;
> + struct task_struct *t;
> + unsigned int victim_points;
>
> if (printk_ratelimit())
> dump_header(p, gfp_mask, order, mem, nodemask);
> @@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
> * parent. This attempts to lose the minimal amount of work done while
> * still freeing memory.
> */
> + victim_points = oom_badness(p, mem, nodemask, totalpages);
> + victim = p;
> + t = p;
> do {
> list_for_each_entry(child, &t->children, sibling) {
> unsigned int child_points;
>
> + if (child->mm == t->mm)
> + continue;
> /*
> * oom_badness() returns 0 if the thread is unkillable
> */
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
@ 2011-03-14 20:50 ` David Rientjes
0 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Oleg Nesterov wrote:
> select_bad_process() assumes that a TIF_MEMDIE process should go away.
> But it can only go away it its parent does wait(). Change this check to
> ignore the TIF_MEMDIE zombies.
>
The equivalent of this change would be to set TIF_MEMDIE for all threads
in a thread group when choosing a process to kill; as we've already
discussed in your first series of patches, that has the risk of fully
depleting memory reserves and causing the kernel the deadlock. We want to
limit TIF_MEMDIE to an oom killed task or to current when it is responding
to a SIGKILL or already in the exit path because we know it's exiting and
without memory reserves it may never exit.
This patch is even more concerning, however, because select_bad_process()
isn't even guaranteed to select a thread from the same thread group this
time.
> Note: this is _not_ enough. Just a minimal fix.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100
> @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
> * blocked waiting for another task which itself is waiting
> * for memory. Is there a better alternative?
> */
> - if (test_tsk_thread_flag(p, TIF_MEMDIE))
> + if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> + !p->exit_state && thread_group_empty(p))
> return ERR_PTR(-1UL);
>
> /*
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 20:22 ` David Rientjes
@ 2011-03-15 18:53 ` Oleg Nesterov
2011-03-15 19:54 ` David Rientjes
0 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2011-03-15 18:53 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
> >
>
> This is true only for the oom_kill_allocating_task sysctl where it is
> required in all cases to kill current; current won't be triggering the oom
> killer if it's dead.
>
> oom_kill_process() is called with the thread selected by
> select_bad_process() and that function will not return any thread if any
> eligible task is found to be PF_EXITING and is not current, or any
> eligible task is found to have TIF_MEMDIE.
>
> In other words, for this conditional to be true in oom_kill_process(),
> then p must be current and so it cannot be the dead group leader as
> specified in your changelog unless PF_EXITING gets set between
> select_bad_process() and the oom_kill_process() call: we don't care about
> that since it's in the exit path and we therefore want to give it access
> to memory reserves to quickly exit anyway and the check for PF_EXITING in
> select_bad_process() prevents any infinite loop of that task getting
> constantly reselected if it's dead.
Confused. I sent the test-case. OK, may be you meant the code in -mm,
but I meant the current code.
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 20:32 ` David Rientjes
@ 2011-03-15 19:12 ` Oleg Nesterov
2011-03-15 19:51 ` David Rientjes
0 siblings, 1 reply; 16+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:12 UTC (permalink / raw)
To: David Rientjes
Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Linus Torvalds wrote:
>
> > The combination of testing PF_EXITING and p->mm just doesn't seem to
> > make any sense.
> >
>
> Right, it doesn't (and I recently removed testing the combination from
> select_bad_process() in -mm). The check for PF_EXITING in the oom killer
> is purely to avoid needlessly killing tasks when something is already
> exiting
Maybe 0/3 wasn't clear enough. This patches does not try to fix things,
it only tries to close the hole in 2.6.38. But it was already released
today.
> and will (hopefully) be freeing its memory soon.
This is not clear to me.
When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
and the changelog says:
__oom_kill_task() is called to elevate the task's timeslice and give it
access to memory reserves so that it may quickly exit.
This privilege is unnecessary, however, if the task has already detached
its mm.
Now you are saing this is pointless.
OK. I already said I do not understand this special case. Perhaps I'll ask
the questions later.
> If an eligible
> thread is found to be PF_EXITING,
The problem is, we can't trust per-thread PF_EXITING checks. But I guess
we will discuss this more anyway.
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
2011-03-14 20:41 ` David Rientjes
@ 2011-03-15 19:21 ` Oleg Nesterov
0 siblings, 0 replies; 16+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:21 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() starts with victim_points == 0. This means that
> > (most likely) any child has more points and can be killed erroneously.
> >
> > Also, "children has a different mm" doesn't match the reality, we
> > should check child->mm != t->mm. This check is not exactly correct
> > if t->mm == NULL but this doesn't really matter, oom_kill_task()
> > will kill them anyway.
> >
> > Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> > too.
> >
>
> There're two issues you're addressing in this patch. It only kills a
> child in place of its selected parent when:
>
> - the child has a higher badness score, and
>
> - it has a different ->mm.
>
> In the former case, NACK, we always want to sacrifice children regardless
> of their badness score (as long as it is non-zero) if it has a separate
> ->mm in place of its parent,
Ah. So this was intentional?
OK. I was hypnotized by the security implications, and this looked so
"obviously wrong" to me.
But, of course I can't judge when it comes to oom's heuristic, and you
certainly know better.
So, thanks for correcting me.
Just a question... what about oom_kill_allocating_task? Probably
Documentation/sysctl/vm.txt should be updated.
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-15 19:12 ` Oleg Nesterov
@ 2011-03-15 19:51 ` David Rientjes
0 siblings, 0 replies; 16+ messages in thread
From: David Rientjes @ 2011-03-15 19:51 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Tue, 15 Mar 2011, Oleg Nesterov wrote:
> When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
> and the changelog says:
>
> __oom_kill_task() is called to elevate the task's timeslice and give it
> access to memory reserves so that it may quickly exit.
>
> This privilege is unnecessary, however, if the task has already detached
> its mm.
>
> Now you are saing this is pointless.
>
If you have the commit id, do a "git blame 8123681022", because I see a:
5081dde3 (Nick Piggin 2006-09-25 23:31:32 -0700 222) if (!p->mm)
5081dde3 (Nick Piggin 2006-09-25 23:31:32 -0700 223) continue;
in select_bad_process() and it's also iterating over every thread:
a49335cc (Paul Jackson 2005-09-06 15:18:09 -0700 215) do_each_thread(g, p) {
It's pointless since oom-skip-zombies-when-iterating-tasklist.patch in
-mm reintroduced the filter for !p->mm in select_bad_process() which was
still there when 81236810 was merged; it's a small optimization, though,
to avoid races where the mm becomes detached between the process'
selection in select_bad_process() and its kill in oom_kill_process().
> The problem is, we can't trust per-thread PF_EXITING checks. But I guess
> we will discuss this more anyway.
>
My approach, as you saw with
oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch in
-mm is to add exceptions to the oom killer when we can't trust that
PF_EXITING will soon be exiting. I think that's a much more long-term
maintainable solution instead of inferring the status of a thread based on
external circumstances (such as number of threads in the thread group)
that could easily change out from under us and once again break the oom
killer.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-15 18:53 ` Oleg Nesterov
@ 2011-03-15 19:54 ` David Rientjes
2011-03-15 21:16 ` Oleg Nesterov
0 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2011-03-15 19:54 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Tue, 15 Mar 2011, Oleg Nesterov wrote:
> Confused. I sent the test-case. OK, may be you meant the code in -mm,
> but I meant the current code.
>
This entire discussion, and your involvement in it, originated from these
two patches being merged into -mm:
oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
oom-skip-zombies-when-iterating-tasklist.patch
So naturally I'm going to challenge your testcases with the latest -mm.
If you wanted to suggest pushing these to 2.6.38 earlier, I don't think
anyone would have disputed that -- I certainly wouldn't have since the
first fixes a quite obvious panic that we've faced on a lot of our
machines. It's not that big of a deal, though, since Andrew has targeted
them for -stable and they're on schedule to be pushed to 2.6.38.x
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-15 19:54 ` David Rientjes
@ 2011-03-15 21:16 ` Oleg Nesterov
0 siblings, 0 replies; 16+ messages in thread
From: Oleg Nesterov @ 2011-03-15 21:16 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/15, David Rientjes wrote:
>
> On Tue, 15 Mar 2011, Oleg Nesterov wrote:
>
> > Confused. I sent the test-case. OK, may be you meant the code in -mm,
> > but I meant the current code.
> >
>
> This entire discussion, and your involvement in it, originated from these
> two patches being merged into -mm:
>
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> oom-skip-zombies-when-iterating-tasklist.patch
Yes. This motivated me to look at the current code, and I was unpleasantly
surprised.
> So naturally I'm going to challenge your testcases with the latest -mm.
Sure. And once again, I didn't expect the 2nd problem was fixed, I forgot
about the second patch.
> If you wanted to suggest pushing these to 2.6.38 earlier,
Yes. It was too late for 2.6.38. I thought we have more time before
release.
Oleg.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2011-03-15 21:26 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <alpine.DEB.2.00.1103011108400.28110@chino.kir.corp.google.com>
[not found] ` <20110303100030.B936.A69D9226@jp.fujitsu.com>
[not found] ` <20110308134233.GA26884@redhat.com>
[not found] ` <alpine.DEB.2.00.1103081549530.27910@chino.kir.corp.google.com>
[not found] ` <20110309151946.dea51cde.akpm@linux-foundation.org>
[not found] ` <alpine.DEB.2.00.1103111142260.30699@chino.kir.corp.google.com>
[not found] ` <20110312123413.GA18351@redhat.com>
[not found] ` <20110312134341.GA27275@redhat.com>
[not found] ` <AANLkTinHGSb2_jfkwx=Wjv96phzPCjBROfCTFCKi4Wey@mail.gmail.com>
[not found] ` <20110313212726.GA24530@redhat.com>
2011-03-14 19:04 ` [PATCH 0/3 for 2.6.38] oom: fixes Oleg Nesterov
2011-03-14 19:04 ` [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm Oleg Nesterov
2011-03-14 19:35 ` Linus Torvalds
2011-03-14 20:31 ` Oleg Nesterov
2011-03-14 20:32 ` David Rientjes
2011-03-15 19:12 ` Oleg Nesterov
2011-03-15 19:51 ` David Rientjes
2011-03-14 20:22 ` David Rientjes
2011-03-15 18:53 ` Oleg Nesterov
2011-03-15 19:54 ` David Rientjes
2011-03-15 21:16 ` Oleg Nesterov
2011-03-14 19:05 ` [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies Oleg Nesterov
2011-03-14 20:50 ` David Rientjes
2011-03-14 19:05 ` [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic Oleg Nesterov
2011-03-14 20:41 ` David Rientjes
2011-03-15 19:21 ` Oleg Nesterov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox