* [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE
2011-03-12 13:43 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
@ 2011-03-12 13:44 ` Oleg Nesterov
2011-03-13 1:14 ` David Rientjes
2011-03-12 13:44 ` [PATCH 2/3] oom: select_bad_process: improve the PF_EXITING check Oleg Nesterov
` (3 subsequent siblings)
4 siblings, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-12 13:44 UTC (permalink / raw)
To: Andrew Morton
Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
Andrey Vagin, David Rientjes
oom_kill_task() kills the whole thread group, but only one task gets
TIF_MEMDIE. This can't be right, every test_thread_flag(TIF_MEMDIE)
check is per-thread.
I think it should be replaced by MMF_, but as a first step let's change
oom_kill_task() to mark every sub-thread as TIF_MEMDIE. And change the
"Kill all processes sharing p->mm" code the same way.
This also fixes another problem. sysctl_oom_kill_allocating_task case
does oom_kill_process(current). If current is not the main thread, then
select_bad_process() won't see the TIF_MEMDIE task.
Note:
- oom_kill_task()->for_each_process() is wrong. It can't detect
all processes sharing p->mm. The fix is simple
- This patch doesn't change other callers of set_*(TIF_MEMDIE).
This needs a separate discussion, but oom_kill_process() is
simply wrong.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
--- 38/mm/oom_kill.c~oom_kill_spread_memdie 2011-03-12 14:19:36.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-12 14:20:42.000000000 +0100
@@ -401,6 +401,18 @@ static void dump_header(struct task_stru
dump_tasks(mem, nodemask);
}
+static void do_oom_kill(struct task_struct *p)
+{
+ struct task_struct *t;
+
+ t = p;
+ do {
+ set_tsk_thread_flag(t, TIF_MEMDIE);
+ } while_each_thread(p, t);
+
+ force_sig(SIGKILL, p);
+}
+
#define K(x) ((x) << (PAGE_SHIFT-10))
static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
{
@@ -436,12 +448,10 @@ static int oom_kill_task(struct task_str
pr_err("Kill process %d (%s) sharing same memory\n",
task_pid_nr(q), q->comm);
task_unlock(q);
- force_sig(SIGKILL, q);
+ do_oom_kill(q);
}
- set_tsk_thread_flag(p, TIF_MEMDIE);
- force_sig(SIGKILL, p);
-
+ do_oom_kill(p);
/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE
2011-03-12 13:44 ` [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
@ 2011-03-13 1:14 ` David Rientjes
0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-13 1:14 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins,
linux-mm, Andrey Vagin
On Sat, 12 Mar 2011, Oleg Nesterov wrote:
> --- 38/mm/oom_kill.c~oom_kill_spread_memdie 2011-03-12 14:19:36.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-12 14:20:42.000000000 +0100
> @@ -401,6 +401,18 @@ static void dump_header(struct task_stru
> dump_tasks(mem, nodemask);
> }
>
> +static void do_oom_kill(struct task_struct *p)
> +{
> + struct task_struct *t;
> +
> + t = p;
> + do {
> + set_tsk_thread_flag(t, TIF_MEMDIE);
> + } while_each_thread(p, t);
> +
> + force_sig(SIGKILL, p);
> +}
> +
> #define K(x) ((x) << (PAGE_SHIFT-10))
> static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
> {
> @@ -436,12 +448,10 @@ static int oom_kill_task(struct task_str
> pr_err("Kill process %d (%s) sharing same memory\n",
> task_pid_nr(q), q->comm);
> task_unlock(q);
> - force_sig(SIGKILL, q);
> + do_oom_kill(q);
> }
>
> - set_tsk_thread_flag(p, TIF_MEMDIE);
> - force_sig(SIGKILL, p);
> -
> + do_oom_kill(p);
> /*
> * We give our sacrificial lamb high priority and access to
> * all the memory it needs. That way it should be able to
As mentioned in the first posting of this patch, this isn't appropraite:
we don't want to set TIF_MEMDIE to all threads unless they can't reclaim
memory in the page allocator and are still in an oom condition. The point
is to try to limit TIF_MEMDIE to only those threads that are guaranteed to
already be in the exit path or handling SIGKILLs, otherwise we risk
depleting memory reserves entirely and there's no sanity checking here
that would suggest that can't happen.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 2/3] oom: select_bad_process: improve the PF_EXITING check
2011-03-12 13:43 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
2011-03-12 13:44 ` [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
@ 2011-03-12 13:44 ` Oleg Nesterov
2011-03-12 13:44 ` [PATCH 3/3] oom: select_bad_process: use same_thread_group() Oleg Nesterov
` (2 subsequent siblings)
4 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-12 13:44 UTC (permalink / raw)
To: Andrew Morton
Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
Andrey Vagin, David Rientjes
The current PF_EXITING check in select_bad_process() is very limited,
it only works if the task is single-threaded.
Add the new helper which tries to handle the mt case. It is not exactly
clear what should we actually check in this case. This patch checks
PF_EXITING, but perhaps we can take signal_group_exit() into account.
In this case select_bad_process() could detect the exiting process even
before every thread calls exit().
Note:
- "if (p != current)" check is obviously wrong in mt case too.
- with or without this change, we should probably check
mm->core_state == NULL. We shouldn't assume that is is going
to exit "soon" otherwise. But this needs other changes anyway,
and in the common case when we do not share ->mm with another
process the false positive is not possible.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
--- 38/mm/oom_kill.c~detect_exiting 2011-03-12 14:20:42.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-12 14:22:19.000000000 +0100
@@ -282,6 +282,22 @@ static enum oom_constraint constrained_a
}
#endif
+static bool mm_is_exiting(struct task_struct *p)
+{
+ struct task_struct *t;
+ bool has_mm = false;
+
+ t = p;
+ do {
+ if (!(t->flags & PF_EXITING))
+ return false;
+ if (t->mm)
+ has_mm = true;
+ } while_each_thread(p, t);
+
+ return has_mm;
+}
+
/*
* Simple selection loop. We chose the process with the highest
* number of 'points'. We expect the caller will lock the tasklist.
@@ -324,7 +340,7 @@ static struct task_struct *select_bad_pr
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
+ if (mm_is_exiting(p)) {
if (p != current)
return ERR_PTR(-1UL);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread* [PATCH 3/3] oom: select_bad_process: use same_thread_group()
2011-03-12 13:43 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
2011-03-12 13:44 ` [PATCH 1/3] oom: oom_kill_task: mark every thread as TIF_MEMDIE Oleg Nesterov
2011-03-12 13:44 ` [PATCH 2/3] oom: select_bad_process: improve the PF_EXITING check Oleg Nesterov
@ 2011-03-12 13:44 ` Oleg Nesterov
2011-03-12 19:40 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Hugh Dickins
2011-03-13 11:36 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes KOSAKI Motohiro
4 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-12 13:44 UTC (permalink / raw)
To: Andrew Morton
Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm,
Andrey Vagin, David Rientjes
select_bad_process() checks if the exiting process is current. This
condition can never be true if the caller is not the main thread.
Use same_thread_group() instead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- 38/mm/oom_kill.c~fix_ck_current 2011-03-12 14:22:19.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-12 14:27:14.000000000 +0100
@@ -341,7 +341,7 @@ static struct task_struct *select_bad_pr
* Otherwise we could get an easy OOM deadlock.
*/
if (mm_is_exiting(p)) {
- if (p != current)
+ if (!same_thread_group(p, current))
return ERR_PTR(-1UL);
chosen = p;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes
2011-03-12 13:43 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
` (2 preceding siblings ...)
2011-03-12 13:44 ` [PATCH 3/3] oom: select_bad_process: use same_thread_group() Oleg Nesterov
@ 2011-03-12 19:40 ` Hugh Dickins
2011-03-13 8:53 ` KOSAKI Motohiro
2011-03-13 21:27 ` Oleg Nesterov
2011-03-13 11:36 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes KOSAKI Motohiro
4 siblings, 2 replies; 69+ messages in thread
From: Hugh Dickins @ 2011-03-12 19:40 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm,
Andrey Vagin, David Rientjes
Hi Oleg,
On Sat, Mar 12, 2011 at 5:43 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 03/12, Oleg Nesterov wrote:
>>
>> On 03/11, David Rientjes wrote:
>> >
>> > On Wed, 9 Mar 2011, Andrew Morton wrote:
>> >
>> > > If Oleg's test program cause a hang with
>> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch and doesn't
>> > > cause a hang without
>> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch then that's a
>> > > big problem for
>> > > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch, no?
>> > >
>> >
>> > It's a problem, but not because of
>> > oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch.
>>
>> It is, afaics. oom-killer can't ussume that a single PF_EXITING && p->mm
>> thread is going to free the memory.
>>
>> > If we don't
>> > have this patch, then we have a trivial panic when an oom kill occurs in a
>> > cpuset with no other eligible processes, the oom killed thread group
>> > leader exits
>>
>> It is not clear what "leader exits" actually mean. OK, perhaps you mean
>> its ->mm == NULL.
>>
>> > but its other threads do not and they trigger oom kills
>> > themselves. for_each_process() does not iterate over these threads and so
>> > it finds no eligible threads to kill and then panics
>>
>> Could you explain what do you mean? No need to kill these threads, they
>> are already killed, we should wait until they all exit.
>>
>> > I'll look at Oleg's test case
>> > and see what can be done to fix that condition, but the answer isn't to
>> > ignore eligible threads that can be killed.
>>
>> Once again, they are already killed. Or I do not understand what you meant.
>>
>> Could you please explain the problem in more details?
>>
>>
>> Also. Could you please look at the patches I sent?
>>
>> [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE
>> [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check
>
> Cough. And both were not right, while_each_thread(p, t) needs the properly
> initialized "t". At least I warned they were not tested ;)
>
>> Note also the note about "p == current" check. it should be fixed too.
>
> I am resending the fixes above plus the new one.
I've spent much of the week building up to join in, but the more I
look around, the more I find to say or investigate, and therefore
never quite get to write the mail. Let this be a placeholder, that I
probably disagree (in an amicable way!) with all of you, and maybe
I'll finally manage to collect my thoughts into mail later today.
I guess my main point will be that TIF_MEMDIE serves a number of
slightly different, even conflicting, purposes; and one of those
purposes, which present company seems to ignore repeatedly, is to
serialize access to final reserves of memory - as a comment by Nick in
select_bad_process() makes clear. (This is distinct from the
serialization to avoid OOM-killing rampage.)
We _might_ choose to abandon that, but if so, it should be a decision,
not an oversight. So I cannot blindly agree with just setting
TIF_MEMDIE on more and more tasks, even if they share the same mm. I
wonder if use of your find_lock_task_mm() in select_bad_process()
might bring together my wish to continue serialization, David's wish
to avoid stupid panics, and your wish to avoid deadlocks.
Though any serialization has a risk of deadlock: we probably need to
weigh up how realistic different cases are. Which brings me neatly
to your little pthread_create, ptrace proggy... I dare say you and
Kosaki and David know exactly what it's doing and why it's a problem,
but even after repeated skims of the ptrace manpage, I'll admit to
not having a clue, nor the inclination to run and then debug it to
find out the answer. Please, Oleg, would you mind very much
explaining it to me? I don't even know if the double pthread_create is
a vital part of the scheme, or just a typo. I see it doesn't even
allocate any special memory, so I assume it leaves a PF_EXITING around
forever, but I couldn't quite see how (with PF_EXITING being set after
the tracehook_report_exit). And I wonder if a similar case can be
constructed to deadlock the for_each_process version of
select_bad_process().
I worry more about someone holding a reference to the mm via /proc (I
see memory allocations after getting the mm).
Thanks; until later,
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes
2011-03-12 19:40 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Hugh Dickins
@ 2011-03-13 8:53 ` KOSAKI Motohiro
2011-03-13 21:27 ` Oleg Nesterov
1 sibling, 0 replies; 69+ messages in thread
From: KOSAKI Motohiro @ 2011-03-13 8:53 UTC (permalink / raw)
To: Hugh Dickins
Cc: kosaki.motohiro, Oleg Nesterov, Andrew Morton, KAMEZAWA Hiroyuki,
linux-mm, Andrey Vagin, David Rientjes
> I've spent much of the week building up to join in, but the more I
> look around, the more I find to say or investigate, and therefore
> never quite get to write the mail. Let this be a placeholder, that I
> probably disagree (in an amicable way!) with all of you, and maybe
> I'll finally manage to collect my thoughts into mail later today.
>
> I guess my main point will be that TIF_MEMDIE serves a number of
> slightly different, even conflicting, purposes; and one of those
> purposes, which present company seems to ignore repeatedly, is to
> serialize access to final reserves of memory - as a comment by Nick in
> select_bad_process() makes clear. (This is distinct from the
> serialization to avoid OOM-killing rampage.)
>
> We _might_ choose to abandon that, but if so, it should be a decision,
> not an oversight. So I cannot blindly agree with just setting
> TIF_MEMDIE on more and more tasks, even if they share the same mm. I
> wonder if use of your find_lock_task_mm() in select_bad_process()
> might bring together my wish to continue serialization, David's wish
> to avoid stupid panics, and your wish to avoid deadlocks.
>
> Though any serialization has a risk of deadlock: we probably need to
> weigh up how realistic different cases are. Which brings me neatly
> to your little pthread_create, ptrace proggy... I dare say you and
> Kosaki and David know exactly what it's doing and why it's a problem,
> but even after repeated skims of the ptrace manpage, I'll admit to
> not having a clue, nor the inclination to run and then debug it to
> find out the answer. Please, Oleg, would you mind very much
> explaining it to me? I don't even know if the double pthread_create is
> a vital part of the scheme, or just a typo. I see it doesn't even
> allocate any special memory, so I assume it leaves a PF_EXITING around
> forever, but I couldn't quite see how (with PF_EXITING being set after
> the tracehook_report_exit). And I wonder if a similar case can be
> constructed to deadlock the for_each_process version of
> select_bad_process().
>
> I worry more about someone holding a reference to the mm via /proc (I
> see memory allocations after getting the mm).
>
> Thanks; until later,
> Hugh
Hi Hugh,
Andrew Vagin showed us actual deadlock case in "[PATCH] mm: check
zone->all_unreclaimable in all_unreclaimable()" thread. and I'm
now digging it. Unfortunatelly the cause is not single. then, I
couln't explained how exact event was occur and what should we do.
However, I can say, at least, TIF_MEMDIE is one of root cause of the
andrey's deadlock. I could observed TIF_MEMDIE process never die and
all other process continue to wait. then, the system become hang up.
I'm still continue to investigate and I wish I find minimum negative
impact solution, but I'm not wonder I finally decide to abandon both
TIF_MEMDIE and boost_dying_task_prio. The hang-up can be reproduced
too easily.
Unfortunatelly, my country is under slightly rare natural disaster
and I don't think I'm going to concentrate a debug activity awhile.
I'm sorry for the lazy activity.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes
2011-03-12 19:40 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Hugh Dickins
2011-03-13 8:53 ` KOSAKI Motohiro
@ 2011-03-13 21:27 ` Oleg Nesterov
2011-03-14 19:04 ` Oleg Nesterov
1 sibling, 1 reply; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-13 21:27 UTC (permalink / raw)
To: Hugh Dickins
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, linux-mm,
Andrey Vagin, David Rientjes
Hi Hugh,
On 03/12, Hugh Dickins wrote:
>
> On Sat, Mar 12, 2011 at 5:43 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >>
> >> Also. Could you please look at the patches I sent?
> >>
> >> [PATCH 1/1] oom_kill_task: mark every thread as TIF_MEMDIE
> >> [PATCH v2 1/1] select_bad_process: improve the PF_EXITING check
> >
> > Cough. And both were not right, while_each_thread(p, t) needs the properly
> > initialized "t". At least I warned they were not tested ;)
> >
> >> Note also the note about "p == current" check. it should be fixed too.
> >
> > I am resending the fixes above plus the new one.
>
> I've spent much of the week building up to join in, but the more I
> look around, the more I find to say or investigate, and therefore
> never quite get to write the mail. Let this be a placeholder, that I
> probably disagree (in an amicable way!) with all of you, and maybe
> I'll finally manage to collect my thoughts into mail later today.
Thanks for looking into this!
Before I say anything else, I'd like to repeat that I am not going
to argue with the nack. The only reason I sent these patches is that
I can not understand David's patch at all. I mean, I can not even
understand the problems it should address. Yet I think the patch is
not right.
So I just tried to guess why this change helps, and suggest the
alternatives for review/testing.
> I guess my main point will be that TIF_MEMDIE serves a number of
> slightly different, even conflicting, purposes;
Yes. And let me repeat, I do not pretend understand it. However,
I bet the usage of TIF_MEMDIE is wrong.
> and one of those
> purposes, which present company seems to ignore repeatedly, is to
> serialize access to final reserves of memory
I don't quite understand "serialize" above. I hope you didn't mean
that only one thread can have TIF_MEMDIE to avoid the races...
But yes, I understand that "a lot" of TIF_MEMDIE thread can abuse
__alloc_pages_high_priority/etc. At least, I hope I understand ;)
- as a comment by Nick in
> select_bad_process() makes clear.
Oh. This comment is not clear at all.
* This task already has access to memory reserves and is
* being killed.
which task? it is quite possible that this task is already dead/released.
Or it is just exiting without memory allocations. This is group leader.
All we know is that it has task_struct, no more.
* Don't allow any other task access to the
* memory reserve.
Which other tasks? What about the sub-threads of the killed process?
It is quite possible that another thread triggered OOM and needs
TIF_MEMDIE to access the memory reserves.
And even this is not consistent. sysctl_oom_kill_allocating_task
does oom_kill_process(current) which may be non-leader.
oom_kill_process()->set_tsk_thread_flag(p, TIF_MEMDIE) is simply wrong.
The _trivial_ exploit (distinct from this one) can kill the system. And
worse! I showed this exploit many times (most probably off-list but all
were cc'ed). This was already fixed (iirc), and know I see we have it
again. This because almost every PF_EXITING check in oom_kill.c is wrong.
I'll return to this tomorrow.
> We _might_ choose to abandon that, but if so, it should be a decision,
> not an oversight. So I cannot blindly agree with just setting
> TIF_MEMDIE on more and more tasks,
OK, lets not do this.
> wonder if use of your find_lock_task_mm() in select_bad_process()
> might bring together my wish to continue serialization, David's wish
> to avoid stupid panics, and your wish to avoid deadlocks.
Hmm. Could you explain?
> but even after repeated skims of the ptrace manpage, I'll admit to
> not having a clue, nor the inclination to run and then debug it to
> find out the answer.
Ahh, sorry. I didn't explain what this test-case does... Because we
discussed this many times before, iirc.
> I don't even know if the double pthread_create is
> a vital part of the scheme
it is,
> so I assume it leaves a PF_EXITING around
> forever,
More precisely, it creates the PF_EXITING thread with ->mm != NULL.
It never goes away (unless you kill the tracer, of course).
> but I couldn't quite see how (with PF_EXITING being set after
> the tracehook_report_exit).
Yes. But note that it creates 2 threads, and the second one hangs
in exit_mm() before clearing ->mm while the 3rd thread can waits
for the 1st one which should enter exit_mm() and participate.
But this is not important, you can ignore the actual details.
> And I wonder if a similar case can be
> constructed to deadlock the for_each_process version of
> select_bad_process().
Unfortunately yes. This is documented in the changelog of 2/2.
We should fix the problems with the coredumps. Hmm, we already
had some patches, but they were forgotten/ignored. But at least
we shouldn't move back and assume that "PF_EXITING && mm" means
"we will have more memory soon".
I'll return tomorrow. At first glance, the new patch from David
has the same problem, but I am not sure.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 0/3 for 2.6.38] oom: fixes
2011-03-13 21:27 ` Oleg Nesterov
@ 2011-03-14 19:04 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
On 03/13, Oleg Nesterov wrote:
>
> The _trivial_ exploit (distinct from this one) can kill the system.
> ...
> I'll return to this tomorrow.
I am going to give up, I do not understand the changes in -mm ;)
But I think we have other problems which should be fixed first.
Note that each fix has the "this is _not_ enough" comment. I was
going to _try_ to do something more complete later, but since we
are going to add more hacks^W subtle changes... lets fix the obvious
bugs first.
Add Linus. Hopefully the changes are simple enough. But once again,
we need more.
Oleg.
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 0/3 for 2.6.38] oom: fixes
@ 2011-03-14 19:04 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
On 03/13, Oleg Nesterov wrote:
>
> The _trivial_ exploit (distinct from this one) can kill the system.
> ...
> I'll return to this tomorrow.
I am going to give up, I do not understand the changes in -mm ;)
But I think we have other problems which should be fixed first.
Note that each fix has the "this is _not_ enough" comment. I was
going to _try_ to do something more complete later, but since we
are going to add more hacks^W subtle changes... lets fix the obvious
bugs first.
Add Linus. Hopefully the changes are simple enough. But once again,
we need more.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:04 ` Oleg Nesterov
@ 2011-03-14 19:04 ` Oleg Nesterov
-1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
This is very wrong by many reasons. In particular, this thread can
be the dead group leader. Check p->mm != NULL.
Note: this is _not_ enough. Just a minimal fix.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- 38/mm/oom_kill.c~1_kill_fix_pf_exiting 2011-03-14 17:53:05.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-14 18:51:49.000000000 +0100
@@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
* If the task is already exiting, don't alarm the sysadmin or kill
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
- if (p->flags & PF_EXITING) {
+ if (p->flags & PF_EXITING && p->mm) {
set_tsk_thread_flag(p, TIF_MEMDIE);
boost_dying_task_prio(p, mem);
return 0;
^ permalink raw reply [flat|nested] 69+ messages in thread* [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 19:04 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:04 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
This is very wrong by many reasons. In particular, this thread can
be the dead group leader. Check p->mm != NULL.
Note: this is _not_ enough. Just a minimal fix.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- 38/mm/oom_kill.c~1_kill_fix_pf_exiting 2011-03-14 17:53:05.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-14 18:51:49.000000000 +0100
@@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
* If the task is already exiting, don't alarm the sysadmin or kill
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
- if (p->flags & PF_EXITING) {
+ if (p->flags & PF_EXITING && p->mm) {
set_tsk_thread_flag(p, TIF_MEMDIE);
boost_dying_task_prio(p, mem);
return 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:04 ` Oleg Nesterov
@ 2011-03-14 19:35 ` Linus Torvalds
-1 siblings, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2011-03-14 19:35 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.
Explain more, please. Maybe I'm missing some context because I wasn't
cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
and exit_mm() is called almost immediately afterwards which will set
p->mm to NULL.
So afaik, this will basically just remove the whole point of the code
entirely - so why not remove it then?
The combination of testing PF_EXITING and p->mm just doesn't seem to
make any sense.
Linus
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 19:35 ` Linus Torvalds
0 siblings, 0 replies; 69+ messages in thread
From: Linus Torvalds @ 2011-03-14 19:35 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.
Explain more, please. Maybe I'm missing some context because I wasn't
cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
and exit_mm() is called almost immediately afterwards which will set
p->mm to NULL.
So afaik, this will basically just remove the whole point of the code
entirely - so why not remove it then?
The combination of testing PF_EXITING and p->mm just doesn't seem to
make any sense.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:35 ` Linus Torvalds
@ 2011-03-14 20:31 ` Oleg Nesterov
-1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 20:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, Linus Torvalds wrote:
>
> On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
>
> Explain more, please. Maybe I'm missing some context because I wasn't
> cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
> and exit_mm() is called almost immediately afterwards which will set
> p->mm to NULL.
>
> So afaik, this will basically just remove the whole point of the code
> entirely - so why not remove it then?
I am afraid I am going to lie... But iirc I tried to remove this code
before. Can't find the previous discussion, probably I am wrong.
Anyway. I never understood why do we have this special case.
> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.
To me, it doesn't make too much sense even if we do not check ->mm.
But. I _think_ the intent was to wait until this "exiting" process
does exit_mm() and frees the memory. This is like the
"the process of releasing memory " code in select_bad_process(). Once
again, this is only my speculation.
In any case, this patch doesn't pretend to be the right fix.
Oleg.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 20:31 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 20:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Hugh Dickins, Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki,
Andrey Vagin, David Rientjes, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, Linus Torvalds wrote:
>
> On Mon, Mar 14, 2011 at 12:04 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
>
> Explain more, please. Maybe I'm missing some context because I wasn't
> cc'd on the original thread, but PF_EXITING gets set by exit_signal(),
> and exit_mm() is called almost immediately afterwards which will set
> p->mm to NULL.
>
> So afaik, this will basically just remove the whole point of the code
> entirely - so why not remove it then?
I am afraid I am going to lie... But iirc I tried to remove this code
before. Can't find the previous discussion, probably I am wrong.
Anyway. I never understood why do we have this special case.
> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.
To me, it doesn't make too much sense even if we do not check ->mm.
But. I _think_ the intent was to wait until this "exiting" process
does exit_mm() and frees the memory. This is like the
"the process of releasing memory " code in select_bad_process(). Once
again, this is only my speculation.
In any case, this patch doesn't pretend to be the right fix.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:35 ` Linus Torvalds
@ 2011-03-14 20:32 ` David Rientjes
-1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Linus Torvalds wrote:
> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.
>
Right, it doesn't (and I recently removed testing the combination from
select_bad_process() in -mm). The check for PF_EXITING in the oom killer
is purely to avoid needlessly killing tasks when something is already
exiting and will (hopefully) be freeing its memory soon. If an eligible
thread is found to be PF_EXITING, the oom killer will defer indefinitely
unless it happens to be current. If it happens to be current, then it is
automatically selected so it gets access to the needed memory reserves.
We do need to ensure that behavior doesn't preempt any task from being
killed if there's an eligible thread other than current that never
actually detaches its ->mm (oom-skip-zombies-when-iterating-tasklist.patch
in -mm filters all threads without an ->mm). That can happen if
mm->mmap_sem never gets released by a thread and that's why an earlier
change that is already in your tree automatically gives current access to
memory reserves immediately upon calling the oom killer if it has a
pending SIGKILL.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 20:32 ` David Rientjes
0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Linus Torvalds wrote:
> The combination of testing PF_EXITING and p->mm just doesn't seem to
> make any sense.
>
Right, it doesn't (and I recently removed testing the combination from
select_bad_process() in -mm). The check for PF_EXITING in the oom killer
is purely to avoid needlessly killing tasks when something is already
exiting and will (hopefully) be freeing its memory soon. If an eligible
thread is found to be PF_EXITING, the oom killer will defer indefinitely
unless it happens to be current. If it happens to be current, then it is
automatically selected so it gets access to the needed memory reserves.
We do need to ensure that behavior doesn't preempt any task from being
killed if there's an eligible thread other than current that never
actually detaches its ->mm (oom-skip-zombies-when-iterating-tasklist.patch
in -mm filters all threads without an ->mm). That can happen if
mm->mmap_sem never gets released by a thread and that's why an earlier
change that is already in your tree automatically gives current access to
memory reserves immediately upon calling the oom killer if it has a
pending SIGKILL.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 20:32 ` David Rientjes
@ 2011-03-15 19:12 ` Oleg Nesterov
-1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:12 UTC (permalink / raw)
To: David Rientjes
Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Linus Torvalds wrote:
>
> > The combination of testing PF_EXITING and p->mm just doesn't seem to
> > make any sense.
> >
>
> Right, it doesn't (and I recently removed testing the combination from
> select_bad_process() in -mm). The check for PF_EXITING in the oom killer
> is purely to avoid needlessly killing tasks when something is already
> exiting
Maybe 0/3 wasn't clear enough. This patches does not try to fix things,
it only tries to close the hole in 2.6.38. But it was already released
today.
> and will (hopefully) be freeing its memory soon.
This is not clear to me.
When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
and the changelog says:
__oom_kill_task() is called to elevate the task's timeslice and give it
access to memory reserves so that it may quickly exit.
This privilege is unnecessary, however, if the task has already detached
its mm.
Now you are saing this is pointless.
OK. I already said I do not understand this special case. Perhaps I'll ask
the questions later.
> If an eligible
> thread is found to be PF_EXITING,
The problem is, we can't trust per-thread PF_EXITING checks. But I guess
we will discuss this more anyway.
Oleg.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 19:12 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:12 UTC (permalink / raw)
To: David Rientjes
Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Linus Torvalds wrote:
>
> > The combination of testing PF_EXITING and p->mm just doesn't seem to
> > make any sense.
> >
>
> Right, it doesn't (and I recently removed testing the combination from
> select_bad_process() in -mm). The check for PF_EXITING in the oom killer
> is purely to avoid needlessly killing tasks when something is already
> exiting
Maybe 0/3 wasn't clear enough. This patches does not try to fix things,
it only tries to close the hole in 2.6.38. But it was already released
today.
> and will (hopefully) be freeing its memory soon.
This is not clear to me.
When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
and the changelog says:
__oom_kill_task() is called to elevate the task's timeslice and give it
access to memory reserves so that it may quickly exit.
This privilege is unnecessary, however, if the task has already detached
its mm.
Now you are saing this is pointless.
OK. I already said I do not understand this special case. Perhaps I'll ask
the questions later.
> If an eligible
> thread is found to be PF_EXITING,
The problem is, we can't trust per-thread PF_EXITING checks. But I guess
we will discuss this more anyway.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-15 19:12 ` Oleg Nesterov
@ 2011-03-15 19:51 ` David Rientjes
-1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-15 19:51 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Tue, 15 Mar 2011, Oleg Nesterov wrote:
> When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
> and the changelog says:
>
> __oom_kill_task() is called to elevate the task's timeslice and give it
> access to memory reserves so that it may quickly exit.
>
> This privilege is unnecessary, however, if the task has already detached
> its mm.
>
> Now you are saing this is pointless.
>
If you have the commit id, do a "git blame 8123681022", because I see a:
5081dde3 (Nick Piggin 2006-09-25 23:31:32 -0700 222) if (!p->mm)
5081dde3 (Nick Piggin 2006-09-25 23:31:32 -0700 223) continue;
in select_bad_process() and it's also iterating over every thread:
a49335cc (Paul Jackson 2005-09-06 15:18:09 -0700 215) do_each_thread(g, p) {
It's pointless since oom-skip-zombies-when-iterating-tasklist.patch in
-mm reintroduced the filter for !p->mm in select_bad_process() which was
still there when 81236810 was merged; it's a small optimization, though,
to avoid races where the mm becomes detached between the process'
selection in select_bad_process() and its kill in oom_kill_process().
> The problem is, we can't trust per-thread PF_EXITING checks. But I guess
> we will discuss this more anyway.
>
My approach, as you saw with
oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch in
-mm is to add exceptions to the oom killer when we can't trust that
PF_EXITING will soon be exiting. I think that's a much more long-term
maintainable solution instead of inferring the status of a thread based on
external circumstances (such as number of threads in the thread group)
that could easily change out from under us and once again break the oom
killer.
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 19:51 ` David Rientjes
0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-15 19:51 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Tue, 15 Mar 2011, Oleg Nesterov wrote:
> When I did this change I looked at 81236810226f71bd9ff77321c8e8276dae7efc61
> and the changelog says:
>
> __oom_kill_task() is called to elevate the task's timeslice and give it
> access to memory reserves so that it may quickly exit.
>
> This privilege is unnecessary, however, if the task has already detached
> its mm.
>
> Now you are saing this is pointless.
>
If you have the commit id, do a "git blame 8123681022", because I see a:
5081dde3 (Nick Piggin 2006-09-25 23:31:32 -0700 222) if (!p->mm)
5081dde3 (Nick Piggin 2006-09-25 23:31:32 -0700 223) continue;
in select_bad_process() and it's also iterating over every thread:
a49335cc (Paul Jackson 2005-09-06 15:18:09 -0700 215) do_each_thread(g, p) {
It's pointless since oom-skip-zombies-when-iterating-tasklist.patch in
-mm reintroduced the filter for !p->mm in select_bad_process() which was
still there when 81236810 was merged; it's a small optimization, though,
to avoid races where the mm becomes detached between the process'
selection in select_bad_process() and its kill in oom_kill_process().
> The problem is, we can't trust per-thread PF_EXITING checks. But I guess
> we will discuss this more anyway.
>
My approach, as you saw with
oom-avoid-deferring-oom-killer-if-exiting-task-is-being-traced.patch in
-mm is to add exceptions to the oom killer when we can't trust that
PF_EXITING will soon be exiting. I think that's a much more long-term
maintainable solution instead of inferring the status of a thread based on
external circumstances (such as number of threads in the thread group)
that could easily change out from under us and once again break the oom
killer.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 19:04 ` Oleg Nesterov
@ 2011-03-14 20:22 ` David Rientjes
-1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:22 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Oleg Nesterov wrote:
> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.
>
This is true only for the oom_kill_allocating_task sysctl where it is
required in all cases to kill current; current won't be triggering the oom
killer if it's dead.
oom_kill_process() is called with the thread selected by
select_bad_process() and that function will not return any thread if any
eligible task is found to be PF_EXITING and is not current, or any
eligible task is found to have TIF_MEMDIE.
In other words, for this conditional to be true in oom_kill_process(),
then p must be current and so it cannot be the dead group leader as
specified in your changelog unless PF_EXITING gets set between
select_bad_process() and the oom_kill_process() call: we don't care about
that since it's in the exit path and we therefore want to give it access
to memory reserves to quickly exit anyway and the check for PF_EXITING in
select_bad_process() prevents any infinite loop of that task getting
constantly reselected if it's dead.
> Note: this is _not_ enough. Just a minimal fix.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~1_kill_fix_pf_exiting 2011-03-14 17:53:05.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 18:51:49.000000000 +0100
> @@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
> * If the task is already exiting, don't alarm the sysadmin or kill
> * its children or threads, just set TIF_MEMDIE so it can die quickly
> */
> - if (p->flags & PF_EXITING) {
> + if (p->flags & PF_EXITING && p->mm) {
> set_tsk_thread_flag(p, TIF_MEMDIE);
> boost_dying_task_prio(p, mem);
> return 0;
>
>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-14 20:22 ` David Rientjes
0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:22 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Oleg Nesterov wrote:
> oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> This is very wrong by many reasons. In particular, this thread can
> be the dead group leader. Check p->mm != NULL.
>
This is true only for the oom_kill_allocating_task sysctl where it is
required in all cases to kill current; current won't be triggering the oom
killer if it's dead.
oom_kill_process() is called with the thread selected by
select_bad_process() and that function will not return any thread if any
eligible task is found to be PF_EXITING and is not current, or any
eligible task is found to have TIF_MEMDIE.
In other words, for this conditional to be true in oom_kill_process(),
then p must be current and so it cannot be the dead group leader as
specified in your changelog unless PF_EXITING gets set between
select_bad_process() and the oom_kill_process() call: we don't care about
that since it's in the exit path and we therefore want to give it access
to memory reserves to quickly exit anyway and the check for PF_EXITING in
select_bad_process() prevents any infinite loop of that task getting
constantly reselected if it's dead.
> Note: this is _not_ enough. Just a minimal fix.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~1_kill_fix_pf_exiting 2011-03-14 17:53:05.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 18:51:49.000000000 +0100
> @@ -470,7 +470,7 @@ static int oom_kill_process(struct task_
> * If the task is already exiting, don't alarm the sysadmin or kill
> * its children or threads, just set TIF_MEMDIE so it can die quickly
> */
> - if (p->flags & PF_EXITING) {
> + if (p->flags & PF_EXITING && p->mm) {
> set_tsk_thread_flag(p, TIF_MEMDIE);
> boost_dying_task_prio(p, mem);
> return 0;
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-14 20:22 ` David Rientjes
@ 2011-03-15 18:53 ` Oleg Nesterov
-1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 18:53 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
> >
>
> This is true only for the oom_kill_allocating_task sysctl where it is
> required in all cases to kill current; current won't be triggering the oom
> killer if it's dead.
>
> oom_kill_process() is called with the thread selected by
> select_bad_process() and that function will not return any thread if any
> eligible task is found to be PF_EXITING and is not current, or any
> eligible task is found to have TIF_MEMDIE.
>
> In other words, for this conditional to be true in oom_kill_process(),
> then p must be current and so it cannot be the dead group leader as
> specified in your changelog unless PF_EXITING gets set between
> select_bad_process() and the oom_kill_process() call: we don't care about
> that since it's in the exit path and we therefore want to give it access
> to memory reserves to quickly exit anyway and the check for PF_EXITING in
> select_bad_process() prevents any infinite loop of that task getting
> constantly reselected if it's dead.
Confused. I sent the test-case. OK, may be you meant the code in -mm,
but I meant the current code.
Oleg.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 18:53 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 18:53 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() simply sets TIF_MEMDIE and returns if PF_EXITING.
> > This is very wrong by many reasons. In particular, this thread can
> > be the dead group leader. Check p->mm != NULL.
> >
>
> This is true only for the oom_kill_allocating_task sysctl where it is
> required in all cases to kill current; current won't be triggering the oom
> killer if it's dead.
>
> oom_kill_process() is called with the thread selected by
> select_bad_process() and that function will not return any thread if any
> eligible task is found to be PF_EXITING and is not current, or any
> eligible task is found to have TIF_MEMDIE.
>
> In other words, for this conditional to be true in oom_kill_process(),
> then p must be current and so it cannot be the dead group leader as
> specified in your changelog unless PF_EXITING gets set between
> select_bad_process() and the oom_kill_process() call: we don't care about
> that since it's in the exit path and we therefore want to give it access
> to memory reserves to quickly exit anyway and the check for PF_EXITING in
> select_bad_process() prevents any infinite loop of that task getting
> constantly reselected if it's dead.
Confused. I sent the test-case. OK, may be you meant the code in -mm,
but I meant the current code.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-15 18:53 ` Oleg Nesterov
@ 2011-03-15 19:54 ` David Rientjes
-1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-15 19:54 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Tue, 15 Mar 2011, Oleg Nesterov wrote:
> Confused. I sent the test-case. OK, may be you meant the code in -mm,
> but I meant the current code.
>
This entire discussion, and your involvement in it, originated from these
two patches being merged into -mm:
oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
oom-skip-zombies-when-iterating-tasklist.patch
So naturally I'm going to challenge your testcases with the latest -mm.
If you wanted to suggest pushing these to 2.6.38 earlier, I don't think
anyone would have disputed that -- I certainly wouldn't have since the
first fixes a quite obvious panic that we've faced on a lot of our
machines. It's not that big of a deal, though, since Andrew has targeted
them for -stable and they're on schedule to be pushed to 2.6.38.x
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 19:54 ` David Rientjes
0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-15 19:54 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Tue, 15 Mar 2011, Oleg Nesterov wrote:
> Confused. I sent the test-case. OK, may be you meant the code in -mm,
> but I meant the current code.
>
This entire discussion, and your involvement in it, originated from these
two patches being merged into -mm:
oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
oom-skip-zombies-when-iterating-tasklist.patch
So naturally I'm going to challenge your testcases with the latest -mm.
If you wanted to suggest pushing these to 2.6.38 earlier, I don't think
anyone would have disputed that -- I certainly wouldn't have since the
first fixes a quite obvious panic that we've faced on a lot of our
machines. It's not that big of a deal, though, since Andrew has targeted
them for -stable and they're on schedule to be pushed to 2.6.38.x
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
2011-03-15 19:54 ` David Rientjes
@ 2011-03-15 21:16 ` Oleg Nesterov
-1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 21:16 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/15, David Rientjes wrote:
>
> On Tue, 15 Mar 2011, Oleg Nesterov wrote:
>
> > Confused. I sent the test-case. OK, may be you meant the code in -mm,
> > but I meant the current code.
> >
>
> This entire discussion, and your involvement in it, originated from these
> two patches being merged into -mm:
>
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> oom-skip-zombies-when-iterating-tasklist.patch
Yes. This motivated me to look at the current code, and I was unpleasantly
surprised.
> So naturally I'm going to challenge your testcases with the latest -mm.
Sure. And once again, I didn't expect the 2nd problem was fixed, I forgot
about the second patch.
> If you wanted to suggest pushing these to 2.6.38 earlier,
Yes. It was too late for 2.6.38. I thought we have more time before
release.
Oleg.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/3 for 2.6.38] oom: oom_kill_process: don't set TIF_MEMDIE if !p->mm
@ 2011-03-15 21:16 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 21:16 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/15, David Rientjes wrote:
>
> On Tue, 15 Mar 2011, Oleg Nesterov wrote:
>
> > Confused. I sent the test-case. OK, may be you meant the code in -mm,
> > but I meant the current code.
> >
>
> This entire discussion, and your involvement in it, originated from these
> two patches being merged into -mm:
>
> oom-prevent-unnecessary-oom-kills-or-kernel-panics.patch
> oom-skip-zombies-when-iterating-tasklist.patch
Yes. This motivated me to look at the current code, and I was unpleasantly
surprised.
> So naturally I'm going to challenge your testcases with the latest -mm.
Sure. And once again, I didn't expect the 2nd problem was fixed, I forgot
about the second patch.
> If you wanted to suggest pushing these to 2.6.38 earlier,
Yes. It was too late for 2.6.38. I thought we have more time before
release.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
2011-03-14 19:04 ` Oleg Nesterov
@ 2011-03-14 19:05 ` Oleg Nesterov
-1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
select_bad_process() assumes that a TIF_MEMDIE process should go away.
But it can only go away it its parent does wait(). Change this check to
ignore the TIF_MEMDIE zombies.
Note: this is _not_ enough. Just a minimal fix.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100
@@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
* blocked waiting for another task which itself is waiting
* for memory. Is there a better alternative?
*/
- if (test_tsk_thread_flag(p, TIF_MEMDIE))
+ if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
+ !p->exit_state && thread_group_empty(p))
return ERR_PTR(-1UL);
/*
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
@ 2011-03-14 19:05 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
select_bad_process() assumes that a TIF_MEMDIE process should go away.
But it can only go away it its parent does wait(). Change this check to
ignore the TIF_MEMDIE zombies.
Note: this is _not_ enough. Just a minimal fix.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100
@@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
* blocked waiting for another task which itself is waiting
* for memory. Is there a better alternative?
*/
- if (test_tsk_thread_flag(p, TIF_MEMDIE))
+ if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
+ !p->exit_state && thread_group_empty(p))
return ERR_PTR(-1UL);
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
2011-03-14 19:05 ` Oleg Nesterov
@ 2011-03-14 20:50 ` David Rientjes
-1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Oleg Nesterov wrote:
> select_bad_process() assumes that a TIF_MEMDIE process should go away.
> But it can only go away it its parent does wait(). Change this check to
> ignore the TIF_MEMDIE zombies.
>
The equivalent of this change would be to set TIF_MEMDIE for all threads
in a thread group when choosing a process to kill; as we've already
discussed in your first series of patches, that has the risk of fully
depleting memory reserves and causing the kernel the deadlock. We want to
limit TIF_MEMDIE to an oom killed task or to current when it is responding
to a SIGKILL or already in the exit path because we know it's exiting and
without memory reserves it may never exit.
This patch is even more concerning, however, because select_bad_process()
isn't even guaranteed to select a thread from the same thread group this
time.
> Note: this is _not_ enough. Just a minimal fix.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100
> @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
> * blocked waiting for another task which itself is waiting
> * for memory. Is there a better alternative?
> */
> - if (test_tsk_thread_flag(p, TIF_MEMDIE))
> + if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> + !p->exit_state && thread_group_empty(p))
> return ERR_PTR(-1UL);
>
> /*
>
>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 2/3 for 2.6.38] oom: select_bad_process: ignore TIF_MEMDIE zombies
@ 2011-03-14 20:50 ` David Rientjes
0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:50 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Oleg Nesterov wrote:
> select_bad_process() assumes that a TIF_MEMDIE process should go away.
> But it can only go away it its parent does wait(). Change this check to
> ignore the TIF_MEMDIE zombies.
>
The equivalent of this change would be to set TIF_MEMDIE for all threads
in a thread group when choosing a process to kill; as we've already
discussed in your first series of patches, that has the risk of fully
depleting memory reserves and causing the kernel the deadlock. We want to
limit TIF_MEMDIE to an oom killed task or to current when it is responding
to a SIGKILL or already in the exit path because we know it's exiting and
without memory reserves it may never exit.
This patch is even more concerning, however, because select_bad_process()
isn't even guaranteed to select a thread from the same thread group this
time.
> Note: this is _not_ enough. Just a minimal fix.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~2_tif_memdie_zombie 2011-03-14 18:51:49.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 18:52:39.000000000 +0100
> @@ -311,7 +311,8 @@ static struct task_struct *select_bad_pr
> * blocked waiting for another task which itself is waiting
> * for memory. Is there a better alternative?
> */
> - if (test_tsk_thread_flag(p, TIF_MEMDIE))
> + if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> + !p->exit_state && thread_group_empty(p))
> return ERR_PTR(-1UL);
>
> /*
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
2011-03-14 19:04 ` Oleg Nesterov
@ 2011-03-14 19:05 ` Oleg Nesterov
-1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.
Also, "children has a different mm" doesn't match the reality, we
should check child->mm != t->mm. This check is not exactly correct
if t->mm == NULL but this doesn't really matter, oom_kill_task()
will kill them anyway.
Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- 38/mm/oom_kill.c~3_fix_kill_chld 2011-03-14 18:52:39.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-14 19:36:01.000000000 +0100
@@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
- struct task_struct *victim = p;
+ struct task_struct *victim;
struct task_struct *child;
- struct task_struct *t = p;
- unsigned int victim_points = 0;
+ struct task_struct *t;
+ unsigned int victim_points;
if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);
@@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
* parent. This attempts to lose the minimal amount of work done while
* still freeing memory.
*/
+ victim_points = oom_badness(p, mem, nodemask, totalpages);
+ victim = p;
+ t = p;
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
+ if (child->mm == t->mm)
+ continue;
/*
* oom_badness() returns 0 if the thread is unkillable
*/
^ permalink raw reply [flat|nested] 69+ messages in thread* [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
@ 2011-03-14 19:05 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-14 19:05 UTC (permalink / raw)
To: Hugh Dickins, Linus Torvalds
Cc: Andrew Morton, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Andrey Vagin,
David Rientjes, Frantisek Hrbata, linux-mm, linux-kernel
oom_kill_process() starts with victim_points == 0. This means that
(most likely) any child has more points and can be killed erroneously.
Also, "children has a different mm" doesn't match the reality, we
should check child->mm != t->mm. This check is not exactly correct
if t->mm == NULL but this doesn't really matter, oom_kill_task()
will kill them anyway.
Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- 38/mm/oom_kill.c~3_fix_kill_chld 2011-03-14 18:52:39.000000000 +0100
+++ 38/mm/oom_kill.c 2011-03-14 19:36:01.000000000 +0100
@@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
- struct task_struct *victim = p;
+ struct task_struct *victim;
struct task_struct *child;
- struct task_struct *t = p;
- unsigned int victim_points = 0;
+ struct task_struct *t;
+ unsigned int victim_points;
if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);
@@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
* parent. This attempts to lose the minimal amount of work done while
* still freeing memory.
*/
+ victim_points = oom_badness(p, mem, nodemask, totalpages);
+ victim = p;
+ t = p;
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned int child_points;
+ if (child->mm == t->mm)
+ continue;
/*
* oom_badness() returns 0 if the thread is unkillable
*/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
2011-03-14 19:05 ` Oleg Nesterov
@ 2011-03-14 20:41 ` David Rientjes
-1 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:41 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Oleg Nesterov wrote:
> oom_kill_process() starts with victim_points == 0. This means that
> (most likely) any child has more points and can be killed erroneously.
>
> Also, "children has a different mm" doesn't match the reality, we
> should check child->mm != t->mm. This check is not exactly correct
> if t->mm == NULL but this doesn't really matter, oom_kill_task()
> will kill them anyway.
>
> Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> too.
>
There're two issues you're addressing in this patch. It only kills a
child in place of its selected parent when:
- the child has a higher badness score, and
- it has a different ->mm.
In the former case, NACK, we always want to sacrifice children regardless
of their badness score (as long as it is non-zero) if it has a separate
->mm in place of its parent, otherwise webservers will be killed instead
of one of their children serving a client, sshd could be killed instead of
bash, etc. The behavior of the oom killer has always been to try to kill
a child with its own ->mm first to avoid losing a large amount of work
being done or unnecessarily killing a job scheduler, for example, when
sacrificing a child would be satisfactory. It'll kill additional tasks,
and perhaps even the parent later if it has no more children, if the oom
condition persists.
In the latter case, I agree, we should be testing if the child has a
different ->mm before sacrificing it for its parent as the comment
indicates it will. I proposed that exact change in "oom: avoid deferring
oom killer if exiting task is being traced" posted to -mm a couple days
ago.
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~3_fix_kill_chld 2011-03-14 18:52:39.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 19:36:01.000000000 +0100
> @@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
> struct mem_cgroup *mem, nodemask_t *nodemask,
> const char *message)
> {
> - struct task_struct *victim = p;
> + struct task_struct *victim;
> struct task_struct *child;
> - struct task_struct *t = p;
> - unsigned int victim_points = 0;
> + struct task_struct *t;
> + unsigned int victim_points;
>
> if (printk_ratelimit())
> dump_header(p, gfp_mask, order, mem, nodemask);
> @@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
> * parent. This attempts to lose the minimal amount of work done while
> * still freeing memory.
> */
> + victim_points = oom_badness(p, mem, nodemask, totalpages);
> + victim = p;
> + t = p;
> do {
> list_for_each_entry(child, &t->children, sibling) {
> unsigned int child_points;
>
> + if (child->mm == t->mm)
> + continue;
> /*
> * oom_badness() returns 0 if the thread is unkillable
> */
>
>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
@ 2011-03-14 20:41 ` David Rientjes
0 siblings, 0 replies; 69+ messages in thread
From: David Rientjes @ 2011-03-14 20:41 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On Mon, 14 Mar 2011, Oleg Nesterov wrote:
> oom_kill_process() starts with victim_points == 0. This means that
> (most likely) any child has more points and can be killed erroneously.
>
> Also, "children has a different mm" doesn't match the reality, we
> should check child->mm != t->mm. This check is not exactly correct
> if t->mm == NULL but this doesn't really matter, oom_kill_task()
> will kill them anyway.
>
> Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> too.
>
There're two issues you're addressing in this patch. It only kills a
child in place of its selected parent when:
- the child has a higher badness score, and
- it has a different ->mm.
In the former case, NACK, we always want to sacrifice children regardless
of their badness score (as long as it is non-zero) if it has a separate
->mm in place of its parent, otherwise webservers will be killed instead
of one of their children serving a client, sshd could be killed instead of
bash, etc. The behavior of the oom killer has always been to try to kill
a child with its own ->mm first to avoid losing a large amount of work
being done or unnecessarily killing a job scheduler, for example, when
sacrificing a child would be satisfactory. It'll kill additional tasks,
and perhaps even the parent later if it has no more children, if the oom
condition persists.
In the latter case, I agree, we should be testing if the child has a
different ->mm before sacrificing it for its parent as the comment
indicates it will. I proposed that exact change in "oom: avoid deferring
oom killer if exiting task is being traced" posted to -mm a couple days
ago.
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- 38/mm/oom_kill.c~3_fix_kill_chld 2011-03-14 18:52:39.000000000 +0100
> +++ 38/mm/oom_kill.c 2011-03-14 19:36:01.000000000 +0100
> @@ -459,10 +459,10 @@ static int oom_kill_process(struct task_
> struct mem_cgroup *mem, nodemask_t *nodemask,
> const char *message)
> {
> - struct task_struct *victim = p;
> + struct task_struct *victim;
> struct task_struct *child;
> - struct task_struct *t = p;
> - unsigned int victim_points = 0;
> + struct task_struct *t;
> + unsigned int victim_points;
>
> if (printk_ratelimit())
> dump_header(p, gfp_mask, order, mem, nodemask);
> @@ -488,10 +488,15 @@ static int oom_kill_process(struct task_
> * parent. This attempts to lose the minimal amount of work done while
> * still freeing memory.
> */
> + victim_points = oom_badness(p, mem, nodemask, totalpages);
> + victim = p;
> + t = p;
> do {
> list_for_each_entry(child, &t->children, sibling) {
> unsigned int child_points;
>
> + if (child->mm == t->mm)
> + continue;
> /*
> * oom_badness() returns 0 if the thread is unkillable
> */
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
2011-03-14 20:41 ` David Rientjes
@ 2011-03-15 19:21 ` Oleg Nesterov
-1 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:21 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() starts with victim_points == 0. This means that
> > (most likely) any child has more points and can be killed erroneously.
> >
> > Also, "children has a different mm" doesn't match the reality, we
> > should check child->mm != t->mm. This check is not exactly correct
> > if t->mm == NULL but this doesn't really matter, oom_kill_task()
> > will kill them anyway.
> >
> > Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> > too.
> >
>
> There're two issues you're addressing in this patch. It only kills a
> child in place of its selected parent when:
>
> - the child has a higher badness score, and
>
> - it has a different ->mm.
>
> In the former case, NACK, we always want to sacrifice children regardless
> of their badness score (as long as it is non-zero) if it has a separate
> ->mm in place of its parent,
Ah. So this was intentional?
OK. I was hypnotized by the security implications, and this looked so
"obviously wrong" to me.
But, of course I can't judge when it comes to oom's heuristic, and you
certainly know better.
So, thanks for correcting me.
Just a question... what about oom_kill_allocating_task? Probably
Documentation/sysctl/vm.txt should be updated.
Oleg.
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 3/3 for 2.6.38] oom: oom_kill_process: fix the child_points logic
@ 2011-03-15 19:21 ` Oleg Nesterov
0 siblings, 0 replies; 69+ messages in thread
From: Oleg Nesterov @ 2011-03-15 19:21 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, Linus Torvalds, Andrew Morton, KOSAKI Motohiro,
KAMEZAWA Hiroyuki, Andrey Vagin, Frantisek Hrbata, linux-mm,
linux-kernel
On 03/14, David Rientjes wrote:
>
> On Mon, 14 Mar 2011, Oleg Nesterov wrote:
>
> > oom_kill_process() starts with victim_points == 0. This means that
> > (most likely) any child has more points and can be killed erroneously.
> >
> > Also, "children has a different mm" doesn't match the reality, we
> > should check child->mm != t->mm. This check is not exactly correct
> > if t->mm == NULL but this doesn't really matter, oom_kill_task()
> > will kill them anyway.
> >
> > Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong
> > too.
> >
>
> There're two issues you're addressing in this patch. It only kills a
> child in place of its selected parent when:
>
> - the child has a higher badness score, and
>
> - it has a different ->mm.
>
> In the former case, NACK, we always want to sacrifice children regardless
> of their badness score (as long as it is non-zero) if it has a separate
> ->mm in place of its parent,
Ah. So this was intentional?
OK. I was hypnotized by the security implications, and this looked so
"obviously wrong" to me.
But, of course I can't judge when it comes to oom's heuristic, and you
certainly know better.
So, thanks for correcting me.
Just a question... what about oom_kill_allocating_task? Probably
Documentation/sysctl/vm.txt should be updated.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes
2011-03-12 13:43 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Oleg Nesterov
` (3 preceding siblings ...)
2011-03-12 19:40 ` [PATCH 0/3] oom: TIF_MEMDIE/PF_EXITING fixes Hugh Dickins
@ 2011-03-13 11:36 ` KOSAKI Motohiro
4 siblings, 0 replies; 69+ messages in thread
From: KOSAKI Motohiro @ 2011-03-13 11:36 UTC (permalink / raw)
To: Oleg Nesterov
Cc: kosaki.motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Hugh Dickins,
linux-mm, Andrey Vagin, David Rientjes
> > Note also the note about "p == current" check. it should be fixed too.
>
> I am resending the fixes above plus the new one.
>
> David, Kosaki, what do you think?
Oleg, could you please give me some testing time? Now TIF_MEMDIE has
really unclear and nasty meanings. mainly 1) to prevent multiple oom-killer
(see select_bad_process) and 2) to allow to use zone's last resort reserved
memory (see gfp_to_alloc_flags). The latter has really problematic even if
apply or not apply your patch.
If no apply your patch, multi thread application have a lot of risk
to fail to exit when oom killed. because sub threads can't use reserved
memory and may get stucked exiting path.
if apply your patch, multi thread application also have a lot of risk
to fail to exit when oom killed. because if all thread used a little
reserved memory but it is not enough successful exit. It become deadlock.
The optimal way is, take a bonus one thread and successfull thread exiting
pass a bonus to sibling thread. But, who want to put performance overhead
into thread exiting path for only really really rare oom events? this is
the problem. therefore, I can't put ack until I've finished some test.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 69+ messages in thread