* Re: Memory overcommit
@ 2009-10-27 20:44 ` Hugh Dickins
0 siblings, 0 replies; 128+ messages in thread
From: Hugh Dickins @ 2009-10-27 20:44 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, linux-mm, linux-kernel, kosaki.motohiro,
minchan.kim, akpm, rientjes, aarcange
On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> Sigh, gnome-session has twice value of mmap(1G).
> Of course, gnome-session only uses 6M bytes of anon.
> I wonder this is because gnome-session has many children..but need to
> dig more. Does anyone has idea ?
When preparing KSM unmerge to handle OOM, I looked at how the precedent
was handled by running a little program which mmaps an anonymous region
of the same size as physical memory, then tries to mlock it. The
program was such an obvious candidate to be killed, I was shocked
by the poor decisions the OOM killer made. Usually I ran it with
mem=512M, with gnome and firefox active. Often the OOM killer killed
it right the first time, but went wrong when I tried it a second time
(I think that's because of what's already swapped out the first time).
I built up a patchset of fixes, but once I came to split them up for
submission, not one of them seemed entirely satisfactory; and Andrea's
fix to the KSM/mlock deadlock forced me to abandon even the first of
the patches (we've since then fixed the way munlocking behaves, so
in theory could revisit that; but Andrea disliked what I was trying
to do there in KSM for other reasons, so I've not touched it since).
I had to get on with KSM, so I set it all aside: none of the issues
was a recent regression.
I did briefly wonder about the reliance on total_vm which you're now
looking into, but didn't touch that at all. Let me describe those
issues which I did try but fail to fix - I've no more time to deal
with them now than then, but ought at least to mention them to you.
1. select_bad_process() tries to avoid killing another process while
there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
processes. However, p->mm is set to NULL well before p reaches
exit_mmap() to actually free the memory, and there may be significant
delays in between (I think exit_robust_list() gave me a hang at one
stage). So in practice, even when the OOM killer selects the right
process to kill, there can be lots of collateral damage from it not
waiting long enough for that process to give up its memory.
I tried to deal with that by moving the TIF_MEMDIE test up before
the p->mm test, but adding in a check on p->exit_state:
if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
!p->exit_state)
return ERR_PTR(-1UL);
But this is then liable to hang the system if there's some reason
why the selected process cannot proceed to free its memory (e.g.
the current KSM unmerge case). It needs to wait "a while", but
give up if no progress is made, instead of hanging: originally
I thought that setting PF_MEMALLOC more widely in page_alloc.c,
and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
would deal with that; but we cannot be sure that waiting of memory
is the only reason for a holdup there (in the KSM unmerge case it's
waiting for an mmap_sem, and there may well be other such cases).
2. I started out running my mlock test program as root (later
switched to use "ulimit -l unlimited" first). But badness() reckons
CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
and CAP_SYS_RAWIO another reason to quarter your points: so running
as root makes you sixteen times less likely to be killed. Quartering
is anyway debatable, but sixteenthing seems utterly excessive to me.
I moved the CAP_SYS_RAWIO test in with the others, so it does no
more than quartering; but is quartering appropriate anyway? I did
wonder if I was right to be "subverting" the fine-grained CAPs in
this way, but have since seen unrelated mail from one who knows
better, implying they're something of a fantasy, that su and sudo
are indeed what's used in the real world. Maybe this patch was okay.
3. badness() has a comment above it which says:
* 5) we try to kill the process the user expects us to kill, this
* algorithm has been meticulously tuned to meet the principle
* of least surprise ... (be careful when you change it)
But Andrea's 2.6.11 86a4c6d9e2e43796bb362debd3f73c0e3b198efa (later
refined by Kurt's 2.6.16 9827b781f20828e5ceb911b879f268f78fe90815)
adds plenty of surprise there, by trying to factor children into the
calculation. Intended to deal with forkbombs, but any reasonable
process whose purpose is to fork children (e.g. gnome-session)
becomes very vulnerable. And whereas badness() itself goes on to
refine the total_vm points by various adjustments peculiar to the
process in question, those refinements have been ignored when
adding the child's total_vm/2. (Andrea does remark that he'd
rather have rewritten badness() from scratch.)
I tried to fix this by moving the PF_OOM_ORIGIN (was PF_SWAPOFF)
part of the calculation up to select_bad_process(), making a
solo_badness() function which makes all those adjustments to
total_vm, then badness() itself a simple function adding half
the children's solo_badness()es to the process' own solo_badness().
But probably lots more needs doing - Andrea's rewrite?
4. In some cases those children are sharing exactly the same mm,
yet its total_vm is being added again and again to the points:
I had a nasty inner loop searching back to see if we'd already
counted this mm (but then, what if the different tasks sharing
the mm deserved different adjustments to the total_vm?).
I hope these notes help someone towards a better solution
(and be prepared to discover more on the way). I agree with
Vedran that the present behaviour is pretty unimpressive, and
I'm puzzled as to how people can have been tinkering with
oom_kill.c down the years without seeing any of this.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-27 20:44 ` Hugh Dickins
@ 2009-10-27 21:04 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-27 21:04 UTC (permalink / raw)
To: Hugh Dickins
Cc: KAMEZAWA Hiroyuki, vedran.furac, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009, Hugh Dickins wrote:
> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>
The heuristics that the oom killer use in selecting a task seem to get
debated quite often.
What hasn't been mentioned is that total_vm does do a good job of
identifying tasks that are using far more memory than expected. That
seems to be the initial target: killing a rogue task that is hogging much
more memory than it should, probably because of a memory leak.
The latest approach seems to be focused more on killing the task that will
free the most resident memory. That certainly is understandable to avoid
killing additional tasks later and avoiding subsequent page allocations in
the short term, but doesn't help to kill the memory leaker.
There's advantages to either approach, but it depends on the contextual
goal of the oom killer when it's called: kill a rogue task that is
allocating more memory than expected, or kill a task that will free the
most memory.
> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>
I've proposed an oom killer timeout in the past which adds a jiffies count
to struct task_struct and will defer killing other tasks until the
predefined time limit (we use 10*HZ) has been exceeded. The problem is
that even if you kill another task, it is highly unlikely that the expired
task will ever exit at that point and is still holding a substantial
amount of memory since it also had access to memory reserves and has still
failed to exit.
> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>
I think someone (Nick?) proposed a patch at one time that removed most of
the heuristics from select_bad_process() other than total_vm of the task
and its children, mems_allowed intersection, and oom_adj.
> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>
oom_kill_process() may not kill the task selected by select_bad_process(),
it will first attempt to kill one of these children with a different mm.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-27 21:04 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-27 21:04 UTC (permalink / raw)
To: Hugh Dickins
Cc: KAMEZAWA Hiroyuki, vedran.furac, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009, Hugh Dickins wrote:
> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>
The heuristics that the oom killer use in selecting a task seem to get
debated quite often.
What hasn't been mentioned is that total_vm does do a good job of
identifying tasks that are using far more memory than expected. That
seems to be the initial target: killing a rogue task that is hogging much
more memory than it should, probably because of a memory leak.
The latest approach seems to be focused more on killing the task that will
free the most resident memory. That certainly is understandable to avoid
killing additional tasks later and avoiding subsequent page allocations in
the short term, but doesn't help to kill the memory leaker.
There's advantages to either approach, but it depends on the contextual
goal of the oom killer when it's called: kill a rogue task that is
allocating more memory than expected, or kill a task that will free the
most memory.
> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>
I've proposed an oom killer timeout in the past which adds a jiffies count
to struct task_struct and will defer killing other tasks until the
predefined time limit (we use 10*HZ) has been exceeded. The problem is
that even if you kill another task, it is highly unlikely that the expired
task will ever exit at that point and is still holding a substantial
amount of memory since it also had access to memory reserves and has still
failed to exit.
> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>
I think someone (Nick?) proposed a patch at one time that removed most of
the heuristics from select_bad_process() other than total_vm of the task
and its children, mems_allowed intersection, and oom_adj.
> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>
oom_kill_process() may not kill the task selected by select_bad_process(),
it will first attempt to kill one of these children with a different mm.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-27 21:04 ` David Rientjes
@ 2009-10-28 0:08 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-28 0:08 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> There's advantages to either approach, but it depends on the contextual
> goal of the oom killer when it's called: kill a rogue task that is
> allocating more memory than expected,
But it is wrong at counting allocated memory!
Come on, it kills /usr/lib/icedove/run-mozilla.sh. Parent, a shell
script, instead of its child(s) which allocated memory. Look, "test"
allocates some (0.1GB) memory, and you have:
% cat test.sh
#!/bin/sh
./test&
./test&
./test&
./test
% perl check_badness.pl|sort -n|g test
26511 7884 test
26511 7885 test
26511 7886 test
26511 7887 test
53994 7883 test.sh
// great, so test.sh "is" the bad ass, ok, emulate OOMK:
% kill -9 7883
// did we kill "a rogue task"
% perl check_badness.pl|sort -n|g test
26511 7884 test
26511 7885 test
26511 7886 test
26511 7887 test
// nooo, they are still alive and eating our memory!
QED by newbie. ;)
> or kill a task that will free the most memory.
.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 0:08 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-28 0:08 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> There's advantages to either approach, but it depends on the contextual
> goal of the oom killer when it's called: kill a rogue task that is
> allocating more memory than expected,
But it is wrong at counting allocated memory!
Come on, it kills /usr/lib/icedove/run-mozilla.sh. Parent, a shell
script, instead of its child(s) which allocated memory. Look, "test"
allocates some (0.1GB) memory, and you have:
% cat test.sh
#!/bin/sh
./test&
./test&
./test&
./test
% perl check_badness.pl|sort -n|g test
26511 7884 test
26511 7885 test
26511 7886 test
26511 7887 test
53994 7883 test.sh
// great, so test.sh "is" the bad ass, ok, emulate OOMK:
% kill -9 7883
// did we kill "a rogue task"
% perl check_badness.pl|sort -n|g test
26511 7884 test
26511 7885 test
26511 7886 test
26511 7887 test
// nooo, they are still alive and eating our memory!
QED by newbie. ;)
> or kill a task that will free the most memory.
.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 0:08 ` Vedran Furač
@ 2009-10-28 0:25 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 0:25 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, Vedran Fura wrote:
> But it is wrong at counting allocated memory!
> Come on, it kills /usr/lib/icedove/run-mozilla.sh. Parent, a shell
> script, instead of its child(s) which allocated memory. Look, "test"
> allocates some (0.1GB) memory, and you have:
>
> % cat test.sh
>
> #!/bin/sh
> ./test&
> ./test&
> ./test&
> ./test
>
> % perl check_badness.pl|sort -n|g test
>
> 26511 7884 test
> 26511 7885 test
> 26511 7886 test
> 26511 7887 test
> 53994 7883 test.sh
>
> // great, so test.sh "is" the bad ass, ok, emulate OOMK:
>
> % kill -9 7883
>
> // did we kill "a rogue task"
>
> % perl check_badness.pl|sort -n|g test
>
> 26511 7884 test
> 26511 7885 test
> 26511 7886 test
> 26511 7887 test
>
> // nooo, they are still alive and eating our memory!
>
This is wrong; it doesn't "emulate oom" since oom_kill_process() always
kills a child of the selected process instead if they do not share the
same memory. The chosen task in that case is untouched.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 0:25 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 0:25 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, Vedran Fura wrote:
> But it is wrong at counting allocated memory!
> Come on, it kills /usr/lib/icedove/run-mozilla.sh. Parent, a shell
> script, instead of its child(s) which allocated memory. Look, "test"
> allocates some (0.1GB) memory, and you have:
>
> % cat test.sh
>
> #!/bin/sh
> ./test&
> ./test&
> ./test&
> ./test
>
> % perl check_badness.pl|sort -n|g test
>
> 26511 7884 test
> 26511 7885 test
> 26511 7886 test
> 26511 7887 test
> 53994 7883 test.sh
>
> // great, so test.sh "is" the bad ass, ok, emulate OOMK:
>
> % kill -9 7883
>
> // did we kill "a rogue task"
>
> % perl check_badness.pl|sort -n|g test
>
> 26511 7884 test
> 26511 7885 test
> 26511 7886 test
> 26511 7887 test
>
> // nooo, they are still alive and eating our memory!
>
This is wrong; it doesn't "emulate oom" since oom_kill_process() always
kills a child of the selected process instead if they do not share the
same memory. The chosen task in that case is untouched.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 0:25 ` David Rientjes
@ 2009-10-28 0:39 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-28 0:39 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> kills a child of the selected process instead if they do not share the
> same memory. The chosen task in that case is untouched.
OK, I stand corrected then. Thanks! But, while testing this I lost X
once again and "test" survived for some time (check the timestamps):
http://pastebin.com/d5c9d026e
- It started by killing gkrellm(!!!)
- Then I lost X (kdeinit4 I guess)
- Then 103 seconds after the killing started, it killed "test" - the
real culprit.
I mean... how?!
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 0:39 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-28 0:39 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> kills a child of the selected process instead if they do not share the
> same memory. The chosen task in that case is untouched.
OK, I stand corrected then. Thanks! But, while testing this I lost X
once again and "test" survived for some time (check the timestamps):
http://pastebin.com/d5c9d026e
- It started by killing gkrellm(!!!)
- Then I lost X (kdeinit4 I guess)
- Then 103 seconds after the killing started, it killed "test" - the
real culprit.
I mean... how?!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 0:39 ` Vedran Furač
@ 2009-10-28 4:08 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 4:08 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, Vedran Furac wrote:
> > This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> > kills a child of the selected process instead if they do not share the
> > same memory. The chosen task in that case is untouched.
>
> OK, I stand corrected then. Thanks! But, while testing this I lost X
> once again and "test" survived for some time (check the timestamps):
>
> http://pastebin.com/d5c9d026e
>
> - It started by killing gkrellm(!!!)
> - Then I lost X (kdeinit4 I guess)
> - Then 103 seconds after the killing started, it killed "test" - the
> real culprit.
>
> I mean... how?!
>
Here are the five oom kills that occurred in your log, and notice that the
first four times it kills a child and not the actual task as I explained:
[97137.724971] Out of memory: kill process 21485 (VBoxSVC) score 1564940 or a child
[97137.725017] Killed process 21503 (VirtualBox)
[97137.864622] Out of memory: kill process 11141 (kdeinit4) score 1196178 or a child
[97137.864656] Killed process 11142 (klauncher)
[97137.888146] Out of memory: kill process 11141 (kdeinit4) score 1184308 or a child
[97137.888180] Killed process 11151 (ksmserver)
[97137.972875] Out of memory: kill process 11141 (kdeinit4) score 1146255 or a child
[97137.972888] Killed process 11224 (audacious2)
Those are practically happening simultaneously with very little memory
being available between each oom kill. Only later is "test" killed:
[97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
[97240.206832] Killed process 5005 (test)
Notice how the badness score is less than 1/4th of the others. So while
you may find it to be hogging a lot of memory, there were others that
consumed much more.
You can get a more detailed understanding of this by doing
echo 1 > /proc/sys/vm/oom_dump_tasks
before trying your testcase; it will show various information like the
total_vm and oom_adj value for each task at the time of oom (and the
actual badness score is exported per-task via /proc/pid/oom_score in
real-time). This will also include the rss and show what the end result
would be in using that value as part of the heuristic on this particular
workload compared to the current implementation.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 4:08 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 4:08 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, Vedran Furac wrote:
> > This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> > kills a child of the selected process instead if they do not share the
> > same memory. The chosen task in that case is untouched.
>
> OK, I stand corrected then. Thanks! But, while testing this I lost X
> once again and "test" survived for some time (check the timestamps):
>
> http://pastebin.com/d5c9d026e
>
> - It started by killing gkrellm(!!!)
> - Then I lost X (kdeinit4 I guess)
> - Then 103 seconds after the killing started, it killed "test" - the
> real culprit.
>
> I mean... how?!
>
Here are the five oom kills that occurred in your log, and notice that the
first four times it kills a child and not the actual task as I explained:
[97137.724971] Out of memory: kill process 21485 (VBoxSVC) score 1564940 or a child
[97137.725017] Killed process 21503 (VirtualBox)
[97137.864622] Out of memory: kill process 11141 (kdeinit4) score 1196178 or a child
[97137.864656] Killed process 11142 (klauncher)
[97137.888146] Out of memory: kill process 11141 (kdeinit4) score 1184308 or a child
[97137.888180] Killed process 11151 (ksmserver)
[97137.972875] Out of memory: kill process 11141 (kdeinit4) score 1146255 or a child
[97137.972888] Killed process 11224 (audacious2)
Those are practically happening simultaneously with very little memory
being available between each oom kill. Only later is "test" killed:
[97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
[97240.206832] Killed process 5005 (test)
Notice how the badness score is less than 1/4th of the others. So while
you may find it to be hogging a lot of memory, there were others that
consumed much more.
You can get a more detailed understanding of this by doing
echo 1 > /proc/sys/vm/oom_dump_tasks
before trying your testcase; it will show various information like the
total_vm and oom_adj value for each task at the time of oom (and the
actual badness score is exported per-task via /proc/pid/oom_score in
real-time). This will also include the rss and show what the end result
would be in using that value as part of the heuristic on this particular
workload compared to the current implementation.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 4:08 ` David Rientjes
@ 2009-10-28 4:55 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 4:55 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009 21:08:56 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 28 Oct 2009, Vedran Furac wrote:
>
> > > This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> > > kills a child of the selected process instead if they do not share the
> > > same memory. The chosen task in that case is untouched.
> >
> > OK, I stand corrected then. Thanks! But, while testing this I lost X
> > once again and "test" survived for some time (check the timestamps):
> >
> > http://pastebin.com/d5c9d026e
> >
> > - It started by killing gkrellm(!!!)
> > - Then I lost X (kdeinit4 I guess)
> > - Then 103 seconds after the killing started, it killed "test" - the
> > real culprit.
> >
> > I mean... how?!
> >
>
> Here are the five oom kills that occurred in your log, and notice that the
> first four times it kills a child and not the actual task as I explained:
>
> [97137.724971] Out of memory: kill process 21485 (VBoxSVC) score 1564940 or a child
> [97137.725017] Killed process 21503 (VirtualBox)
> [97137.864622] Out of memory: kill process 11141 (kdeinit4) score 1196178 or a child
> [97137.864656] Killed process 11142 (klauncher)
> [97137.888146] Out of memory: kill process 11141 (kdeinit4) score 1184308 or a child
> [97137.888180] Killed process 11151 (ksmserver)
> [97137.972875] Out of memory: kill process 11141 (kdeinit4) score 1146255 or a child
> [97137.972888] Killed process 11224 (audacious2)
>
> Those are practically happening simultaneously with very little memory
> being available between each oom kill. Only later is "test" killed:
>
> [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> [97240.206832] Killed process 5005 (test)
>
> Notice how the badness score is less than 1/4th of the others. So while
> you may find it to be hogging a lot of memory, there were others that
> consumed much more.
not related to child-parent problem.
Seeing this number more.
==
[97137.709272] Active_anon:671487 active_file:82 inactive_anon:132316
[97137.709273] inactive_file:82 unevictable:50 dirty:0 writeback:0 unstable:0
[97137.709273] free:6122 slab:17179 mapped:30661 pagetables:8052 bounce:0
==
acitve_file + inactive_file is very low. Almost all pages are for anon.
But "mapped(NR_FILE_MAPPED)" is a little high. This implies remaining file caches
are mapped by many processes OR some mega bytes of shmem is used.
# of pagetables is 8052, this means
8052x4096/8*4k bytes = 16Gbytes of mapped area.
Total available memory is near to be active/inactive + slab
671487+82+132316+82+50+6122+17179+8052=835370x4k= 3.2Gbytes ?
(this system is swapless)
Then, considering the pmap kosaki shows,
I guess killed ones had big total_vm but has not much real rss,
and no helps for oom.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 4:55 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 4:55 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009 21:08:56 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 28 Oct 2009, Vedran Furac wrote:
>
> > > This is wrong; it doesn't "emulate oom" since oom_kill_process() always
> > > kills a child of the selected process instead if they do not share the
> > > same memory. The chosen task in that case is untouched.
> >
> > OK, I stand corrected then. Thanks! But, while testing this I lost X
> > once again and "test" survived for some time (check the timestamps):
> >
> > http://pastebin.com/d5c9d026e
> >
> > - It started by killing gkrellm(!!!)
> > - Then I lost X (kdeinit4 I guess)
> > - Then 103 seconds after the killing started, it killed "test" - the
> > real culprit.
> >
> > I mean... how?!
> >
>
> Here are the five oom kills that occurred in your log, and notice that the
> first four times it kills a child and not the actual task as I explained:
>
> [97137.724971] Out of memory: kill process 21485 (VBoxSVC) score 1564940 or a child
> [97137.725017] Killed process 21503 (VirtualBox)
> [97137.864622] Out of memory: kill process 11141 (kdeinit4) score 1196178 or a child
> [97137.864656] Killed process 11142 (klauncher)
> [97137.888146] Out of memory: kill process 11141 (kdeinit4) score 1184308 or a child
> [97137.888180] Killed process 11151 (ksmserver)
> [97137.972875] Out of memory: kill process 11141 (kdeinit4) score 1146255 or a child
> [97137.972888] Killed process 11224 (audacious2)
>
> Those are practically happening simultaneously with very little memory
> being available between each oom kill. Only later is "test" killed:
>
> [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> [97240.206832] Killed process 5005 (test)
>
> Notice how the badness score is less than 1/4th of the others. So while
> you may find it to be hogging a lot of memory, there were others that
> consumed much more.
not related to child-parent problem.
Seeing this number more.
==
[97137.709272] Active_anon:671487 active_file:82 inactive_anon:132316
[97137.709273] inactive_file:82 unevictable:50 dirty:0 writeback:0 unstable:0
[97137.709273] free:6122 slab:17179 mapped:30661 pagetables:8052 bounce:0
==
acitve_file + inactive_file is very low. Almost all pages are for anon.
But "mapped(NR_FILE_MAPPED)" is a little high. This implies remaining file caches
are mapped by many processes OR some mega bytes of shmem is used.
# of pagetables is 8052, this means
8052x4096/8*4k bytes = 16Gbytes of mapped area.
Total available memory is near to be active/inactive + slab
671487+82+132316+82+50+6122+17179+8052=835370x4k= 3.2Gbytes ?
(this system is swapless)
Then, considering the pmap kosaki shows,
I guess killed ones had big total_vm but has not much real rss,
and no helps for oom.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 4:55 ` KAMEZAWA Hiroyuki
@ 2009-10-28 5:13 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 5:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
> not related to child-parent problem.
>
> Seeing this number more.
> ==
> [97137.709272] Active_anon:671487 active_file:82 inactive_anon:132316
> [97137.709273] inactive_file:82 unevictable:50 dirty:0 writeback:0 unstable:0
> [97137.709273] free:6122 slab:17179 mapped:30661 pagetables:8052 bounce:0
> ==
>
> acitve_file + inactive_file is very low. Almost all pages are for anon.
> But "mapped(NR_FILE_MAPPED)" is a little high. This implies remaining file caches
> are mapped by many processes OR some mega bytes of shmem is used.
>
> # of pagetables is 8052, this means
> 8052x4096/8*4k bytes = 16Gbytes of mapped area.
>
> Total available memory is near to be active/inactive + slab
> 671487+82+132316+82+50+6122+17179+8052=835370x4k= 3.2Gbytes ?
> (this system is swapless)
>
Yep:
[97137.724965] 917504 pages RAM
[97137.724967] 69721 pages reserved
(917504 - 69721) * 4K = ~3.23G
> Then, considering the pmap kosaki shows,
> I guess killed ones had big total_vm but has not much real rss,
> and no helps for oom.
>
echo 1 > /proc/sys/vm/oom_dump_tasks can confirm that.
The bigger issue is making the distinction between killing a rogue task
that is using much more memory than expected (the supposed current
behavior, influenced from userspace by /proc/pid/oom_adj), and killing the
task with the highest rss. The latter is definitely desired if we are
allocating tons of memory but reduces the ability of the user to influence
the badness score.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 5:13 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 5:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
> not related to child-parent problem.
>
> Seeing this number more.
> ==
> [97137.709272] Active_anon:671487 active_file:82 inactive_anon:132316
> [97137.709273] inactive_file:82 unevictable:50 dirty:0 writeback:0 unstable:0
> [97137.709273] free:6122 slab:17179 mapped:30661 pagetables:8052 bounce:0
> ==
>
> acitve_file + inactive_file is very low. Almost all pages are for anon.
> But "mapped(NR_FILE_MAPPED)" is a little high. This implies remaining file caches
> are mapped by many processes OR some mega bytes of shmem is used.
>
> # of pagetables is 8052, this means
> 8052x4096/8*4k bytes = 16Gbytes of mapped area.
>
> Total available memory is near to be active/inactive + slab
> 671487+82+132316+82+50+6122+17179+8052=835370x4k= 3.2Gbytes ?
> (this system is swapless)
>
Yep:
[97137.724965] 917504 pages RAM
[97137.724967] 69721 pages reserved
(917504 - 69721) * 4K = ~3.23G
> Then, considering the pmap kosaki shows,
> I guess killed ones had big total_vm but has not much real rss,
> and no helps for oom.
>
echo 1 > /proc/sys/vm/oom_dump_tasks can confirm that.
The bigger issue is making the distinction between killing a rogue task
that is using much more memory than expected (the supposed current
behavior, influenced from userspace by /proc/pid/oom_adj), and killing the
task with the highest rss. The latter is definitely desired if we are
allocating tons of memory but reduces the ability of the user to influence
the badness score.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 5:13 ` David Rientjes
@ 2009-10-28 6:05 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 6:05 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009 22:13:44 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> Yep:
>
> [97137.724965] 917504 pages RAM
> [97137.724967] 69721 pages reserved
>
> (917504 - 69721) * 4K = ~3.23G
>
> > Then, considering the pmap kosaki shows,
> > I guess killed ones had big total_vm but has not much real rss,
> > and no helps for oom.
> >
>
> echo 1 > /proc/sys/vm/oom_dump_tasks can confirm that.
>
yes.
> The bigger issue is making the distinction between killing a rogue task
> that is using much more memory than expected (the supposed current
> behavior, influenced from userspace by /proc/pid/oom_adj), and killing the
> task with the highest rss.
All kernel engineers know "than expected or not" can be never known to the kernel.
So, oom_adj workaround is used now. (by some special users.)
OOM Killer itself is also a workaround, too.
"No kill" is the best thing but we know there are tend to be memory-leaker on bad
systems and all systems in this world are not perfect.
In the kernel view, there is no difference between rogue one and highest rss one.
As heuristics, "time" is used now. But it's not very trustable.
> The latter is definitely desired if we are
> allocating tons of memory but reduces the ability of the user to influence
> the badness score.
>
Yes, some more trustable values other than vmsize/rss/time are appriciated.
I wonder recent memory consumption speed can be an another key value.
Anyway, current bahavior of "killing X" is a bad thing.
We need some fixes.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 6:05 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 6:05 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009 22:13:44 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> Yep:
>
> [97137.724965] 917504 pages RAM
> [97137.724967] 69721 pages reserved
>
> (917504 - 69721) * 4K = ~3.23G
>
> > Then, considering the pmap kosaki shows,
> > I guess killed ones had big total_vm but has not much real rss,
> > and no helps for oom.
> >
>
> echo 1 > /proc/sys/vm/oom_dump_tasks can confirm that.
>
yes.
> The bigger issue is making the distinction between killing a rogue task
> that is using much more memory than expected (the supposed current
> behavior, influenced from userspace by /proc/pid/oom_adj), and killing the
> task with the highest rss.
All kernel engineers know "than expected or not" can be never known to the kernel.
So, oom_adj workaround is used now. (by some special users.)
OOM Killer itself is also a workaround, too.
"No kill" is the best thing but we know there are tend to be memory-leaker on bad
systems and all systems in this world are not perfect.
In the kernel view, there is no difference between rogue one and highest rss one.
As heuristics, "time" is used now. But it's not very trustable.
> The latter is definitely desired if we are
> allocating tons of memory but reduces the ability of the user to influence
> the badness score.
>
Yes, some more trustable values other than vmsize/rss/time are appriciated.
I wonder recent memory consumption speed can be an another key value.
Anyway, current bahavior of "killing X" is a bad thing.
We need some fixes.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 6:05 ` KAMEZAWA Hiroyuki
@ 2009-10-28 6:17 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 6:17 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
> All kernel engineers know "than expected or not" can be never known to the kernel.
> So, oom_adj workaround is used now. (by some special users.)
> OOM Killer itself is also a workaround, too.
> "No kill" is the best thing but we know there are tend to be memory-leaker on bad
> systems and all systems in this world are not perfect.
>
Right, and historically that has been addressed by considering total_vm
and adjusting it with oom_adj so that we can identify memory leaking tasks
through user-defined criteria.
> Yes, some more trustable values other than vmsize/rss/time are appriciated.
> I wonder recent memory consumption speed can be an another key value.
>
Sounds very logical.
> Anyway, current bahavior of "killing X" is a bad thing.
> We need some fixes.
>
You can easily protect X with OOM_DISABLE, as you know. I don't think we
need any X-specific heuristics added to the kernel, it looks like the
special cases have already polluted badness() enough.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 6:17 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 6:17 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
> All kernel engineers know "than expected or not" can be never known to the kernel.
> So, oom_adj workaround is used now. (by some special users.)
> OOM Killer itself is also a workaround, too.
> "No kill" is the best thing but we know there are tend to be memory-leaker on bad
> systems and all systems in this world are not perfect.
>
Right, and historically that has been addressed by considering total_vm
and adjusting it with oom_adj so that we can identify memory leaking tasks
through user-defined criteria.
> Yes, some more trustable values other than vmsize/rss/time are appriciated.
> I wonder recent memory consumption speed can be an another key value.
>
Sounds very logical.
> Anyway, current bahavior of "killing X" is a bad thing.
> We need some fixes.
>
You can easily protect X with OOM_DISABLE, as you know. I don't think we
need any X-specific heuristics added to the kernel, it looks like the
special cases have already polluted badness() enough.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 6:17 ` David Rientjes
@ 2009-10-28 6:20 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 6:20 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009 23:17:41 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
> > All kernel engineers know "than expected or not" can be never known to the kernel.
> > So, oom_adj workaround is used now. (by some special users.)
> > OOM Killer itself is also a workaround, too.
> > "No kill" is the best thing but we know there are tend to be memory-leaker on bad
> > systems and all systems in this world are not perfect.
> >
>
> Right, and historically that has been addressed by considering total_vm
> and adjusting it with oom_adj so that we can identify memory leaking tasks
> through user-defined criteria.
>
> > Yes, some more trustable values other than vmsize/rss/time are appriciated.
> > I wonder recent memory consumption speed can be an another key value.
> >
>
> Sounds very logical.
>
> > Anyway, current bahavior of "killing X" is a bad thing.
> > We need some fixes.
> >
>
> You can easily protect X with OOM_DISABLE, as you know. I don't think we
> need any X-specific heuristics added to the kernel, it looks like the
> special cases have already polluted badness() enough.
>
It's _not_ special to X.
Almost all applications which uses many dynamica libraries can be affected by this,
total_vm. And, as I explained to Vedran, multi-threaded program like Java can easily
increase total_vm without using many anon_rss.
And it's the reason I hate overcommit_memory. size of VM doesn't tell anything.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 6:20 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 6:20 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009 23:17:41 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
> > All kernel engineers know "than expected or not" can be never known to the kernel.
> > So, oom_adj workaround is used now. (by some special users.)
> > OOM Killer itself is also a workaround, too.
> > "No kill" is the best thing but we know there are tend to be memory-leaker on bad
> > systems and all systems in this world are not perfect.
> >
>
> Right, and historically that has been addressed by considering total_vm
> and adjusting it with oom_adj so that we can identify memory leaking tasks
> through user-defined criteria.
>
> > Yes, some more trustable values other than vmsize/rss/time are appriciated.
> > I wonder recent memory consumption speed can be an another key value.
> >
>
> Sounds very logical.
>
> > Anyway, current bahavior of "killing X" is a bad thing.
> > We need some fixes.
> >
>
> You can easily protect X with OOM_DISABLE, as you know. I don't think we
> need any X-specific heuristics added to the kernel, it looks like the
> special cases have already polluted badness() enough.
>
It's _not_ special to X.
Almost all applications which uses many dynamica libraries can be affected by this,
total_vm. And, as I explained to Vedran, multi-threaded program like Java can easily
increase total_vm without using many anon_rss.
And it's the reason I hate overcommit_memory. size of VM doesn't tell anything.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 6:20 ` KAMEZAWA Hiroyuki
@ 2009-10-29 8:38 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-29 8:38 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
> It's _not_ special to X.
>
> Almost all applications which uses many dynamica libraries can be affected by this,
> total_vm. And, as I explained to Vedran, multi-threaded program like Java can easily
> increase total_vm without using many anon_rss.
> And it's the reason I hate overcommit_memory. size of VM doesn't tell anything.
>
Right, because in Vedran's latest oom log it shows that Xorg is preferred
more than any other thread other than the memory hogging test program with
your patch than without. I pointed out a clear distinction in the killing
order using both total_vm and rss in that log and in my opinion killing
Xorg as opposed to krunner would be undesireable.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-29 8:38 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-29 8:38 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, KAMEZAWA Hiroyuki wrote:
> It's _not_ special to X.
>
> Almost all applications which uses many dynamica libraries can be affected by this,
> total_vm. And, as I explained to Vedran, multi-threaded program like Java can easily
> increase total_vm without using many anon_rss.
> And it's the reason I hate overcommit_memory. size of VM doesn't tell anything.
>
Right, because in Vedran's latest oom log it shows that Xorg is preferred
more than any other thread other than the memory hogging test program with
your patch than without. I pointed out a clear distinction in the killing
order using both total_vm and rss in that log and in my opinion killing
Xorg as opposed to krunner would be undesireable.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-29 8:38 ` David Rientjes
@ 2009-10-29 11:11 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-29 11:11 UTC (permalink / raw)
To: David Rientjes
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> Right, because in Vedran's latest oom log it shows that Xorg is preferred
> more than any other thread other than the memory hogging test program with
> your patch than without. I pointed out a clear distinction in the killing
> order using both total_vm and rss in that log and in my opinion killing
> Xorg as opposed to krunner would be undesireable.
But then you should rename OOM killer to TRIPK:
Totally Random Innocent Process Killer
If you have OOM situation and Xorg is the first, that means it's leaking
memory badly and the system is probably already frozen/FUBAR. Killing
krunner in that situation wouldn't do any good. From a user perspective,
nothing changes, system is still FUBAR and (s)he would probably reboot
cursing linux in the process.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-29 11:11 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-29 11:11 UTC (permalink / raw)
To: David Rientjes
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> Right, because in Vedran's latest oom log it shows that Xorg is preferred
> more than any other thread other than the memory hogging test program with
> your patch than without. I pointed out a clear distinction in the killing
> order using both total_vm and rss in that log and in my opinion killing
> Xorg as opposed to krunner would be undesireable.
But then you should rename OOM killer to TRIPK:
Totally Random Innocent Process Killer
If you have OOM situation and Xorg is the first, that means it's leaking
memory badly and the system is probably already frozen/FUBAR. Killing
krunner in that situation wouldn't do any good. From a user perspective,
nothing changes, system is still FUBAR and (s)he would probably reboot
cursing linux in the process.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-29 11:11 ` Vedran Furač
@ 2009-10-29 19:53 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-29 19:53 UTC (permalink / raw)
To: vedran.furac
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Thu, 29 Oct 2009, Vedran Furac wrote:
> But then you should rename OOM killer to TRIPK:
> Totally Random Innocent Process Killer
>
The randomness here is the order of the child list when the oom killer
selects a task, based on the badness score, and then tries to kill a child
with a different mm before the parent.
The problem you identified in http://pastebin.com/f3f9674a0, however, is a
forkbomb issue where the badness score should never have been so high for
kdeinit4 compared to "test". That's directly proportional to adding the
scores of all disjoint child total_vm values into the badness score for
the parent and then killing the children instead.
That's the problem, not using total_vm as a baseline. Replacing that with
rss is not going to solve the issue and reducing the user's ability to
specify a rough oom priority from userspace is simply not an option.
> If you have OOM situation and Xorg is the first, that means it's leaking
> memory badly and the system is probably already frozen/FUBAR. Killing
> krunner in that situation wouldn't do any good. From a user perspective,
> nothing changes, system is still FUBAR and (s)he would probably reboot
> cursing linux in the process.
>
It depends on what you're running, we need to be able to have the option
of protecting very large tasks on production servers. Imagine if "test"
here is actually a critical application that we need to protect, its
not solely mlocked anonymous memory, but still kill if it is leaking
memory beyond your approximate 2.5GB. How do you do that when using rss
as the baseline?
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-29 19:53 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-29 19:53 UTC (permalink / raw)
To: vedran.furac
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Thu, 29 Oct 2009, Vedran Furac wrote:
> But then you should rename OOM killer to TRIPK:
> Totally Random Innocent Process Killer
>
The randomness here is the order of the child list when the oom killer
selects a task, based on the badness score, and then tries to kill a child
with a different mm before the parent.
The problem you identified in http://pastebin.com/f3f9674a0, however, is a
forkbomb issue where the badness score should never have been so high for
kdeinit4 compared to "test". That's directly proportional to adding the
scores of all disjoint child total_vm values into the badness score for
the parent and then killing the children instead.
That's the problem, not using total_vm as a baseline. Replacing that with
rss is not going to solve the issue and reducing the user's ability to
specify a rough oom priority from userspace is simply not an option.
> If you have OOM situation and Xorg is the first, that means it's leaking
> memory badly and the system is probably already frozen/FUBAR. Killing
> krunner in that situation wouldn't do any good. From a user perspective,
> nothing changes, system is still FUBAR and (s)he would probably reboot
> cursing linux in the process.
>
It depends on what you're running, we need to be able to have the option
of protecting very large tasks on production servers. Imagine if "test"
here is actually a critical application that we need to protect, its
not solely mlocked anonymous memory, but still kill if it is leaking
memory beyond your approximate 2.5GB. How do you do that when using rss
as the baseline?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-29 19:53 ` David Rientjes
@ 2009-10-29 23:48 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-29 23:48 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Thu, 29 Oct 2009 12:53:42 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > If you have OOM situation and Xorg is the first, that means it's leaking
> > memory badly and the system is probably already frozen/FUBAR. Killing
> > krunner in that situation wouldn't do any good. From a user perspective,
> > nothing changes, system is still FUBAR and (s)he would probably reboot
> > cursing linux in the process.
> >
>
> It depends on what you're running, we need to be able to have the option
> of protecting very large tasks on production servers. Imagine if "test"
> here is actually a critical application that we need to protect, its
> not solely mlocked anonymous memory, but still kill if it is leaking
> memory beyond your approximate 2.5GB. How do you do that when using rss
> as the baseline?
As I wrote repeatedly,
- OOM-Killer itselfs is bad thing, bad situation.
- The kernel can't know the program is bad or not. just guess it.
- Then, there is no "correct" OOM-Killer other than fork-bomb killer.
- User has a knob as oom_adj. This is very strong.
Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
"Current biggest memory eater is killed" sounds reasonable, easy to
understand. And if total_vm works well, overcommit_guess should catch it.
Please improve overcommit_guess if you want to stay on total_vm.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-29 23:48 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-29 23:48 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Thu, 29 Oct 2009 12:53:42 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > If you have OOM situation and Xorg is the first, that means it's leaking
> > memory badly and the system is probably already frozen/FUBAR. Killing
> > krunner in that situation wouldn't do any good. From a user perspective,
> > nothing changes, system is still FUBAR and (s)he would probably reboot
> > cursing linux in the process.
> >
>
> It depends on what you're running, we need to be able to have the option
> of protecting very large tasks on production servers. Imagine if "test"
> here is actually a critical application that we need to protect, its
> not solely mlocked anonymous memory, but still kill if it is leaking
> memory beyond your approximate 2.5GB. How do you do that when using rss
> as the baseline?
As I wrote repeatedly,
- OOM-Killer itselfs is bad thing, bad situation.
- The kernel can't know the program is bad or not. just guess it.
- Then, there is no "correct" OOM-Killer other than fork-bomb killer.
- User has a knob as oom_adj. This is very strong.
Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
"Current biggest memory eater is killed" sounds reasonable, easy to
understand. And if total_vm works well, overcommit_guess should catch it.
Please improve overcommit_guess if you want to stay on total_vm.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-29 23:48 ` KAMEZAWA Hiroyuki
@ 2009-10-30 9:10 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-30 9:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:
> As I wrote repeatedly,
>
> - OOM-Killer itselfs is bad thing, bad situation.
Not necessarily, the memory controller and cpusets uses it quite often to
enforce it's policy and is standard runtime behavior. We'd like to
imagine that our cpuset will never be too small to run all the attached
jobs, but that happens and we can easily recover from it by killing a
task.
> - The kernel can't know the program is bad or not. just guess it.
Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
can tell the kernel what we'd like the oom killer behavior should be if
the situation arises.
> - Then, there is no "correct" OOM-Killer other than fork-bomb killer.
Well of course there is, you're seeing this is a WAY too simplistic
manner. If we are oom, we want to be able to influence how the oom killer
behaves and respond to that situation. You are proposing that we change
the baseline for how the oom killer selects tasks which we use CONSTANTLY
as part of our normal production environment. I'd appreciate it if you'd
take it a little more seriously.
> - User has a knob as oom_adj. This is very strong.
>
Agreed.
> Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
> "Current biggest memory eater is killed" sounds reasonable, easy to
> understand. And if total_vm works well, overcommit_guess should catch it.
> Please improve overcommit_guess if you want to stay on total_vm.
>
I don't necessarily want to stay on total_vm, but I also don't want to
move to rss as a baseline, as you would probably agree.
We disagree about a very fundamental principle: you are coming from a
perspective of always wanting to kill the biggest resident memory eater
even for a single order-0 allocation that fails and I'm coming from a
perspective of wanting to ensure that our machines know how the oom killer
will react when it is used. Moving to rss reduces the ability of the user
to specify an expected oom priority other than polarizing it by either
disabling it completely with an oom_adj value of -17 or choosing the
definite next victim with +15. That's my objection to it: the user cannot
possibly be expected to predict what proportion of each application's
memory will be resident at the time of oom.
I understand you want to totally rewrite the oom killer for whatever
reason, but I think you need to spend a lot more time understanding the
needs that the Linux community has for its behavior instead of insisting
on your point of view.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 9:10 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-30 9:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:
> As I wrote repeatedly,
>
> - OOM-Killer itselfs is bad thing, bad situation.
Not necessarily, the memory controller and cpusets uses it quite often to
enforce it's policy and is standard runtime behavior. We'd like to
imagine that our cpuset will never be too small to run all the attached
jobs, but that happens and we can easily recover from it by killing a
task.
> - The kernel can't know the program is bad or not. just guess it.
Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
can tell the kernel what we'd like the oom killer behavior should be if
the situation arises.
> - Then, there is no "correct" OOM-Killer other than fork-bomb killer.
Well of course there is, you're seeing this is a WAY too simplistic
manner. If we are oom, we want to be able to influence how the oom killer
behaves and respond to that situation. You are proposing that we change
the baseline for how the oom killer selects tasks which we use CONSTANTLY
as part of our normal production environment. I'd appreciate it if you'd
take it a little more seriously.
> - User has a knob as oom_adj. This is very strong.
>
Agreed.
> Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
> "Current biggest memory eater is killed" sounds reasonable, easy to
> understand. And if total_vm works well, overcommit_guess should catch it.
> Please improve overcommit_guess if you want to stay on total_vm.
>
I don't necessarily want to stay on total_vm, but I also don't want to
move to rss as a baseline, as you would probably agree.
We disagree about a very fundamental principle: you are coming from a
perspective of always wanting to kill the biggest resident memory eater
even for a single order-0 allocation that fails and I'm coming from a
perspective of wanting to ensure that our machines know how the oom killer
will react when it is used. Moving to rss reduces the ability of the user
to specify an expected oom priority other than polarizing it by either
disabling it completely with an oom_adj value of -17 or choosing the
definite next victim with +15. That's my objection to it: the user cannot
possibly be expected to predict what proportion of each application's
memory will be resident at the time of oom.
I understand you want to totally rewrite the oom killer for whatever
reason, but I think you need to spend a lot more time understanding the
needs that the Linux community has for its behavior instead of insisting
on your point of view.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 9:10 ` David Rientjes
@ 2009-10-30 9:36 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-30 9:36 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009 02:10:37 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > - The kernel can't know the program is bad or not. just guess it.
>
> Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> can tell the kernel what we'd like the oom killer behavior should be if
> the situation arises.
>
My point is that the server cannot distinguish memory leak from intentional
memory usage. No other than that.
> > - Then, there is no "correct" OOM-Killer other than fork-bomb killer.
>
> Well of course there is, you're seeing this is a WAY too simplistic
> manner. If we are oom, we want to be able to influence how the oom killer
> behaves and respond to that situation. You are proposing that we change
> the baseline for how the oom killer selects tasks which we use CONSTANTLY
> as part of our normal production environment. I'd appreciate it if you'd
> take it a little more seriously.
>
Yes, I'm serious.
In this summer, at lunch with a daily linux user, I was said
"you, enterprise guys, don't consider desktop or laptop problem at all."
yes, I use only servers. My customer uses server, too. My first priority
is always on server users.
But, for this time, I wrote reply to Vedran and try to fix desktop problem.
Even if current logic works well for servers, "KDE/GNOME is killed" problem
seems to be serious. And this may be a problem for EMBEDED people, I guess.
> > - User has a knob as oom_adj. This is very strong.
> >
>
> Agreed.
>
This and memcg are very useful. But everone says "bad workaround" ;(
Maybe only servers can use these functions.
> > Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
> > "Current biggest memory eater is killed" sounds reasonable, easy to
> > understand. And if total_vm works well, overcommit_guess should catch it.
> > Please improve overcommit_guess if you want to stay on total_vm.
> >
>
> I don't necessarily want to stay on total_vm, but I also don't want to
> move to rss as a baseline, as you would probably agree.
>
I'll rewrite all. I'll not rely only on rss. There are several situations
and we need some more information than we have know. I'll have to implement
ways to gather information before chaging badness.
> We disagree about a very fundamental principle: you are coming from a
> perspective of always wanting to kill the biggest resident memory eater
> even for a single order-0 allocation that fails and I'm coming from a
> perspective of wanting to ensure that our machines know how the oom killer
> will react when it is used.
yes.
> Moving to rss reduces the ability of the user to specify an expected oom
> priority other than polarizing it by either
> disabling it completely with an oom_adj value of -17 or choosing the
> definite next victim with +15. That's my objection to it: the user cannot
> possibly be expected to predict what proportion of each application's
> memory will be resident at the time of oom.
>
I can say the same thing to total_vm size. total_vm size doesn't include any
good information for oom situation. And tweaking based on that not-useful
parameter will make things worse.
For oom_adj tweak, we may need other technique other than "shift".
If I've wrote oom_adj, I'll write it as
/proc/<pid>/guarantee_nooom_size
#echo 3G > /proc/<pid>/guarantee_nooom_size
Then, 3G bytes of this process's memory usage will not be accounted to badness.
I'm not sure I can add new interface or replace oom_adj, now.
But to do this, current chilren's score problem etc...should be fixed.
> I understand you want to totally rewrite the oom killer for whatever
> reason, but I think you need to spend a lot more time understanding the
> needs that the Linux community has for its behavior instead of insisting
> on your point of view.
>
yes, use more time. I don't think all of changes can be in quick work.
To be honest, this is a part of work to implement "custom oom handler" cgroup.
Before going further, I'd like to fix current problem.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 9:36 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-30 9:36 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009 02:10:37 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > - The kernel can't know the program is bad or not. just guess it.
>
> Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> can tell the kernel what we'd like the oom killer behavior should be if
> the situation arises.
>
My point is that the server cannot distinguish memory leak from intentional
memory usage. No other than that.
> > - Then, there is no "correct" OOM-Killer other than fork-bomb killer.
>
> Well of course there is, you're seeing this is a WAY too simplistic
> manner. If we are oom, we want to be able to influence how the oom killer
> behaves and respond to that situation. You are proposing that we change
> the baseline for how the oom killer selects tasks which we use CONSTANTLY
> as part of our normal production environment. I'd appreciate it if you'd
> take it a little more seriously.
>
Yes, I'm serious.
In this summer, at lunch with a daily linux user, I was said
"you, enterprise guys, don't consider desktop or laptop problem at all."
yes, I use only servers. My customer uses server, too. My first priority
is always on server users.
But, for this time, I wrote reply to Vedran and try to fix desktop problem.
Even if current logic works well for servers, "KDE/GNOME is killed" problem
seems to be serious. And this may be a problem for EMBEDED people, I guess.
> > - User has a knob as oom_adj. This is very strong.
> >
>
> Agreed.
>
This and memcg are very useful. But everone says "bad workaround" ;(
Maybe only servers can use these functions.
> > Then, there is only "reasonable" or "easy-to-understand" OOM-Kill.
> > "Current biggest memory eater is killed" sounds reasonable, easy to
> > understand. And if total_vm works well, overcommit_guess should catch it.
> > Please improve overcommit_guess if you want to stay on total_vm.
> >
>
> I don't necessarily want to stay on total_vm, but I also don't want to
> move to rss as a baseline, as you would probably agree.
>
I'll rewrite all. I'll not rely only on rss. There are several situations
and we need some more information than we have know. I'll have to implement
ways to gather information before chaging badness.
> We disagree about a very fundamental principle: you are coming from a
> perspective of always wanting to kill the biggest resident memory eater
> even for a single order-0 allocation that fails and I'm coming from a
> perspective of wanting to ensure that our machines know how the oom killer
> will react when it is used.
yes.
> Moving to rss reduces the ability of the user to specify an expected oom
> priority other than polarizing it by either
> disabling it completely with an oom_adj value of -17 or choosing the
> definite next victim with +15. That's my objection to it: the user cannot
> possibly be expected to predict what proportion of each application's
> memory will be resident at the time of oom.
>
I can say the same thing to total_vm size. total_vm size doesn't include any
good information for oom situation. And tweaking based on that not-useful
parameter will make things worse.
For oom_adj tweak, we may need other technique other than "shift".
If I've wrote oom_adj, I'll write it as
/proc/<pid>/guarantee_nooom_size
#echo 3G > /proc/<pid>/guarantee_nooom_size
Then, 3G bytes of this process's memory usage will not be accounted to badness.
I'm not sure I can add new interface or replace oom_adj, now.
But to do this, current chilren's score problem etc...should be fixed.
> I understand you want to totally rewrite the oom killer for whatever
> reason, but I think you need to spend a lot more time understanding the
> needs that the Linux community has for its behavior instead of insisting
> on your point of view.
>
yes, use more time. I don't think all of changes can be in quick work.
To be honest, this is a part of work to implement "custom oom handler" cgroup.
Before going further, I'd like to fix current problem.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 9:36 ` KAMEZAWA Hiroyuki
(?)
@ 2009-10-30 10:49 ` Thomas Fjellstrom
-1 siblings, 0 replies; 128+ messages in thread
From: Thomas Fjellstrom @ 2009-10-30 10:49 UTC (permalink / raw)
To: linux-kernel
On Fri October 30 2009, KAMEZAWA Hiroyuki wrote:
> On Fri, 30 Oct 2009 02:10:37 -0700 (PDT)
>
> David Rientjes <rientjes@google.com> wrote:
> > > - The kernel can't know the program is bad or not. just guess it.
> >
> > Totally irrelevant, given your fourth point about /proc/pid/oom_adj.
> > We can tell the kernel what we'd like the oom killer behavior should be
> > if the situation arises.
>
> My point is that the server cannot distinguish memory leak from
> intentional memory usage. No other than that.
>
> > > - Then, there is no "correct" OOM-Killer other than fork-bomb
> > > killer.
> >
> > Well of course there is, you're seeing this is a WAY too simplistic
> > manner. If we are oom, we want to be able to influence how the oom
> > killer behaves and respond to that situation. You are proposing that
> > we change the baseline for how the oom killer selects tasks which we
> > use CONSTANTLY as part of our normal production environment. I'd
> > appreciate it if you'd take it a little more seriously.
>
> Yes, I'm serious.
>
> In this summer, at lunch with a daily linux user, I was said
> "you, enterprise guys, don't consider desktop or laptop problem at all."
> yes, I use only servers. My customer uses server, too. My first priority
> is always on server users.
> But, for this time, I wrote reply to Vedran and try to fix desktop
> problem. Even if current logic works well for servers, "KDE/GNOME is
> killed" problem seems to be serious. And this may be a problem for
> EMBEDED people, I guess.
Whats worse is a friend of mine gets stuck with a useless machine for a
couple hours or more when oom tries to do its thing. It swap storms for
hours. Not a good thing imo.
[snip]
--
Thomas Fjellstrom
tfjellstrom@shaw.ca
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 9:36 ` KAMEZAWA Hiroyuki
@ 2009-11-03 20:49 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-11-03 20:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > > - The kernel can't know the program is bad or not. just guess it.
> >
> > Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> > can tell the kernel what we'd like the oom killer behavior should be if
> > the situation arises.
> >
>
> My point is that the server cannot distinguish memory leak from intentional
> memory usage. No other than that.
>
That's a different point. Today, we can influence the badness score of
any user thread to prioritize oom killing from userspace and that can be
done regardless of whether there's a memory leaker, a fork bomber, etc.
The priority based oom killing is important to production scenarios and
cannot be replaced by a heuristic that works everytime if it cannot be
influenced by userspace.
A spike in memory consumption when a process is initially forked would be
defined as a memory leaker in your quiet_time model.
> In this summer, at lunch with a daily linux user, I was said
> "you, enterprise guys, don't consider desktop or laptop problem at all."
> yes, I use only servers. My customer uses server, too. My first priority
> is always on server users.
> But, for this time, I wrote reply to Vedran and try to fix desktop problem.
> Even if current logic works well for servers, "KDE/GNOME is killed" problem
> seems to be serious. And this may be a problem for EMBEDED people, I guess.
>
You argued before that the problem wasn't specific to X (after I said you
could protect it very trivially with /proc/pid/oom_adj set to
OOM_DISABLE), but that's now your reasoning for rewriting the oom killer
heuristics?
> I can say the same thing to total_vm size. total_vm size doesn't include any
> good information for oom situation. And tweaking based on that not-useful
> parameter will make things worse.
>
Tweaking on the heuristic will probably make it more convoluted and
overall worse, I agree. But it's a more stable baseline than rss from
which we can set oom killing priorities from userspace.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-11-03 20:49 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-11-03 20:49 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > > - The kernel can't know the program is bad or not. just guess it.
> >
> > Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> > can tell the kernel what we'd like the oom killer behavior should be if
> > the situation arises.
> >
>
> My point is that the server cannot distinguish memory leak from intentional
> memory usage. No other than that.
>
That's a different point. Today, we can influence the badness score of
any user thread to prioritize oom killing from userspace and that can be
done regardless of whether there's a memory leaker, a fork bomber, etc.
The priority based oom killing is important to production scenarios and
cannot be replaced by a heuristic that works everytime if it cannot be
influenced by userspace.
A spike in memory consumption when a process is initially forked would be
defined as a memory leaker in your quiet_time model.
> In this summer, at lunch with a daily linux user, I was said
> "you, enterprise guys, don't consider desktop or laptop problem at all."
> yes, I use only servers. My customer uses server, too. My first priority
> is always on server users.
> But, for this time, I wrote reply to Vedran and try to fix desktop problem.
> Even if current logic works well for servers, "KDE/GNOME is killed" problem
> seems to be serious. And this may be a problem for EMBEDED people, I guess.
>
You argued before that the problem wasn't specific to X (after I said you
could protect it very trivially with /proc/pid/oom_adj set to
OOM_DISABLE), but that's now your reasoning for rewriting the oom killer
heuristics?
> I can say the same thing to total_vm size. total_vm size doesn't include any
> good information for oom situation. And tweaking based on that not-useful
> parameter will make things worse.
>
Tweaking on the heuristic will probably make it more convoluted and
overall worse, I agree. But it's a more stable baseline than rss from
which we can set oom killing priorities from userspace.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-11-03 20:49 ` David Rientjes
@ 2009-11-04 0:50 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04 0:50 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 3 Nov 2009 12:49:52 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
> > > > - The kernel can't know the program is bad or not. just guess it.
> > >
> > > Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> > > can tell the kernel what we'd like the oom killer behavior should be if
> > > the situation arises.
> > >
> >
> > My point is that the server cannot distinguish memory leak from intentional
> > memory usage. No other than that.
> >
>
> That's a different point. Today, we can influence the badness score of
> any user thread to prioritize oom killing from userspace and that can be
> done regardless of whether there's a memory leaker, a fork bomber, etc.
> The priority based oom killing is important to production scenarios and
> cannot be replaced by a heuristic that works everytime if it cannot be
> influenced by userspace.
>
I don't removed oom_adj...
> A spike in memory consumption when a process is initially forked would be
> defined as a memory leaker in your quiet_time model.
>
I'll rewrite or drop quiet_time.
> > In this summer, at lunch with a daily linux user, I was said
> > "you, enterprise guys, don't consider desktop or laptop problem at all."
> > yes, I use only servers. My customer uses server, too. My first priority
> > is always on server users.
> > But, for this time, I wrote reply to Vedran and try to fix desktop problem.
> > Even if current logic works well for servers, "KDE/GNOME is killed" problem
> > seems to be serious. And this may be a problem for EMBEDED people, I guess.
> >
>
> You argued before that the problem wasn't specific to X (after I said you
> could protect it very trivially with /proc/pid/oom_adj set to
> OOM_DISABLE), but that's now your reasoning for rewriting the oom killer
> heuristics?
>
One of reasons. My cusotomers always suffers from "OOM-RANDOM-KILLER".
Why I mentioned about "lunch" is for saying that "I'm not working _only_
for servers."
ok ?
> > I can say the same thing to total_vm size. total_vm size doesn't include any
> > good information for oom situation. And tweaking based on that not-useful
> > parameter will make things worse.
> >
>
> Tweaking on the heuristic will probably make it more convoluted and
> overall worse, I agree. But it's a more stable baseline than rss from
> which we can set oom killing priorities from userspace.
- "rss < total_vm_size" always.
- oom_adj culculation is quite strong.
- total_vm of processes which maps hugetlb is very big ....but killing them
is no help for usual oom.
I recommend you to add "stable baseline" knob for user space, as I wrote.
My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
large.
If users can estimate how their process uses memory, it will be good thing.
I'll add some other than oom_adj (I don't say I'll drop oom_adj).
Thanks,
-Kame
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-11-04 0:50 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04 0:50 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 3 Nov 2009 12:49:52 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> On Fri, 30 Oct 2009, KAMEZAWA Hiroyuki wrote:
>
> > > > - The kernel can't know the program is bad or not. just guess it.
> > >
> > > Totally irrelevant, given your fourth point about /proc/pid/oom_adj. We
> > > can tell the kernel what we'd like the oom killer behavior should be if
> > > the situation arises.
> > >
> >
> > My point is that the server cannot distinguish memory leak from intentional
> > memory usage. No other than that.
> >
>
> That's a different point. Today, we can influence the badness score of
> any user thread to prioritize oom killing from userspace and that can be
> done regardless of whether there's a memory leaker, a fork bomber, etc.
> The priority based oom killing is important to production scenarios and
> cannot be replaced by a heuristic that works everytime if it cannot be
> influenced by userspace.
>
I don't removed oom_adj...
> A spike in memory consumption when a process is initially forked would be
> defined as a memory leaker in your quiet_time model.
>
I'll rewrite or drop quiet_time.
> > In this summer, at lunch with a daily linux user, I was said
> > "you, enterprise guys, don't consider desktop or laptop problem at all."
> > yes, I use only servers. My customer uses server, too. My first priority
> > is always on server users.
> > But, for this time, I wrote reply to Vedran and try to fix desktop problem.
> > Even if current logic works well for servers, "KDE/GNOME is killed" problem
> > seems to be serious. And this may be a problem for EMBEDED people, I guess.
> >
>
> You argued before that the problem wasn't specific to X (after I said you
> could protect it very trivially with /proc/pid/oom_adj set to
> OOM_DISABLE), but that's now your reasoning for rewriting the oom killer
> heuristics?
>
One of reasons. My cusotomers always suffers from "OOM-RANDOM-KILLER".
Why I mentioned about "lunch" is for saying that "I'm not working _only_
for servers."
ok ?
> > I can say the same thing to total_vm size. total_vm size doesn't include any
> > good information for oom situation. And tweaking based on that not-useful
> > parameter will make things worse.
> >
>
> Tweaking on the heuristic will probably make it more convoluted and
> overall worse, I agree. But it's a more stable baseline than rss from
> which we can set oom killing priorities from userspace.
- "rss < total_vm_size" always.
- oom_adj culculation is quite strong.
- total_vm of processes which maps hugetlb is very big ....but killing them
is no help for usual oom.
I recommend you to add "stable baseline" knob for user space, as I wrote.
My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
large.
If users can estimate how their process uses memory, it will be good thing.
I'll add some other than oom_adj (I don't say I'll drop oom_adj).
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-11-04 0:50 ` KAMEZAWA Hiroyuki
@ 2009-11-04 1:58 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-11-04 1:58 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
> > That's a different point. Today, we can influence the badness score of
> > any user thread to prioritize oom killing from userspace and that can be
> > done regardless of whether there's a memory leaker, a fork bomber, etc.
> > The priority based oom killing is important to production scenarios and
> > cannot be replaced by a heuristic that works everytime if it cannot be
> > influenced by userspace.
> >
> I don't removed oom_adj...
>
Right, but we must ensure that we have the same ability to influence a
priority based oom killing scheme from userspace as we currently do with a
relatively static total_vm. total_vm may not be the optimal baseline, but
it does allow users to tune oom_adj specifically to identify tasks that
are using more memory than expected and to be static enough to not depend
on rss, for example, that is really hard to predict at the time of oom.
That's actually my main goal in this discussion: to avoid losing any
ability of userspace to influence to priority of tasks being oom killed
(if you haven't noticed :).
> > Tweaking on the heuristic will probably make it more convoluted and
> > overall worse, I agree. But it's a more stable baseline than rss from
> > which we can set oom killing priorities from userspace.
>
> - "rss < total_vm_size" always.
But rss is much more dynamic than total_vm, that's my point.
> - oom_adj culculation is quite strong.
> - total_vm of processes which maps hugetlb is very big ....but killing them
> is no help for usual oom.
>
> I recommend you to add "stable baseline" knob for user space, as I wrote.
> My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
> large.
>
There's no clear relationship between VM size and runtime. The forkbomb
heuristic itself could easily return a badness of ULONG_MAX if one is
detected using runtime and number of children, as I earlier proposed, but
that doesn't seem helpful to factor into the scoring.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-11-04 1:58 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-11-04 1:58 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
> > That's a different point. Today, we can influence the badness score of
> > any user thread to prioritize oom killing from userspace and that can be
> > done regardless of whether there's a memory leaker, a fork bomber, etc.
> > The priority based oom killing is important to production scenarios and
> > cannot be replaced by a heuristic that works everytime if it cannot be
> > influenced by userspace.
> >
> I don't removed oom_adj...
>
Right, but we must ensure that we have the same ability to influence a
priority based oom killing scheme from userspace as we currently do with a
relatively static total_vm. total_vm may not be the optimal baseline, but
it does allow users to tune oom_adj specifically to identify tasks that
are using more memory than expected and to be static enough to not depend
on rss, for example, that is really hard to predict at the time of oom.
That's actually my main goal in this discussion: to avoid losing any
ability of userspace to influence to priority of tasks being oom killed
(if you haven't noticed :).
> > Tweaking on the heuristic will probably make it more convoluted and
> > overall worse, I agree. But it's a more stable baseline than rss from
> > which we can set oom killing priorities from userspace.
>
> - "rss < total_vm_size" always.
But rss is much more dynamic than total_vm, that's my point.
> - oom_adj culculation is quite strong.
> - total_vm of processes which maps hugetlb is very big ....but killing them
> is no help for usual oom.
>
> I recommend you to add "stable baseline" knob for user space, as I wrote.
> My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
> large.
>
There's no clear relationship between VM size and runtime. The forkbomb
heuristic itself could easily return a badness of ULONG_MAX if one is
detected using runtime and number of children, as I earlier proposed, but
that doesn't seem helpful to factor into the scoring.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-11-04 1:58 ` David Rientjes
@ 2009-11-04 2:17 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04 2:17 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 3 Nov 2009 17:58:04 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > > That's a different point. Today, we can influence the badness score of
> > > any user thread to prioritize oom killing from userspace and that can be
> > > done regardless of whether there's a memory leaker, a fork bomber, etc.
> > > The priority based oom killing is important to production scenarios and
> > > cannot be replaced by a heuristic that works everytime if it cannot be
> > > influenced by userspace.
> > >
> > I don't removed oom_adj...
> >
>
> Right, but we must ensure that we have the same ability to influence a
> priority based oom killing scheme from userspace as we currently do with a
> relatively static total_vm. total_vm may not be the optimal baseline, but
> it does allow users to tune oom_adj specifically to identify tasks that
> are using more memory than expected and to be static enough to not depend
> on rss, for example, that is really hard to predict at the time of oom.
>
> That's actually my main goal in this discussion: to avoid losing any
> ability of userspace to influence to priority of tasks being oom killed
> (if you haven't noticed :).
>
> > > Tweaking on the heuristic will probably make it more convoluted and
> > > overall worse, I agree. But it's a more stable baseline than rss from
> > > which we can set oom killing priorities from userspace.
> >
> > - "rss < total_vm_size" always.
>
> But rss is much more dynamic than total_vm, that's my point.
>
My point and your point are differnt.
1. All my concern is "baseline for heuristics"
2. All your concern is "baseline for knob, as oom_adj"
ok ? For selecting victim by the kernel, dynamic value is much more useful.
Current behavior of "Random kill" and "Kill multiple processes" are too bad.
Considering oom-killer is for what, I think "1" is more important.
But I know what you want, so, I offers new knob which is not affected by RSS
as I wrote in previous mail.
Off-topic:
As memcg is growing better, using OOM-Killer for resource control should be
ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
but plz consider to use memcg.
> > - oom_adj culculation is quite strong.
> > - total_vm of processes which maps hugetlb is very big ....but killing them
> > is no help for usual oom.
> >
> > I recommend you to add "stable baseline" knob for user space, as I wrote.
> > My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
> > large.
> >
>
> There's no clear relationship between VM size and runtime. The forkbomb
> heuristic itself could easily return a badness of ULONG_MAX if one is
> detected using runtime and number of children, as I earlier proposed, but
> that doesn't seem helpful to factor into the scoring.
>
Old processes are important, younger are not. But as I wrote, I'll drop
most of patch "6". So, plz forget about this part.
I'm interested in fork-bomb killer rather than crazy badness calculation, now.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-11-04 2:17 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04 2:17 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 3 Nov 2009 17:58:04 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > > That's a different point. Today, we can influence the badness score of
> > > any user thread to prioritize oom killing from userspace and that can be
> > > done regardless of whether there's a memory leaker, a fork bomber, etc.
> > > The priority based oom killing is important to production scenarios and
> > > cannot be replaced by a heuristic that works everytime if it cannot be
> > > influenced by userspace.
> > >
> > I don't removed oom_adj...
> >
>
> Right, but we must ensure that we have the same ability to influence a
> priority based oom killing scheme from userspace as we currently do with a
> relatively static total_vm. total_vm may not be the optimal baseline, but
> it does allow users to tune oom_adj specifically to identify tasks that
> are using more memory than expected and to be static enough to not depend
> on rss, for example, that is really hard to predict at the time of oom.
>
> That's actually my main goal in this discussion: to avoid losing any
> ability of userspace to influence to priority of tasks being oom killed
> (if you haven't noticed :).
>
> > > Tweaking on the heuristic will probably make it more convoluted and
> > > overall worse, I agree. But it's a more stable baseline than rss from
> > > which we can set oom killing priorities from userspace.
> >
> > - "rss < total_vm_size" always.
>
> But rss is much more dynamic than total_vm, that's my point.
>
My point and your point are differnt.
1. All my concern is "baseline for heuristics"
2. All your concern is "baseline for knob, as oom_adj"
ok ? For selecting victim by the kernel, dynamic value is much more useful.
Current behavior of "Random kill" and "Kill multiple processes" are too bad.
Considering oom-killer is for what, I think "1" is more important.
But I know what you want, so, I offers new knob which is not affected by RSS
as I wrote in previous mail.
Off-topic:
As memcg is growing better, using OOM-Killer for resource control should be
ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
but plz consider to use memcg.
> > - oom_adj culculation is quite strong.
> > - total_vm of processes which maps hugetlb is very big ....but killing them
> > is no help for usual oom.
> >
> > I recommend you to add "stable baseline" knob for user space, as I wrote.
> > My patch 6 adds stable baseline bonus as 50% of vm size if run_time is enough
> > large.
> >
>
> There's no clear relationship between VM size and runtime. The forkbomb
> heuristic itself could easily return a badness of ULONG_MAX if one is
> detected using runtime and number of children, as I earlier proposed, but
> that doesn't seem helpful to factor into the scoring.
>
Old processes are important, younger are not. But as I wrote, I'll drop
most of patch "6". So, plz forget about this part.
I'm interested in fork-bomb killer rather than crazy badness calculation, now.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-11-04 2:17 ` KAMEZAWA Hiroyuki
@ 2009-11-04 3:10 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-11-04 3:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
> My point and your point are differnt.
>
> 1. All my concern is "baseline for heuristics"
> 2. All your concern is "baseline for knob, as oom_adj"
>
> ok ? For selecting victim by the kernel, dynamic value is much more useful.
> Current behavior of "Random kill" and "Kill multiple processes" are too bad.
> Considering oom-killer is for what, I think "1" is more important.
>
> But I know what you want, so, I offers new knob which is not affected by RSS
> as I wrote in previous mail.
>
> Off-topic:
> As memcg is growing better, using OOM-Killer for resource control should be
> ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
> but plz consider to use memcg.
>
I understand what you're trying to do, and I agree with it for most
desktop systems. However, I think that admins should have a very strong
influence in what tasks the oom killer kills. It doesn't really matter if
it's via oom_adj or not, and its debatable whether an adjustment on a
static heuristic score is in our best interest in the first place. But we
must have an alternative so that our control over oom killing isn't lost.
I'd also like to open another topic for discussion if you're proposing
such sweeping changes: at what point do we allow ~__GFP_NOFAIL allocations
to fail even if order < PAGE_ALLOC_COSTLY_ORDER and defer killing
anything? We both agreed that it's not always in the best interest to
kill a task so that an allocation can succeed, so we need to define some
criteria to simply fail the allocation instead.
> Old processes are important, younger are not. But as I wrote, I'll drop
> most of patch "6". So, plz forget about this part.
>
> I'm interested in fork-bomb killer rather than crazy badness calculation, now.
>
Ok, great. Thanks.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-11-04 3:10 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-11-04 3:10 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
> My point and your point are differnt.
>
> 1. All my concern is "baseline for heuristics"
> 2. All your concern is "baseline for knob, as oom_adj"
>
> ok ? For selecting victim by the kernel, dynamic value is much more useful.
> Current behavior of "Random kill" and "Kill multiple processes" are too bad.
> Considering oom-killer is for what, I think "1" is more important.
>
> But I know what you want, so, I offers new knob which is not affected by RSS
> as I wrote in previous mail.
>
> Off-topic:
> As memcg is growing better, using OOM-Killer for resource control should be
> ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
> but plz consider to use memcg.
>
I understand what you're trying to do, and I agree with it for most
desktop systems. However, I think that admins should have a very strong
influence in what tasks the oom killer kills. It doesn't really matter if
it's via oom_adj or not, and its debatable whether an adjustment on a
static heuristic score is in our best interest in the first place. But we
must have an alternative so that our control over oom killing isn't lost.
I'd also like to open another topic for discussion if you're proposing
such sweeping changes: at what point do we allow ~__GFP_NOFAIL allocations
to fail even if order < PAGE_ALLOC_COSTLY_ORDER and defer killing
anything? We both agreed that it's not always in the best interest to
kill a task so that an allocation can succeed, so we need to define some
criteria to simply fail the allocation instead.
> Old processes are important, younger are not. But as I wrote, I'll drop
> most of patch "6". So, plz forget about this part.
>
> I'm interested in fork-bomb killer rather than crazy badness calculation, now.
>
Ok, great. Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-11-04 3:10 ` David Rientjes
@ 2009-11-04 3:19 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04 3:19 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 3 Nov 2009 19:10:34 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > My point and your point are differnt.
> >
> > 1. All my concern is "baseline for heuristics"
> > 2. All your concern is "baseline for knob, as oom_adj"
> >
> > ok ? For selecting victim by the kernel, dynamic value is much more useful.
> > Current behavior of "Random kill" and "Kill multiple processes" are too bad.
> > Considering oom-killer is for what, I think "1" is more important.
> >
> > But I know what you want, so, I offers new knob which is not affected by RSS
> > as I wrote in previous mail.
> >
> > Off-topic:
> > As memcg is growing better, using OOM-Killer for resource control should be
> > ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
> > but plz consider to use memcg.
> >
>
> I understand what you're trying to do, and I agree with it for most
> desktop systems. However, I think that admins should have a very strong
> influence in what tasks the oom killer kills. It doesn't really matter if
> it's via oom_adj or not, and its debatable whether an adjustment on a
> static heuristic score is in our best interest in the first place. But we
> must have an alternative so that our control over oom killing isn't lost.
>
I'll not go too quickly, so, let's discuss and rewrite patches more, later.
I'll parepare new version in the next week. For this week, I'll post
swap accounting and improve fork-bomb detector.
> I'd also like to open another topic for discussion if you're proposing
> such sweeping changes: at what point do we allow ~__GFP_NOFAIL allocations
> to fail even if order < PAGE_ALLOC_COSTLY_ORDER and defer killing
> anything? We both agreed that it's not always in the best interest to
> kill a task so that an allocation can succeed, so we need to define some
> criteria to simply fail the allocation instead.
>
Yes, I think allocation itself (> order=0) should fail more before we finally
invoke OOM. It tends to be soft-landing rather than oom-killer.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-11-04 3:19 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-11-04 3:19 UTC (permalink / raw)
To: David Rientjes
Cc: vedran.furac, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 3 Nov 2009 19:10:34 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 4 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > My point and your point are differnt.
> >
> > 1. All my concern is "baseline for heuristics"
> > 2. All your concern is "baseline for knob, as oom_adj"
> >
> > ok ? For selecting victim by the kernel, dynamic value is much more useful.
> > Current behavior of "Random kill" and "Kill multiple processes" are too bad.
> > Considering oom-killer is for what, I think "1" is more important.
> >
> > But I know what you want, so, I offers new knob which is not affected by RSS
> > as I wrote in previous mail.
> >
> > Off-topic:
> > As memcg is growing better, using OOM-Killer for resource control should be
> > ended, I think. Maybe Fake-NUMA+cpuset is working well for google system,
> > but plz consider to use memcg.
> >
>
> I understand what you're trying to do, and I agree with it for most
> desktop systems. However, I think that admins should have a very strong
> influence in what tasks the oom killer kills. It doesn't really matter if
> it's via oom_adj or not, and its debatable whether an adjustment on a
> static heuristic score is in our best interest in the first place. But we
> must have an alternative so that our control over oom killing isn't lost.
>
I'll not go too quickly, so, let's discuss and rewrite patches more, later.
I'll parepare new version in the next week. For this week, I'll post
swap accounting and improve fork-bomb detector.
> I'd also like to open another topic for discussion if you're proposing
> such sweeping changes: at what point do we allow ~__GFP_NOFAIL allocations
> to fail even if order < PAGE_ALLOC_COSTLY_ORDER and defer killing
> anything? We both agreed that it's not always in the best interest to
> kill a task so that an allocation can succeed, so we need to define some
> criteria to simply fail the allocation instead.
>
Yes, I think allocation itself (> order=0) should fail more before we finally
invoke OOM. It tends to be soft-landing rather than oom-killer.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-29 19:53 ` David Rientjes
@ 2009-10-30 13:59 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-30 13:59 UTC (permalink / raw)
To: David Rientjes
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Thu, 29 Oct 2009, Vedran Furac wrote:
>
>> But then you should rename OOM killer to TRIPK:
>> Totally Random Innocent Process Killer
>>
>
> The randomness here is the order of the child list when the oom killer
> selects a task, based on the badness score, and then tries to kill a child
> with a different mm before the parent.
>
> The problem you identified in http://pastebin.com/f3f9674a0, however, is a
> forkbomb issue where the badness score should never have been so high for
> kdeinit4 compared to "test". That's directly proportional to adding the
> scores of all disjoint child total_vm values into the badness score for
> the parent and then killing the children instead.
Could you explain me why ntpd invoked oom killer? Its parent is init. Or
syslog-ng?
> That's the problem, not using total_vm as a baseline. Replacing that with
> rss is not going to solve the issue and reducing the user's ability to
> specify a rough oom priority from userspace is simply not an option.
OK then, if you have a solution, I would be glad to test your patch. I
won't care much if you don't change total_vm as a baseline. Just make
random killing history.
Regards,
Vedran
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 13:59 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-30 13:59 UTC (permalink / raw)
To: David Rientjes
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Thu, 29 Oct 2009, Vedran Furac wrote:
>
>> But then you should rename OOM killer to TRIPK:
>> Totally Random Innocent Process Killer
>>
>
> The randomness here is the order of the child list when the oom killer
> selects a task, based on the badness score, and then tries to kill a child
> with a different mm before the parent.
>
> The problem you identified in http://pastebin.com/f3f9674a0, however, is a
> forkbomb issue where the badness score should never have been so high for
> kdeinit4 compared to "test". That's directly proportional to adding the
> scores of all disjoint child total_vm values into the badness score for
> the parent and then killing the children instead.
Could you explain me why ntpd invoked oom killer? Its parent is init. Or
syslog-ng?
> That's the problem, not using total_vm as a baseline. Replacing that with
> rss is not going to solve the issue and reducing the user's ability to
> specify a rough oom priority from userspace is simply not an option.
OK then, if you have a solution, I would be glad to test your patch. I
won't care much if you don't change total_vm as a baseline. Just make
random killing history.
Regards,
Vedran
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 13:59 ` Vedran Furač
@ 2009-10-30 19:24 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-30 19:24 UTC (permalink / raw)
To: vedran.furac
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009, Vedran Furac wrote:
> > The problem you identified in http://pastebin.com/f3f9674a0, however, is a
> > forkbomb issue where the badness score should never have been so high for
> > kdeinit4 compared to "test". That's directly proportional to adding the
> > scores of all disjoint child total_vm values into the badness score for
> > the parent and then killing the children instead.
>
> Could you explain me why ntpd invoked oom killer? Its parent is init. Or
> syslog-ng?
>
Because it attempted an order-0 GFP_USER allocation and direct reclaim
could not free any pages.
The task that invoked the oom killer is simply the unlucky task that tried
an allocation that couldn't be satisified through direct reclaim. It's
usually unrelated to the task chosen for kill unless
/proc/sys/vm/oom_kill_allocating_task is enabled (which SGI requested to
avoid excessively long tasklist scans).
> > That's the problem, not using total_vm as a baseline. Replacing that with
> > rss is not going to solve the issue and reducing the user's ability to
> > specify a rough oom priority from userspace is simply not an option.
>
> OK then, if you have a solution, I would be glad to test your patch. I
> won't care much if you don't change total_vm as a baseline. Just make
> random killing history.
>
The only randomness is in selecting a task that has a different mm from
the parent in the order of its child list. Yes, that can be addressed by
doing a smarter iteration through the children before killing one of them.
Keep in mind that a heuristic as simple as this:
- kill the task that was started most recently by the same uid, or
- kill the task that was started most recently on the system if a root
task calls the oom killer,
would have yielded perfect results for your testcase but isn't necessarily
something that we'd ever want to see.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 19:24 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-30 19:24 UTC (permalink / raw)
To: vedran.furac
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009, Vedran Furac wrote:
> > The problem you identified in http://pastebin.com/f3f9674a0, however, is a
> > forkbomb issue where the badness score should never have been so high for
> > kdeinit4 compared to "test". That's directly proportional to adding the
> > scores of all disjoint child total_vm values into the badness score for
> > the parent and then killing the children instead.
>
> Could you explain me why ntpd invoked oom killer? Its parent is init. Or
> syslog-ng?
>
Because it attempted an order-0 GFP_USER allocation and direct reclaim
could not free any pages.
The task that invoked the oom killer is simply the unlucky task that tried
an allocation that couldn't be satisified through direct reclaim. It's
usually unrelated to the task chosen for kill unless
/proc/sys/vm/oom_kill_allocating_task is enabled (which SGI requested to
avoid excessively long tasklist scans).
> > That's the problem, not using total_vm as a baseline. Replacing that with
> > rss is not going to solve the issue and reducing the user's ability to
> > specify a rough oom priority from userspace is simply not an option.
>
> OK then, if you have a solution, I would be glad to test your patch. I
> won't care much if you don't change total_vm as a baseline. Just make
> random killing history.
>
The only randomness is in selecting a task that has a different mm from
the parent in the order of its child list. Yes, that can be addressed by
doing a smarter iteration through the children before killing one of them.
Keep in mind that a heuristic as simple as this:
- kill the task that was started most recently by the same uid, or
- kill the task that was started most recently on the system if a root
task calls the oom killer,
would have yielded perfect results for your testcase but isn't necessarily
something that we'd ever want to see.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 19:24 ` David Rientjes
@ 2009-11-02 19:58 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-11-02 19:58 UTC (permalink / raw)
To: David Rientjes
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Fri, 30 Oct 2009, Vedran Furac wrote:
>
>>> The problem you identified in http://pastebin.com/f3f9674a0, however, is a
>>> forkbomb issue where the badness score should never have been so high for
>>> kdeinit4 compared to "test". That's directly proportional to adding the
>>> scores of all disjoint child total_vm values into the badness score for
>>> the parent and then killing the children instead.
>> Could you explain me why ntpd invoked oom killer? Its parent is init. Or
>> syslog-ng?
>>
>
> Because it attempted an order-0 GFP_USER allocation and direct reclaim
> could not free any pages.
>
> The task that invoked the oom killer is simply the unlucky task that tried
> an allocation that couldn't be satisified through direct reclaim. It's
> usually unrelated to the task chosen for kill unless
> /proc/sys/vm/oom_kill_allocating_task is enabled (which SGI requested to
> avoid excessively long tasklist scans).
Oh, well, I didn't know that. Maybe rephrasing of that part of the
output would help eliminating future misinterpretation.
>> OK then, if you have a solution, I would be glad to test your patch. I
>> won't care much if you don't change total_vm as a baseline. Just make
>> random killing history.
>
> The only randomness is in selecting a task that has a different mm from
> the parent in the order of its child list. Yes, that can be addressed by
> doing a smarter iteration through the children before killing one of them.
>
> Keep in mind that a heuristic as simple as this:
>
> - kill the task that was started most recently by the same uid, or
>
> - kill the task that was started most recently on the system if a root
> task calls the oom killer,
>
> would have yielded perfect results for your testcase but isn't necessarily
> something that we'd ever want to see.
Of course, I want algorithm that works well in all possible situations.
Regards,
Vedran
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-11-02 19:58 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-11-02 19:58 UTC (permalink / raw)
To: David Rientjes
Cc: KAMEZAWA Hiroyuki, Hugh Dickins, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Fri, 30 Oct 2009, Vedran Furac wrote:
>
>>> The problem you identified in http://pastebin.com/f3f9674a0, however, is a
>>> forkbomb issue where the badness score should never have been so high for
>>> kdeinit4 compared to "test". That's directly proportional to adding the
>>> scores of all disjoint child total_vm values into the badness score for
>>> the parent and then killing the children instead.
>> Could you explain me why ntpd invoked oom killer? Its parent is init. Or
>> syslog-ng?
>>
>
> Because it attempted an order-0 GFP_USER allocation and direct reclaim
> could not free any pages.
>
> The task that invoked the oom killer is simply the unlucky task that tried
> an allocation that couldn't be satisified through direct reclaim. It's
> usually unrelated to the task chosen for kill unless
> /proc/sys/vm/oom_kill_allocating_task is enabled (which SGI requested to
> avoid excessively long tasklist scans).
Oh, well, I didn't know that. Maybe rephrasing of that part of the
output would help eliminating future misinterpretation.
>> OK then, if you have a solution, I would be glad to test your patch. I
>> won't care much if you don't change total_vm as a baseline. Just make
>> random killing history.
>
> The only randomness is in selecting a task that has a different mm from
> the parent in the order of its child list. Yes, that can be addressed by
> doing a smarter iteration through the children before killing one of them.
>
> Keep in mind that a heuristic as simple as this:
>
> - kill the task that was started most recently by the same uid, or
>
> - kill the task that was started most recently on the system if a root
> task calls the oom killer,
>
> would have yielded perfect results for your testcase but isn't necessarily
> something that we'd ever want to see.
Of course, I want algorithm that works well in all possible situations.
Regards,
Vedran
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 4:08 ` David Rientjes
@ 2009-10-28 13:28 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-28 13:28 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Wed, 28 Oct 2009, Vedran Furac wrote:
>
>>> This is wrong; it doesn't "emulate oom" since oom_kill_process() always
>>> kills a child of the selected process instead if they do not share the
>>> same memory. The chosen task in that case is untouched.
>> OK, I stand corrected then. Thanks! But, while testing this I lost X
>> once again and "test" survived for some time (check the timestamps):
>>
>> http://pastebin.com/d5c9d026e
>>
>> - It started by killing gkrellm(!!!)
>> - Then I lost X (kdeinit4 I guess)
>> - Then 103 seconds after the killing started, it killed "test" - the
>> real culprit.
>>
>> I mean... how?!
>>
>
> Here are the five oom kills that occurred in your log, and notice that the
> first four times it kills a child and not the actual task as I explained:
Yes, but four times wrong.
> Those are practically happening simultaneously with very little memory
> being available between each oom kill. Only later is "test" killed:
>
> [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> [97240.206832] Killed process 5005 (test)
>
> Notice how the badness score is less than 1/4th of the others. So while
> you may find it to be hogging a lot of memory, there were others that
> consumed much more.
^^^^^^^^^^^^^^^^^^^^^
This is just wrong. I have 3.5GB of RAM, free says that 2GB are empty
(ignoring cache). Culprit then allocates all free memory (2GB). That
means it is using *more* than all other processes *together*. There
cannot be any other "that consumed much more".
> You can get a more detailed understanding of this by doing
>
> echo 1 > /proc/sys/vm/oom_dump_tasks
>
> before trying your testcase; it will show various information like the
> total_vm
Looking at total_vm (VIRT in top/vsize in ps?) is completely wrong. If I
sum up those numbers for every process running I would get:
%ps -eo pid,vsize,command|awk '{ SUM += $2} END {print SUM/1024/1024}'
14.7935
14GB. And I only have 3GB. I usually use exmap to get realistic numbers:
http://www.berthels.co.uk/exmap/doc.html
> and oom_adj value for each task at the time of oom (and the
> actual badness score is exported per-task via /proc/pid/oom_score in
> real-time). This will also include the rss and show what the end result
> would be in using that value as part of the heuristic on this particular
> workload compared to the current implementation.
Thanks, I'll try that... but I guess that using rss would yield better
results.
Regards,
Vedran
^ permalink raw reply [flat|nested] 128+ messages in thread* Re: Memory overcommit
@ 2009-10-28 13:28 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-28 13:28 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Wed, 28 Oct 2009, Vedran Furac wrote:
>
>>> This is wrong; it doesn't "emulate oom" since oom_kill_process() always
>>> kills a child of the selected process instead if they do not share the
>>> same memory. The chosen task in that case is untouched.
>> OK, I stand corrected then. Thanks! But, while testing this I lost X
>> once again and "test" survived for some time (check the timestamps):
>>
>> http://pastebin.com/d5c9d026e
>>
>> - It started by killing gkrellm(!!!)
>> - Then I lost X (kdeinit4 I guess)
>> - Then 103 seconds after the killing started, it killed "test" - the
>> real culprit.
>>
>> I mean... how?!
>>
>
> Here are the five oom kills that occurred in your log, and notice that the
> first four times it kills a child and not the actual task as I explained:
Yes, but four times wrong.
> Those are practically happening simultaneously with very little memory
> being available between each oom kill. Only later is "test" killed:
>
> [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> [97240.206832] Killed process 5005 (test)
>
> Notice how the badness score is less than 1/4th of the others. So while
> you may find it to be hogging a lot of memory, there were others that
> consumed much more.
^^^^^^^^^^^^^^^^^^^^^
This is just wrong. I have 3.5GB of RAM, free says that 2GB are empty
(ignoring cache). Culprit then allocates all free memory (2GB). That
means it is using *more* than all other processes *together*. There
cannot be any other "that consumed much more".
> You can get a more detailed understanding of this by doing
>
> echo 1 > /proc/sys/vm/oom_dump_tasks
>
> before trying your testcase; it will show various information like the
> total_vm
Looking at total_vm (VIRT in top/vsize in ps?) is completely wrong. If I
sum up those numbers for every process running I would get:
%ps -eo pid,vsize,command|awk '{ SUM += $2} END {print SUM/1024/1024}'
14.7935
14GB. And I only have 3GB. I usually use exmap to get realistic numbers:
http://www.berthels.co.uk/exmap/doc.html
> and oom_adj value for each task at the time of oom (and the
> actual badness score is exported per-task via /proc/pid/oom_score in
> real-time). This will also include the rss and show what the end result
> would be in using that value as part of the heuristic on this particular
> workload compared to the current implementation.
Thanks, I'll try that... but I guess that using rss would yield better
results.
Regards,
Vedran
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread* Re: Memory overcommit
2009-10-28 13:28 ` Vedran Furač
@ 2009-10-28 20:10 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 20:10 UTC (permalink / raw)
To: Vedran Furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, Vedran Furac wrote:
> > Those are practically happening simultaneously with very little memory
> > being available between each oom kill. Only later is "test" killed:
> >
> > [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> > [97240.206832] Killed process 5005 (test)
> >
> > Notice how the badness score is less than 1/4th of the others. So while
> > you may find it to be hogging a lot of memory, there were others that
> > consumed much more.
> ^^^^^^^^^^^^^^^^^^^^^
>
> This is just wrong. I have 3.5GB of RAM, free says that 2GB are empty
> (ignoring cache). Culprit then allocates all free memory (2GB). That
> means it is using *more* than all other processes *together*. There
> cannot be any other "that consumed much more".
>
Just post the oom killer results after using echo 1 >
/proc/sys/vm/oom_dump_tasks as requested and it will clarify why those
tasks were chosen to kill. It will also show the result of using rss
instead of total_vm and allow us to see how such a change would have
changed the killing order for your workload.
> Thanks, I'll try that... but I guess that using rss would yield better
> results.
>
We would know if you posted the data.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 20:10 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 20:10 UTC (permalink / raw)
To: Vedran Furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, Vedran Furac wrote:
> > Those are practically happening simultaneously with very little memory
> > being available between each oom kill. Only later is "test" killed:
> >
> > [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> > [97240.206832] Killed process 5005 (test)
> >
> > Notice how the badness score is less than 1/4th of the others. So while
> > you may find it to be hogging a lot of memory, there were others that
> > consumed much more.
> ^^^^^^^^^^^^^^^^^^^^^
>
> This is just wrong. I have 3.5GB of RAM, free says that 2GB are empty
> (ignoring cache). Culprit then allocates all free memory (2GB). That
> means it is using *more* than all other processes *together*. There
> cannot be any other "that consumed much more".
>
Just post the oom killer results after using echo 1 >
/proc/sys/vm/oom_dump_tasks as requested and it will clarify why those
tasks were chosen to kill. It will also show the result of using rss
instead of total_vm and allow us to see how such a change would have
changed the killing order for your workload.
> Thanks, I'll try that... but I guess that using rss would yield better
> results.
>
We would know if you posted the data.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 20:10 ` David Rientjes
@ 2009-10-29 3:05 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-29 3:05 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> We would know if you posted the data.
I need to find some free time to destroy a session on a computer which I
use for work. You could easily test it yourself also as this doesn't
happen only to me.
Anyways, here it is... this time it started with ntpd:
http://pastebin.com/f3f9674a0
Regards,
Vedran
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-29 3:05 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-29 3:05 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> We would know if you posted the data.
I need to find some free time to destroy a session on a computer which I
use for work. You could easily test it yourself also as this doesn't
happen only to me.
Anyways, here it is... this time it started with ntpd:
http://pastebin.com/f3f9674a0
Regards,
Vedran
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-29 3:05 ` Vedran Furač
@ 2009-10-29 8:35 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-29 8:35 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Thu, 29 Oct 2009, Vedran Furac wrote:
> > We would know if you posted the data.
>
> I need to find some free time to destroy a session on a computer which I
> use for work. You could easily test it yourself also as this doesn't
> happen only to me.
>
> Anyways, here it is... this time it started with ntpd:
>
> http://pastebin.com/f3f9674a0
>
That oom log shows 12 ooms but no tasks actually appear to be getting
killed (there're no "Killed process 1234 (task)" found). Do you have any
idea why?
Anyway, as I posted in response to KAMEZAWA-san's patch, the change to
get_mm_rss(mm) prefers Xorg more than the current implementation.
>From your log at the link above:
total_vm
669624 test
195695 krunner
187342 krusader
168881 plasma-desktop
130562 ktorrent
127081 knotify4
125881 icedove-bin
123036 akregator
rss
668738 test
42191 Xorg
30761 firefox-bin
13331 icedove-bin
10234 ktorrent
9263 akregator
8864 plasma-desktop
7532 krunner
Can you explain why Xorg is preferred as a baseline to kill rather than
krunner in your example?
Thanks.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-29 8:35 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-29 8:35 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Thu, 29 Oct 2009, Vedran Furac wrote:
> > We would know if you posted the data.
>
> I need to find some free time to destroy a session on a computer which I
> use for work. You could easily test it yourself also as this doesn't
> happen only to me.
>
> Anyways, here it is... this time it started with ntpd:
>
> http://pastebin.com/f3f9674a0
>
That oom log shows 12 ooms but no tasks actually appear to be getting
killed (there're no "Killed process 1234 (task)" found). Do you have any
idea why?
Anyway, as I posted in response to KAMEZAWA-san's patch, the change to
get_mm_rss(mm) prefers Xorg more than the current implementation.
>From your log at the link above:
total_vm
669624 test
195695 krunner
187342 krusader
168881 plasma-desktop
130562 ktorrent
127081 knotify4
125881 icedove-bin
123036 akregator
rss
668738 test
42191 Xorg
30761 firefox-bin
13331 icedove-bin
10234 ktorrent
9263 akregator
8864 plasma-desktop
7532 krunner
Can you explain why Xorg is preferred as a baseline to kill rather than
krunner in your example?
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-29 8:35 ` David Rientjes
@ 2009-10-29 11:01 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-29 11:01 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Thu, 29 Oct 2009, Vedran Furac wrote:
>
>>> We would know if you posted the data.
>> I need to find some free time to destroy a session on a computer which I
>> use for work. You could easily test it yourself also as this doesn't
>> happen only to me.
>>
>> Anyways, here it is... this time it started with ntpd:
>>
>> http://pastebin.com/f3f9674a0
>>
>
> That oom log shows 12 ooms but no tasks actually appear to be getting
> killed (there're no "Killed process 1234 (task)" found). Do you have any
> idea why?
That's /var/log/messages. I posted it and not dmesg because whole log
didn't fit dmesg buffer, here is waht i have (compare timestamps):
% dmesg|grep -i kill
[ 1493.064458] Out of memory: kill process 6304 (kdeinit4) score 1190231
or a child
[ 1493.064467] Killed process 6409 (konqueror)
[ 1493.261149] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1493.261166] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1493.276528] Out of memory: kill process 6304 (kdeinit4) score 1161265
or a child
[ 1493.276538] Killed process 6411 (krusader)
[ 1499.221160] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1499.221178] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1499.236431] Out of memory: kill process 6304 (kdeinit4) score 1067593
or a child
[ 1499.236441] Killed process 6412 (irexec)
[ 1499.370192] firefox-bin invoked oom-killer: gfp_mask=0x201da,
order=0, oomkilladj=0
[ 1499.370209] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1499.385417] Out of memory: kill process 6304 (kdeinit4) score 1066861
or a child
[ 1499.385427] Killed process 6420 (xchm)
[ 1499.458304] kio_file invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1499.458333] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1499.458367] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
[ 1499.473573] Out of memory: kill process 6304 (kdeinit4) score 1043690
or a child
[ 1499.473582] Killed process 6425 (kio_file)
[ 1500.250746] korgac invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.250765] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.266186] Out of memory: kill process 6304 (kdeinit4) score 1020350
or a child
[ 1500.266196] Killed process 6464 (icedove)
[ 1500.349355] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.349371] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.364689] Out of memory: kill process 6304 (kdeinit4) score 1019864
or a child
[ 1500.364699] Killed process 6477 (kio_http)
[ 1500.452151] kded4 invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.452167] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.452196] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
[ 1500.467307] Out of memory: kill process 6304 (kdeinit4) score 993142
or a child
[ 1500.467316] Killed process 6478 (kio_http)
[ 1500.780222] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.780239] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.796280] Out of memory: kill process 6304 (kdeinit4) score 966331
or a child
[ 1500.796290] Killed process 6484 (kio_http)
[ 1501.065374] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1501.065390] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1501.080579] Out of memory: kill process 6304 (kdeinit4) score 939434
or a child
[ 1501.080587] Killed process 6486 (kio_http)
[ 1501.381188] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1501.381204] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1501.396338] Out of memory: kill process 6304 (kdeinit4) score 912691
or a child
[ 1501.396346] Killed process 6487 (firefox-bin)
[ 1502.661294] icedove-bin invoked oom-killer: gfp_mask=0x201da,
order=0, oomkilladj=0
[ 1502.661311] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1502.676563] Out of memory: kill process 7580 (test) score 708945 or a
child
[ 1502.676575] Killed process 7580 (test)
> Can you explain why Xorg is preferred as a baseline to kill rather than
> krunner in your example?
Krunner is a small app for running other apps and do similar things. It
shouldn't use a lot of memory. OTOH, Xorg has to hold all the pixmaps
and so on. That was expected result. Fist Xorg, then firefox and
thunderbird.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-29 11:01 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-29 11:01 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Thu, 29 Oct 2009, Vedran Furac wrote:
>
>>> We would know if you posted the data.
>> I need to find some free time to destroy a session on a computer which I
>> use for work. You could easily test it yourself also as this doesn't
>> happen only to me.
>>
>> Anyways, here it is... this time it started with ntpd:
>>
>> http://pastebin.com/f3f9674a0
>>
>
> That oom log shows 12 ooms but no tasks actually appear to be getting
> killed (there're no "Killed process 1234 (task)" found). Do you have any
> idea why?
That's /var/log/messages. I posted it and not dmesg because whole log
didn't fit dmesg buffer, here is waht i have (compare timestamps):
% dmesg|grep -i kill
[ 1493.064458] Out of memory: kill process 6304 (kdeinit4) score 1190231
or a child
[ 1493.064467] Killed process 6409 (konqueror)
[ 1493.261149] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1493.261166] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1493.276528] Out of memory: kill process 6304 (kdeinit4) score 1161265
or a child
[ 1493.276538] Killed process 6411 (krusader)
[ 1499.221160] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1499.221178] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1499.236431] Out of memory: kill process 6304 (kdeinit4) score 1067593
or a child
[ 1499.236441] Killed process 6412 (irexec)
[ 1499.370192] firefox-bin invoked oom-killer: gfp_mask=0x201da,
order=0, oomkilladj=0
[ 1499.370209] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1499.385417] Out of memory: kill process 6304 (kdeinit4) score 1066861
or a child
[ 1499.385427] Killed process 6420 (xchm)
[ 1499.458304] kio_file invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1499.458333] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1499.458367] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
[ 1499.473573] Out of memory: kill process 6304 (kdeinit4) score 1043690
or a child
[ 1499.473582] Killed process 6425 (kio_file)
[ 1500.250746] korgac invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.250765] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.266186] Out of memory: kill process 6304 (kdeinit4) score 1020350
or a child
[ 1500.266196] Killed process 6464 (icedove)
[ 1500.349355] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.349371] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.364689] Out of memory: kill process 6304 (kdeinit4) score 1019864
or a child
[ 1500.364699] Killed process 6477 (kio_http)
[ 1500.452151] kded4 invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.452167] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.452196] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
[ 1500.467307] Out of memory: kill process 6304 (kdeinit4) score 993142
or a child
[ 1500.467316] Killed process 6478 (kio_http)
[ 1500.780222] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1500.780239] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1500.796280] Out of memory: kill process 6304 (kdeinit4) score 966331
or a child
[ 1500.796290] Killed process 6484 (kio_http)
[ 1501.065374] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1501.065390] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1501.080579] Out of memory: kill process 6304 (kdeinit4) score 939434
or a child
[ 1501.080587] Killed process 6486 (kio_http)
[ 1501.381188] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
oomkilladj=0
[ 1501.381204] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1501.396338] Out of memory: kill process 6304 (kdeinit4) score 912691
or a child
[ 1501.396346] Killed process 6487 (firefox-bin)
[ 1502.661294] icedove-bin invoked oom-killer: gfp_mask=0x201da,
order=0, oomkilladj=0
[ 1502.661311] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
[ 1502.676563] Out of memory: kill process 7580 (test) score 708945 or a
child
[ 1502.676575] Killed process 7580 (test)
> Can you explain why Xorg is preferred as a baseline to kill rather than
> krunner in your example?
Krunner is a small app for running other apps and do similar things. It
shouldn't use a lot of memory. OTOH, Xorg has to hold all the pixmaps
and so on. That was expected result. Fist Xorg, then firefox and
thunderbird.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-29 11:01 ` Vedran Furač
@ 2009-10-29 19:42 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-29 19:42 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Thu, 29 Oct 2009, Vedran Furac wrote:
> [ 1493.064458] Out of memory: kill process 6304 (kdeinit4) score 1190231
> or a child
> [ 1493.064467] Killed process 6409 (konqueror)
> [ 1493.261149] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1493.261166] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1493.276528] Out of memory: kill process 6304 (kdeinit4) score 1161265
> or a child
> [ 1493.276538] Killed process 6411 (krusader)
> [ 1499.221160] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1499.221178] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.236431] Out of memory: kill process 6304 (kdeinit4) score 1067593
> or a child
> [ 1499.236441] Killed process 6412 (irexec)
> [ 1499.370192] firefox-bin invoked oom-killer: gfp_mask=0x201da,
> order=0, oomkilladj=0
> [ 1499.370209] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.385417] Out of memory: kill process 6304 (kdeinit4) score 1066861
> or a child
> [ 1499.385427] Killed process 6420 (xchm)
> [ 1499.458304] kio_file invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1499.458333] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.458367] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
> [ 1499.473573] Out of memory: kill process 6304 (kdeinit4) score 1043690
> or a child
> [ 1499.473582] Killed process 6425 (kio_file)
> [ 1500.250746] korgac invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.250765] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.266186] Out of memory: kill process 6304 (kdeinit4) score 1020350
> or a child
> [ 1500.266196] Killed process 6464 (icedove)
> [ 1500.349355] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.349371] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.364689] Out of memory: kill process 6304 (kdeinit4) score 1019864
> or a child
> [ 1500.364699] Killed process 6477 (kio_http)
> [ 1500.452151] kded4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.452167] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.452196] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
> [ 1500.467307] Out of memory: kill process 6304 (kdeinit4) score 993142
> or a child
> [ 1500.467316] Killed process 6478 (kio_http)
> [ 1500.780222] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.780239] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.796280] Out of memory: kill process 6304 (kdeinit4) score 966331
> or a child
> [ 1500.796290] Killed process 6484 (kio_http)
> [ 1501.065374] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1501.065390] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1501.080579] Out of memory: kill process 6304 (kdeinit4) score 939434
> or a child
> [ 1501.080587] Killed process 6486 (kio_http)
> [ 1501.381188] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1501.381204] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1501.396338] Out of memory: kill process 6304 (kdeinit4) score 912691
> or a child
> [ 1501.396346] Killed process 6487 (firefox-bin)
> [ 1502.661294] icedove-bin invoked oom-killer: gfp_mask=0x201da,
> order=0, oomkilladj=0
> [ 1502.661311] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1502.676563] Out of memory: kill process 7580 (test) score 708945 or a
> child
> [ 1502.676575] Killed process 7580 (test)
>
Ok, so this is the forkbomb problem by adding half of each child's
total_vm into the badness score of the parent. We should address this
completely seperately by addressing that specific part of the heuristic,
not changing what we consider to be a baseline.
The rationale is quite simple: we'll still experience the same problem
with rss as we did with total_vm in the forkbomb scenario above on certain
workloads (maybe not yours, but others). The oom killer always kills a
child first if it has a different mm than the selected parent, so the
amount of memory freeing as a result of that is entirely dependent on the
order of the child list. It may be very little, but killed because its
siblings had large total_vm values.
So instead of focusing on rss, we simply need to find a better heuristic
for the forkbomb issue which I've already proposed a very trivial solution
for. Then, afterwards, we can debate about how the scoring heuristic can
be changed to select better tasks (and perhaps remove a lot of the clutter
that's there currently!).
> > Can you explain why Xorg is preferred as a baseline to kill rather than
> > krunner in your example?
>
> Krunner is a small app for running other apps and do similar things. It
> shouldn't use a lot of memory. OTOH, Xorg has to hold all the pixmaps
> and so on. That was expected result. Fist Xorg, then firefox and
> thunderbird.
>
You're making all these claims and assertions based _solely_ on the theory
that killing the application with the most resident RAM is always the
optimal solution. That's just not true, especially if we're just
allocating small numbers of order-0 memory.
Much better is to allow the user to decide at what point, regardless of
swap usage, their application is using much more memory than expected or
required. They can do that right now pretty well with /proc/pid/oom_adj
without this outlandish claim that they should be expected to know the rss
of their applications at the time of oom to effectively tune oom_adj.
What would you suggest? A script that sits in a loop checking each task's
current rss from /proc/pid/stat or their current oom priority though
/proc/pid/oom_score and adjusting oom_adj preemptively just in case the
oom killer is invoked in the next second?
And that "small app" has 30MB of rss which could be freed, if killed, and
utilized for subsequent page allocations.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-29 19:42 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-29 19:42 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Thu, 29 Oct 2009, Vedran Furac wrote:
> [ 1493.064458] Out of memory: kill process 6304 (kdeinit4) score 1190231
> or a child
> [ 1493.064467] Killed process 6409 (konqueror)
> [ 1493.261149] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1493.261166] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1493.276528] Out of memory: kill process 6304 (kdeinit4) score 1161265
> or a child
> [ 1493.276538] Killed process 6411 (krusader)
> [ 1499.221160] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1499.221178] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.236431] Out of memory: kill process 6304 (kdeinit4) score 1067593
> or a child
> [ 1499.236441] Killed process 6412 (irexec)
> [ 1499.370192] firefox-bin invoked oom-killer: gfp_mask=0x201da,
> order=0, oomkilladj=0
> [ 1499.370209] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.385417] Out of memory: kill process 6304 (kdeinit4) score 1066861
> or a child
> [ 1499.385427] Killed process 6420 (xchm)
> [ 1499.458304] kio_file invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1499.458333] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1499.458367] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
> [ 1499.473573] Out of memory: kill process 6304 (kdeinit4) score 1043690
> or a child
> [ 1499.473582] Killed process 6425 (kio_file)
> [ 1500.250746] korgac invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.250765] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.266186] Out of memory: kill process 6304 (kdeinit4) score 1020350
> or a child
> [ 1500.266196] Killed process 6464 (icedove)
> [ 1500.349355] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.349371] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.364689] Out of memory: kill process 6304 (kdeinit4) score 1019864
> or a child
> [ 1500.364699] Killed process 6477 (kio_http)
> [ 1500.452151] kded4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.452167] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.452196] [<ffffffff81120900>] ? d_kill+0x5c/0x7c
> [ 1500.467307] Out of memory: kill process 6304 (kdeinit4) score 993142
> or a child
> [ 1500.467316] Killed process 6478 (kio_http)
> [ 1500.780222] akregator invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1500.780239] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1500.796280] Out of memory: kill process 6304 (kdeinit4) score 966331
> or a child
> [ 1500.796290] Killed process 6484 (kio_http)
> [ 1501.065374] syslog-ng invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1501.065390] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1501.080579] Out of memory: kill process 6304 (kdeinit4) score 939434
> or a child
> [ 1501.080587] Killed process 6486 (kio_http)
> [ 1501.381188] knotify4 invoked oom-killer: gfp_mask=0x201da, order=0,
> oomkilladj=0
> [ 1501.381204] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1501.396338] Out of memory: kill process 6304 (kdeinit4) score 912691
> or a child
> [ 1501.396346] Killed process 6487 (firefox-bin)
> [ 1502.661294] icedove-bin invoked oom-killer: gfp_mask=0x201da,
> order=0, oomkilladj=0
> [ 1502.661311] [<ffffffff810d6dd7>] ? oom_kill_process+0x9a/0x264
> [ 1502.676563] Out of memory: kill process 7580 (test) score 708945 or a
> child
> [ 1502.676575] Killed process 7580 (test)
>
Ok, so this is the forkbomb problem by adding half of each child's
total_vm into the badness score of the parent. We should address this
completely seperately by addressing that specific part of the heuristic,
not changing what we consider to be a baseline.
The rationale is quite simple: we'll still experience the same problem
with rss as we did with total_vm in the forkbomb scenario above on certain
workloads (maybe not yours, but others). The oom killer always kills a
child first if it has a different mm than the selected parent, so the
amount of memory freeing as a result of that is entirely dependent on the
order of the child list. It may be very little, but killed because its
siblings had large total_vm values.
So instead of focusing on rss, we simply need to find a better heuristic
for the forkbomb issue which I've already proposed a very trivial solution
for. Then, afterwards, we can debate about how the scoring heuristic can
be changed to select better tasks (and perhaps remove a lot of the clutter
that's there currently!).
> > Can you explain why Xorg is preferred as a baseline to kill rather than
> > krunner in your example?
>
> Krunner is a small app for running other apps and do similar things. It
> shouldn't use a lot of memory. OTOH, Xorg has to hold all the pixmaps
> and so on. That was expected result. Fist Xorg, then firefox and
> thunderbird.
>
You're making all these claims and assertions based _solely_ on the theory
that killing the application with the most resident RAM is always the
optimal solution. That's just not true, especially if we're just
allocating small numbers of order-0 memory.
Much better is to allow the user to decide at what point, regardless of
swap usage, their application is using much more memory than expected or
required. They can do that right now pretty well with /proc/pid/oom_adj
without this outlandish claim that they should be expected to know the rss
of their applications at the time of oom to effectively tune oom_adj.
What would you suggest? A script that sits in a loop checking each task's
current rss from /proc/pid/stat or their current oom priority though
/proc/pid/oom_score and adjusting oom_adj preemptively just in case the
oom killer is invoked in the next second?
And that "small app" has 30MB of rss which could be freed, if killed, and
utilized for subsequent page allocations.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-29 19:42 ` David Rientjes
@ 2009-10-30 13:53 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-30 13:53 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> Ok, so this is the forkbomb problem by adding half of each child's
> total_vm into the badness score of the parent. We should address this
> completely seperately by addressing that specific part of the heuristic,
> not changing what we consider to be a baseline.
> thunderbird.
>
> You're making all these claims and assertions based _solely_ on the theory
> that killing the application with the most resident RAM is always the
> optimal solution. That's just not true, especially if we're just
> allocating small numbers of order-0 memory.
Well, you are kernel hacker, not me. You know how linux mm works much
more than I do. I just reported a, what I think is a big problem, which
needs to be solved ASAP (2.6.33). I'm afraid that we'll just talk much
and nothing will be done with solution/fix postponed indefinitely. Not
sure if you are interested, but I tested this on windowsxp also, and
nothing bad happens there, system continues to function properly.
For 2-3 years I had memory overcommit turn off. I didn't get any OOM,
but sometimes Java didn't work and it seems that because of some kernel
weirdness (or misunderstanding on my part) I couldn't use all the
available memory:
# echo 2 > /proc/sys/vm/overcommit_memory
# echo 95 > /proc/sys/vm/overcommit_ratio
% ./test /* malloc in loop as before */
malloc: Cannot allocate memory /* Great, no OOM, but: */
% free -m
total used free shared buffers cached
Mem: 3458 3429 29 0 102 1119
-/+ buffers/cache: 2207 1251
There's plenty of memory available. Shouldn't cache be automatically
dropped (this question was in my original mail, hence the subject)?
All this frustrated not only me, but a great number of users on our
local Croatian linux usenet newsgroup with some of them pointing that as
the reason they use solaris. And so on...
> Much better is to allow the user to decide at what point, regardless of
> swap usage, their application is using much more memory than expected or
> required. They can do that right now pretty well with /proc/pid/oom_adj
> without this outlandish claim that they should be expected to know the rss
> of their applications at the time of oom to effectively tune oom_adj.
Believe me, barely a few developers use oom_adj for their applications,
and probably almost none of the end users. What should they do, every
time they start an application, go to console and set the oom_adj. You
cannot expect them to do that.
> What would you suggest? A script that sits in a loop checking each task's
> current rss from /proc/pid/stat or their current oom priority though
> /proc/pid/oom_score and adjusting oom_adj preemptively just in case the
> oom killer is invoked in the next second?
:)
^ permalink raw reply [flat|nested] 128+ messages in thread* Re: Memory overcommit
@ 2009-10-30 13:53 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-30 13:53 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> Ok, so this is the forkbomb problem by adding half of each child's
> total_vm into the badness score of the parent. We should address this
> completely seperately by addressing that specific part of the heuristic,
> not changing what we consider to be a baseline.
> thunderbird.
>
> You're making all these claims and assertions based _solely_ on the theory
> that killing the application with the most resident RAM is always the
> optimal solution. That's just not true, especially if we're just
> allocating small numbers of order-0 memory.
Well, you are kernel hacker, not me. You know how linux mm works much
more than I do. I just reported a, what I think is a big problem, which
needs to be solved ASAP (2.6.33). I'm afraid that we'll just talk much
and nothing will be done with solution/fix postponed indefinitely. Not
sure if you are interested, but I tested this on windowsxp also, and
nothing bad happens there, system continues to function properly.
For 2-3 years I had memory overcommit turn off. I didn't get any OOM,
but sometimes Java didn't work and it seems that because of some kernel
weirdness (or misunderstanding on my part) I couldn't use all the
available memory:
# echo 2 > /proc/sys/vm/overcommit_memory
# echo 95 > /proc/sys/vm/overcommit_ratio
% ./test /* malloc in loop as before */
malloc: Cannot allocate memory /* Great, no OOM, but: */
% free -m
total used free shared buffers cached
Mem: 3458 3429 29 0 102 1119
-/+ buffers/cache: 2207 1251
There's plenty of memory available. Shouldn't cache be automatically
dropped (this question was in my original mail, hence the subject)?
All this frustrated not only me, but a great number of users on our
local Croatian linux usenet newsgroup with some of them pointing that as
the reason they use solaris. And so on...
> Much better is to allow the user to decide at what point, regardless of
> swap usage, their application is using much more memory than expected or
> required. They can do that right now pretty well with /proc/pid/oom_adj
> without this outlandish claim that they should be expected to know the rss
> of their applications at the time of oom to effectively tune oom_adj.
Believe me, barely a few developers use oom_adj for their applications,
and probably almost none of the end users. What should they do, every
time they start an application, go to console and set the oom_adj. You
cannot expect them to do that.
> What would you suggest? A script that sits in a loop checking each task's
> current rss from /proc/pid/stat or their current oom priority though
> /proc/pid/oom_score and adjusting oom_adj preemptively just in case the
> oom killer is invoked in the next second?
:)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread* Re: Memory overcommit
2009-10-30 13:53 ` Vedran Furač
@ 2009-10-30 14:08 ` Thomas Fjellstrom
-1 siblings, 0 replies; 128+ messages in thread
From: Thomas Fjellstrom @ 2009-10-30 14:08 UTC (permalink / raw)
To: linux-kernel, vedran.furac
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri October 30 2009, Vedran Furač wrote:
> David Rientjes wrote:
> > Ok, so this is the forkbomb problem by adding half of each child's
> > total_vm into the badness score of the parent. We should address this
> > completely seperately by addressing that specific part of the
> > heuristic, not changing what we consider to be a baseline.
> > thunderbird.
> >
> > You're making all these claims and assertions based _solely_ on the
> > theory that killing the application with the most resident RAM is
> > always the optimal solution. That's just not true, especially if we're
> > just allocating small numbers of order-0 memory.
>
> Well, you are kernel hacker, not me. You know how linux mm works much
> more than I do. I just reported a, what I think is a big problem, which
> needs to be solved ASAP (2.6.33). I'm afraid that we'll just talk much
> and nothing will be done with solution/fix postponed indefinitely. Not
> sure if you are interested, but I tested this on windowsxp also, and
> nothing bad happens there, system continues to function properly.
>
> For 2-3 years I had memory overcommit turn off. I didn't get any OOM,
> but sometimes Java didn't work and it seems that because of some kernel
> weirdness (or misunderstanding on my part) I couldn't use all the
> available memory:
>
> # echo 2 > /proc/sys/vm/overcommit_memory
>
> # echo 95 > /proc/sys/vm/overcommit_ratio
> % ./test /* malloc in loop as before */
> malloc: Cannot allocate memory /* Great, no OOM, but: */
>
> % free -m
> total used free shared buffers cached
> Mem: 3458 3429 29 0 102 1119
> -/+ buffers/cache: 2207 1251
>
> There's plenty of memory available. Shouldn't cache be automatically
> dropped (this question was in my original mail, hence the subject)?
>
> All this frustrated not only me, but a great number of users on our
> local Croatian linux usenet newsgroup with some of them pointing that as
> the reason they use solaris. And so on...
I think this is the MOST serious issue related to the oom killer. For some
reason it refuses to drop pages before trying to kill. When it should drop
cache, THEN kill if needed.
> > Much better is to allow the user to decide at what point, regardless of
> > swap usage, their application is using much more memory than expected
> > or required. They can do that right now pretty well with
> > /proc/pid/oom_adj without this outlandish claim that they should be
> > expected to know the rss of their applications at the time of oom to
> > effectively tune oom_adj.
>
> Believe me, barely a few developers use oom_adj for their applications,
> and probably almost none of the end users. What should they do, every
> time they start an application, go to console and set the oom_adj. You
> cannot expect them to do that.
>
> > What would you suggest? A script that sits in a loop checking each
> > task's current rss from /proc/pid/stat or their current oom priority
> > though /proc/pid/oom_score and adjusting oom_adj preemptively just in
> > case the oom killer is invoked in the next second?
> >
> :)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
Thomas Fjellstrom
tfjellstrom@shaw.ca
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 14:08 ` Thomas Fjellstrom
0 siblings, 0 replies; 128+ messages in thread
From: Thomas Fjellstrom @ 2009-10-30 14:08 UTC (permalink / raw)
To: linux-kernel, vedran.furac
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri October 30 2009, Vedran Furač wrote:
> David Rientjes wrote:
> > Ok, so this is the forkbomb problem by adding half of each child's
> > total_vm into the badness score of the parent. We should address this
> > completely seperately by addressing that specific part of the
> > heuristic, not changing what we consider to be a baseline.
> > thunderbird.
> >
> > You're making all these claims and assertions based _solely_ on the
> > theory that killing the application with the most resident RAM is
> > always the optimal solution. That's just not true, especially if we're
> > just allocating small numbers of order-0 memory.
>
> Well, you are kernel hacker, not me. You know how linux mm works much
> more than I do. I just reported a, what I think is a big problem, which
> needs to be solved ASAP (2.6.33). I'm afraid that we'll just talk much
> and nothing will be done with solution/fix postponed indefinitely. Not
> sure if you are interested, but I tested this on windowsxp also, and
> nothing bad happens there, system continues to function properly.
>
> For 2-3 years I had memory overcommit turn off. I didn't get any OOM,
> but sometimes Java didn't work and it seems that because of some kernel
> weirdness (or misunderstanding on my part) I couldn't use all the
> available memory:
>
> # echo 2 > /proc/sys/vm/overcommit_memory
>
> # echo 95 > /proc/sys/vm/overcommit_ratio
> % ./test /* malloc in loop as before */
> malloc: Cannot allocate memory /* Great, no OOM, but: */
>
> % free -m
> total used free shared buffers cached
> Mem: 3458 3429 29 0 102 1119
> -/+ buffers/cache: 2207 1251
>
> There's plenty of memory available. Shouldn't cache be automatically
> dropped (this question was in my original mail, hence the subject)?
>
> All this frustrated not only me, but a great number of users on our
> local Croatian linux usenet newsgroup with some of them pointing that as
> the reason they use solaris. And so on...
I think this is the MOST serious issue related to the oom killer. For some
reason it refuses to drop pages before trying to kill. When it should drop
cache, THEN kill if needed.
> > Much better is to allow the user to decide at what point, regardless of
> > swap usage, their application is using much more memory than expected
> > or required. They can do that right now pretty well with
> > /proc/pid/oom_adj without this outlandish claim that they should be
> > expected to know the rss of their applications at the time of oom to
> > effectively tune oom_adj.
>
> Believe me, barely a few developers use oom_adj for their applications,
> and probably almost none of the end users. What should they do, every
> time they start an application, go to console and set the oom_adj. You
> cannot expect them to do that.
>
> > What would you suggest? A script that sits in a loop checking each
> > task's current rss from /proc/pid/stat or their current oom priority
> > though /proc/pid/oom_score and adjusting oom_adj preemptively just in
> > case the oom killer is invoked in the next second?
> >
> :)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 14:08 ` Thomas Fjellstrom
@ 2009-10-30 15:13 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-30 15:13 UTC (permalink / raw)
To: tfjellstrom
Cc: linux-kernel, David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki,
linux-mm, KOSAKI Motohiro, minchan.kim, Andrew Morton,
Andrea Arcangeli
Thomas Fjellstrom wrote:
>> malloc: Cannot allocate memory /* Great, no OOM, but: */
>>
>> % free -m total used free shared buffers cached
>> Mem: 3458 3429 29 0 102 1119
>> -/+ buffers/cache: 2207 1251
>>
>> There's plenty of memory available. Shouldn't cache be
>> automatically dropped (this question was in my original mail, hence
>> the subject)?
>>
>
> I think this is the MOST serious issue related to the oom killer. For
> some reason it refuses to drop pages before trying to kill. When it
> should drop cache, THEN kill if needed.
This isn't about OOM, but situation when you turn off overcommit. I was
jumping to conclusion here. You can drop caches manually with:
# echo 1 > /proc/sys/vm/drop_caches
but you still get: "malloc: Cannot allocate memory" even if almost
nothing is cached:
total used free shared buffers cached
Mem: 3458 2210 1248 0 3 90
-/+ buffers/cache: 2116 1342
As for not dropping pages by kernel before killing, I don't know nothing
about it. It happens so fast and I never tried to measure it.
^ permalink raw reply [flat|nested] 128+ messages in thread* Re: Memory overcommit
@ 2009-10-30 15:13 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-30 15:13 UTC (permalink / raw)
To: tfjellstrom
Cc: linux-kernel, David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki,
linux-mm, KOSAKI Motohiro, minchan.kim, Andrew Morton,
Andrea Arcangeli
Thomas Fjellstrom wrote:
>> malloc: Cannot allocate memory /* Great, no OOM, but: */
>>
>> % free -m total used free shared buffers cached
>> Mem: 3458 3429 29 0 102 1119
>> -/+ buffers/cache: 2207 1251
>>
>> There's plenty of memory available. Shouldn't cache be
>> automatically dropped (this question was in my original mail, hence
>> the subject)?
>>
>
> I think this is the MOST serious issue related to the oom killer. For
> some reason it refuses to drop pages before trying to kill. When it
> should drop cache, THEN kill if needed.
This isn't about OOM, but situation when you turn off overcommit. I was
jumping to conclusion here. You can drop caches manually with:
# echo 1 > /proc/sys/vm/drop_caches
but you still get: "malloc: Cannot allocate memory" even if almost
nothing is cached:
total used free shared buffers cached
Mem: 3458 2210 1248 0 3 90
-/+ buffers/cache: 2116 1342
As for not dropping pages by kernel before killing, I don't know nothing
about it. It happens so fast and I never tried to measure it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 13:53 ` Vedran Furač
@ 2009-10-30 14:12 ` Andrea Arcangeli
-1 siblings, 0 replies; 128+ messages in thread
From: Andrea Arcangeli @ 2009-10-30 14:12 UTC (permalink / raw)
To: Vedran Furač
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
On Fri, Oct 30, 2009 at 02:53:33PM +0100, Vedran Furač wrote:
> % free -m
> total used free shared buffers cached
> Mem: 3458 3429 29 0 102 1119
> -/+ buffers/cache: 2207 1251
>
> There's plenty of memory available. Shouldn't cache be automatically
> dropped (this question was in my original mail, hence the subject)?
This is not about cache, cache amount is physical, this about
virtual amount that can only go in ram or swap (at any later time,
current time is irrelevant) vs "ram + swap". In short add more swap if
you don't like overcommit and check grep Commit /proc/meminfo in case
this is accounting bug...
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 14:12 ` Andrea Arcangeli
0 siblings, 0 replies; 128+ messages in thread
From: Andrea Arcangeli @ 2009-10-30 14:12 UTC (permalink / raw)
To: Vedran Furač
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
On Fri, Oct 30, 2009 at 02:53:33PM +0100, Vedran FuraA? wrote:
> % free -m
> total used free shared buffers cached
> Mem: 3458 3429 29 0 102 1119
> -/+ buffers/cache: 2207 1251
>
> There's plenty of memory available. Shouldn't cache be automatically
> dropped (this question was in my original mail, hence the subject)?
This is not about cache, cache amount is physical, this about
virtual amount that can only go in ram or swap (at any later time,
current time is irrelevant) vs "ram + swap". In short add more swap if
you don't like overcommit and check grep Commit /proc/meminfo in case
this is accounting bug...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 14:12 ` Andrea Arcangeli
@ 2009-10-30 14:41 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-30 14:41 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
Andrea Arcangeli wrote:
> On Fri, Oct 30, 2009 at 02:53:33PM +0100, Vedran Furač wrote:
>> % free -m
>> total used free shared buffers cached
>> Mem: 3458 3429 29 0 102 1119
>> -/+ buffers/cache: 2207 1251
>>
>> There's plenty of memory available. Shouldn't cache be automatically
>> dropped (this question was in my original mail, hence the subject)?
>
> This is not about cache, cache amount is physical, this about
> virtual amount that can only go in ram or swap (at any later time,
> current time is irrelevant) vs "ram + swap".
Oh... so this is because apps "reserve" (Committed_AS?) more then they
currently need.
> In short add more swap if
> you don't like overcommit and check grep Commit /proc/meminfo in case
> this is accounting bug...
A the time of "malloc: Cannot allocate memory":
CommitLimit: 3364440 kB
Committed_AS: 3240200 kB
So probably everything is ok (and free is misleading). Overcommit is
unfortunately necessary if I want to be able to use all my memory.
Btw. http://www.redhat.com/advice/tips/meminfo.html says Committed_AS is
a (gu)estimate. Hope it is a good (not to high) guesstimate. :)
Regards,
Vedran
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 14:41 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-10-30 14:41 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
Andrea Arcangeli wrote:
> On Fri, Oct 30, 2009 at 02:53:33PM +0100, Vedran FuraA? wrote:
>> % free -m
>> total used free shared buffers cached
>> Mem: 3458 3429 29 0 102 1119
>> -/+ buffers/cache: 2207 1251
>>
>> There's plenty of memory available. Shouldn't cache be automatically
>> dropped (this question was in my original mail, hence the subject)?
>
> This is not about cache, cache amount is physical, this about
> virtual amount that can only go in ram or swap (at any later time,
> current time is irrelevant) vs "ram + swap".
Oh... so this is because apps "reserve" (Committed_AS?) more then they
currently need.
> In short add more swap if
> you don't like overcommit and check grep Commit /proc/meminfo in case
> this is accounting bug...
A the time of "malloc: Cannot allocate memory":
CommitLimit: 3364440 kB
Committed_AS: 3240200 kB
So probably everything is ok (and free is misleading). Overcommit is
unfortunately necessary if I want to be able to use all my memory.
Btw. http://www.redhat.com/advice/tips/meminfo.html says Committed_AS is
a (gu)estimate. Hope it is a good (not to high) guesstimate. :)
Regards,
Vedran
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 14:41 ` Vedran Furač
@ 2009-10-30 15:15 ` Andrea Arcangeli
-1 siblings, 0 replies; 128+ messages in thread
From: Andrea Arcangeli @ 2009-10-30 15:15 UTC (permalink / raw)
To: Vedran Furač
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
On Fri, Oct 30, 2009 at 03:41:12PM +0100, Vedran Furač wrote:
> Oh... so this is because apps "reserve" (Committed_AS?) more then they
> currently need.
They don't actually reserve, they end up "reserving" if overcommit is
set to 2 (OVERCOMMIT_NEVER)... Apps aren't reserving, more likely they
simply avoid a flood of mmap when a single one is enough to map an
huge MAP_PRIVATE region like shared libs that you may only execute
partially (this is why total_vm is usually much bigger than real ram
mapped by pagetables represented in rss). But those shared libs are
99% pageable and they don't need to stay in swap or ram, so
overcommit-as greatly overstimates the actual needs even if shared lib
loading wouldn't be 64bit optimized (i.e. large and a single one).
> A the time of "malloc: Cannot allocate memory":
>
> CommitLimit: 3364440 kB
> Committed_AS: 3240200 kB
>
> So probably everything is ok (and free is misleading). Overcommit is
> unfortunately necessary if I want to be able to use all my memory.
Add more swap.
> Btw. http://www.redhat.com/advice/tips/meminfo.html says Committed_AS is
> a (gu)estimate. Hope it is a good (not to high) guesstimate. :)
It is a guess in the sense to guarantee no ENOMEM it has to take into
account the worst possible case, that is all shared lib MAP_PRIVATE
mappings are cowed, which is very far from reality. Other than that
the overcommitas should exactly match all mmapped possibly writeable
space that can only fit in ram+swap, so from that point of view it's
not a guessed number (modulo the smp read out of order). The only
guess is how much slab, cache and other stuff is freeable, which
doesn't provide true perfection to OVERCOMMIT_NEVER.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 15:15 ` Andrea Arcangeli
0 siblings, 0 replies; 128+ messages in thread
From: Andrea Arcangeli @ 2009-10-30 15:15 UTC (permalink / raw)
To: Vedran Furač
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
On Fri, Oct 30, 2009 at 03:41:12PM +0100, Vedran FuraA? wrote:
> Oh... so this is because apps "reserve" (Committed_AS?) more then they
> currently need.
They don't actually reserve, they end up "reserving" if overcommit is
set to 2 (OVERCOMMIT_NEVER)... Apps aren't reserving, more likely they
simply avoid a flood of mmap when a single one is enough to map an
huge MAP_PRIVATE region like shared libs that you may only execute
partially (this is why total_vm is usually much bigger than real ram
mapped by pagetables represented in rss). But those shared libs are
99% pageable and they don't need to stay in swap or ram, so
overcommit-as greatly overstimates the actual needs even if shared lib
loading wouldn't be 64bit optimized (i.e. large and a single one).
> A the time of "malloc: Cannot allocate memory":
>
> CommitLimit: 3364440 kB
> Committed_AS: 3240200 kB
>
> So probably everything is ok (and free is misleading). Overcommit is
> unfortunately necessary if I want to be able to use all my memory.
Add more swap.
> Btw. http://www.redhat.com/advice/tips/meminfo.html says Committed_AS is
> a (gu)estimate. Hope it is a good (not to high) guesstimate. :)
It is a guess in the sense to guarantee no ENOMEM it has to take into
account the worst possible case, that is all shared lib MAP_PRIVATE
mappings are cowed, which is very far from reality. Other than that
the overcommitas should exactly match all mmapped possibly writeable
space that can only fit in ram+swap, so from that point of view it's
not a guessed number (modulo the smp read out of order). The only
guess is how much slab, cache and other stuff is freeable, which
doesn't provide true perfection to OVERCOMMIT_NEVER.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 15:15 ` Andrea Arcangeli
@ 2009-10-30 16:24 ` Hugh Dickins
-1 siblings, 0 replies; 128+ messages in thread
From: Hugh Dickins @ 2009-10-30 16:24 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Vedran Furač, David Rientjes, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
On Fri, 30 Oct 2009, Andrea Arcangeli wrote:
>
> It is a guess in the sense to guarantee no ENOMEM it has to take into
> account the worst possible case, that is all shared lib MAP_PRIVATE
> mappings are cowed, which is very far from reality.
A MAP_PRIVATE area is only counted into Committed_AS when it is or
has in the past been PROT_WRITE. I think it's up to the ELF header
of the shared library whether a section is PROT_WRITE or not; but it
looks like many are not, so Committed_AS should be (a little) nearer
reality than you fear.
Though we do account for Committed_AS, even while allowing overcommit,
we do not at present account for Committed_AS per mm. Seeing David
and KAMEZAWA-san debating over total_vm versus rss versus anon_rss,
I wonder whether such a "commit" count might be a better measure for
OOM choices (but shmem is as usual awkward: though accounted just once
in Committed_AS, it would probably have to be accounted to every mm
that maps it). Just an idea to throw into the mix.
Hugh
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 16:24 ` Hugh Dickins
0 siblings, 0 replies; 128+ messages in thread
From: Hugh Dickins @ 2009-10-30 16:24 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Vedran Furač, David Rientjes, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
On Fri, 30 Oct 2009, Andrea Arcangeli wrote:
>
> It is a guess in the sense to guarantee no ENOMEM it has to take into
> account the worst possible case, that is all shared lib MAP_PRIVATE
> mappings are cowed, which is very far from reality.
A MAP_PRIVATE area is only counted into Committed_AS when it is or
has in the past been PROT_WRITE. I think it's up to the ELF header
of the shared library whether a section is PROT_WRITE or not; but it
looks like many are not, so Committed_AS should be (a little) nearer
reality than you fear.
Though we do account for Committed_AS, even while allowing overcommit,
we do not at present account for Committed_AS per mm. Seeing David
and KAMEZAWA-san debating over total_vm versus rss versus anon_rss,
I wonder whether such a "commit" count might be a better measure for
OOM choices (but shmem is as usual awkward: though accounted just once
in Committed_AS, it would probably have to be accounted to every mm
that maps it). Just an idea to throw into the mix.
Hugh
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 15:15 ` Andrea Arcangeli
@ 2009-11-02 19:56 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-11-02 19:56 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
Andrea Arcangeli wrote:
> On Fri, Oct 30, 2009 at 03:41:12PM +0100, Vedran Furač wrote:
>> Oh... so this is because apps "reserve" (Committed_AS?) more then they
>> currently need.
>
> They don't actually reserve, they end up "reserving" if overcommit is
> set to 2 (OVERCOMMIT_NEVER)... Apps aren't reserving, more likely they
> simply avoid a flood of mmap when a single one is enough to map an
> huge MAP_PRIVATE region like shared libs that you may only execute
> partially (this is why total_vm is usually much bigger than real ram
> mapped by pagetables represented in rss). But those shared libs are
> 99% pageable and they don't need to stay in swap or ram, so
> overcommit-as greatly overstimates the actual needs even if shared lib
> loading wouldn't be 64bit optimized (i.e. large and a single one).
Thanks for info!
>> A the time of "malloc: Cannot allocate memory":
>>
>> CommitLimit: 3364440 kB
>> Committed_AS: 3240200 kB
>>
>> So probably everything is ok (and free is misleading). Overcommit is
>> unfortunately necessary if I want to be able to use all my memory.
>
> Add more swap.
I don't use swap. With current prices of RAM, swap is history, at least
for desktops. I hate when e.g. firefox gets swapped out if I don't use
it for a while. Removing swap decreased desktop latencies drastically.
And I don't care much if I'll loose 100MB of potential free memory that
could be used for disk cache...
Regards.
Vedran
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-11-02 19:56 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-11-02 19:56 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: David Rientjes, Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm,
linux-kernel, KOSAKI Motohiro, minchan.kim, Andrew Morton
Andrea Arcangeli wrote:
> On Fri, Oct 30, 2009 at 03:41:12PM +0100, Vedran FuraA? wrote:
>> Oh... so this is because apps "reserve" (Committed_AS?) more then they
>> currently need.
>
> They don't actually reserve, they end up "reserving" if overcommit is
> set to 2 (OVERCOMMIT_NEVER)... Apps aren't reserving, more likely they
> simply avoid a flood of mmap when a single one is enough to map an
> huge MAP_PRIVATE region like shared libs that you may only execute
> partially (this is why total_vm is usually much bigger than real ram
> mapped by pagetables represented in rss). But those shared libs are
> 99% pageable and they don't need to stay in swap or ram, so
> overcommit-as greatly overstimates the actual needs even if shared lib
> loading wouldn't be 64bit optimized (i.e. large and a single one).
Thanks for info!
>> A the time of "malloc: Cannot allocate memory":
>>
>> CommitLimit: 3364440 kB
>> Committed_AS: 3240200 kB
>>
>> So probably everything is ok (and free is misleading). Overcommit is
>> unfortunately necessary if I want to be able to use all my memory.
>
> Add more swap.
I don't use swap. With current prices of RAM, swap is history, at least
for desktops. I hate when e.g. firefox gets swapped out if I don't use
it for a while. Removing swap decreased desktop latencies drastically.
And I don't care much if I'll loose 100MB of potential free memory that
could be used for disk cache...
Regards.
Vedran
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 13:53 ` Vedran Furač
@ 2009-10-30 19:44 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-30 19:44 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009, Vedran Furac wrote:
> Well, you are kernel hacker, not me. You know how linux mm works much
> more than I do. I just reported a, what I think is a big problem, which
> needs to be solved ASAP (2.6.33).
The oom killer heuristics have not been changed recently, why is this
suddenly a problem that needs to be immediately addressed? The heuristics
you've been referring to have been used for at least three years.
> I'm afraid that we'll just talk much
> and nothing will be done with solution/fix postponed indefinitely. Not
> sure if you are interested, but I tested this on windowsxp also, and
> nothing bad happens there, system continues to function properly.
>
I'm totally sympathetic to testcases such as your own where the oom killer
seems to react in an undesirable way. I agree that it could do a much
better job at targeting "test" and killing it without negatively impacting
other tasks.
However, I don't think we can simply change the baseline (like the rss
change which has been added to -mm (??)) and consider it a major
improvement when it severely impacts how system administrators are able to
tune the badness heuristic from userspace via /proc/pid/oom_adj. I'm sure
you'd agree that user input is important in this matter and so that we
should maximize that ability rather than make it more difficult. That's
my main criticism of the suggestions thus far (and, sorry, but I have to
look out for production server interests here: you can't take away our
ability to influence oom badness scoring just because other simple
heuristics may be more understandable).
> > Much better is to allow the user to decide at what point, regardless of
> > swap usage, their application is using much more memory than expected or
> > required. They can do that right now pretty well with /proc/pid/oom_adj
> > without this outlandish claim that they should be expected to know the rss
> > of their applications at the time of oom to effectively tune oom_adj.
>
> Believe me, barely a few developers use oom_adj for their applications,
> and probably almost none of the end users. What should they do, every
> time they start an application, go to console and set the oom_adj. You
> cannot expect them to do that.
>
oom_adj is an extremely important part of our infrastructure and although
the majority of Linux users may not use it (I know a number of opensource
programs that tune its own, however), we can't let go of our ability to
specify an oom killing priority.
There are no simple solutions to this problem: the model proposed thus
far, which has basically been to acknowledge that oom killer is a bad
thing to encounter (but within that, some rationale was found that we can
react however we want??) and should be extremely easy to understand (just
kill the memory hogger with the most resident RAM) is a non-starter.
What would be better, and what I think we'll end up with, is a root
selectable heuristic so that production servers and desktop machines can
use different heuristics to make oom kill selections. We already have
/proc/sys/vm/oom_kill_allocating_task which I added 1-2 years ago to
address concerns specifically of SGI and their enormously long tasklist
scans. This would be variation on that idea and would include different
simplistic behaviors (such as always killing the most memory hogging task,
killing the most recently started task by the same uid, etc), and leave
the default heuristic much the same as currently.
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-30 19:44 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-30 19:44 UTC (permalink / raw)
To: vedran.furac
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
On Fri, 30 Oct 2009, Vedran Furac wrote:
> Well, you are kernel hacker, not me. You know how linux mm works much
> more than I do. I just reported a, what I think is a big problem, which
> needs to be solved ASAP (2.6.33).
The oom killer heuristics have not been changed recently, why is this
suddenly a problem that needs to be immediately addressed? The heuristics
you've been referring to have been used for at least three years.
> I'm afraid that we'll just talk much
> and nothing will be done with solution/fix postponed indefinitely. Not
> sure if you are interested, but I tested this on windowsxp also, and
> nothing bad happens there, system continues to function properly.
>
I'm totally sympathetic to testcases such as your own where the oom killer
seems to react in an undesirable way. I agree that it could do a much
better job at targeting "test" and killing it without negatively impacting
other tasks.
However, I don't think we can simply change the baseline (like the rss
change which has been added to -mm (??)) and consider it a major
improvement when it severely impacts how system administrators are able to
tune the badness heuristic from userspace via /proc/pid/oom_adj. I'm sure
you'd agree that user input is important in this matter and so that we
should maximize that ability rather than make it more difficult. That's
my main criticism of the suggestions thus far (and, sorry, but I have to
look out for production server interests here: you can't take away our
ability to influence oom badness scoring just because other simple
heuristics may be more understandable).
> > Much better is to allow the user to decide at what point, regardless of
> > swap usage, their application is using much more memory than expected or
> > required. They can do that right now pretty well with /proc/pid/oom_adj
> > without this outlandish claim that they should be expected to know the rss
> > of their applications at the time of oom to effectively tune oom_adj.
>
> Believe me, barely a few developers use oom_adj for their applications,
> and probably almost none of the end users. What should they do, every
> time they start an application, go to console and set the oom_adj. You
> cannot expect them to do that.
>
oom_adj is an extremely important part of our infrastructure and although
the majority of Linux users may not use it (I know a number of opensource
programs that tune its own, however), we can't let go of our ability to
specify an oom killing priority.
There are no simple solutions to this problem: the model proposed thus
far, which has basically been to acknowledge that oom killer is a bad
thing to encounter (but within that, some rationale was found that we can
react however we want??) and should be extremely easy to understand (just
kill the memory hogger with the most resident RAM) is a non-starter.
What would be better, and what I think we'll end up with, is a root
selectable heuristic so that production servers and desktop machines can
use different heuristics to make oom kill selections. We already have
/proc/sys/vm/oom_kill_allocating_task which I added 1-2 years ago to
address concerns specifically of SGI and their enormously long tasklist
scans. This would be variation on that idea and would include different
simplistic behaviors (such as always killing the most memory hogging task,
killing the most recently started task by the same uid, etc), and leave
the default heuristic much the same as currently.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-30 19:44 ` David Rientjes
@ 2009-11-02 19:56 ` Vedran Furač
-1 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-11-02 19:56 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Fri, 30 Oct 2009, Vedran Furac wrote:
>
>> Well, you are kernel hacker, not me. You know how linux mm works much
>> more than I do. I just reported a, what I think is a big problem, which
>> needs to be solved ASAP (2.6.33).
>
> The oom killer heuristics have not been changed recently, why is this
> suddenly a problem that needs to be immediately addressed? The heuristics
> you've been referring to have been used for at least three years.
It isn't "suddenly a problem", but only a problem, big long time
problem. If it is three years old, then it should have been addressed
asap three years ago (and we would not need to talk about it now,
hopefully).
> However, I don't think we can simply change the baseline (like the rss
> change which has been added to -mm (??)) and consider it a major
> improvement when it severely impacts how system administrators are able to
> tune the badness heuristic from userspace via /proc/pid/oom_adj. I'm sure
> you'd agree that user input is important in this matter and so that we
> should maximize that ability rather than make it more difficult. That's
> my main criticism of the suggestions thus far (and, sorry, but I have to
> look out for production server interests here: you can't take away our
> ability to influence oom badness scoring just because other simple
> heuristics may be more understandable).
>
> What would be better, and what I think we'll end up with, is a root
> selectable heuristic so that production servers and desktop machines can
> use different heuristics to make oom kill selections. We already have
> /proc/sys/vm/oom_kill_allocating_task which I added 1-2 years ago to
> address concerns specifically of SGI and their enormously long tasklist
> scans. This would be variation on that idea and would include different
> simplistic behaviors (such as always killing the most memory hogging task,
> killing the most recently started task by the same uid, etc), and leave
> the default heuristic much the same as currently.
OK, agreed. Did you take a look at the set of patches Kame sent today?
Regards,
Vedran
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-11-02 19:56 ` Vedran Furač
0 siblings, 0 replies; 128+ messages in thread
From: Vedran Furač @ 2009-11-02 19:56 UTC (permalink / raw)
To: David Rientjes
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, linux-mm, linux-kernel,
KOSAKI Motohiro, minchan.kim, Andrew Morton, Andrea Arcangeli
David Rientjes wrote:
> On Fri, 30 Oct 2009, Vedran Furac wrote:
>
>> Well, you are kernel hacker, not me. You know how linux mm works much
>> more than I do. I just reported a, what I think is a big problem, which
>> needs to be solved ASAP (2.6.33).
>
> The oom killer heuristics have not been changed recently, why is this
> suddenly a problem that needs to be immediately addressed? The heuristics
> you've been referring to have been used for at least three years.
It isn't "suddenly a problem", but only a problem, big long time
problem. If it is three years old, then it should have been addressed
asap three years ago (and we would not need to talk about it now,
hopefully).
> However, I don't think we can simply change the baseline (like the rss
> change which has been added to -mm (??)) and consider it a major
> improvement when it severely impacts how system administrators are able to
> tune the badness heuristic from userspace via /proc/pid/oom_adj. I'm sure
> you'd agree that user input is important in this matter and so that we
> should maximize that ability rather than make it more difficult. That's
> my main criticism of the suggestions thus far (and, sorry, but I have to
> look out for production server interests here: you can't take away our
> ability to influence oom badness scoring just because other simple
> heuristics may be more understandable).
>
> What would be better, and what I think we'll end up with, is a root
> selectable heuristic so that production servers and desktop machines can
> use different heuristics to make oom kill selections. We already have
> /proc/sys/vm/oom_kill_allocating_task which I added 1-2 years ago to
> address concerns specifically of SGI and their enormously long tasklist
> scans. This would be variation on that idea and would include different
> simplistic behaviors (such as always killing the most memory hogging task,
> killing the most recently started task by the same uid, etc), and leave
> the default heuristic much the same as currently.
OK, agreed. Did you take a look at the set of patches Kame sent today?
Regards,
Vedran
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-27 20:44 ` Hugh Dickins
@ 2009-10-28 0:43 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 0:43 UTC (permalink / raw)
To: Hugh Dickins
Cc: vedran.furac, linux-mm, linux-kernel, kosaki.motohiro,
minchan.kim, akpm, rientjes, aarcange
On Tue, 27 Oct 2009 20:44:16 +0000 (GMT)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
> On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > Sigh, gnome-session has twice value of mmap(1G).
> > Of course, gnome-session only uses 6M bytes of anon.
> > I wonder this is because gnome-session has many children..but need to
> > dig more. Does anyone has idea ?
>
> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>
> I built up a patchset of fixes, but once I came to split them up for
> submission, not one of them seemed entirely satisfactory; and Andrea's
> fix to the KSM/mlock deadlock forced me to abandon even the first of
> the patches (we've since then fixed the way munlocking behaves, so
> in theory could revisit that; but Andrea disliked what I was trying
> to do there in KSM for other reasons, so I've not touched it since).
> I had to get on with KSM, so I set it all aside: none of the issues
> was a recent regression.
>
> I did briefly wonder about the reliance on total_vm which you're now
> looking into, but didn't touch that at all. Let me describe those
> issues which I did try but fail to fix - I've no more time to deal
> with them now than then, but ought at least to mention them to you.
>
Okay, thank you for detailed information.
> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
Hmm.
> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>
ok, then, easy handling can't be a help.
> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
I can't agree that part of heuristics, either.
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>
ok.
> 3. badness() has a comment above it which says:
> * 5) we try to kill the process the user expects us to kill, this
> * algorithm has been meticulously tuned to meet the principle
> * of least surprise ... (be careful when you change it)
> But Andrea's 2.6.11 86a4c6d9e2e43796bb362debd3f73c0e3b198efa (later
> refined by Kurt's 2.6.16 9827b781f20828e5ceb911b879f268f78fe90815)
> adds plenty of surprise there, by trying to factor children into the
> calculation. Intended to deal with forkbombs, but any reasonable
> process whose purpose is to fork children (e.g. gnome-session)
> becomes very vulnerable. And whereas badness() itself goes on to
> refine the total_vm points by various adjustments peculiar to the
> process in question, those refinements have been ignored when
> adding the child's total_vm/2. (Andrea does remark that he'd
> rather have rewritten badness() from scratch.)
>
> I tried to fix this by moving the PF_OOM_ORIGIN (was PF_SWAPOFF)
> part of the calculation up to select_bad_process(), making a
> solo_badness() function which makes all those adjustments to
> total_vm, then badness() itself a simple function adding half
> the children's solo_badness()es to the process' own solo_badness().
> But probably lots more needs doing - Andrea's rewrite?
>
> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>
>
> I hope these notes help someone towards a better solution
> (and be prepared to discover more on the way). I agree with
> Vedran that the present behaviour is pretty unimpressive, and
> I'm puzzled as to how people can have been tinkering with
> oom_kill.c down the years without seeing any of this.
>
Sorry, I usually don't use X on servers and almost all recent my OOM test
was done under memcg ;(
Thank you for your investigation. Maybe I'll need several steps.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 0:43 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 0:43 UTC (permalink / raw)
To: Hugh Dickins
Cc: vedran.furac, linux-mm, linux-kernel, kosaki.motohiro,
minchan.kim, akpm, rientjes, aarcange
On Tue, 27 Oct 2009 20:44:16 +0000 (GMT)
Hugh Dickins <hugh.dickins@tiscali.co.uk> wrote:
> On Tue, 27 Oct 2009, KAMEZAWA Hiroyuki wrote:
> > Sigh, gnome-session has twice value of mmap(1G).
> > Of course, gnome-session only uses 6M bytes of anon.
> > I wonder this is because gnome-session has many children..but need to
> > dig more. Does anyone has idea ?
>
> When preparing KSM unmerge to handle OOM, I looked at how the precedent
> was handled by running a little program which mmaps an anonymous region
> of the same size as physical memory, then tries to mlock it. The
> program was such an obvious candidate to be killed, I was shocked
> by the poor decisions the OOM killer made. Usually I ran it with
> mem=512M, with gnome and firefox active. Often the OOM killer killed
> it right the first time, but went wrong when I tried it a second time
> (I think that's because of what's already swapped out the first time).
>
> I built up a patchset of fixes, but once I came to split them up for
> submission, not one of them seemed entirely satisfactory; and Andrea's
> fix to the KSM/mlock deadlock forced me to abandon even the first of
> the patches (we've since then fixed the way munlocking behaves, so
> in theory could revisit that; but Andrea disliked what I was trying
> to do there in KSM for other reasons, so I've not touched it since).
> I had to get on with KSM, so I set it all aside: none of the issues
> was a recent regression.
>
> I did briefly wonder about the reliance on total_vm which you're now
> looking into, but didn't touch that at all. Let me describe those
> issues which I did try but fail to fix - I've no more time to deal
> with them now than then, but ought at least to mention them to you.
>
Okay, thank you for detailed information.
> 1. select_bad_process() tries to avoid killing another process while
> there's still a TIF_MEMDIE, but its loop starts by skipping !p->mm
> processes. However, p->mm is set to NULL well before p reaches
> exit_mmap() to actually free the memory, and there may be significant
> delays in between (I think exit_robust_list() gave me a hang at one
> stage). So in practice, even when the OOM killer selects the right
> process to kill, there can be lots of collateral damage from it not
> waiting long enough for that process to give up its memory.
>
Hmm.
> I tried to deal with that by moving the TIF_MEMDIE test up before
> the p->mm test, but adding in a check on p->exit_state:
> if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
> !p->exit_state)
> return ERR_PTR(-1UL);
> But this is then liable to hang the system if there's some reason
> why the selected process cannot proceed to free its memory (e.g.
> the current KSM unmerge case). It needs to wait "a while", but
> give up if no progress is made, instead of hanging: originally
> I thought that setting PF_MEMALLOC more widely in page_alloc.c,
> and giving up on the TIF_MEMDIE if it was waiting in PF_MEMALLOC,
> would deal with that; but we cannot be sure that waiting of memory
> is the only reason for a holdup there (in the KSM unmerge case it's
> waiting for an mmap_sem, and there may well be other such cases).
>
ok, then, easy handling can't be a help.
> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
I can't agree that part of heuristics, either.
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
>
ok.
> 3. badness() has a comment above it which says:
> * 5) we try to kill the process the user expects us to kill, this
> * algorithm has been meticulously tuned to meet the principle
> * of least surprise ... (be careful when you change it)
> But Andrea's 2.6.11 86a4c6d9e2e43796bb362debd3f73c0e3b198efa (later
> refined by Kurt's 2.6.16 9827b781f20828e5ceb911b879f268f78fe90815)
> adds plenty of surprise there, by trying to factor children into the
> calculation. Intended to deal with forkbombs, but any reasonable
> process whose purpose is to fork children (e.g. gnome-session)
> becomes very vulnerable. And whereas badness() itself goes on to
> refine the total_vm points by various adjustments peculiar to the
> process in question, those refinements have been ignored when
> adding the child's total_vm/2. (Andrea does remark that he'd
> rather have rewritten badness() from scratch.)
>
> I tried to fix this by moving the PF_OOM_ORIGIN (was PF_SWAPOFF)
> part of the calculation up to select_bad_process(), making a
> solo_badness() function which makes all those adjustments to
> total_vm, then badness() itself a simple function adding half
> the children's solo_badness()es to the process' own solo_badness().
> But probably lots more needs doing - Andrea's rewrite?
>
> 4. In some cases those children are sharing exactly the same mm,
> yet its total_vm is being added again and again to the points:
> I had a nasty inner loop searching back to see if we'd already
> counted this mm (but then, what if the different tasks sharing
> the mm deserved different adjustments to the total_vm?).
>
>
> I hope these notes help someone towards a better solution
> (and be prepared to discover more on the way). I agree with
> Vedran that the present behaviour is pretty unimpressive, and
> I'm puzzled as to how people can have been tinkering with
> oom_kill.c down the years without seeing any of this.
>
Sorry, I usually don't use X on servers and almost all recent my OOM test
was done under memcg ;(
Thank you for your investigation. Maybe I'll need several steps.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-27 20:44 ` Hugh Dickins
@ 2009-10-28 2:47 ` KOSAKI Motohiro
-1 siblings, 0 replies; 128+ messages in thread
From: KOSAKI Motohiro @ 2009-10-28 2:47 UTC (permalink / raw)
To: Hugh Dickins
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, vedran.furac, linux-mm,
linux-kernel, minchan.kim, akpm, rientjes, aarcange
> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
I agree quartering is debatable.
At least, killing quartering is worth for any user, and it can be push into -stable.
>From 27331555366c908a93c2cdd780b77e421869c5af Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Wed, 28 Oct 2009 11:28:39 +0900
Subject: [PATCH] oom: Mitigate suer-user's bonus of oom-score
Currently, badness calculation code of oom contemplate following bonus.
- Super-user have quartering oom-score
- CAP_SYS_RAWIO process (e.g. database) also have quartering oom-score
The problem is, Super-users have CAP_SYS_RAWIO too. Then, they have
sixteenthing bonus. it's obviously too excessive and meaningless.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 13 +++++--------
1 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ea2147d..40d323d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
/*
* Superuser processes are usually more important, so we make it
* less likely that we kill those.
- */
- if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
- has_capability_noaudit(p, CAP_SYS_RESOURCE))
- points /= 4;
-
- /*
- * We don't want to kill a process with direct hardware access.
+ *
+ * Plus, We don't want to kill a process with direct hardware access.
* Not only could that mess up the hardware, but usually users
* tend to only have this flag set on applications they think
* of as important.
*/
- if (has_capability_noaudit(p, CAP_SYS_RAWIO))
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+ has_capability_noaudit(p, CAP_SYS_RAWIO))
points /= 4;
/*
--
1.6.2.5
^ permalink raw reply related [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 2:47 ` KOSAKI Motohiro
0 siblings, 0 replies; 128+ messages in thread
From: KOSAKI Motohiro @ 2009-10-28 2:47 UTC (permalink / raw)
To: Hugh Dickins
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, vedran.furac, linux-mm,
linux-kernel, minchan.kim, akpm, rientjes, aarcange
> 2. I started out running my mlock test program as root (later
> switched to use "ulimit -l unlimited" first). But badness() reckons
> CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> and CAP_SYS_RAWIO another reason to quarter your points: so running
> as root makes you sixteen times less likely to be killed. Quartering
> is anyway debatable, but sixteenthing seems utterly excessive to me.
>
> I moved the CAP_SYS_RAWIO test in with the others, so it does no
> more than quartering; but is quartering appropriate anyway? I did
> wonder if I was right to be "subverting" the fine-grained CAPs in
> this way, but have since seen unrelated mail from one who knows
> better, implying they're something of a fantasy, that su and sudo
> are indeed what's used in the real world. Maybe this patch was okay.
I agree quartering is debatable.
At least, killing quartering is worth for any user, and it can be push into -stable.
From 27331555366c908a93c2cdd780b77e421869c5af Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Wed, 28 Oct 2009 11:28:39 +0900
Subject: [PATCH] oom: Mitigate suer-user's bonus of oom-score
Currently, badness calculation code of oom contemplate following bonus.
- Super-user have quartering oom-score
- CAP_SYS_RAWIO process (e.g. database) also have quartering oom-score
The problem is, Super-users have CAP_SYS_RAWIO too. Then, they have
sixteenthing bonus. it's obviously too excessive and meaningless.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/oom_kill.c | 13 +++++--------
1 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ea2147d..40d323d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
/*
* Superuser processes are usually more important, so we make it
* less likely that we kill those.
- */
- if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
- has_capability_noaudit(p, CAP_SYS_RESOURCE))
- points /= 4;
-
- /*
- * We don't want to kill a process with direct hardware access.
+ *
+ * Plus, We don't want to kill a process with direct hardware access.
* Not only could that mess up the hardware, but usually users
* tend to only have this flag set on applications they think
* of as important.
*/
- if (has_capability_noaudit(p, CAP_SYS_RAWIO))
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
+ has_capability_noaudit(p, CAP_SYS_RAWIO))
points /= 4;
/*
--
1.6.2.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 2:47 ` KOSAKI Motohiro
@ 2009-10-28 3:17 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 3:17 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Hugh Dickins, vedran.furac, linux-mm, linux-kernel, minchan.kim,
akpm, rientjes, aarcange
On Wed, 28 Oct 2009 11:47:55 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 2. I started out running my mlock test program as root (later
> > switched to use "ulimit -l unlimited" first). But badness() reckons
> > CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> > and CAP_SYS_RAWIO another reason to quarter your points: so running
> > as root makes you sixteen times less likely to be killed. Quartering
> > is anyway debatable, but sixteenthing seems utterly excessive to me.
> >
> > I moved the CAP_SYS_RAWIO test in with the others, so it does no
> > more than quartering; but is quartering appropriate anyway? I did
> > wonder if I was right to be "subverting" the fine-grained CAPs in
> > this way, but have since seen unrelated mail from one who knows
> > better, implying they're something of a fantasy, that su and sudo
> > are indeed what's used in the real world. Maybe this patch was okay.
>
> I agree quartering is debatable.
> At least, killing quartering is worth for any user, and it can be push into -stable.
>
>
>
>
> From 27331555366c908a93c2cdd780b77e421869c5af Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Wed, 28 Oct 2009 11:28:39 +0900
> Subject: [PATCH] oom: Mitigate suer-user's bonus of oom-score
>
> Currently, badness calculation code of oom contemplate following bonus.
> - Super-user have quartering oom-score
> - CAP_SYS_RAWIO process (e.g. database) also have quartering oom-score
>
> The problem is, Super-users have CAP_SYS_RAWIO too. Then, they have
> sixteenthing bonus. it's obviously too excessive and meaningless.
>
> This patch fixes it.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
I'll pick this up to my series.
Thanks,
-Kame
> ---
> mm/oom_kill.c | 13 +++++--------
> 1 files changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ea2147d..40d323d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> /*
> * Superuser processes are usually more important, so we make it
> * less likely that we kill those.
> - */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> - has_capability_noaudit(p, CAP_SYS_RESOURCE))
> - points /= 4;
> -
> - /*
> - * We don't want to kill a process with direct hardware access.
> + *
> + * Plus, We don't want to kill a process with direct hardware access.
> * Not only could that mess up the hardware, but usually users
> * tend to only have this flag set on applications they think
> * of as important.
> */
> - if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points /= 4;
>
> /*
> --
> 1.6.2.5
>
>
>
>
>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 3:17 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 128+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-28 3:17 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Hugh Dickins, vedran.furac, linux-mm, linux-kernel, minchan.kim,
akpm, rientjes, aarcange
On Wed, 28 Oct 2009 11:47:55 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 2. I started out running my mlock test program as root (later
> > switched to use "ulimit -l unlimited" first). But badness() reckons
> > CAP_SYS_ADMIN or CAP_SYS_RESOURCE is a reason to quarter your points;
> > and CAP_SYS_RAWIO another reason to quarter your points: so running
> > as root makes you sixteen times less likely to be killed. Quartering
> > is anyway debatable, but sixteenthing seems utterly excessive to me.
> >
> > I moved the CAP_SYS_RAWIO test in with the others, so it does no
> > more than quartering; but is quartering appropriate anyway? I did
> > wonder if I was right to be "subverting" the fine-grained CAPs in
> > this way, but have since seen unrelated mail from one who knows
> > better, implying they're something of a fantasy, that su and sudo
> > are indeed what's used in the real world. Maybe this patch was okay.
>
> I agree quartering is debatable.
> At least, killing quartering is worth for any user, and it can be push into -stable.
>
>
>
>
> From 27331555366c908a93c2cdd780b77e421869c5af Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Date: Wed, 28 Oct 2009 11:28:39 +0900
> Subject: [PATCH] oom: Mitigate suer-user's bonus of oom-score
>
> Currently, badness calculation code of oom contemplate following bonus.
> - Super-user have quartering oom-score
> - CAP_SYS_RAWIO process (e.g. database) also have quartering oom-score
>
> The problem is, Super-users have CAP_SYS_RAWIO too. Then, they have
> sixteenthing bonus. it's obviously too excessive and meaningless.
>
> This patch fixes it.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
I'll pick this up to my series.
Thanks,
-Kame
> ---
> mm/oom_kill.c | 13 +++++--------
> 1 files changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ea2147d..40d323d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> /*
> * Superuser processes are usually more important, so we make it
> * less likely that we kill those.
> - */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> - has_capability_noaudit(p, CAP_SYS_RESOURCE))
> - points /= 4;
> -
> - /*
> - * We don't want to kill a process with direct hardware access.
> + *
> + * Plus, We don't want to kill a process with direct hardware access.
> * Not only could that mess up the hardware, but usually users
> * tend to only have this flag set on applications they think
> * of as important.
> */
> - if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points /= 4;
>
> /*
> --
> 1.6.2.5
>
>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 2:47 ` KOSAKI Motohiro
@ 2009-10-28 4:12 ` David Rientjes
-1 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 4:12 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, vedran.furac, linux-mm,
linux-kernel, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, KOSAKI Motohiro wrote:
> I agree quartering is debatable.
> At least, killing quartering is worth for any user, and it can be push into -stable.
>
Not sure where the -stable reference came from, I don't think this is a
candidate.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ea2147d..40d323d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> /*
> * Superuser processes are usually more important, so we make it
> * less likely that we kill those.
> - */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> - has_capability_noaudit(p, CAP_SYS_RESOURCE))
> - points /= 4;
> -
> - /*
> - * We don't want to kill a process with direct hardware access.
> + *
> + * Plus, We don't want to kill a process with direct hardware access.
> * Not only could that mess up the hardware, but usually users
> * tend to only have this flag set on applications they think
> * of as important.
> */
> - if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points /= 4;
>
> /*
Acked-by: David Rientjes <rientjes@google.com>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 4:12 ` David Rientjes
0 siblings, 0 replies; 128+ messages in thread
From: David Rientjes @ 2009-10-28 4:12 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Hugh Dickins, KAMEZAWA Hiroyuki, vedran.furac, linux-mm,
linux-kernel, minchan.kim, Andrew Morton, Andrea Arcangeli
On Wed, 28 Oct 2009, KOSAKI Motohiro wrote:
> I agree quartering is debatable.
> At least, killing quartering is worth for any user, and it can be push into -stable.
>
Not sure where the -stable reference came from, I don't think this is a
candidate.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ea2147d..40d323d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -152,18 +152,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> /*
> * Superuser processes are usually more important, so we make it
> * less likely that we kill those.
> - */
> - if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> - has_capability_noaudit(p, CAP_SYS_RESOURCE))
> - points /= 4;
> -
> - /*
> - * We don't want to kill a process with direct hardware access.
> + *
> + * Plus, We don't want to kill a process with direct hardware access.
> * Not only could that mess up the hardware, but usually users
> * tend to only have this flag set on applications they think
> * of as important.
> */
> - if (has_capability_noaudit(p, CAP_SYS_RAWIO))
> + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> + has_capability_noaudit(p, CAP_SYS_RAWIO))
> points /= 4;
>
> /*
Acked-by: David Rientjes <rientjes@google.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
2009-10-28 4:12 ` David Rientjes
@ 2009-10-28 8:10 ` Hugh Dickins
-1 siblings, 0 replies; 128+ messages in thread
From: Hugh Dickins @ 2009-10-28 8:10 UTC (permalink / raw)
To: David Rientjes
Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, vedran.furac, linux-mm,
linux-kernel, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009, David Rientjes wrote:
>
> Not sure where the -stable reference came from, I don't think this is a
> candidate.
I agree with David, this is only one little piece of a messy puzzle,
there's no good reason to rush this into -stable.
> > + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> > + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> > + has_capability_noaudit(p, CAP_SYS_RAWIO))
>
> Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
(as far as it goes: the whole thing of quartering badness here
because "we don't want to kill" and "important" is questionable;
but definitely much more open to argument both ways than sixteenthing).
^ permalink raw reply [flat|nested] 128+ messages in thread
* Re: Memory overcommit
@ 2009-10-28 8:10 ` Hugh Dickins
0 siblings, 0 replies; 128+ messages in thread
From: Hugh Dickins @ 2009-10-28 8:10 UTC (permalink / raw)
To: David Rientjes
Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, vedran.furac, linux-mm,
linux-kernel, minchan.kim, Andrew Morton, Andrea Arcangeli
On Tue, 27 Oct 2009, David Rientjes wrote:
>
> Not sure where the -stable reference came from, I don't think this is a
> candidate.
I agree with David, this is only one little piece of a messy puzzle,
there's no good reason to rush this into -stable.
> > + if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
> > + has_capability_noaudit(p, CAP_SYS_RESOURCE) ||
> > + has_capability_noaudit(p, CAP_SYS_RAWIO))
>
> Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
(as far as it goes: the whole thing of quartering badness here
because "we don't want to kill" and "important" is questionable;
but definitely much more open to argument both ways than sixteenthing).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 128+ messages in thread