additional oom-killer tuneable worth submitting?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* additional oom-killer tuneable worth submitting?
@ 2006-12-07 18:30 Chris Friesen
  2006-12-07 18:50 ` Jesper Juhl
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Chris Friesen @ 2006-12-07 18:30 UTC (permalink / raw)
  To: linux-kernel

The kernel currently has a way to adjust the oom-killer score via 
/proc/<pid>/oomadj.

However, to adjust this effectively requires knowledge of the scores of 
all the other processes on the system.

I'd like to float an idea (which we've implemented and been using for 
some time) where the semantics are slightly different:

We add a new "oom_thresh" member to the task struct.
We introduce a new proc entry "/proc/<pid>/oomthresh" to control it.

The "oom-thresh" value maps to the max expected memory consumption for 
that process.  As long as a process uses less memory than the specified 
threshold, then it is immune to the oom-killer.

On an embedded platform this allows the designer to engineer the system 
and protect critical apps based on their expected memory consumption. 
If one of those apps goes crazy and starts chewing additional memory 
then it becomes vulnerable to the oom killer while the other apps remain 
protected.

If a patch for the above feature was submitted, would there be any 
chance of getting it included?  Maybe controlled by a config option?

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-07 18:30 additional oom-killer tuneable worth submitting? Chris Friesen
@ 2006-12-07 18:50 ` Jesper Juhl
  2006-12-07 21:25   ` Chris Friesen
  2006-12-07 19:21 ` Peter Zijlstra
  2006-12-07 23:22 ` Alan
  2 siblings, 1 reply; 15+ messages in thread
From: Jesper Juhl @ 2006-12-07 18:50 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel

A few questions below.

On 07/12/06, Chris Friesen <cfriesen@nortel.com> wrote:
>
> The kernel currently has a way to adjust the oom-killer score via
> /proc/<pid>/oomadj.
>
> However, to adjust this effectively requires knowledge of the scores of
> all the other processes on the system.
>
> I'd like to float an idea (which we've implemented and been using for
> some time) where the semantics are slightly different:
>
> We add a new "oom_thresh" member to the task struct.
> We introduce a new proc entry "/proc/<pid>/oomthresh" to control it.
>

How does "oomthresh" and "oomadj" affect each other?


> The "oom-thresh" value maps to the max expected memory consumption for
> that process.  As long as a process uses less memory than the specified
> threshold, then it is immune to the oom-killer.
>

Default "oomthresh" value for a new process is 0 (zero) I assume -
right?  If not, then I'd suggest that it should be.

What happens when a process fork()s? Does the child enherit the
parents "oomthresh" value?

Would it make sense to make "oomthresh" apply to process groups
instead of processes?


> On an embedded platform this allows the designer to engineer the system
> and protect critical apps based on their expected memory consumption.
> If one of those apps goes crazy and starts chewing additional memory
> then it becomes vulnerable to the oom killer while the other apps remain
> protected.
>

What happens in the case where the OOM killer really, really needs to
kill one or more processes since there is not a single drop of memory
available, but all processes are below their configured thresholds?


> If a patch for the above feature was submitted, would there be any
> chance of getting it included?  Maybe controlled by a config option?

Impossible to know without posting the patch for review :)


-- 
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-07 18:30 additional oom-killer tuneable worth submitting? Chris Friesen
  2006-12-07 18:50 ` Jesper Juhl
@ 2006-12-07 19:21 ` Peter Zijlstra
  2006-12-07 21:26   ` Chris Friesen
  2006-12-07 23:22 ` Alan
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2006-12-07 19:21 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel

On Thu, 2006-12-07 at 12:30 -0600, Chris Friesen wrote:
> The kernel currently has a way to adjust the oom-killer score via 
> /proc/<pid>/oomadj.
> 
> However, to adjust this effectively requires knowledge of the scores of 
> all the other processes on the system.
> 
> I'd like to float an idea (which we've implemented and been using for 
> some time) where the semantics are slightly different:
> 
> We add a new "oom_thresh" member to the task struct.
> We introduce a new proc entry "/proc/<pid>/oomthresh" to control it.
> 
> The "oom-thresh" value maps to the max expected memory consumption for 
> that process.  As long as a process uses less memory than the specified 
> threshold, then it is immune to the oom-killer.

You would need to specify the measure of memory used by your process;
see the (still not resolved) RSS debate.

> On an embedded platform this allows the designer to engineer the system 
> and protect critical apps based on their expected memory consumption. 
> If one of those apps goes crazy and starts chewing additional memory 
> then it becomes vulnerable to the oom killer while the other apps remain 
> protected.
> 
> If a patch for the above feature was submitted, would there be any 
> chance of getting it included?  Maybe controlled by a config option?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-07 18:50 ` Jesper Juhl
@ 2006-12-07 21:25   ` Chris Friesen
  2006-12-07 21:37     ` Jesper Juhl
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Friesen @ 2006-12-07 21:25 UTC (permalink / raw)
  To: Jesper Juhl; +Cc: linux-kernel

Jesper Juhl wrote:

> How does "oomthresh" and "oomadj" affect each other?

If memory consumption is less than "oomthresh", that process is simply 
bypassed.  (Equivalent to oomkilladj==OOM_DISABLE.)  Otherwise, continue 
processing as normal.

> Default "oomthresh" value for a new process is 0 (zero) I assume -
> right?  If not, then I'd suggest that it should be.

Correct.

> What happens when a process fork()s? Does the child enherit the
> parents "oomthresh" value?

Currently it does not.  This is to allow for different memory access 
patterns by parent/child.  And exec() wipes it as well.

> Would it make sense to make "oomthresh" apply to process groups
> instead of processes?

Hmm...it might make sense given that the point of the group is to manage 
tasks together...but it would make accounting more tricky.  Currently 
it's just a very simple comparison of p->mm->total_vm against the 
threshold in badness().

> What happens in the case where the OOM killer really, really needs to
> kill one or more processes since there is not a single drop of memory
> available, but all processes are below their configured thresholds?

Then the system wasn't properly engineered.  <grin>

In this case you reboot.

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-07 19:21 ` Peter Zijlstra
@ 2006-12-07 21:26   ` Chris Friesen
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Friesen @ 2006-12-07 21:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel

Peter Zijlstra wrote:
> On Thu, 2006-12-07 at 12:30 -0600, Chris Friesen wrote:

>>The "oom-thresh" value maps to the max expected memory consumption for 
>>that process.  As long as a process uses less memory than the specified 
>>threshold, then it is immune to the oom-killer.
> 
> You would need to specify the measure of memory used by your process;
> see the (still not resolved) RSS debate.

Currently we simply use mm->total_vm, same as the oom killer.

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-07 21:25   ` Chris Friesen
@ 2006-12-07 21:37     ` Jesper Juhl
  2006-12-07 21:57       ` Chris Friesen
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Juhl @ 2006-12-07 21:37 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel

On 07/12/06, Chris Friesen <cfriesen@nortel.com> wrote:
> Jesper Juhl wrote:
>
> > What happens in the case where the OOM killer really, really needs to
> > kill one or more processes since there is not a single drop of memory
> > available, but all processes are below their configured thresholds?
>
> Then the system wasn't properly engineered.  <grin>
>
I had a feeling you'd say that.

> In this case you reboot.
>
I realize that if this case happens the system is misconfigured as far
as oomthresh goes, but if this is a knob that we put in the mainline
kernel then I believe there should be some sort of emergency handling
code that takes this situation into account.  Perhaps throw some very
nasty looking log messages and then fall back to the classic OOM
killer behaviour..?


-- 
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-07 21:37     ` Jesper Juhl
@ 2006-12-07 21:57       ` Chris Friesen
  2006-12-07 22:25         ` Jesper Juhl
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Friesen @ 2006-12-07 21:57 UTC (permalink / raw)
  To: Jesper Juhl; +Cc: linux-kernel

Jesper Juhl wrote:
>> Jesper Juhl wrote:

>> > What happens in the case where the OOM killer really, really needs to
>> > kill one or more processes since there is not a single drop of memory
>> > available, but all processes are below their configured thresholds?

> I realize that if this case happens the system is misconfigured as far
> as oomthresh goes, but if this is a knob that we put in the mainline
> kernel then I believe there should be some sort of emergency handling
> code that takes this situation into account.  Perhaps throw some very
> nasty looking log messages and then fall back to the classic OOM
> killer behaviour..?

Yeah, I can see that the reboot might be a bit drastic for mainline.  I 
think the fallback to classic behaviour might work okay.

Anyway, the chances of hitting that case are likely pretty slim.  The 
way we've been using this is to only set the threshold for fairly 
important long-lived daemons.  Much of the "standard" stuff (shell, cat, 
cp, mv, etc.) is left unprotected.

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-07 21:57       ` Chris Friesen
@ 2006-12-07 22:25         ` Jesper Juhl
  0 siblings, 0 replies; 15+ messages in thread
From: Jesper Juhl @ 2006-12-07 22:25 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel

On 07/12/06, Chris Friesen <cfriesen@nortel.com> wrote:
> Jesper Juhl wrote:
> >> Jesper Juhl wrote:
>
> >> > What happens in the case where the OOM killer really, really needs to
> >> > kill one or more processes since there is not a single drop of memory
> >> > available, but all processes are below their configured thresholds?
>
> > I realize that if this case happens the system is misconfigured as far
> > as oomthresh goes, but if this is a knob that we put in the mainline
> > kernel then I believe there should be some sort of emergency handling
> > code that takes this situation into account.  Perhaps throw some very
> > nasty looking log messages and then fall back to the classic OOM
> > killer behaviour..?
>
> Yeah, I can see that the reboot might be a bit drastic for mainline.  I
> think the fallback to classic behaviour might work okay.
>
> Anyway, the chances of hitting that case are likely pretty slim.  The
> way we've been using this is to only set the threshold for fairly
> important long-lived daemons.  Much of the "standard" stuff (shell, cat,
> cp, mv, etc.) is left unprotected.
>
Sure, that's sensible, to only protect the important stuff.
But even if the chances of hitting this are slim, we still need a way
out. For most people anything is better than a hung box.

Some examples;

 For a desktop (where people may be experimenting with the feature) -
seeing your firefox process evaporate due to the OOM killer and then
finding a message explaining what happened in dmesg is a lot less
frustrating than a hang or sudden reboot.

 For a server - If you mis-configure the new feature you may be in for
a long drive to reboot a box whereas falling back to the classic OOM
killer (+ nasty messages in dmesg) will likely save you the trip and
clue you in as to what you mis-configured.

 For an embedded box - triggering a reboot would probably be better
than both a hang or classic OOM kill in many cases (better to have the
device reboot and come back working than to hang or start
malfunctioning due to a missing process).

So maybe what's needed is an additional knob for people to tweak - one
that selects what should happen in this rare case: 1) fallback to
classic OOM (default), 2) reboot, 3) hang.   In all cases messages
should be logged explaining what happened.
Or is that overkill?  If so I'd personally prefer just falling back to
classic OOM kill in this case.

A way out for the "OOM but all processes below threshold" case +
perhaps coupled with oomthresh applying to process groups instead of
just processes and I personally start to like this feature.

Let's see some code...

-- 
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-07 23:22 ` Alan
@ 2006-12-07 23:21   ` Chris Friesen
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Friesen @ 2006-12-07 23:21 UTC (permalink / raw)
  To: Alan; +Cc: linux-kernel

Alan wrote:

>>The "oom-thresh" value maps to the max expected memory consumption for 
>>that process.  As long as a process uses less memory than the specified 
>>threshold, then it is immune to the oom-killer.

> You've just introduced a deadlock. What happens if nobody is over that
> predicted memory and the kernel uses more resource ?

Based on the discussion with Jesper, we fall back to regular behaviour. 
  (Or possibly hang or reboot, if we added another switch).

>>On an embedded platform this allows the designer to engineer the system 
>>and protect critical apps based on their expected memory consumption. 
>>If one of those apps goes crazy and starts chewing additional memory 
>>then it becomes vulnerable to the oom killer while the other apps remain 
>>protected.

> That is why we have no-overcommit support. Now there is an argument for
> a meaningful rlimit-as to go with it, and together I think they do what
> you really need.

No overcommit only protects the system as a whole, not any particular 
processes.  The purpose of this is to protect specific daemons from 
being killed when the system as a whole is short on memory.  Same 
rationale as for oomadj, but different knob to twiddle.

Chris

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-07 18:30 additional oom-killer tuneable worth submitting? Chris Friesen
  2006-12-07 18:50 ` Jesper Juhl
  2006-12-07 19:21 ` Peter Zijlstra
@ 2006-12-07 23:22 ` Alan
  2006-12-07 23:21   ` Chris Friesen
  2 siblings, 1 reply; 15+ messages in thread
From: Alan @ 2006-12-07 23:22 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel

> We add a new "oom_thresh" member to the task struct.
> We introduce a new proc entry "/proc/<pid>/oomthresh" to control it.
> 
> The "oom-thresh" value maps to the max expected memory consumption for 
> that process.  As long as a process uses less memory than the specified 
> threshold, then it is immune to the oom-killer.

You've just introduced a deadlock. What happens if nobody is over that
predicted memory and the kernel uses more resource ?
> 
> On an embedded platform this allows the designer to engineer the system 
> and protect critical apps based on their expected memory consumption. 
> If one of those apps goes crazy and starts chewing additional memory 
> then it becomes vulnerable to the oom killer while the other apps remain 
> protected.

That is why we have no-overcommit support. Now there is an argument for
a meaningful rlimit-as to go with it, and together I think they do what
you really need.

Alan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
@ 2006-12-08 13:58 Al Boldi
  2006-12-08 14:56 ` Alan
  0 siblings, 1 reply; 15+ messages in thread
From: Al Boldi @ 2006-12-08 13:58 UTC (permalink / raw)
  To: linux-kernel

Alan wrote:
> > On an embedded platform this allows the designer to engineer the system
> > and protect critical apps based on their expected memory consumption.
> > If one of those apps goes crazy and starts chewing additional memory
> > then it becomes vulnerable to the oom killer while the other apps remain
> > protected.
>
> That is why we have no-overcommit support.

Alan, I think you know that this isn't really true, due to shared-libs.

> Now there is an argument for
> a meaningful rlimit-as to go with it, and together I think they do what
> you really need.

The problem with rlimit is that it works per process.  Tuning this by hand 
may be awkward and/or wasteful.  What we need is to rlimit on a global 
basis, by calculating an upperlimit dynamically, such as to avoid 
overcommit/OOM.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-08 13:58 Al Boldi
@ 2006-12-08 14:56 ` Alan
  2006-12-08 15:19   ` Al Boldi
  0 siblings, 1 reply; 15+ messages in thread
From: Alan @ 2006-12-08 14:56 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

On Fri, 8 Dec 2006 16:58:29 +0300
Al Boldi <a1426z@gawab.com> wrote:
> > That is why we have no-overcommit support.
> 
> Alan, I think you know that this isn't really true, due to shared-libs.

Shared libraries are correctly handled by no-overcommit and in fact they
have almost zero impact on out of memory questions because the shared
parts of the library are file backed and constant. That means they don't
actually cost swap space.

> > Now there is an argument for
> > a meaningful rlimit-as to go with it, and together I think they do what
> > you really need.
> 
> The problem with rlimit is that it works per process.  Tuning this by hand 
> may be awkward and/or wasteful.  What we need is to rlimit on a global 
> basis, by calculating an upperlimit dynamically, such as to avoid 
> overcommit/OOM.

You've just described the existing no overcommit functionality, although
you've forgotten to allow for pre-reserving of stacks and some other
detail that has been found to make it work better as it has been refined.

Alan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-08 14:56 ` Alan
@ 2006-12-08 15:19   ` Al Boldi
  2006-12-08 15:55     ` Alan
  0 siblings, 1 reply; 15+ messages in thread
From: Al Boldi @ 2006-12-08 15:19 UTC (permalink / raw)
  To: Alan; +Cc: linux-kernel

Alan wrote:
> Al Boldi <a1426z@gawab.com> wrote:
> > > That is why we have no-overcommit support.
> >
> > Alan, I think you know that this isn't really true, due to shared-libs.
>
> Shared libraries are correctly handled by no-overcommit and in fact they
> have almost zero impact on out of memory questions because the shared
> parts of the library are file backed and constant. That means they don't
> actually cost swap space.

What I understood from Arjan is that the problem isn't swapspace, but rather 
that shared-libs are implement via a COW trick, which always overcommits, no 
matter what.

> > > Now there is an argument for
> > > a meaningful rlimit-as to go with it, and together I think they do
> > > what you really need.
> >
> > The problem with rlimit is that it works per process.  Tuning this by
> > hand may be awkward and/or wasteful.  What we need is to rlimit on a
> > global basis, by calculating an upperlimit dynamically, such as to avoid
> > overcommit/OOM.
>
> You've just described the existing no overcommit functionality, although
> you've forgotten to allow for pre-reserving of stacks and some other
> detail that has been found to make it work better as it has been refined.

Are you saying there is some new no-overcommit functionality in 2.6.19, or 
has this been there before?


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-08 15:19   ` Al Boldi
@ 2006-12-08 15:55     ` Alan
  2006-12-08 16:59       ` Al Boldi
  0 siblings, 1 reply; 15+ messages in thread
From: Alan @ 2006-12-08 15:55 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

> What I understood from Arjan is that the problem isn't swapspace, but rather 
> that shared-libs are implement via a COW trick, which always overcommits, no 
> matter what.

The zero overcommit layer accounts address space not pages.

> Are you saying there is some new no-overcommit functionality in 2.6.19, or 
> has this been there before?

Red Hat Enterprise Linux for a very long time, got merged upstream a long
long time ago to. Then got various fixes along the way. It's old
functionality.

Alan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: additional oom-killer tuneable worth submitting?
  2006-12-08 15:55     ` Alan
@ 2006-12-08 16:59       ` Al Boldi
  0 siblings, 0 replies; 15+ messages in thread
From: Al Boldi @ 2006-12-08 16:59 UTC (permalink / raw)
  To: Alan; +Cc: linux-kernel

Alan wrote:
> > What I understood from Arjan is that the problem isn't swapspace, but
> > rather that shared-libs are implement via a COW trick, which always
> > overcommits, no matter what.
>
> The zero overcommit layer accounts address space not pages.

So OOM can still occur?

> > Are you saying there is some new no-overcommit functionality in 2.6.19,
> > or has this been there before?
>
> Red Hat Enterprise Linux for a very long time, got merged upstream a long
> long time ago to. Then got various fixes along the way. It's old
> functionality.

That's what I thought, but it's still really easy to OOM even with 
no-overcommit.

Using ulimit -v [total VMsize/runqueue] seems to inhibit this rather 
effectively, but needs to be maintained dynamically per process.

Couldn't this be handled by the kernel?


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-12-08 16:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-12-07 18:30 additional oom-killer tuneable worth submitting? Chris Friesen
2006-12-07 18:50 ` Jesper Juhl
2006-12-07 21:25   ` Chris Friesen
2006-12-07 21:37     ` Jesper Juhl
2006-12-07 21:57       ` Chris Friesen
2006-12-07 22:25         ` Jesper Juhl
2006-12-07 19:21 ` Peter Zijlstra
2006-12-07 21:26   ` Chris Friesen
2006-12-07 23:22 ` Alan
2006-12-07 23:21   ` Chris Friesen
  -- strict thread matches above, loose matches on Subject: below --
2006-12-08 13:58 Al Boldi
2006-12-08 14:56 ` Alan
2006-12-08 15:19   ` Al Boldi
2006-12-08 15:55     ` Alan
2006-12-08 16:59       ` Al Boldi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox