Linux killed Kenny, bastard!

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Linux killed Kenny, bastard!
@ 2009-01-12 15:33 Evgeniy Polyakov
  2009-01-12 15:44 ` Dave Jones
                   ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-12 15:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Linus Torvalds

Hi.

Do you want to own a tame killer? Do you want to control the world?

Start with your computer now and own the planet next: you already have
an OOM-killer in the Linux to kill for you. But to date it was quite
berserk and usually killed not what you would like him to murder.

Now you can add a name of the victims, which will be checked by the
oom-killer, who select the process to kill first among the ones which
have given string in their executable name.

By default the process to be killed is called 'Kenny', and if you like
him, change then name by calling

echo Java > /proc/sys/vm/oom_victim

Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net>

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3d56fe7..26d4361 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -68,6 +68,7 @@ extern int print_fatal_signals;
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern int sysctl_panic_on_oom;
+extern char oom_victim_name[];
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_oom_dump_tasks;
 extern int max_threads;
@@ -1185,6 +1186,15 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "oom_victim",
+		.data		= oom_victim_name,
+		.maxlen		= TASK_COMM_LEN,
+		.mode		= 0644,
+		.proc_handler	= &proc_dostring,
+		.strategy	= &sysctl_string,
+	},
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a0a0190..12419f5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -28,6 +28,8 @@
 #include <linux/memcontrol.h>
 #include <linux/security.h>
 
+char oom_victim_name[TASK_COMM_LEN] = "Kenny";
+
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks;
@@ -205,8 +207,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 	struct task_struct *g, *p;
 	struct task_struct *chosen = NULL;
 	struct timespec uptime;
+	char *name = oom_victim_name;
 	*ppoints = 0;
 
+again:
 	do_posix_clock_monotonic_gettime(&uptime);
 	do_each_thread(g, p) {
 		unsigned long points;
@@ -223,6 +227,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		if (mem && !task_in_mem_cgroup(p, mem))
 			continue;
 
+		if (name && !strstr(p->comm, name))
+			continue;
+
 		/*
 		 * This task already has access to memory reserves and is
 		 * being killed. Don't allow any other task access to the
@@ -263,6 +270,15 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
 		}
 	} while_each_thread(g, p);
 
+	/*
+	 * We did not find the process with requested string in its name,
+	 * so lets search for the usual victim.
+	 */
+	if (name && !chosen) {
+		name = NULL;
+		goto again;
+	}
+
 	return chosen;
 }
 


-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:33 Linux killed Kenny, bastard! Evgeniy Polyakov
@ 2009-01-12 15:44 ` Dave Jones
  2009-01-12 15:48   ` Evgeniy Polyakov
  2009-01-12 15:49 ` Linux killed Kenny, bastard! Alan Cox
  2009-01-13 16:35 ` KOSAKI Motohiro
  2 siblings, 1 reply; 71+ messages in thread
From: Dave Jones @ 2009-01-12 15:44 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 06:33:05PM +0300, Evgeniy Polyakov wrote:
 > Hi.
 > 
 > Do you want to own a tame killer? Do you want to control the world?
 > 
 > Start with your computer now and own the planet next: you already have
 > an OOM-killer in the Linux to kill for you. But to date it was quite
 > berserk and usually killed not what you would like him to murder.
 > 
 > Now you can add a name of the victims, which will be checked by the
 > oom-killer, who select the process to kill first among the ones which
 > have given string in their executable name.
 > 
 > By default the process to be killed is called 'Kenny', and if you like
 > him, change then name by calling

I realise it ruins the joke, and it sounds unlikely, but anyone who
happens to have a process called 'Kenny' might be unpleasantly surprised
by this.

If we merge this feature, I think it should default to just using the
existing heuristic.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:44 ` Dave Jones
@ 2009-01-12 15:48   ` Evgeniy Polyakov
  2009-01-12 15:51     ` Alan Cox
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-12 15:48 UTC (permalink / raw)
  To: Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

Hi Dave.

On Mon, Jan 12, 2009 at 10:44:56AM -0500, Dave Jones (davej@redhat.com) wrote:
>  > Do you want to own a tame killer? Do you want to control the world?
>  > 
>  > Start with your computer now and own the planet next: you already have
>  > an OOM-killer in the Linux to kill for you. But to date it was quite
>  > berserk and usually killed not what you would like him to murder.
>  > 
>  > Now you can add a name of the victims, which will be checked by the
>  > oom-killer, who select the process to kill first among the ones which
>  > have given string in their executable name.
>  > 
>  > By default the process to be killed is called 'Kenny', and if you like
>  > him, change then name by calling
> 
> I realise it ruins the joke, and it sounds unlikely, but anyone who
> happens to have a process called 'Kenny' might be unpleasantly surprised
> by this.
> 
> If we merge this feature, I think it should default to just using the
> existing heuristic.

Well, Kenny has to die, but if we still decide to change the world, here
is the fist step.

--- ./mm/oom_kill.c~	2009-01-12 17:51:23.000000000 +0300
+++ ./mm/oom_kill.c	2009-01-12 18:48:04.000000000 +0300
@@ -28,7 +28,7 @@
 #include <linux/memcontrol.h>
 #include <linux/security.h>
 
-char oom_victim_name[TASK_COMM_LEN] = "Kenny";
+char oom_victim_name[TASK_COMM_LEN] = "";
 
 int sysctl_panic_on_oom;
 int sysctl_oom_kill_allocating_task;


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:33 Linux killed Kenny, bastard! Evgeniy Polyakov
  2009-01-12 15:44 ` Dave Jones
@ 2009-01-12 15:49 ` Alan Cox
  2009-01-12 15:50   ` Evgeniy Polyakov
  2009-01-13 16:35 ` KOSAKI Motohiro
  2 siblings, 1 reply; 71+ messages in thread
From: Alan Cox @ 2009-01-12 15:49 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel, Andrew Morton, Linus Torvalds

On Mon, 12 Jan 2009 18:33:05 +0300
Evgeniy Polyakov <zbr@ioremap.net> wrote:

> Hi.
> 
> Do you want to own a tame killer? Do you want to control the world?

We've got /proc/*/oom_adj already

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:49 ` Linux killed Kenny, bastard! Alan Cox
@ 2009-01-12 15:50   ` Evgeniy Polyakov
  2009-01-12 15:52     ` Alan Cox
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-12 15:50 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 03:49:22PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> > Do you want to own a tame killer? Do you want to control the world?
> 
> We've got /proc/*/oom_adj already

Which has to be checked for every process ever created,
which is quite unfeasible in some conditions.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:48   ` Evgeniy Polyakov
@ 2009-01-12 15:51     ` Alan Cox
  2009-01-12 15:52       ` Evgeniy Polyakov
  2009-01-13 13:52       ` [why oom_adj does not work] " Evgeniy Polyakov
  0 siblings, 2 replies; 71+ messages in thread
From: Alan Cox @ 2009-01-12 15:51 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

> Well, Kenny has to die, but if we still decide to change the world, here
> is the fist step.

NAK this entire thing - we have an existing interface that does the job
far better.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:51     ` Alan Cox
@ 2009-01-12 15:52       ` Evgeniy Polyakov
  2009-01-12 21:29         ` Chris Snook
  2009-01-13 13:52       ` [why oom_adj does not work] " Evgeniy Polyakov
  1 sibling, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-12 15:52 UTC (permalink / raw)
  To: Alan Cox; +Cc: Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 03:51:08PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> > Well, Kenny has to die, but if we still decide to change the world, here
> > is the fist step.
> 
> NAK this entire thing - we have an existing interface that does the job
> far better.

Modulo the fact that it does not work for the quickly created processes
which do not have their oom scores adjusted before the oom.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:50   ` Evgeniy Polyakov
@ 2009-01-12 15:52     ` Alan Cox
  2009-01-12 15:56       ` Evgeniy Polyakov
  0 siblings, 1 reply; 71+ messages in thread
From: Alan Cox @ 2009-01-12 15:52 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel, Andrew Morton, Linus Torvalds

On Mon, 12 Jan 2009 18:50:30 +0300
Evgeniy Polyakov <zbr@ioremap.net> wrote:

> On Mon, Jan 12, 2009 at 03:49:22PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> > > Do you want to own a tame killer? Do you want to control the world?
> > 
> > We've got /proc/*/oom_adj already
> 
> Which has to be checked for every process ever created,
> which is quite unfeasible in some conditions.

The task name is not a reliable indicator of true name and truncated so
is useless. You only nominate one task, you don't integrate with the
existing interface.

What you actually need is notifiers to work on /proc (exactly the same as
we need to avoid the bogus waitfd crap). At that point you can implement
arbitary policy by using dnotify/inotify/etc on /proc.

Alan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:52     ` Alan Cox
@ 2009-01-12 15:56       ` Evgeniy Polyakov
  2009-01-12 16:19         ` Alan Cox
  2009-01-12 16:22         ` Dave Jones
  0 siblings, 2 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-12 15:56 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 03:52:39PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> > Which has to be checked for every process ever created,
> > which is quite unfeasible in some conditions.
> 
> The task name is not a reliable indicator of true name and truncated so
> is useless. You only nominate one task, you don't integrate with the
> existing interface.

Not one, but tasks which have the given string in the name. Like script
names spawned at DoS time.

> What you actually need is notifiers to work on /proc (exactly the same as
> we need to avoid the bogus waitfd crap). At that point you can implement
> arbitary policy by using dnotify/inotify/etc on /proc.

Yes, it could be done. If inotify will not be killed itself, will be
enabled in the config and daemon will be started.
But right now there is no way to solve that task, in the long term this
is a good idea to implement modulo security problems it may concern.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:56       ` Evgeniy Polyakov
@ 2009-01-12 16:19         ` Alan Cox
  2009-01-12 16:29           ` Evgeniy Polyakov
  2009-01-12 16:22         ` Dave Jones
  1 sibling, 1 reply; 71+ messages in thread
From: Alan Cox @ 2009-01-12 16:19 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: linux-kernel, Andrew Morton, Linus Torvalds

> Yes, it could be done. If inotify will not be killed itself, will be
> enabled in the config and daemon will be started.
> But right now there is no way to solve that task, in the long term this
> is a good idea to implement modulo security problems it may concern.

It is perfectly soluble right now, use the existing /proc interface. If
you want to specifically victimise new tasks first then set everything
else with an adjust *against* being killed and new stuff will start off
as cannon fodder until classified.

The name approach is the wrong way to handle this. It has no reflection
of heirarchy of process, targetting by users, containers etc.. 

In fact containers are probably the right way to do it

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:56       ` Evgeniy Polyakov
  2009-01-12 16:19         ` Alan Cox
@ 2009-01-12 16:22         ` Dave Jones
  2009-01-12 16:28           ` Evgeniy Polyakov
  1 sibling, 1 reply; 71+ messages in thread
From: Dave Jones @ 2009-01-12 16:22 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 06:56:15PM +0300, Evgeniy Polyakov wrote:
 > On Mon, Jan 12, 2009 at 03:52:39PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
 > > > Which has to be checked for every process ever created,
 > > > which is quite unfeasible in some conditions.
 > > 
 > > The task name is not a reliable indicator of true name and truncated so
 > > is useless. You only nominate one task, you don't integrate with the
 > > existing interface.
 > 
 > Not one, but tasks which have the given string in the name. Like script
 > names spawned at DoS time.

There is also the problem that process names aren't unique.
If the process table contains two entries called 'Kenny', there's nothing
that says they came from the same executable.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 16:22         ` Dave Jones
@ 2009-01-12 16:28           ` Evgeniy Polyakov
  0 siblings, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-12 16:28 UTC (permalink / raw)
  To: Dave Jones, Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 11:22:09AM -0500, Dave Jones (davej@redhat.com) wrote:
>  > Not one, but tasks which have the given string in the name. Like script
>  > names spawned at DoS time.
> 
> There is also the problem that process names aren't unique.
> If the process table contains two entries called 'Kenny', there's nothing
> that says they came from the same executable.

Agree, oom-killer will try to get theirs points and if they are really
different applications, the 'bad' one will be killed.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 16:19         ` Alan Cox
@ 2009-01-12 16:29           ` Evgeniy Polyakov
  2009-01-12 23:00             ` Bill Davidsen
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-12 16:29 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 04:19:31PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> > Yes, it could be done. If inotify will not be killed itself, will be
> > enabled in the config and daemon will be started.
> > But right now there is no way to solve that task, in the long term this
> > is a good idea to implement modulo security problems it may concern.
> 
> It is perfectly soluble right now, use the existing /proc interface. If
> you want to specifically victimise new tasks first then set everything
> else with an adjust *against* being killed and new stuff will start off
> as cannon fodder until classified.
> 
> The name approach is the wrong way to handle this. It has no reflection
> of heirarchy of process, targetting by users, containers etc.. 
> 
> In fact containers are probably the right way to do it

Containers to solve oom-killer selection problem? :)

Being more serious, I agree that having a simple name does not solve the
problem if observed from any angle, but it is not the main goal.
Patch solves oom-killer selection issue from likely the most commonly
used case: when you know who should be checked and killed first when
problem appears.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:52       ` Evgeniy Polyakov
@ 2009-01-12 21:29         ` Chris Snook
  2009-01-12 21:42           ` Evgeniy Polyakov
  0 siblings, 1 reply; 71+ messages in thread
From: Chris Snook @ 2009-01-12 21:29 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Alan Cox, Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

Evgeniy Polyakov wrote:
> On Mon, Jan 12, 2009 at 03:51:08PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
>>> Well, Kenny has to die, but if we still decide to change the world, here
>>> is the fist step.
>> NAK this entire thing - we have an existing interface that does the job
>> far better.
> 
> Modulo the fact that it does not work for the quickly created processes
> which do not have their oom scores adjusted before the oom.
> 

cgroups solve this problem much more cleanly.

-- Chris

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 21:29         ` Chris Snook
@ 2009-01-12 21:42           ` Evgeniy Polyakov
  0 siblings, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-12 21:42 UTC (permalink / raw)
  To: Chris Snook
  Cc: Alan Cox, Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 04:29:10PM -0500, Chris Snook (csnook@redhat.com) wrote:
> >Modulo the fact that it does not work for the quickly created processes
> >which do not have their oom scores adjusted before the oom.
> 
> cgroups solve this problem much more cleanly.

When they are configured and enabled :)
And actually not, since having two separate groups still may result in
the wrong oom-killing, the same group should contain all potentially
'bad' processes, so that it could be triggered first and not the whole
scan.

Having a name to kill is way too simpler than anything else, and while
this may be not the finest grain solution, it is what is the most
obvious and the simplest to work with.

I do agree, that there are ways to solve the same problem, and likely
they provide better control, but setup/control cost is uncomparable with
simple name-based scheme to select 'victim' processes by their scores.

Effectively it is similar to oom_kill_allocating_task trick, which also
can be solved by adjusting oom-score for every other process in the
system or by putting it into the separate group, or anything else.
But still it is much simpler to have a single flag which solves the
problem maybe not optimally, but close to it in the most cases.

The same does my patch, which allows to select a set of processes by the
given string in the executable name, and then get a victim among them
based on the existing scores. This is the simplest and thus it could be
the most useful case.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 16:29           ` Evgeniy Polyakov
@ 2009-01-12 23:00             ` Bill Davidsen
  2009-01-12 23:17               ` Evgeniy Polyakov
  0 siblings, 1 reply; 71+ messages in thread
From: Bill Davidsen @ 2009-01-12 23:00 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

Evgeniy Polyakov wrote:
> On Mon, Jan 12, 2009 at 04:19:31PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
>>> Yes, it could be done. If inotify will not be killed itself, will be
>>> enabled in the config and daemon will be started.
>>> But right now there is no way to solve that task, in the long term this
>>> is a good idea to implement modulo security problems it may concern.
>> It is perfectly soluble right now, use the existing /proc interface. If
>> you want to specifically victimise new tasks first then set everything
>> else with an adjust *against* being killed and new stuff will start off
>> as cannon fodder until classified.
>>
>> The name approach is the wrong way to handle this. It has no reflection
>> of heirarchy of process, targetting by users, containers etc.. 
>>
>> In fact containers are probably the right way to do it
> 
> Containers to solve oom-killer selection problem? :)
> 
> Being more serious, I agree that having a simple name does not solve the
> problem if observed from any angle, but it is not the main goal.
> Patch solves oom-killer selection issue from likely the most commonly
> used case: when you know who should be checked and killed first when
> problem appears.
> 
The only cases in which this would really be useful is when running some 
software which once in a great while goes super prompt critical and starts 
throwing processes of a known name format in all directions, or when you have a 
problem and know the process names involved before OOM kills everything in sight.

This does have a strange attraction, I did save the patch in case another "every 
few years" problem comes up.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 23:00             ` Bill Davidsen
@ 2009-01-12 23:17               ` Evgeniy Polyakov
  2009-01-13  1:53                 ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-12 23:17 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 06:00:10PM -0500, Bill Davidsen (davidsen@tmr.com) wrote:
> >Being more serious, I agree that having a simple name does not solve the
> >problem if observed from any angle, but it is not the main goal.
> >Patch solves oom-killer selection issue from likely the most commonly
> >used case: when you know who should be checked and killed first when
> >problem appears.
> >
> The only cases in which this would really be useful is when running some 
> software which once in a great while goes super prompt critical and starts 
> throwing processes of a known name format in all directions, or when you 
> have a problem and know the process names involved before OOM kills 
> everything in sight.

Like anything that spawns a thread or process per request/client, or
preallocates set of them which connect to the huge object like database.
Most of the time database/server is killed first instead of comparably
small clients. In some cases it is possible to tune the environment, in
others it is not that simple. This patch works for such situatons
perfectly and does not require additional administrative burden, since
it does not make thinge worse as a whole, but only better for the very
commonly used cases, that's why I propose it for inclusion.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 23:17               ` Evgeniy Polyakov
@ 2009-01-13  1:53                 ` David Rientjes
  2009-01-13  8:52                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2009-01-13  1:53 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Bill Davidsen, Alan Cox, linux-kernel, Andrew Morton,
	Linus Torvalds

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> Like anything that spawns a thread or process per request/client, or
> preallocates set of them which connect to the huge object like database.
> Most of the time database/server is killed first instead of comparably
> small clients.

No, the reverse is true: when a task is chosen for oom kill based on the 
badness heuristic, the oom killer first attempts to kill any child task 
that isn't attached to the same mm.  If the child shares an mm, both tasks 
must die before memory freeing can occur.

> In some cases it is possible to tune the environment, in
> others it is not that simple. This patch works for such situatons
> perfectly and does not require additional administrative burden, since
> it does not make thinge worse as a whole, but only better for the very
> commonly used cases, that's why I propose it for inclusion.
> 

It's an inappropriate addition since /proc/pid/oom_adj scores exist which 
can prefer or protect certain tasks over others when the oom killer 
chooses a target, including oom kill immunity.  These scores are inherited 
from parent tasks and can be tuned after the fork to your oom kill target 
preference.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13  1:53                 ` David Rientjes
@ 2009-01-13  8:52                   ` Evgeniy Polyakov
  2009-01-13  9:54                     ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13  8:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Bill Davidsen, Alan Cox, linux-kernel, Andrew Morton,
	Linus Torvalds

On Mon, Jan 12, 2009 at 05:53:47PM -0800, David Rientjes (rientjes@google.com) wrote:
> On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:
> 
> > Like anything that spawns a thread or process per request/client, or
> > preallocates set of them which connect to the huge object like database.
> > Most of the time database/server is killed first instead of comparably
> > small clients.
> 
> No, the reverse is true: when a task is chosen for oom kill based on the 
> badness heuristic, the oom killer first attempts to kill any child task 
> that isn't attached to the same mm.  If the child shares an mm, both tasks 
> must die before memory freeing can occur.

It is a theory, not a practice. OOM-killer most of time starts from ssh,
database and lighttpd on the tested machines, when it could start in
the reverse order and do not touch ssh at all. Better not from daemon
itself, but its fastcgi spawned processes.

> > In some cases it is possible to tune the environment, in
> > others it is not that simple. This patch works for such situatons
> > perfectly and does not require additional administrative burden, since
> > it does not make thinge worse as a whole, but only better for the very
> > commonly used cases, that's why I propose it for inclusion.
> > 
> 
> It's an inappropriate addition since /proc/pid/oom_adj scores exist which 
> can prefer or protect certain tasks over others when the oom killer 
> chooses a target, including oom kill immunity.  These scores are inherited 
> from parent tasks and can be tuned after the fork to your oom kill target 
> preference.

I agree, that there are ways to tune the way oom-killer selects the
victim, and likely after hours of games this subtly will work for the
specified workload. What I propose is the simplest way for the most
commonly used case. It is a help for the admin and not the force to
invent complex machinery which will be error-prone and hard to debug
when eventually oom happens. This will work, but it is way more complex
than what I propose, without immediately visible net effects on other
parts of the originally balanced system.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13  8:52                   ` Evgeniy Polyakov
@ 2009-01-13  9:54                     ` David Rientjes
  2009-01-13 11:54                       ` Evgeniy Polyakov
  2009-01-13 13:41                       ` Jan-Frode Myklebust
  0 siblings, 2 replies; 71+ messages in thread
From: David Rientjes @ 2009-01-13  9:54 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Bill Davidsen, Alan Cox, linux-kernel, Andrew Morton,
	Linus Torvalds

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> It is a theory, not a practice. OOM-killer most of time starts from ssh,
> database and lighttpd on the tested machines, when it could start in
> the reverse order and do not touch ssh at all. Better not from daemon
> itself, but its fastcgi spawned processes.
> 

In the unconstrained system-wide oom case, it scans each task on the 
system (which can take very long, ask SGI) and rates its badness scoring.  
When a memory-hogging task is identified, which you have complete control 
over in userspace by tuning /proc/pid/oom_adj, it attempts to kill a child 
first if it will allow for memory freeing without killing the parent.

> I agree, that there are ways to tune the way oom-killer selects the
> victim, and likely after hours of games this subtly will work for the
> specified workload.

It doesn't involve "hours of games," it is a very simple heuristic that 
you can easily tune to specify your preferences.

What you're looking for with your patch is simply a way to specify an oom 
preference before the task has been forked, but that's simple to do with 
the current logic since oom_adj scores are inherited and preference is 
given to killing a child before parent.

> What I propose is the simplest way for the most
> commonly used case.

No, procfs is the correct interface for tuning oom kill preferences and 
not by name parsing.

With oom_adj scores, you have the ability to specify oom kill preferences 
within a cpuset or memory controller as well, whereas oom_victim_name is 
global and very costly when not found in select_bad_process().

> It is a help for the admin and not the force to
> invent complex machinery which will be error-prone and hard to debug
> when eventually oom happens.

It's very simple to debug the oom killer's decisions, which is why I 
introduced /proc/sys/vm/oom_dump_tasks.

It also requires two expensive scans of the entire tasklist (I introduced 
/proc/sys/vm/oom_kill_allocating_task specifically to avoid _one_ 
expensive scan) when oom_victim_name isn't found.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13  9:54                     ` David Rientjes
@ 2009-01-13 11:54                       ` Evgeniy Polyakov
  2009-01-13 12:15                         ` Alan Cox
                                           ` (2 more replies)
  2009-01-13 13:41                       ` Jan-Frode Myklebust
  1 sibling, 3 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 11:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Bill Davidsen, Alan Cox, linux-kernel, Andrew Morton,
	Linus Torvalds

On Tue, Jan 13, 2009 at 01:54:02AM -0800, David Rientjes (rientjes@google.com) wrote:
> > It is a theory, not a practice. OOM-killer most of time starts from ssh,
> > database and lighttpd on the tested machines, when it could start in
> > the reverse order and do not touch ssh at all. Better not from daemon
> > itself, but its fastcgi spawned processes.
> 
> In the unconstrained system-wide oom case, it scans each task on the 
> system (which can take very long, ask SGI) and rates its badness scoring.  
> When a memory-hogging task is identified, which you have complete control 
> over in userspace by tuning /proc/pid/oom_adj, it attempts to kill a child 
> first if it will allow for memory freeing without killing the parent.

Should this explain why ssh is killed?

> > I agree, that there are ways to tune the way oom-killer selects the
> > victim, and likely after hours of games this subtly will work for the
> > specified workload.
> 
> It doesn't involve "hours of games," it is a very simple heuristic that 
> you can easily tune to specify your preferences.
> 
> What you're looking for with your patch is simply a way to specify an oom 
> preference before the task has been forked, but that's simple to do with 
> the current logic since oom_adj scores are inherited and preference is 
> given to killing a child before parent.

It is very subtle approach. Consider the case when you have a pool of
threads/processes which are created and released on demand, there are
several such pools for different servers and you do know which one
will very likely being guilty.

Who should adjust the scores for newly created processes? Who should
check that processes in the first group have negative oom ajustment and
in the second group a positive value? Who determines when its time to
ajust the scores?

> > What I propose is the simplest way for the most
> > commonly used case.
> 
> No, procfs is the correct interface for tuning oom kill preferences and 
> not by name parsing.
> 
> With oom_adj scores, you have the ability to specify oom kill preferences 
> within a cpuset or memory controller as well, whereas oom_victim_name is 
> global and very costly when not found in select_bad_process().
>
> > It is a help for the admin and not the force to
> > invent complex machinery which will be error-prone and hard to debug
> > when eventually oom happens.
> 
> It's very simple to debug the oom killer's decisions, which is why I 
> introduced /proc/sys/vm/oom_dump_tasks.
> 
> It also requires two expensive scans of the entire tasklist (I introduced 
> /proc/sys/vm/oom_kill_allocating_task specifically to avoid _one_ 
> expensive scan) when oom_victim_name isn't found.

It is not really costly, since most of the time we skip an entry and do
not lock the task and do not calculate its badness value. No one scares
that 'ps ax' is costly because it has to run through all the processes.

Messing with the scores is actually more expensive since we have to lock
the task and perform a calculus. I do not say it is wrong, but it is
much more complex task to being stable compared to simple task selection
by its name. It is what is used and what people expect to have (that's
actually why it was implemented :) and not to write some daemons to
monitor the clients and appropriate processes or change the code of the
servers.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 11:54                       ` Evgeniy Polyakov
@ 2009-01-13 12:15                         ` Alan Cox
  2009-01-13 12:29                           ` Evgeniy Polyakov
  2009-01-13 19:15                         ` David Rientjes
  2009-01-13 23:26                         ` Valdis.Kletnieks
  2 siblings, 1 reply; 71+ messages in thread
From: Alan Cox @ 2009-01-13 12:15 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Rientjes, Bill Davidsen, linux-kernel, Andrew Morton,
	Linus Torvalds

> Who should adjust the scores for newly created processes? Who should
> check that processes in the first group have negative oom ajustment and
> in the second group a positive value? Who determines when its time to
> ajust the scores?

This is policy. Where does policy go ? User space


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 12:15                         ` Alan Cox
@ 2009-01-13 12:29                           ` Evgeniy Polyakov
  2009-01-13 13:19                             ` Theodore Tso
  2009-01-13 19:36                             ` David Rientjes
  0 siblings, 2 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 12:29 UTC (permalink / raw)
  To: Alan Cox
  Cc: David Rientjes, Bill Davidsen, linux-kernel, Andrew Morton,
	Linus Torvalds

On Tue, Jan 13, 2009 at 12:15:10PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> > Who should adjust the scores for newly created processes? Who should
> > check that processes in the first group have negative oom ajustment and
> > in the second group a positive value? Who determines when its time to
> > ajust the scores?
> 
> This is policy. Where does policy go ? User space

Don't you notice how many 'who' were placed and only single 'user space'
answer? Becasue it is not an answer, it is a theoretical POV, which does
not really work in practice, since it is way too unconvenient and
error-prone, and actually it does not work when needed, since because of
its complexity something will be missed. I've just talked with the
admins who originally requested 'kill-by-name' feature why they did not
work with /proc/.../oom_adj, and got a nice answer: we tries, but
likely something went wrong and it did not work the way we wanted.

There is no way to know that adjustment is correct, that everything was
uptodate when oom happend, that nothing was forgotten and practice shows
that there are always such problems and invalid tasks are killed.

When you put a name you do know that it works, since it is only single
place to be updated and no need to bother with ugly tools or changes
especially to handle short-living processes.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 12:29                           ` Evgeniy Polyakov
@ 2009-01-13 13:19                             ` Theodore Tso
  2009-01-13 13:35                               ` Evgeniy Polyakov
  2009-01-13 13:47                               ` Alan Cox
  2009-01-13 19:36                             ` David Rientjes
  1 sibling, 2 replies; 71+ messages in thread
From: Theodore Tso @ 2009-01-13 13:19 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Alan Cox, David Rientjes, Bill Davidsen, linux-kernel,
	Andrew Morton, Linus Torvalds

Instead of trying to specify which process should be protected from
the OOM killer by name, how about something which is inherited from
the parent process?  After all, if having the child not get killed due
to OOM is important, the child won't even have a chance to run if the
parent gets killed off.  And in fact, we have something that fits that
bill fairly well; getrlimit()/setrlimit().  Why not define a new
resource limit which specifies a relative immunity to the oom_killer?

Most of the infrastructure to support that will already be in place
(i.e., shell support, PAM support in /etc/securitylimits.conf); all
that would need to be done is to teach a few userspace
programs/libraries about the new resource limit.

This would be a much cleaner approach, I would think.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 13:19                             ` Theodore Tso
@ 2009-01-13 13:35                               ` Evgeniy Polyakov
  2009-01-14  0:24                                 ` Bill Davidsen
  2009-01-13 13:47                               ` Alan Cox
  1 sibling, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 13:35 UTC (permalink / raw)
  To: Theodore Tso, Alan Cox, David Rientjes, Bill Davidsen,
	linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 08:19:37AM -0500, Theodore Tso (tytso@mit.edu) wrote:
> Instead of trying to specify which process should be protected from
> the OOM killer by name, how about something which is inherited from
> the parent process?  After all, if having the child not get killed due
> to OOM is important, the child won't even have a chance to run if the
> parent gets killed off.  And in fact, we have something that fits that
> bill fairly well; getrlimit()/setrlimit().  Why not define a new
> resource limit which specifies a relative immunity to the oom_killer?
> 
> Most of the infrastructure to support that will already be in place
> (i.e., shell support, PAM support in /etc/securitylimits.conf); all
> that would need to be done is to teach a few userspace
> programs/libraries about the new resource limit.
> 
> This would be a much cleaner approach, I would think.

It will be similar to oom_adj parameter (although I did not find where
it is inherited from the parent), but with the different updating
interface. I do not think it will be anyhow easier to solve the problem,
since it is not directly in the parent/child hierarchy, since there are
cases when we do want to kill children (this phrase just screams for the
addition: and eat them), but only some processes which are not really
the most significant.

Existing oom score adjustment mechanism works for this cases, but it is
by itself is not convenient to be used. Even its documentation does not
say how it is used :) It is not just simple add/remove, but score
multiplication or division by the two in the power of the oom_adj value.
Plus really no one knows how scores are calculated except those who read
the mm/kill.c before going to sleep.

So effectively oom_adj only works as enable/disable switch, and since no
one knows how to tune it, it is better to do not touch at all. And get
ssh killed. I believe if it is ever used then only to disable oom at
all, which is wrong, since task still may be killed but after some
others. My patch adds a simple priority for that based on the name of
the process, which are known to the administrators who maintain given
system.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13  9:54                     ` David Rientjes
  2009-01-13 11:54                       ` Evgeniy Polyakov
@ 2009-01-13 13:41                       ` Jan-Frode Myklebust
  2009-01-13 13:59                         ` Alan Cox
  1 sibling, 1 reply; 71+ messages in thread
From: Jan-Frode Myklebust @ 2009-01-13 13:41 UTC (permalink / raw)
  To: linux-kernel

On 2009-01-13, David Rientjes <rientjes@google.com> wrote:
>
> When a memory-hogging task is identified, which you have complete control 
> over in userspace by tuning /proc/pid/oom_adj, it attempts to kill a child 
> first if it will allow for memory freeing without killing the parent.

So an alternative to Evgeniy Polyakov's patch would be:

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..5dcfc88 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,10 +2311,19 @@ increase the likelihood of this process being killed by the oom-killer.  Valid
 values are in the range -16 to +15, plus the special value -17, which disables
 oom-killing altogether for this process.
 
+Child processes will inherit the parent oom_score, so to launch a potential
+rouge process that you want to be the primary target of the oom-killer, that
+can be done by adjusting the score of the parent process, before launching the
+potential rouge process. F.ex. to make sure the process "Kenny" will be a
+prime candidate to get killed:
+
+       echo 15 > /proc/self/oom_adj
+       ./Kenny
+       echo -15 > /proc/self/oom_adj
+
 2.13 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
 
-------------------------------------------------------------------------------
 This file can be used to check the current score used by the oom-killer is for
 any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
 process should be killed in an out-of-memory situation.



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 13:19                             ` Theodore Tso
  2009-01-13 13:35                               ` Evgeniy Polyakov
@ 2009-01-13 13:47                               ` Alan Cox
  1 sibling, 0 replies; 71+ messages in thread
From: Alan Cox @ 2009-01-13 13:47 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Evgeniy Polyakov, David Rientjes, Bill Davidsen, linux-kernel,
	Andrew Morton, Linus Torvalds

> (i.e., shell support, PAM support in /etc/securitylimits.conf); all
> that would need to be done is to teach a few userspace
> programs/libraries about the new resource limit.

You don't even need that - just define the behaviour of oom_adj to
inherit.

Of course thats still often totally the wrong behaviour as you'll find out
when a key system service is started up by a client that wants to use it
that was run by some low privilege untrusted process with tight resource
limits.

For desktop Jim Gettys proposed a very simple and quite elegant use of
this sort of thing which was to let the window manager do some of the
work according to what hadn't been used for ages.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-12 15:51     ` Alan Cox
  2009-01-12 15:52       ` Evgeniy Polyakov
@ 2009-01-13 13:52       ` Evgeniy Polyakov
  2009-01-13 14:06         ` Alan Cox
  1 sibling, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 13:52 UTC (permalink / raw)
  To: Alan Cox; +Cc: Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

On Mon, Jan 12, 2009 at 03:51:08PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> > Well, Kenny has to die, but if we still decide to change the world, here
> > is the fist step.
> 
> NAK this entire thing - we have an existing interface that does the job
> far better.

Mwahaha, I just checked how scores are calculated, so that userspace
could adjust them. Let's start with beginning:

	list_for_each_entry(child, &p->children, sibling) {
		task_lock(child);
		if (child->mm != mm && child->mm)
			points += child->mm->total_vm/2 + 1;
		task_unlock(child);
	}

	/*
	 * CPU time is in tens of seconds and run time is in thousands
         * of seconds. There is no particular reason for this other than
         * that it turned out to work very well in practice.
	 */
	cpu_time = (cputime_to_jiffies(p->utime) + cputime_to_jiffies(p->stime))
		>> (SHIFT_HZ + 3);

	if (uptime >= p->start_time.tv_sec)
		run_time = (uptime - p->start_time.tv_sec) >> 10;
	else
		run_time = 0;

	s = int_sqrt(cpu_time);
	if (s)
		points /= s;
	s = int_sqrt(int_sqrt(run_time));
	if (s)
		points /= s;

Do you _REALLY_ think anyone can calculate it yourself and then properly
calculate adjustment used to properly select oom-killed process?

I can not and will not even try if I would be an admin of the given
system. So, Alan, until you can calc that numbers in mind and then do
this for the whole heavy loaded system, please do not spread the idea
that oom_adj can be used to tune the oom-killer.
And no, reading data from /proc/.../oom_score is not enough, since
they change with time, so the same will be needed to be done to tune
the adjustment?

So far my patch is the sanest way to deal with the OOM selection, when
we have to differentiate some processes. I agree, it is not the best
solution, but it is way ahead of what we have right now for the users
and not hardcore kernel hackers.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 13:41                       ` Jan-Frode Myklebust
@ 2009-01-13 13:59                         ` Alan Cox
  0 siblings, 0 replies; 71+ messages in thread
From: Alan Cox @ 2009-01-13 13:59 UTC (permalink / raw)
  To: Jan-Frode Myklebust; +Cc: linux-kernel

> +potential rouge process. F.ex. to make sure the process "Kenny" will be a
> +prime candidate to get killed:
> +
> +       echo 15 > /proc/self/oom_adj
> +       ./Kenny
> +       echo -15 > /proc/self/oom_adj
> +

This is a bogus and silly example - but it does show why the whole thing
is not needed

	(echo "15" >/proc/self/oom_adj; exec ./Kenny) &

See it's like nice, you can do it yourself anyway if you are
co-operating, and if you aren't co-operating you need containers anyway...

Alan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-13 13:52       ` [why oom_adj does not work] " Evgeniy Polyakov
@ 2009-01-13 14:06         ` Alan Cox
  2009-01-13 14:24           ` Evgeniy Polyakov
  0 siblings, 1 reply; 71+ messages in thread
From: Alan Cox @ 2009-01-13 14:06 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

> Do you _REALLY_ think anyone can calculate it yourself and then properly
> calculate adjustment used to properly select oom-killed process?

Its always a heuristic.

> So far my patch is the sanest way to deal with the OOM selection

No. You keep maintaining this but your crude hack is useless in a non
co-operative environment, has lots of issue with name aliasing and
doesn't deal with real needs.

We have container interfaces that can do this and far more and do them
right. In fact the very start of all the OpenVZ and container work years
ago was the beancounter patches which were addressed at exactly this
problem (although more specifically 'making sure undergraduates processes
get killed first')

Alan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-13 14:06         ` Alan Cox
@ 2009-01-13 14:24           ` Evgeniy Polyakov
  2009-01-13 15:00             ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 14:24 UTC (permalink / raw)
  To: Alan Cox; +Cc: Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 02:06:27PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
> > Do you _REALLY_ think anyone can calculate it yourself and then properly
> > calculate adjustment used to properly select oom-killed process?
> 
> Its always a heuristic.

For the system which knows what it is. User does not and really can not
work with it, since there is no sane way to implement that heuristic in
the applications or even in (theoretically possible) monitor daemon.

So, effectively, oom adjustment does not work.

> > So far my patch is the sanest way to deal with the OOM selection
> 
> No. You keep maintaining this but your crude hack is useless in a non
> co-operative environment, has lots of issue with name aliasing and
> doesn't deal with real needs.

It is created because of real needs. Because people need to control the
behaviour of the system and they want to control which application will
be killed to free the memory. Attached patch is not the best solution,
but it works for the all cases I can think about.

Let's take you 'name aliasing' claim: if there are several processes
with the same name, system will select the one with the worst score
according to the own magical algorithm. So it will not kill random
process just because it happend to have ricky name.

And the same applies to the other issues. It just helps system to select
the process to be killed according to userspace expectation of what
should be killed to free the memory.

> We have container interfaces that can do this and far more and do them
> right. In fact the very start of all the OpenVZ and container work years
> ago was the beancounter patches which were addressed at exactly this
> problem (although more specifically 'making sure undergraduates processes
> get killed first')

Are the beancounters used to limit amount of virtual ram and not the
physical one? It really does not work to limit for example some java
machine which will ate all virtual space swapping out different node.
It works for some (and likely the most, I do not argue this) cases and
has overhead. But we are talking not about how to limit the processes,
but what to do when we happend to have out-of-memory condition. And it
happens all the time even if you put the processes into the separate
container, since there are situations (that's why it was started at
first), when you have a huge process which should not be killed and set
of either its children or external processes, which should be checked
and some of them (administrator would like to specify the less
important) should be killed without much harm to the system.

And patch I presented allows to do it. It introduces a hint for the
killer on what processes should be checked first. It works exactly the
way people work with their system: they run different application and
expect some of them to be higher or lower priority when things come to
the oom condition. No one ever proposes to kill exactly the process we
select (although that may be a good idea in some cases), but instead to
show that oom-killer should check given group first. The group
administrator knows to be potentially harmless.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-13 14:24           ` Evgeniy Polyakov
@ 2009-01-13 15:00             ` Balbir Singh
  2009-01-13 15:21               ` Evgeniy Polyakov
  0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2009-01-13 15:00 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Alan Cox, Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 7:54 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> On Tue, Jan 13, 2009 at 02:06:27PM +0000, Alan Cox (alan@lxorguk.ukuu.org.uk) wrote:
>> > Do you _REALLY_ think anyone can calculate it yourself and then properly
>> > calculate adjustment used to properly select oom-killed process?
>>
>> Its always a heuristic.
>
> For the system which knows what it is. User does not and really can not
> work with it, since there is no sane way to implement that heuristic in
> the applications or even in (theoretically possible) monitor daemon.
>
> So, effectively, oom adjustment does not work.
>
>> > So far my patch is the sanest way to deal with the OOM selection
>>
>> No. You keep maintaining this but your crude hack is useless in a non
>> co-operative environment, has lots of issue with name aliasing and
>> doesn't deal with real needs.
>
> It is created because of real needs. Because people need to control the
> behaviour of the system and they want to control which application will
> be killed to free the memory. Attached patch is not the best solution,
> but it works for the all cases I can think about.
>

Where does this end? Tomorrow you'll add an interface for applications
that should *not* be killed? What sort of a heuristic is name? I think
the only name the kernel knows about is "init".

> Let's take you 'name aliasing' claim: if there are several processes
> with the same name, system will select the one with the worst score
> according to the own magical algorithm. So it will not kill random
> process just because it happend to have ricky name.
>

Having a name in the kernel is like building a hit-list, why can't the
examples that Alan sent work for you?
Names are tricky as well, if someone used a symbolic link to the
application with a different name, they would no longer be candidates
for OOM first? or vice-versa?

> And the same applies to the other issues. It just helps system to select
> the process to be killed according to userspace expectation of what
> should be killed to free the memory.
>
>> We have container interfaces that can do this and far more and do them
>> right. In fact the very start of all the OpenVZ and container work years
>> ago was the beancounter patches which were addressed at exactly this
>> problem (although more specifically 'making sure undergraduates processes
>> get killed first')
>
> Are the beancounters used to limit amount of virtual ram and not the
> physical one? It really does not work to limit for example some java
> machine which will ate all virtual space swapping out different node.
> It works for some (and likely the most, I do not argue this) cases and
> has overhead. But we are talking not about how to limit the processes,
> but what to do when we happend to have out-of-memory condition. And it
> happens all the time even if you put the processes into the separate
> container, since there are situations (that's why it was started at
> first), when you have a huge process which should not be killed and set
> of either its children or external processes, which should be checked
> and some of them (administrator would like to specify the less
> important) should be killed without much harm to the system.
>
> And patch I presented allows to do it. It introduces a hint for the
> killer on what processes should be checked first. It works exactly the
> way people work with their system: they run different application and
> expect some of them to be higher or lower priority when things come to
> the oom condition. No one ever proposes to kill exactly the process we
> select (although that may be a good idea in some cases), but instead to
> show that oom-killer should check given group first. The group
> administrator knows to be potentially harmless.
>

You can replace the lines of kernel code you wrote with a simple
one-line script that Alan sent out.

Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-13 15:00             ` Balbir Singh
@ 2009-01-13 15:21               ` Evgeniy Polyakov
  2009-01-13 18:04                 ` Valdis.Kletnieks
  2009-01-13 19:46                 ` David Rientjes
  0 siblings, 2 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 15:21 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Alan Cox, Dave Jones, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 08:30:16PM +0530, Balbir Singh (balbir@linux.vnet.ibm.com) wrote:
> > It is created because of real needs. Because people need to control the
> > behaviour of the system and they want to control which application will
> > be killed to free the memory. Attached patch is not the best solution,
> > but it works for the all cases I can think about.
> >
> 
> Where does this end? Tomorrow you'll add an interface for applications
> that should *not* be killed? What sort of a heuristic is name? I think
> the only name the kernel knows about is "init".

We have an interface to disable oom for the process already :)
But I could agree that it could be a good idea to have an interface
to provide a list of names or whatever else to select what user knows
and works with to be killed first/last

> > Let's take you 'name aliasing' claim: if there are several processes
> > with the same name, system will select the one with the worst score
> > according to the own magical algorithm. So it will not kill random
> > process just because it happend to have ricky name.
> >
> 
> Having a name in the kernel is like building a hit-list, why can't the
> examples that Alan sent work for you?

Using oom_adj? Because there is no way I can determine which number to
put there. It is not even documented for those who do not read kernel
sources. Even after that: oom_score changes with time, and having 1/2 or
8 oom_adj is correct right now, it will not be in a few moments.

Having containers is a bit overkill to determine which one to kill,
especially when several sets of processes are created from the same
parent :)

> Names are tricky as well, if someone used a symbolic link to the
> application with a different name, they would no longer be candidates
> for OOM first? or vice-versa?

It is up to the user to decide what he wants to be checked first.
Only user knows what he runs.

> You can replace the lines of kernel code you wrote with a simple
> one-line script that Alan sent out.

Almost. But I can not if tasks are spawned from the parent process. We
can not change the process to adjust its forked children to have
different adjustment and can not change it for the process itself, since
it should live and children should be dead.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-12 15:33 Linux killed Kenny, bastard! Evgeniy Polyakov
  2009-01-12 15:44 ` Dave Jones
  2009-01-12 15:49 ` Linux killed Kenny, bastard! Alan Cox
@ 2009-01-13 16:35 ` KOSAKI Motohiro
  2009-01-13 22:04   ` Evgeniy Polyakov
  2 siblings, 1 reply; 71+ messages in thread
From: KOSAKI Motohiro @ 2009-01-13 16:35 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: kosaki.motohiro, linux-kernel, Andrew Morton, Linus Torvalds

Hi

sorry... I also don't like this patch.


> @@ -263,6 +270,15 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  		}
>  	} while_each_thread(g, p);
>  
> +	/*
> +	 * We did not find the process with requested string in its name,
> +	 * so lets search for the usual victim.
> +	 */
> +	if (name && !chosen) {
> +		name = NULL;
> +		goto again;
> +	}
> +
>  	return chosen;

this patch makes oom handling slower.
slow bad process selection cause next another out of memory.

then, your trouble become large.




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-13 15:21               ` Evgeniy Polyakov
@ 2009-01-13 18:04                 ` Valdis.Kletnieks
  2009-01-13 19:46                 ` David Rientjes
  1 sibling, 0 replies; 71+ messages in thread
From: Valdis.Kletnieks @ 2009-01-13 18:04 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Balbir Singh, Alan Cox, Dave Jones, linux-kernel, Andrew Morton,
	Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 658 bytes --]

On Tue, 13 Jan 2009 18:21:06 +0300, Evgeniy Polyakov said:

> Using oom_adj? Because there is no way I can determine which number to
> put there. It is not even documented for those who do not read kernel
> sources. Even after that: oom_score changes with time, and having 1/2 or
> 8 oom_adj is correct right now, it will not be in a few moments.

In that case, the *real* problem to be fixed is a lack of documentation.
It should be possible to add a blurb somewhere in Documentation/* that
says:

"echo 10000 > oom_adjust" is guaranteed to make this process the first one
up against the wall when the revolution comes (for some value of 10000, of
course).

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 11:54                       ` Evgeniy Polyakov
  2009-01-13 12:15                         ` Alan Cox
@ 2009-01-13 19:15                         ` David Rientjes
  2009-01-13 22:00                           ` Evgeniy Polyakov
  2009-01-13 23:26                         ` Valdis.Kletnieks
  2 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2009-01-13 19:15 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> Should this explain why ssh is killed?
> 

If you would like to make sshd immune from the oom killer, use

	echo -17 > /proc/$(pidof sshd)/oom_adj

just like any other task.  This score will be inherited by any task that 
it executes, so you'll probably want to readjust your shell's oom_adj 
score appropriately in your rc file.

> It is very subtle approach. Consider the case when you have a pool of
> threads/processes which are created and released on demand, there are
> several such pools for different servers and you do know which one
> will very likely being guilty.
> 
> Who should adjust the scores for newly created processes? Who should
> check that processes in the first group have negative oom ajustment and
> in the second group a positive value? Who determines when its time to
> ajust the scores?
> 

It is userspace's responsibility to set the policy, the kernel merely 
provides the mechanism.

> > With oom_adj scores, you have the ability to specify oom kill preferences 
> > within a cpuset or memory controller as well, whereas oom_victim_name is 
> > global and very costly when not found in select_bad_process().
> >

You chose not to respond to this, which is a major flaw in your approach.  

Your patch makes cpuset and memory controller oom killing much slower 
because it requires two iterations through the system tasklist when your 
global oom_victim_name task is either not running or in a disjoint cpuset 
or memcg.

> It is not really costly, since most of the time we skip an entry and do
> not lock the task and do not calculate its badness value. No one scares
> that 'ps ax' is costly because it has to run through all the processes.
> 

Talk to SGI about oom killer tasklist scans for their large systems; it 
was a prerequisite for me to provide /proc/sys/vm/oom_kill_allocating_task 
to avoid a single scan when I made cpuset-constrained ooms go through 
select_bad_process().

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 12:29                           ` Evgeniy Polyakov
  2009-01-13 13:19                             ` Theodore Tso
@ 2009-01-13 19:36                             ` David Rientjes
  2009-01-13 21:46                               ` Evgeniy Polyakov
  1 sibling, 1 reply; 71+ messages in thread
From: David Rientjes @ 2009-01-13 19:36 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> Don't you notice how many 'who' were placed and only single 'user space'
> answer? Becasue it is not an answer, it is a theoretical POV, which does
> not really work in practice, since it is way too unconvenient and
> error-prone, and actually it does not work when needed, since because of
> its complexity something will be missed. I've just talked with the
> admins who originally requested 'kill-by-name' feature why they did not
> work with /proc/.../oom_adj, and got a nice answer: we tries, but
> likely something went wrong and it did not work the way we wanted.
> 
> There is no way to know that adjustment is correct, that everything was
> uptodate when oom happend, that nothing was forgotten and practice shows
> that there are always such problems and invalid tasks are killed.
> 
> When you put a name you do know that it works, since it is only single
> place to be updated and no need to bother with ugly tools or changes
> especially to handle short-living processes.
> 

The goal of the oom killer is to kill a rogue memory hogging task, which 
will lead to future memory freeing once the task dies, and allow the 
system or container to resume normal operation.

You're not realizing the power of /proc/pid/oom_adj: it allows you to tune 
the badness scoring so that YOU, the user, may determine what the 
definition of 'rogue' is on a task-by-task basis.

Your patch simply allows users to specify a task by name that will always 
be killed first when the oom killer is invoked.  That's terribly 
insufficient if another task uses an excessive amount of memory that you 
didn't expect; a rogue task may be leaking memory and the task you've 
identified by name with your patch is repeatedly forked and killed when 
the rogue task goes untouched.

With oom_adj scores, you can easily specify at what point each task should 
be considered rogue.  You can elevate the oom_adj score for those you have 
a preference to kill and reduce the oom_adj score for those that you'd 
prefer being deferred _unless_ they get sufficiently out of hand.

Your patch presents a shortcut where the entire badness scoring (and, 
thus, all oom_adj scores) is ignored if the named task exists.  That not 
only has syncronization issues, but also can cause the kernel to loop 
forever in killing a task by the same name without ever freeing memory for 
anything else.

Additionally, your patch completely breaks cpuset oom killing since 
candidacy is determined in badness() because a task may have allocated 
non-migrated memory elsewhere before being moved to a different cpuset.  
Your oom_victim_name task may exist globally, but will always be 
identified for oom kill even when the oom exists exclusively in a disjoint 
cpuset.  That does _not_ lead to future memory freeing that current can 
use, and if the parent of the killed task decides to immediately fork 
another instance, this cpuset will be completely livelocked. 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-13 15:21               ` Evgeniy Polyakov
  2009-01-13 18:04                 ` Valdis.Kletnieks
@ 2009-01-13 19:46                 ` David Rientjes
  2009-01-13 21:33                   ` Evgeniy Polyakov
  1 sibling, 1 reply; 71+ messages in thread
From: David Rientjes @ 2009-01-13 19:46 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Balbir Singh, Alan Cox, Dave Jones, linux-kernel, Andrew Morton,
	Linus Torvalds

On Tue, 13 Jan 2009, Evgeniy Polyakov wrote:

> Using oom_adj? Because there is no way I can determine which number to
> put there. It is not even documented for those who do not read kernel
> sources. Even after that: oom_score changes with time, and having 1/2 or
> 8 oom_adj is correct right now, it will not be in a few moments.
> 

Your oom_adj scores should never need to be changed unless you're tuning 
the inherited value of a child; it simply represents your input into when 
a specific task should be considered rogue enough to target.

However, patches to improve the documentation of the oom killer, or any 
other kernel feature, are always welcome.

> > You can replace the lines of kernel code you wrote with a simple
> > one-line script that Alan sent out.
> 
> Almost. But I can not if tasks are spawned from the parent process. We
> can not change the process to adjust its forked children to have
> different adjustment and can not change it for the process itself, since
> it should live and children should be dead.
> 

Children are already preferred over the chosen parent task, as I've 
explained a few times.  When a task is identified for oom kill by the 
badness heuristics, the oom killer attempts to kill a child that does not 
share the same mm first, which is exactly what you're asking for here.  If 
the parent shares the mm, it needs to exit as well before memory freeing 
may occur.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-13 19:46                 ` David Rientjes
@ 2009-01-13 21:33                   ` Evgeniy Polyakov
  2009-01-13 21:39                     ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 21:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Alan Cox, Dave Jones, linux-kernel, Andrew Morton,
	Linus Torvalds

On Tue, Jan 13, 2009 at 11:46:14AM -0800, David Rientjes (rientjes@google.com) wrote:
> Children are already preferred over the chosen parent task, as I've 
> explained a few times.  When a task is identified for oom kill by the 
> badness heuristics, the oom killer attempts to kill a child that does not 
> share the same mm first, which is exactly what you're asking for here.  If 
> the parent shares the mm, it needs to exit as well before memory freeing 
> may occur.

I really did not investigate why it happend, but oom'ed machine had
killed cgi daemons and parent process itself. And ssh to the heap.
While it should be enough just to kill appropriate daemon. Apparently
things are not that shine as should be.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-13 21:33                   ` Evgeniy Polyakov
@ 2009-01-13 21:39                     ` David Rientjes
  2009-01-13 22:05                       ` Evgeniy Polyakov
  2009-01-14 16:12                       ` OOM documentation update [was: Linux killed Kenny, bastard!] Evgeniy Polyakov
  0 siblings, 2 replies; 71+ messages in thread
From: David Rientjes @ 2009-01-13 21:39 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Balbir Singh, Alan Cox, Dave Jones, linux-kernel, Andrew Morton,
	Linus Torvalds

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:

> I really did not investigate why it happend, but oom'ed machine had
> killed cgi daemons and parent process itself. And ssh to the heap.
> While it should be enough just to kill appropriate daemon. Apparently
> things are not that shine as should be.
> 

As previously mentioned, you have all the diagnostic tools at your 
disposal already:

	echo 1 > /proc/sys/vm/oom_dump_tasks

The badness scoring is straight-forward given that information, so you can 
diagnose why a specific task was not killed and another was chosen.  You 
can also use that information to appropriately tune the oom_adj scores to 
identify your oom killer target preferences.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 19:36                             ` David Rientjes
@ 2009-01-13 21:46                               ` Evgeniy Polyakov
  2009-01-13 22:49                                 ` Theodore Tso
  2009-01-13 23:10                                 ` David Rientjes
  0 siblings, 2 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 21:46 UTC (permalink / raw)
  To: David Rientjes; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 11:36:04AM -0800, David Rientjes (rientjes@google.com) wrote:
> The goal of the oom killer is to kill a rogue memory hogging task, which 
> will lead to future memory freeing once the task dies, and allow the 
> system or container to resume normal operation.
> 
> You're not realizing the power of /proc/pid/oom_adj: it allows you to tune 
> the badness scoring so that YOU, the user, may determine what the 
> definition of 'rogue' is on a task-by-task basis.
> 
> Your patch simply allows users to specify a task by name that will always 
> be killed first when the oom killer is invoked.  That's terribly 
> insufficient if another task uses an excessive amount of memory that you 
> didn't expect; a rogue task may be leaking memory and the task you've 
> identified by name with your patch is repeatedly forked and killed when 
> the rogue task goes untouched.

It is up to user to decide, exactly the same will happen if you tune the
oom_adj for the task.

> With oom_adj scores, you can easily specify at what point each task should 
> be considered rogue.  You can elevate the oom_adj score for those you have 
> a preference to kill and reduce the oom_adj score for those that you'd 
> prefer being deferred _unless_ they get sufficiently out of hand.

No, you can not. Did you try that? The only sane way to use oom_adj is
to disable oom killer for the task or make its score very small or very
big, there is really no way to make a finegrained tuning, since score
changes and userspace does not know the algorithm.

> Your patch presents a shortcut where the entire badness scoring (and, 
> thus, all oom_adj scores) is ignored if the named task exists.  That not 
> only has syncronization issues, but also can cause the kernel to loop 
> forever in killing a task by the same name without ever freeing memory for 
> anything else.

No, that's not what it does. Patch allows to select process with the
highest score among those who have appropriate name. Please check it
twice.

> Additionally, your patch completely breaks cpuset oom killing since 
> candidacy is determined in badness() because a task may have allocated 
> non-migrated memory elsewhere before being moved to a different cpuset.  
> Your oom_victim_name task may exist globally, but will always be 
> identified for oom kill even when the oom exists exclusively in a disjoint 
> cpuset.  That does _not_ lead to future memory freeing that current can 
> use, and if the parent of the killed task decides to immediately fork 
> another instance, this cpuset will be completely livelocked. 

Please check the patch first. It selects process according to the
badness, check memory group first and fallbacks to scan other processes
if process with the given name was not found or name is null.

User does not work with the some magically calculated scores, he just
starts the processes and knows only their names. User can specify pid,
but in the case of short-living connections it is not possible. Changing
parent oom score opens a huge possibility to kill it, while in case of
some application server (or database) it should never be killed, and
only some of its clients (which work for the users and not for the
calculating backend for example) have to be killed.

There is no way to implement it with short-living pids (they are
unknown, if inotify worked with the /proc it could be doable though,
except that special daemon is needed) and can not change the parent's
score as was suggested.

You claim that existing scheme works, but in practice it does not.
So I created a patch which somehow makes the solution closer. It is not
perfect, but it works compared to what was suggested from the
theoretical point of view.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 19:15                         ` David Rientjes
@ 2009-01-13 22:00                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 22:00 UTC (permalink / raw)
  To: David Rientjes; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 11:15:26AM -0800, David Rientjes (rientjes@google.com) wrote:
> > Should this explain why ssh is killed?
> 
> If you would like to make sshd immune from the oom killer, use
> 
> 	echo -17 > /proc/$(pidof sshd)/oom_adj
> 
> just like any other task.  This score will be inherited by any task that 
> it executes, so you'll probably want to readjust your shell's oom_adj 
> score appropriately in your rc file.

For every process? What about short-living ones? Again, parent can not
be changed, since it has to stay, only its children should be killed
(and in some cases not eall, but only those from special set, like cgi
daemons started by the clients, and not database connections).

> > It is very subtle approach. Consider the case when you have a pool of
> > threads/processes which are created and released on demand, there are
> > several such pools for different servers and you do know which one
> > will very likely being guilty.
> > 
> > Who should adjust the scores for newly created processes? Who should
> > check that processes in the first group have negative oom ajustment and
> > in the second group a positive value? Who determines when its time to
> > ajust the scores?
> > 
> 
> It is userspace's responsibility to set the policy, the kernel merely 
> provides the mechanism.

Which does not work for the specified cases. There is no way to specify
the pid of the short-living processes and parent can not be changed (at
least for long enough time).

> > > With oom_adj scores, you have the ability to specify oom kill preferences 
> > > within a cpuset or memory controller as well, whereas oom_victim_name is 
> > > global and very costly when not found in select_bad_process().
> > >
> 
> You chose not to respond to this, which is a major flaw in your approach.  
> 
> Your patch makes cpuset and memory controller oom killing much slower 
> because it requires two iterations through the system tasklist when your 
> global oom_victim_name task is either not running or in a disjoint cpuset 
> or memcg.

It does exactly the same which happens for usual processes, it just
selects the ones with given name and then calculate badness and so on.
It has really nothing with memory group or cpuset, process is selected in
the given memory group.

> > It is not really costly, since most of the time we skip an entry and do
> > not lock the task and do not calculate its badness value. No one scares
> > that 'ps ax' is costly because it has to run through all the processes.
> > 
> 
> Talk to SGI about oom killer tasklist scans for their large systems; it 
> was a prerequisite for me to provide /proc/sys/vm/oom_kill_allocating_task 
> to avoid a single scan when I made cpuset-constrained ooms go through 
> select_bad_process().

If user specifies the name of the process he knows what he is doing. It
is always possibble to set it to null and avoid second scan.
More on this: the loop with check inside and loop without it in sum
equal to the loop with check and other processing.
Which means that if I change the patch to select two process: one with
given name and another one among the others, it will take exactly the
same time and will not introduce second loop (module loop prefetch
optimisations). For example:

loop {
  if (a)
    do_something1();
}

loop {
  do_something2();
}

equals to
loop {
  if (a)
    do_something1();
  do_something2();
}

Getting amount of the checks in that loop already, another one to
compare several (and most of the time just one) letters in the
name does not add overhead. So this does not count.

What really counts is the fact, that so far it is while not perfect, but
working solution for the problem I described, and existing (and
proposed) methods do not work.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 16:35 ` KOSAKI Motohiro
@ 2009-01-13 22:04   ` Evgeniy Polyakov
  0 siblings, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 22:04 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: linux-kernel, Andrew Morton, Linus Torvalds

Hi.

On Wed, Jan 14, 2009 at 01:35:36AM +0900, KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> > +	/*
> > +	 * We did not find the process with requested string in its name,
> > +	 * so lets search for the usual victim.
> > +	 */
> > +	if (name && !chosen) {
> > +		name = NULL;
> > +		goto again;
> > +	}
> > +
> >  	return chosen;
> 
> this patch makes oom handling slower.
> slow bad process selection cause next another out of memory.
> 
> then, your trouble become large.
 
It really does not. As I described, when task does not have a valid
name, all checks are skipped.

So effectively it equals to the additional check in the loop, which
although being non-zero, but really is very small, since most of the
time process does not match and only single letter of the name should be
checked.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [why oom_adj does not work] Re: Linux killed Kenny, bastard!
  2009-01-13 21:39                     ` David Rientjes
@ 2009-01-13 22:05                       ` Evgeniy Polyakov
  2009-01-14 16:12                       ` OOM documentation update [was: Linux killed Kenny, bastard!] Evgeniy Polyakov
  1 sibling, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 22:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Alan Cox, Dave Jones, linux-kernel, Andrew Morton,
	Linus Torvalds

On Tue, Jan 13, 2009 at 01:39:01PM -0800, David Rientjes (rientjes@google.com) wrote:
> > I really did not investigate why it happend, but oom'ed machine had
> > killed cgi daemons and parent process itself. And ssh to the heap.
> > While it should be enough just to kill appropriate daemon. Apparently
> > things are not that shine as should be.
> > 
> 
> As previously mentioned, you have all the diagnostic tools at your 
> disposal already:
> 
> 	echo 1 > /proc/sys/vm/oom_dump_tasks
> 
> The badness scoring is straight-forward given that information, so you can 
> diagnose why a specific task was not killed and another was chosen.  You 
> can also use that information to appropriately tune the oom_adj scores to 
> identify your oom killer target preferences.

There is no ssh there, I can not do any diagnostics. I first have to
change oom score for the ssh, but that's a different story.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 21:46                               ` Evgeniy Polyakov
@ 2009-01-13 22:49                                 ` Theodore Tso
  2009-01-13 23:02                                   ` Evgeniy Polyakov
  2009-01-13 23:10                                 ` David Rientjes
  1 sibling, 1 reply; 71+ messages in thread
From: Theodore Tso @ 2009-01-13 22:49 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Rientjes, Alan Cox, linux-kernel, Andrew Morton,
	Linus Torvalds

On Wed, Jan 14, 2009 at 12:46:27AM +0300, Evgeniy Polyakov wrote:
> User does not work with the some magically calculated scores, he just
> starts the processes and knows only their names. User can specify pid,
> but in the case of short-living connections it is not possible. Changing
> parent oom score opens a huge possibility to kill it, while in case of
> some application server (or database) it should never be killed, and
> only some of its clients (which work for the users and not for the
> calculating backend for example) have to be killed.

The standard way this gets handled for resource limits is very simple:

1)  parent forks the child process
2)  in the child process we set up resource limits, adjust oom
3)  exec the child's program.

As Alan has already pointed out to you:

   (echo XXXX > /proc/self/oom_adj ; exec /usr/bin/program)

There are two problems; one is whether or not the OOM protection is
inherited or not, and how one sets OOM protection --- and I think you
will find a huge resistance to using names as a way of expressing
policy.

The second problem is that oom_adj scoring is a hueristic which is
hard for system administrators to understand --- and these are
separable problems.  Don't try to conflate them, and try using the
fact that a random score echo'ed into /proc/pid/oom_adj is hard to
tune as a justification for using process executable names.

If you want to argue that using containers is too hard, and there out
to be a simpler tuning parameter where (for the sake of argument) all
processes are given a number from 0 to 10, where 5 is the default, and
higher numbers will be picked unconditionally over lower numbers, and
the existing OOM score is used to distinguish between two process with
the same OOM protection, that's fine.

How we set that OOM protection class, whether it is via setrlimit() or
echoing into a magic /proc/pid/oom_protection file, and whether it
inherits across fork and exec calls, are a separate question.

	 	     	      	     	   - Ted

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 22:49                                 ` Theodore Tso
@ 2009-01-13 23:02                                   ` Evgeniy Polyakov
  2009-01-14  1:11                                     ` Theodore Tso
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 23:02 UTC (permalink / raw)
  To: Theodore Tso, David Rientjes, Alan Cox, linux-kernel,
	Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 05:49:41PM -0500, Theodore Tso (tytso@mit.edu) wrote:
> > User does not work with the some magically calculated scores, he just
> > starts the processes and knows only their names. User can specify pid,
> > but in the case of short-living connections it is not possible. Changing
> > parent oom score opens a huge possibility to kill it, while in case of
> > some application server (or database) it should never be killed, and
> > only some of its clients (which work for the users and not for the
> > calculating backend for example) have to be killed.
> 
> The standard way this gets handled for resource limits is very simple:
> 
> 1)  parent forks the child process
> 2)  in the child process we set up resource limits, adjust oom
> 3)  exec the child's program.
> 
> As Alan has already pointed out to you:
> 
>    (echo XXXX > /proc/self/oom_adj ; exec /usr/bin/program)

Yes, I saw that in archive, but did not receive myself, so did not
answer. This works in the above simple case, but if we dig a little bit
into the case when there are children, parent has to live and not all
children should be considered equal by the oom-killer, things change
dramatially. And we can not change the sources. Well, in particaular my
case we can, but it is not about the single system :)

> There are two problems; one is whether or not the OOM protection is
> inherited or not, and how one sets OOM protection --- and I think you
> will find a huge resistance to using names as a way of expressing
> policy.

Yup, this whole thread shows this resistance quite good :)

> The second problem is that oom_adj scoring is a hueristic which is
> hard for system administrators to understand --- and these are
> separable problems.  Don't try to conflate them, and try using the
> fact that a random score echo'ed into /proc/pid/oom_adj is hard to
> tune as a justification for using process executable names.

I tried, and although I do agree on the fact that it can be used to turn
oom-killer on or off, but not for the tuning. But even this does not
really work in the case showed, when we can not change the application,
and having a main goal to save the parent and kill only some subset of
the short-living children. So we can not really adjust parent oom-score
and get the same in the children, since this will put parent and
important children at risk.

> If you want to argue that using containers is too hard, and there out
> to be a simpler tuning parameter where (for the sake of argument) all
> processes are given a number from 0 to 10, where 5 is the default, and
> higher numbers will be picked unconditionally over lower numbers, and
> the existing OOM score is used to distinguish between two process with
> the same OOM protection, that's fine.

> How we set that OOM protection class, whether it is via setrlimit() or
> echoing into a magic /proc/pid/oom_protection file, and whether it
> inherits across fork and exec calls, are a separate question.

Let's put containers out of the picture. While it may or may not work,
they are definitely not an issue in the given systems. Having simpler
tunables would be great, but we can not change them, since it is
already existing abi, documentation could be extended though, I can
cook up a patch tomorrow if no one else will do this.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 21:46                               ` Evgeniy Polyakov
  2009-01-13 22:49                                 ` Theodore Tso
@ 2009-01-13 23:10                                 ` David Rientjes
  2009-01-13 23:35                                   ` Evgeniy Polyakov
  1 sibling, 1 reply; 71+ messages in thread
From: David Rientjes @ 2009-01-13 23:10 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:

> > Your patch simply allows users to specify a task by name that will always 
> > be killed first when the oom killer is invoked.  That's terribly 
> > insufficient if another task uses an excessive amount of memory that you 
> > didn't expect; a rogue task may be leaking memory and the task you've 
> > identified by name with your patch is repeatedly forked and killed when 
> > the rogue task goes untouched.
> 
> It is up to user to decide, exactly the same will happen if you tune the
> oom_adj for the task.
> 

It is up to the user to decide, using oom_adj scores as influence, how to 
define when a task should be selected by the oom killer.  Your 
name-parsing hack can do that for a single global task, but oom_adj scores 
are actually much more powerful.

> No, you can not. Did you try that? The only sane way to use oom_adj is
> to disable oom killer for the task or make its score very small or very
> big, there is really no way to make a finegrained tuning, since score
> changes and userspace does not know the algorithm.
> 

We finely tune oom_adj scores so that we get the desired results, yes.  
What you're complaining about here is purely a documentation issue.

> > Additionally, your patch completely breaks cpuset oom killing since 
> > candidacy is determined in badness() because a task may have allocated 
> > non-migrated memory elsewhere before being moved to a different cpuset.  
> > Your oom_victim_name task may exist globally, but will always be 
> > identified for oom kill even when the oom exists exclusively in a disjoint 
> > cpuset.  That does _not_ lead to future memory freeing that current can 
> > use, and if the parent of the killed task decides to immediately fork 
> > another instance, this cpuset will be completely livelocked. 
> 
> Please check the patch first. It selects process according to the
> badness, check memory group first and fallbacks to scan other processes
> if process with the given name was not found or name is null.
> 

Again, your patch _completely_ breaks cpuset oom killing.  That is a 
completely separate issue than the memory controller, and it's 
disappointing you still don't see it.

In a cpuset constrained oom condition, we do not explicitly exclude all 
tasks that are in a disjoint, exclusive cpuset since it's quite possible 
that a task has allocated memory outside its cpuset (either because its 
cpuset assignment has changed or because its cpuset's mems has changed) 
and killing it would free memory in current's cpuset.  We do, however, 
prefer to kill a task within the same cpuset; that preference is 
implemented in the badness() scoring.

If a task exists on the system in a disjoint, exclusive cpuset that 
matches oom_victim_name, your patch will cause it to be killed even though 
badness() has penalized it for not sharing a cpuset (dividing its score by 
eight).  That probably needlessly killed oom_victim_name since it won't 
allow for future memory freeing in the oom-triggering cpuset and the 
original oom condition persists.

Now if the parent of that task or another system task forks 
oom_victim_name again, the same thing will happen on the next iteration of 
the oom killer.  This will not free any memory in current's cpuset and it 
will effectively be livelocked.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 11:54                       ` Evgeniy Polyakov
  2009-01-13 12:15                         ` Alan Cox
  2009-01-13 19:15                         ` David Rientjes
@ 2009-01-13 23:26                         ` Valdis.Kletnieks
  2009-01-13 23:36                           ` Evgeniy Polyakov
  2 siblings, 1 reply; 71+ messages in thread
From: Valdis.Kletnieks @ 2009-01-13 23:26 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Rientjes, Bill Davidsen, Alan Cox, linux-kernel,
	Andrew Morton, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 549 bytes --]

On Tue, 13 Jan 2009 14:54:08 +0300, Evgeniy Polyakov said:

> Who should adjust the scores for newly created processes? Who should
> check that processes in the first group have negative oom ajustment and
> in the second group a positive value? Who determines when its time to
> ajust the scores?

Are you saying that you, as the box's administrator, don't know the answers
to those questions?

Or are you asking what the actual method of implementation is - which process
is supposed to write to (possibly some other process) oom_adjust, and when?

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 23:10                                 ` David Rientjes
@ 2009-01-13 23:35                                   ` Evgeniy Polyakov
  2009-01-13 23:43                                     ` David Rientjes
  2009-01-14  4:23                                     ` Valdis.Kletnieks
  0 siblings, 2 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 23:35 UTC (permalink / raw)
  To: David Rientjes; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 03:10:50PM -0800, David Rientjes (rientjes@google.com) wrote:
> On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:
> 
> > > Your patch simply allows users to specify a task by name that will always 
> > > be killed first when the oom killer is invoked.  That's terribly 
> > > insufficient if another task uses an excessive amount of memory that you 
> > > didn't expect; a rogue task may be leaking memory and the task you've 
> > > identified by name with your patch is repeatedly forked and killed when 
> > > the rogue task goes untouched.
> > 
> > It is up to user to decide, exactly the same will happen if you tune the
> > oom_adj for the task.
> > 
> 
> It is up to the user to decide, using oom_adj scores as influence, how to 
> define when a task should be selected by the oom killer.  Your 
> name-parsing hack can do that for a single global task, but oom_adj scores 
> are actually much more powerful.
> 
> > No, you can not. Did you try that? The only sane way to use oom_adj is
> > to disable oom killer for the task or make its score very small or very
> > big, there is really no way to make a finegrained tuning, since score
> > changes and userspace does not know the algorithm.
> > 
> 
> We finely tune oom_adj scores so that we get the desired results, yes.  
> What you're complaining about here is purely a documentation issue.

Which does not work. Even besides documenation issue, which really means
that no one really tried to work with it :)

> > > Additionally, your patch completely breaks cpuset oom killing since 
> > > candidacy is determined in badness() because a task may have allocated 
> > > non-migrated memory elsewhere before being moved to a different cpuset.  
> > > Your oom_victim_name task may exist globally, but will always be 
> > > identified for oom kill even when the oom exists exclusively in a disjoint 
> > > cpuset.  That does _not_ lead to future memory freeing that current can 
> > > use, and if the parent of the killed task decides to immediately fork 
> > > another instance, this cpuset will be completely livelocked. 
> > 
> > Please check the patch first. It selects process according to the
> > badness, check memory group first and fallbacks to scan other processes
> > if process with the given name was not found or name is null.
> > 
> 
> Again, your patch _completely_ breaks cpuset oom killing.  That is a 
> completely separate issue than the memory controller, and it's 
> disappointing you still don't see it.
> 
> In a cpuset constrained oom condition, we do not explicitly exclude all 
> tasks that are in a disjoint, exclusive cpuset since it's quite possible 
> that a task has allocated memory outside its cpuset (either because its 
> cpuset assignment has changed or because its cpuset's mems has changed) 
> and killing it would free memory in current's cpuset.  We do, however, 
> prefer to kill a task within the same cpuset; that preference is 
> implemented in the badness() scoring.
> 
> If a task exists on the system in a disjoint, exclusive cpuset that 
> matches oom_victim_name, your patch will cause it to be killed even though 
> badness() has penalized it for not sharing a cpuset (dividing its score by 
> eight).  That probably needlessly killed oom_victim_name since it won't 
> allow for future memory freeing in the oom-triggering cpuset and the 
> original oom condition persists.

It is exactly the purpose of the patch: to kill what is requested to be
killed.

I wonder how do you expect users to guess via libastral that even
adjusted score does not work, since it happens that task is so special,
that it can not be killed :)

> Now if the parent of that task or another system task forks 
> oom_victim_name again, the same thing will happen on the next iteration of 
> the oom killer.  This will not free any memory in current's cpuset and it 
> will effectively be livelocked.

My knowledge about cpusets is somewhat between zero and void, even more
I opened mm/kill.c the first time when created a patch (oom-killer is
not that interesting actually, but it is a matter of taste of course).

I can create exactly the reverse situation when task is supposed to be
killed, but because of the cpuset/group/whatever else you pointed above,
its score will be decreased and rogue task will continue to live :)
This game can be played by both, but let's leave that for others.

The purpose of the patch is to create an ability to kill what is
needed by the user. Exactly what user wants. User can kill the system by
millions of ways, and we allow him to do so, but we do not really allow
him what to kill when system breaks into oom condition. With my patch it
is possible. And it is exactly what is expected by the user who does not
know anything else except the names of the applications he starts.
And please let's not start again with oom_adj if arguments are still the
same: it does not work in the case I showed.

Piece? :)

Please provide a way to fix the problem I described without my patch,
and everything will be immediately resolved :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 23:26                         ` Valdis.Kletnieks
@ 2009-01-13 23:36                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 23:36 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: David Rientjes, Bill Davidsen, Alan Cox, linux-kernel,
	Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 06:26:05PM -0500, Valdis.Kletnieks@vt.edu (Valdis.Kletnieks@vt.edu) wrote:
> On Tue, 13 Jan 2009 14:54:08 +0300, Evgeniy Polyakov said:
> 
> > Who should adjust the scores for newly created processes? Who should
> > check that processes in the first group have negative oom ajustment and
> > in the second group a positive value? Who determines when its time to
> > ajust the scores?
> 
> Are you saying that you, as the box's administrator, don't know the answers
> to those questions?
> 
> Or are you asking what the actual method of implementation is - which process
> is supposed to write to (possibly some other process) oom_adjust, and when?

Unfortunately I do not know the answers. Since all parameters are highly
dynamic and can not be always automatically tuned. And we do not have a
daemon to watch created processes. And in the general case we can not
change the application.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 23:35                                   ` Evgeniy Polyakov
@ 2009-01-13 23:43                                     ` David Rientjes
  2009-01-13 23:55                                       ` Evgeniy Polyakov
  2009-01-14  4:23                                     ` Valdis.Kletnieks
  1 sibling, 1 reply; 71+ messages in thread
From: David Rientjes @ 2009-01-13 23:43 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:

> Which does not work. Even besides documenation issue, which really means
> that no one really tried to work with it :)
> 

Please.  A lack of thorough documentation, while it should be fixed, does 
not imply that a feature is not being used.

> > Again, your patch _completely_ breaks cpuset oom killing.  That is a 
> > completely separate issue than the memory controller, and it's 
> > disappointing you still don't see it.
> > 
> > In a cpuset constrained oom condition, we do not explicitly exclude all 
> > tasks that are in a disjoint, exclusive cpuset since it's quite possible 
> > that a task has allocated memory outside its cpuset (either because its 
> > cpuset assignment has changed or because its cpuset's mems has changed) 
> > and killing it would free memory in current's cpuset.  We do, however, 
> > prefer to kill a task within the same cpuset; that preference is 
> > implemented in the badness() scoring.
> > 
> > If a task exists on the system in a disjoint, exclusive cpuset that 
> > matches oom_victim_name, your patch will cause it to be killed even though 
> > badness() has penalized it for not sharing a cpuset (dividing its score by 
> > eight).  That probably needlessly killed oom_victim_name since it won't 
> > allow for future memory freeing in the oom-triggering cpuset and the 
> > original oom condition persists.
> 
> It is exactly the purpose of the patch: to kill what is requested to be
> killed.
> 

There are global system-wide oom conditions, cpuset-constrained oom 
conditions, memory controller oom conditions, and mempolicy oom 
conditions.  You're patch affects them all, yet it is quite possible that 
killing oom_victim_name will not alleviate the oom condition in a disjoint 
cpuset.  It would have been needlessly killed because you make no 
distinction on the constraint of the oom.

> I wonder how do you expect users to guess via libastral that even
> adjusted score does not work, since it happens that task is so special,
> that it can not be killed :)
> 
> My knowledge about cpusets is somewhat between zero and void, even more
> I opened mm/kill.c the first time when created a patch (oom-killer is
> not that interesting actually, but it is a matter of taste of course).
> 

Being ignorant about cpusets doesn't justify you breaking their oom 
handling.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 23:43                                     ` David Rientjes
@ 2009-01-13 23:55                                       ` Evgeniy Polyakov
  2009-01-14  0:32                                         ` David Rientjes
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-13 23:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 03:43:56PM -0800, David Rientjes (rientjes@google.com) wrote:
> > Which does not work. Even besides documenation issue, which really means
> > that no one really tried to work with it :)
> 
> Please.  A lack of thorough documentation, while it should be fixed, does 
> not imply that a feature is not being used.

Out of curiousity, how feature can be used, if no one except hardcore
kernel hackers know how to work with it? I do not insult, no, I'm really
curious. This may explain, why admins I worked with about this issue did
not fully succeeded with tuning.

> > It is exactly the purpose of the patch: to kill what is requested to be
> > killed.
> > 
> 
> There are global system-wide oom conditions, cpuset-constrained oom 
> conditions, memory controller oom conditions, and mempolicy oom 
> conditions.  You're patch affects them all, yet it is quite possible that 
> killing oom_victim_name will not alleviate the oom condition in a disjoint 
> cpuset.  It would have been needlessly killed because you make no 
> distinction on the constraint of the oom.

Still it is possible to start a fork-bomb and kill the machine in some
cases, but we allow this. And also allow to limit amount of the
processes started by the user.

This is the same: we have several ways to solve oom-killer problem. Some
of them work in some cases, some in other. Proposed patch is another way
to deal with the problem. And in some cases it may be wrong. But if user
specified that behaviour, he knows what he is doing. Especially when
there is no way to properly implement the solution using existing
methods.

> > I wonder how do you expect users to guess via libastral that even
> > adjusted score does not work, since it happens that task is so special,
> > that it can not be killed :)
> > 
> > My knowledge about cpusets is somewhat between zero and void, even more
> > I opened mm/kill.c the first time when created a patch (oom-killer is
> > not that interesting actually, but it is a matter of taste of course).
> > 
> 
> Being ignorant about cpusets doesn't justify you breaking their oom 
> handling.

I did not break cpuset oom-handling, I provided a way to implement it
differently to solve the problem. Yes, this may have side effects, if
people care, they will not use the feature and leave victim name as NULL
(although allowing Kenny to live breaks the absolute fundamentals).
Those people who do need this functionality will work with it.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 13:35                               ` Evgeniy Polyakov
@ 2009-01-14  0:24                                 ` Bill Davidsen
  2009-01-14  0:35                                   ` Evgeniy Polyakov
  0 siblings, 1 reply; 71+ messages in thread
From: Bill Davidsen @ 2009-01-14  0:24 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Theodore Tso, Alan Cox, David Rientjes, linux-kernel,
	Andrew Morton, Linus Torvalds

Evgeniy Polyakov wrote:
> So effectively oom_adj only works as enable/disable switch, and since no
> one knows how to tune it, it is better to do not touch at all. And get
> ssh killed. I believe if it is ever used then only to disable oom at
> all, which is wrong, since task still may be killed but after some
> others. My patch adds a simple priority for that based on the name of
> the process, which are known to the administrators who maintain given
> system.
>
>   
If nothing else, this would seem to reduce the number of processes for 
which the OOM coefficient of evil must be calculated.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 23:55                                       ` Evgeniy Polyakov
@ 2009-01-14  0:32                                         ` David Rientjes
  2009-01-14  0:53                                           ` Evgeniy Polyakov
  0 siblings, 1 reply; 71+ messages in thread
From: David Rientjes @ 2009-01-14  0:32 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Wed, 14 Jan 2009, Evgeniy Polyakov wrote:

> Out of curiousity, how feature can be used, if no one except hardcore
> kernel hackers know how to work with it? I do not insult, no, I'm really
> curious. This may explain, why admins I worked with about this issue did
> not fully succeeded with tuning.
> 

You read the code.

It's always great to improve the documentation of a kernel feature, and I 
agree that it certainly applies in this case.

I think you could also improve how the badness() scoring is implemented to 
make it easier to predict from userspace.  I doubt you would find much 
opposition to improving the heuristic; we cannot, however, change 
/proc/pid/oom_adj since it already has users who depend on it.

> > Being ignorant about cpusets doesn't justify you breaking their oom 
> > handling.
> 
> I did not break cpuset oom-handling, I provided a way to implement it
> differently to solve the problem. Yes, this may have side effects, if
> people care, they will not use the feature and leave victim name as NULL
> (although allowing Kenny to live breaks the absolute fundamentals).
> Those people who do need this functionality will work with it.
> 

You're treating each oom constraint like they are on the same; in a 
cpuset-constrained oom, which can be much more common than system-wide 
unconstrained ooms, we want to target a task that will allow for future 
memory freeing in that cpuset.

So in these cases, to avoid needlessly killing your victim, you would be 
forced to set oom_victim_name to NULL.  That's hardly useful if the same 
problem you're trying to fix still exists both globally and within a 
cpuset.  Your patch doesn't address this use case, so it's already 
incomplete.

In a mempolicy-constrained oom as the result of MPOL_BIND, which can also 
be much more common than system-wide unconstrained ooms, we want to target 
current because it has allocations from the bound nodes.  Your patch 
doesn't touch this path, so it's already inconsistent.

I'm comfortable that this patch will not be merged, so I'll silently point 
to past posts for the duration of this thread.  I definitely think the 
documentation can be improved and I don't think you'll have any opposition 
to sane heuristic changes that also rely on userspace input via 
/proc/pid/oom_adj.  Thank you for working on this!

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-14  0:24                                 ` Bill Davidsen
@ 2009-01-14  0:35                                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14  0:35 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Theodore Tso, Alan Cox, David Rientjes, linux-kernel,
	Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 07:24:35PM -0500, Bill Davidsen (davidsen@tmr.com) wrote:
> >So effectively oom_adj only works as enable/disable switch, and since no
> >one knows how to tune it, it is better to do not touch at all. And get
> >ssh killed. I believe if it is ever used then only to disable oom at
> >all, which is wrong, since task still may be killed but after some
> >others. My patch adds a simple priority for that based on the name of
> >the process, which are known to the administrators who maintain given
> >system.
> >  
> If nothing else, this would seem to reduce the number of processes for 
> which the OOM coefficient of evil must be calculated.

Yes, it allows this.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-14  0:32                                         ` David Rientjes
@ 2009-01-14  0:53                                           ` Evgeniy Polyakov
  0 siblings, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14  0:53 UTC (permalink / raw)
  To: David Rientjes; +Cc: Alan Cox, linux-kernel, Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 04:32:41PM -0800, David Rientjes (rientjes@google.com) wrote:
> > Out of curiousity, how feature can be used, if no one except hardcore
> > kernel hackers know how to work with it? I do not insult, no, I'm really
> > curious. This may explain, why admins I worked with about this issue did
> > not fully succeeded with tuning.
> 
> You read the code.

Not the best solution actually and at least not the simplest :)

> You're treating each oom constraint like they are on the same; in a 
> cpuset-constrained oom, which can be much more common than system-wide 
> unconstrained ooms, we want to target a task that will allow for future 
> memory freeing in that cpuset.

I do not break the way oom problem is addressed currently. I just
extend it from the different angle, which can not be resolved in some
cases. Those who do not need the feature, can safely disable the check
and rely on the old algorithms.

What you are talking here is a different problem. Completely different
case. We may have the problem case you described, and we will think on
how to resolve the issue, but approach used in the patch does not
enforce the new policy, it extends it adding new global tunable.

> So in these cases, to avoid needlessly killing your victim, you would be 
> forced to set oom_victim_name to NULL.  That's hardly useful if the same 
> problem you're trying to fix still exists both globally and within a 
> cpuset.  Your patch doesn't address this use case, so it's already 
> incomplete.

Incorrect point of view. If administrator wants to select victim task by
name, this patch allows this. If he does not want this, it is turned
off. There are perfectly split areas where each approach applies: global
name-based selection and more specific to the areas where problem
arises.

> In a mempolicy-constrained oom as the result of MPOL_BIND, which can also 
> be much more common than system-wide unconstrained ooms, we want to target 
> current because it has allocations from the bound nodes.  Your patch 
> doesn't touch this path, so it's already inconsistent.

And again wrong conclusion: patch is intended to work in the area it was
created for. It is the simplest (and the only btw) solution for the showed
problem. In the systems where it is not needed, it will not be used and
old algorithms will work fine, apparently since no one proposed it
before, other areas work ok without it. While the problem (quite common
actually) I showed was not addressed at all and all proposed solutions
just failed if we start checking requrements more precisely.

> I'm comfortable that this patch will not be merged, so I'll silently point 
> to past posts for the duration of this thread.  I definitely think the 
> documentation can be improved and I don't think you'll have any opposition 
> to sane heuristic changes that also rely on userspace input via 
> /proc/pid/oom_adj.  Thank you for working on this!

So you agree that no existing solution can solve the oom problem when
parent task has to stay and children have to be differentiated. And
agree that not merging the solution for this commonly happened problem
is the right way. And while we can reread the whole thread multiple
times, we will find again and again that proposed approaches do not
work. This patch does its simple task without breaking others and in the
systems where this feature is not needed, it can be safely turned off,
while still fixing the problem for those who care.

I will update documentation tomorrow if there will be no patches.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 23:02                                   ` Evgeniy Polyakov
@ 2009-01-14  1:11                                     ` Theodore Tso
  2009-01-14  1:20                                       ` Evgeniy Polyakov
  0 siblings, 1 reply; 71+ messages in thread
From: Theodore Tso @ 2009-01-14  1:11 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Rientjes, Alan Cox, linux-kernel, Andrew Morton,
	Linus Torvalds

On Wed, Jan 14, 2009 at 02:02:40AM +0300, Evgeniy Polyakov wrote:
> > As Alan has already pointed out to you:
> > 
> >    (echo XXXX > /proc/self/oom_adj ; exec /usr/bin/program)
> 
> Yes, I saw that in archive, but did not receive myself, so did not
> answer. This works in the above simple case, but if we dig a little bit
> into the case when there are children, parent has to live and not all
> children should be considered equal by the oom-killer, things change
> dramatially. And we can not change the sources. Well, in particaular my
> case we can, but it is not about the single system :)

I think you will find that most people are far more interested in
making sure we define consistent, usable interfaces --- and depending
on process names is a complete and total hack.  Justifying it by
claiming that we won't be able to change application source code, so
we have to use a hack, isn't going to get you very far.

The security implications alone are troubling; OK, so we make the
process name "sshd" privileged and exempt from the OOM killer.  What
happens if a user creates a program called sshd in their home
directory and executes it --- gee, it's protected from the OOM killer
as well.  It's just not going to fly.  Give up now.

If your argument is "we have to protect crappy closed source
applications where their programmers can't be bothered to change their
source code to use a proper interface", you're just going to get
laughed out of the room.

       		   	       	     	- Ted

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-14  1:11                                     ` Theodore Tso
@ 2009-01-14  1:20                                       ` Evgeniy Polyakov
  2009-01-14  4:06                                         ` Theodore Tso
  0 siblings, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14  1:20 UTC (permalink / raw)
  To: Theodore Tso, David Rientjes, Alan Cox, linux-kernel,
	Andrew Morton, Linus Torvalds

On Tue, Jan 13, 2009 at 08:11:38PM -0500, Theodore Tso (tytso@mit.edu) wrote:
> I think you will find that most people are far more interested in
> making sure we define consistent, usable interfaces --- and depending
> on process names is a complete and total hack.  Justifying it by
> claiming that we won't be able to change application source code, so
> we have to use a hack, isn't going to get you very far.

It is not about the possibility to change the sources, but the way
interface is exported to the userspace. Right now it is not usable for
some cases. And forcing applications, which are actually cross-platform,
depending on the way linux controls its own oom-killer is noticebly more
hackish than selecting a system-wide process by its name.

> The security implications alone are troubling; OK, so we make the
> process name "sshd" privileged and exempt from the OOM killer.  What
> happens if a user creates a program called sshd in their home
> directory and executes it --- gee, it's protected from the OOM killer
> as well.  It's just not going to fly.  Give up now.

It is not about who is protected, but who will be selected to be killed.
If you have a rogue application which happend to have the right name,
everything is ok, otherwise it should be tuned further. And even in that
case nothing harmless will happen, since another processes will be
killed first (since admin selected the name on purpose to kill
potentially damaging applications).

> If your argument is "we have to protect crappy closed source
> applications where their programmers can't be bothered to change their
> source code to use a proper interface", you're just going to get
> laughed out of the room.

You believe that changing apache to control oom_adj is the right way to
deal with linux oom-killer? Do we already flight to the moon?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-14  1:20                                       ` Evgeniy Polyakov
@ 2009-01-14  4:06                                         ` Theodore Tso
  0 siblings, 0 replies; 71+ messages in thread
From: Theodore Tso @ 2009-01-14  4:06 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Rientjes, Alan Cox, linux-kernel, Andrew Morton,
	Linus Torvalds

On Wed, Jan 14, 2009 at 04:20:01AM +0300, Evgeniy Polyakov wrote:
> 
> It is not about the possibility to change the sources, but the way
> interface is exported to the userspace. Right now it is not usable for
> some cases. And forcing applications, which are actually cross-platform,
> depending on the way linux controls its own oom-killer is noticebly more
> hackish than selecting a system-wide process by its name.

And we can change that interface if it's not the right one, or perhaps
extend it.  After all, you are are proposing extending that interface;
just in a really horrible, hackish way.

> You believe that changing apache to control oom_adj is the right way to
> deal with linux oom-killer? Do we already flight to the moon?

Actually, I would believe the right answer is adding a new resource
limit which can be set using the standard getrlimit()/setrlimit()
interface, and then have apache use this standard interface as a way
of configuring itself --- much like how a process can change other
resource limits, such as the number of file descriptors it has open,
etc.  And if what you want to do is simply make the process volunteer
to be one of the first processes shot by the OOM killer, the apache
process wouldn't even need setuid privileges to lower the "OOM
protection" resource limit.

I think this is cleaner than echoing a magic value to
/proc/self/oom_adj, and it won't be the first time various open source
programs have been changed to take advantage of Linux-specific
interfaces, especially if there's no other standard way of doing
things --- and an extension of BSD's getrlimit()/setrlimit() is
natural and makes a lot of sense.  Heck, apache has been changed to
take advantage of Linux's epoll interface....

						- Ted

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-13 23:35                                   ` Evgeniy Polyakov
  2009-01-13 23:43                                     ` David Rientjes
@ 2009-01-14  4:23                                     ` Valdis.Kletnieks
  2009-01-14  9:07                                       ` Evgeniy Polyakov
  1 sibling, 1 reply; 71+ messages in thread
From: Valdis.Kletnieks @ 2009-01-14  4:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Rientjes, Alan Cox, linux-kernel, Andrew Morton,
	Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 616 bytes --]

On Wed, 14 Jan 2009 02:35:02 +0300, Evgeniy Polyakov said:

> It is exactly the purpose of the patch: to kill what is requested to be
> killed.
> 
> I wonder how do you expect users to guess via libastral that even
> adjusted score does not work, since it happens that task is so special,
> that it can not be killed :)

What does your patch do if one user has a process 'foo' that they're willing
to have die first, and a process 'bar' that absolutely can't be killed..

Meanwhile, another user has a 'must die first' process 'bar', and a 'must not
die' process 'foo'.

Methinks your patch needs libastral as well?

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: Linux killed Kenny, bastard!
  2009-01-14  4:23                                     ` Valdis.Kletnieks
@ 2009-01-14  9:07                                       ` Evgeniy Polyakov
  0 siblings, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14  9:07 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: David Rientjes, Alan Cox, linux-kernel, Andrew Morton,
	Linus Torvalds

On Tue, Jan 13, 2009 at 11:23:57PM -0500, Valdis.Kletnieks@vt.edu (Valdis.Kletnieks@vt.edu) wrote:
> > I wonder how do you expect users to guess via libastral that even
> > adjusted score does not work, since it happens that task is so special,
> > that it can not be killed :)
> 
> What does your patch do if one user has a process 'foo' that they're willing
> to have die first, and a process 'bar' that absolutely can't be killed..
> 
> Meanwhile, another user has a 'must die first' process 'bar', and a 'must not
> die' process 'foo'.
> 
> Methinks your patch needs libastral as well?

My patch needs an administrator to setup the name pattern.
Undocumented magic deeply hidden in the calculus of the badness() is quite different.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-13 21:39                     ` David Rientjes
  2009-01-13 22:05                       ` Evgeniy Polyakov
@ 2009-01-14 16:12                       ` Evgeniy Polyakov
  2009-01-14 17:06                         ` [take2] " Evgeniy Polyakov
  1 sibling, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14 16:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Rientjes, Balbir Singh, Alan Cox, Dave Jones, Andrew Morton,
	Linus Torvalds, Theodore Tso

Please apply.
While existing interface do not fix the problems, they should be at least documented.

Sign.

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..f7530a1 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,6 +2311,30 @@ increase the likelihood of this process being killed by the oom-killer.  Valid
 values are in the range -16 to +15, plus the special value -17, which disables
 oom-killing altogether for this process.
 
+Process to be killed at out-of-memory situation is selected among all others
+based on its badness score. This value equals to the memory size of the process
+originally and then changed according to its cpu time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is devided by the sqare root of the cpu time and then by
+the double square root of the run time.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+Following heueristics are then applied:
+ * if task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ 	or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked task does not belong
+ 	to it, its score is divided by 8
+ * resulted score is multiplied by the two in the power of oom_adj when it is
+ 	positive, and devided otherwise, i.e.
+	points <<= oom_adj when it is positive and
+	points >>= oom_adj otherwise
+
+Swapped tasks are killed first.
+Task with the biggest number of badness points is selected to be killed.
+Usually children tasks are prefered compared to their parent.
+
 2.13 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
 
 

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [take2] OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-14 16:12                       ` OOM documentation update [was: Linux killed Kenny, bastard!] Evgeniy Polyakov
@ 2009-01-14 17:06                         ` Evgeniy Polyakov
  2009-01-14 21:34                           ` Randy Dunlap
  2009-01-14 21:53                           ` Bryan Donlan
  0 siblings, 2 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14 17:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Rientjes, Balbir Singh, Alan Cox, Dave Jones, Andrew Morton,
	Linus Torvalds, Theodore Tso, Matthias Andree

Updated version fixes some errors and extends description by adding
swapping and children relation explaintaion.

Signed.

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..4aa1918 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,6 +2311,33 @@ increase the likelihood of this process being killed by the oom-killer.  Valid
 values are in the range -16 to +15, plus the special value -17, which disables
 oom-killing altogether for this process.
 
+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+originally and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the cpu time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ 	or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked task does not belong
+ 	to it, its score is divided by 8
+ * resulted score is multiplied by the two in the power of oom_adj when it is
+ 	positive, and divided otherwise, i.e.
+	points <<= oom_adj when it is positive and
+	points >>= oom_adj otherwise
+
+The task with the highest badness score is then killed.
+
 2.13 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
 


-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [take2] OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-14 17:06                         ` [take2] " Evgeniy Polyakov
@ 2009-01-14 21:34                           ` Randy Dunlap
  2009-01-14 21:53                           ` Bryan Donlan
  1 sibling, 0 replies; 71+ messages in thread
From: Randy Dunlap @ 2009-01-14 21:34 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, David Rientjes, Balbir Singh, Alan Cox, Dave Jones,
	Andrew Morton, Linus Torvalds, Theodore Tso, Matthias Andree

On Wed, 14 Jan 2009 20:06:06 +0300 Evgeniy Polyakov wrote:

> Updated version fixes some errors and extends description by adding
> swapping and children relation explaintaion.
> 
> Signed.

??

> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index d105eb4..4aa1918 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -2311,6 +2311,33 @@ increase the likelihood of this process being killed by the oom-killer.  Valid
>  values are in the range -16 to +15, plus the special value -17, which disables
>  oom-killing altogether for this process.
>  
> +The process to be killed in an out-of-memory situation is selected among all others
> +based on its badness score. This value equals the original memory size of the process
> +originally and is then updated according to its CPU time (utime + stime) and the

drop "originally" since earlier part of sentence says "original memory size".

> +run time (uptime - start time). The longer it runs the smaller is the score.
> +Badness score is divided by the square root of the cpu time and then by

                                                      CPU

> +the double square root of the run time.
> +
> +Swapped out tasks are killed first. Half of each child's memory size is added to
> +the parent's score if they do not share the same memory. Thus forking servers
> +are the prime candidates to be killed. Having only one 'hungry' child will make
> +parent less preferable than the child.
> +
> +/proc/<pid>/oom_score shows process' current badness score.
> +
> +The following heuristics are then applied:
> + * if the task was reniced, its score doubles
> + * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
> + 	or CAP_SYS_RAWIO) have their score divided by 4
> + * if oom condition happened in one cpuset and checked task does not belong
> + 	to it, its score is divided by 8
> + * resulted score is multiplied by the two in the power of oom_adj when it is

confusing.  Is this:    multiplied by two to the power of oom_adj when it is ... ?

> + 	positive, and divided otherwise, i.e.
> +	points <<= oom_adj when it is positive and
> +	points >>= oom_adj otherwise
> +
> +The task with the highest badness score is then killed.
> +
>  2.13 /proc/<pid>/oom_score - Display current oom-killer score
>  -------------------------------------------------------------


---
~Randy

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [take2] OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-14 17:06                         ` [take2] " Evgeniy Polyakov
  2009-01-14 21:34                           ` Randy Dunlap
@ 2009-01-14 21:53                           ` Bryan Donlan
  2009-01-14 22:10                             ` Evgeniy Polyakov
  2009-01-14 22:14                             ` [take3] " Evgeniy Polyakov
  1 sibling, 2 replies; 71+ messages in thread
From: Bryan Donlan @ 2009-01-14 21:53 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, David Rientjes, Balbir Singh, Alan Cox, Dave Jones,
	Andrew Morton, Linus Torvalds, Theodore Tso, Matthias Andree

On Wed, Jan 14, 2009 at 12:06 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote:

> + * resulted score is multiplied by the two in the power of oom_adj when it is
> +       positive, and divided otherwise, i.e.
> +       points <<= oom_adj when it is positive and
> +       points >>= oom_adj otherwise

Two to the power of a negative number is equivalent to dividing by two
to the power of said exponent's absolute value, making this paragraph
more than a bit confusing - indeed, a literal read would make it
equivalent to multiplying by 2^abs(oom_adj).

I would think that the following would be enough:
* The resulting score is multiplied by two to the power of oom_adj.

Unless we assume admins don't know how exponentiation by a negative
number works :)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [take2] OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-14 21:53                           ` Bryan Donlan
@ 2009-01-14 22:10                             ` Evgeniy Polyakov
  2009-01-14 22:14                             ` [take3] " Evgeniy Polyakov
  1 sibling, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14 22:10 UTC (permalink / raw)
  To: Bryan Donlan
  Cc: linux-kernel, David Rientjes, Balbir Singh, Alan Cox, Dave Jones,
	Andrew Morton, Linus Torvalds, Theodore Tso, Matthias Andree

On Wed, Jan 14, 2009 at 04:53:16PM -0500, Bryan Donlan (bdonlan@gmail.com) wrote:
> On Wed, Jan 14, 2009 at 12:06 PM, Evgeniy Polyakov <zbr@ioremap.net> wrote:
> 
> > + * resulted score is multiplied by the two in the power of oom_adj when it is
> > +       positive, and divided otherwise, i.e.
> > +       points <<= oom_adj when it is positive and
> > +       points >>= oom_adj otherwise
> 
> Two to the power of a negative number is equivalent to dividing by two
> to the power of said exponent's absolute value, making this paragraph
> more than a bit confusing - indeed, a literal read would make it
> equivalent to multiplying by 2^abs(oom_adj).
> 
> I would think that the following would be enough:
> * The resulting score is multiplied by two to the power of oom_adj.

Yes, I think it is enough with shift example.

Thanks guys I will update the doc.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [take3] OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-14 21:53                           ` Bryan Donlan
  2009-01-14 22:10                             ` Evgeniy Polyakov
@ 2009-01-14 22:14                             ` Evgeniy Polyakov
  2009-01-15  0:58                               ` David Rientjes
  1 sibling, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-14 22:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Bryan Donlan, David Rientjes, Balbir Singh, Alan Cox, Dave Jones,
	Andrew Morton, Linus Torvalds, Theodore Tso, Matthias Andree,
	Randy Dunlap


diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..eed2fbb 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,6 +2311,32 @@ increase the likelihood of this process being killed by the oom-killer.  Valid
 values are in the range -16 to +15, plus the special value -17, which disables
 oom-killing altogether for this process.
 
+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the CPU time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ 	or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked task does not belong
+ 	to it, its score is divided by 8
+ * the resulting score is multiplied by two to the power of oom_adj, i.e.
+	points <<= oom_adj when it is positive and
+	points >>= -(oom_adj) otherwise
+
+The task with the highest badness score is then killed.
+
 2.13 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
 

-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [take3] OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-14 22:14                             ` [take3] " Evgeniy Polyakov
@ 2009-01-15  0:58                               ` David Rientjes
  2009-01-15  8:51                                 ` Evgeniy Polyakov
  2009-01-15  8:57                                 ` [take4] " Evgeniy Polyakov
  0 siblings, 2 replies; 71+ messages in thread
From: David Rientjes @ 2009-01-15  0:58 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, Bryan Donlan, Balbir Singh, Alan Cox, Dave Jones,
	Andrew Morton, Linus Torvalds, Theodore Tso, Matthias Andree,
	Randy Dunlap

On Thu, 15 Jan 2009, Evgeniy Polyakov wrote:

> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index d105eb4..eed2fbb 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -2311,6 +2311,32 @@ increase the likelihood of this process being killed by the oom-killer.  Valid
>  values are in the range -16 to +15, plus the special value -17, which disables
>  oom-killing altogether for this process.
>  
> +The process to be killed in an out-of-memory situation is selected among all others
> +based on its badness score. This value equals the original memory size of the process
> +and is then updated according to its CPU time (utime + stime) and the
> +run time (uptime - start time). The longer it runs the smaller is the score.
> +Badness score is divided by the square root of the CPU time and then by
> +the double square root of the run time.
> +
> +Swapped out tasks are killed first. Half of each child's memory size is added to
> +the parent's score if they do not share the same memory. Thus forking servers
> +are the prime candidates to be killed. Having only one 'hungry' child will make
> +parent less preferable than the child.
> +
> +/proc/<pid>/oom_score shows process' current badness score.
> +
> +The following heuristics are then applied:
> + * if the task was reniced, its score doubles
> + * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
> + 	or CAP_SYS_RAWIO) have their score divided by 4
> + * if oom condition happened in one cpuset and checked task does not belong
> + 	to it, its score is divided by 8
> + * the resulting score is multiplied by two to the power of oom_adj, i.e.
> +	points <<= oom_adj when it is positive and
> +	points >>= -(oom_adj) otherwise
> +
> +The task with the highest badness score is then killed.
> +

Not quite, even after a task is selected for oom kill, the oom killer 
still prefers to kill one of its children first if any have a different 
mm.  See oom_kill_process().

You also don't mention the exception of OOM_DISABLE (oom_adj score of -17) 
in your formula for how oom_adj impacts the points value.  Although its 
already explained earlier, it should be mentioned here since a oom_adj is 
an int and a right shift of 17 does not guarantee `points' will be 0.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [take3] OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-15  0:58                               ` David Rientjes
@ 2009-01-15  8:51                                 ` Evgeniy Polyakov
  2009-01-15  8:57                                 ` [take4] " Evgeniy Polyakov
  1 sibling, 0 replies; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-15  8:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-kernel, Bryan Donlan, Balbir Singh, Alan Cox, Dave Jones,
	Andrew Morton, Linus Torvalds, Theodore Tso, Matthias Andree,
	Randy Dunlap

On Wed, Jan 14, 2009 at 04:58:41PM -0800, David Rientjes (rientjes@google.com) wrote:
> > +
> > +The task with the highest badness score is then killed.
> > +
> 
> Not quite, even after a task is selected for oom kill, the oom killer 
> still prefers to kill one of its children first if any have a different 
> mm.  See oom_kill_process().

Ok, if it was not clear from the description.

> You also don't mention the exception of OOM_DISABLE (oom_adj score of -17) 
> in your formula for how oom_adj impacts the points value.  Although its 
> already explained earlier, it should be mentioned here since a oom_adj is 
> an int and a right shift of 17 does not guarantee `points' will be 0.

It is written several lines above.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [take4] OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-15  0:58                               ` David Rientjes
  2009-01-15  8:51                                 ` Evgeniy Polyakov
@ 2009-01-15  8:57                                 ` Evgeniy Polyakov
  2009-01-15 11:13                                   ` David Rientjes
  1 sibling, 1 reply; 71+ messages in thread
From: Evgeniy Polyakov @ 2009-01-15  8:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Rientjes, Bryan Donlan, Balbir Singh, Alan Cox, Dave Jones,
	Andrew Morton, Linus Torvalds, Theodore Tso, Matthias Andree,
	Randy Dunlap

Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net>

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index d105eb4..4902966 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -2311,6 +2311,34 @@ increase the likelihood of this process being killed by the oom-killer.  Valid
 values are in the range -16 to +15, plus the special value -17, which disables
 oom-killing altogether for this process.
 
+The process to be killed in an out-of-memory situation is selected among all others
+based on its badness score. This value equals the original memory size of the process
+and is then updated according to its CPU time (utime + stime) and the
+run time (uptime - start time). The longer it runs the smaller is the score.
+Badness score is divided by the square root of the CPU time and then by
+the double square root of the run time.
+
+Swapped out tasks are killed first. Half of each child's memory size is added to
+the parent's score if they do not share the same memory. Thus forking servers
+are the prime candidates to be killed. Having only one 'hungry' child will make
+parent less preferable than the child.
+
+/proc/<pid>/oom_score shows process' current badness score.
+
+The following heuristics are then applied:
+ * if the task was reniced, its score doubles
+ * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
+ 	or CAP_SYS_RAWIO) have their score divided by 4
+ * if oom condition happened in one cpuset and checked task does not belong
+ 	to it, its score is divided by 8
+ * the resulting score is multiplied by two to the power of oom_adj, i.e.
+	points <<= oom_adj when it is positive and
+	points >>= -(oom_adj) otherwise
+
+The task with the highest badness score is then selected and its children
+are killed, process itself will be killed in an OOM situation when it does
+not have children or some of them disabled oom like described above.
+
 2.13 /proc/<pid>/oom_score - Display current oom-killer score
 -------------------------------------------------------------
 


-- 
	Evgeniy Polyakov

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [take4] OOM documentation update [was: Linux killed Kenny, bastard!]
  2009-01-15  8:57                                 ` [take4] " Evgeniy Polyakov
@ 2009-01-15 11:13                                   ` David Rientjes
  0 siblings, 0 replies; 71+ messages in thread
From: David Rientjes @ 2009-01-15 11:13 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: linux-kernel, Bryan Donlan, Balbir Singh, Alan Cox, Dave Jones,
	Andrew Morton, Linus Torvalds, Theodore Tso, Matthias Andree,
	Randy Dunlap

On Thu, 15 Jan 2009, Evgeniy Polyakov wrote:

> Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net>
> 

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2009-01-15 11:15 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-12 15:33 Linux killed Kenny, bastard! Evgeniy Polyakov
2009-01-12 15:44 ` Dave Jones
2009-01-12 15:48   ` Evgeniy Polyakov
2009-01-12 15:51     ` Alan Cox
2009-01-12 15:52       ` Evgeniy Polyakov
2009-01-12 21:29         ` Chris Snook
2009-01-12 21:42           ` Evgeniy Polyakov
2009-01-13 13:52       ` [why oom_adj does not work] " Evgeniy Polyakov
2009-01-13 14:06         ` Alan Cox
2009-01-13 14:24           ` Evgeniy Polyakov
2009-01-13 15:00             ` Balbir Singh
2009-01-13 15:21               ` Evgeniy Polyakov
2009-01-13 18:04                 ` Valdis.Kletnieks
2009-01-13 19:46                 ` David Rientjes
2009-01-13 21:33                   ` Evgeniy Polyakov
2009-01-13 21:39                     ` David Rientjes
2009-01-13 22:05                       ` Evgeniy Polyakov
2009-01-14 16:12                       ` OOM documentation update [was: Linux killed Kenny, bastard!] Evgeniy Polyakov
2009-01-14 17:06                         ` [take2] " Evgeniy Polyakov
2009-01-14 21:34                           ` Randy Dunlap
2009-01-14 21:53                           ` Bryan Donlan
2009-01-14 22:10                             ` Evgeniy Polyakov
2009-01-14 22:14                             ` [take3] " Evgeniy Polyakov
2009-01-15  0:58                               ` David Rientjes
2009-01-15  8:51                                 ` Evgeniy Polyakov
2009-01-15  8:57                                 ` [take4] " Evgeniy Polyakov
2009-01-15 11:13                                   ` David Rientjes
2009-01-12 15:49 ` Linux killed Kenny, bastard! Alan Cox
2009-01-12 15:50   ` Evgeniy Polyakov
2009-01-12 15:52     ` Alan Cox
2009-01-12 15:56       ` Evgeniy Polyakov
2009-01-12 16:19         ` Alan Cox
2009-01-12 16:29           ` Evgeniy Polyakov
2009-01-12 23:00             ` Bill Davidsen
2009-01-12 23:17               ` Evgeniy Polyakov
2009-01-13  1:53                 ` David Rientjes
2009-01-13  8:52                   ` Evgeniy Polyakov
2009-01-13  9:54                     ` David Rientjes
2009-01-13 11:54                       ` Evgeniy Polyakov
2009-01-13 12:15                         ` Alan Cox
2009-01-13 12:29                           ` Evgeniy Polyakov
2009-01-13 13:19                             ` Theodore Tso
2009-01-13 13:35                               ` Evgeniy Polyakov
2009-01-14  0:24                                 ` Bill Davidsen
2009-01-14  0:35                                   ` Evgeniy Polyakov
2009-01-13 13:47                               ` Alan Cox
2009-01-13 19:36                             ` David Rientjes
2009-01-13 21:46                               ` Evgeniy Polyakov
2009-01-13 22:49                                 ` Theodore Tso
2009-01-13 23:02                                   ` Evgeniy Polyakov
2009-01-14  1:11                                     ` Theodore Tso
2009-01-14  1:20                                       ` Evgeniy Polyakov
2009-01-14  4:06                                         ` Theodore Tso
2009-01-13 23:10                                 ` David Rientjes
2009-01-13 23:35                                   ` Evgeniy Polyakov
2009-01-13 23:43                                     ` David Rientjes
2009-01-13 23:55                                       ` Evgeniy Polyakov
2009-01-14  0:32                                         ` David Rientjes
2009-01-14  0:53                                           ` Evgeniy Polyakov
2009-01-14  4:23                                     ` Valdis.Kletnieks
2009-01-14  9:07                                       ` Evgeniy Polyakov
2009-01-13 19:15                         ` David Rientjes
2009-01-13 22:00                           ` Evgeniy Polyakov
2009-01-13 23:26                         ` Valdis.Kletnieks
2009-01-13 23:36                           ` Evgeniy Polyakov
2009-01-13 13:41                       ` Jan-Frode Myklebust
2009-01-13 13:59                         ` Alan Cox
2009-01-12 16:22         ` Dave Jones
2009-01-12 16:28           ` Evgeniy Polyakov
2009-01-13 16:35 ` KOSAKI Motohiro
2009-01-13 22:04   ` Evgeniy Polyakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox