linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Improving OOM killer
@ 2010-02-01 22:02 Lubos Lunak
  2010-02-01 23:53 ` David Rientjes
                   ` (3 more replies)
  0 siblings, 4 replies; 53+ messages in thread
From: Lubos Lunak @ 2010-02-01 22:02 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, David Rientjes, KOSAKI Motohiro, Balbir Singh,
	Nick Piggin, Jiri Kosina


 Hello,

 I'd like to suggest some changes to the OOM killer code that I believe 
improve its behaviour - here it turns it from something completely useless and 
harmful into something that works very well. I'm first posting changes for 
review, I'll reformat the patch later as needed if approved. Given that it's 
been like this for about a decade, I get the feeling there must be some 
strange catch.

 My scenario is working in a KDE desktop session and accidentally running 
parallel make in doc/ subdirectory of sources of a KDE module. As I use 
distributed compiling, I run make with -j20 or more, but, as the tool used for 
processing KDE documentation is quite memory-intensive, running this many 
of them is more than enough to consume all the 2GB RAM in the machine. What 
happens in that case is that the machine becomes virtually unresponsible, 
where even Ctrl+Alt+F1 can take minutes, not to mention some action that'd 
actually redeem the situation. If I wait long enough for something to happen, 
which can be even hours, the action that ends the situation is killing one of 
the most vital KDE processes, rendering the whole session useless and making 
me lose all unsaved data.

 The process tree looks roughly like this:

init
  |- kdeinit
  |  |- ksmserver
  |  |  |- kwin
  |  |- <other>
  |- konsole
     |- make
        |- sh
        |  |- meinproc4
        |- sh
        |  |- meinproc4
        |- <etc>

 What happens is that OOM killer usually selects either ksmserver (KDE session 
manager) or kdeinit (KDE master process that spawns most KDE processes). Note 
that in either case OOM killer does not reach the point of killing the actual 
offender - it will randomly kill in the tree under kdeinit until it decides 
to kill ksmserver, which means terminating the desktop session. As konsole is 
a KUniqueApplication, it forks into background and gets reparented to init, 
thus getting away from the kdeinit subtree. Since the memory pressure is 
distributed among several meinproc4 processes, the badness does not get 
summed up in its make grandparent, as badness() does this only for direct 
parents.

 Each meinproc4 process still uses a considerable amount of memory, so one 
could assume that the situation would be solved by simply killing them one by 
one, but it is not so because of using what I consider poor metric for 
measuring memory usage - VmSize. VmSize, if I'm not mistaken, is the size of 
the address space taken by the process, which in practice does not say much 
about how much memory the process actually uses. For example, /proc/*/status 
for one selected KDE process:

VmPeak:   534676 kB
VmSize:   528340 kB
VmLck:         0 kB
VmHWM:     73464 kB
VmRSS:     73388 kB
VmData:   142332 kB
VmStk:        92 kB
VmExe:        44 kB
VmLib:     91232 kB
VmPTE:       716 kB

And various excerpts from /proc/*/smaps for this process:
...
7f7b3f800000-7f7b40000000 rwxp 00000000 00:00 0
Size:               8192 kB
Rss:                  16 kB
Referenced:           16 kB
...
7f7b40055000-7f7b44000000 ---p 00000000 00:00 0
Size:              65196 kB
Rss:                   0 kB
Referenced:            0 kB
...
7f7b443cd000-7f7b445cd000 ---p 0001c000 08:01 
790267                     /usr/lib64/kde4/libnsplugin.so
Size:               2048 kB
Rss:                   0 kB
Referenced:            0 kB
...
7f7b48300000-7f7b4927d000 rw-s 00000000 08:01 
58690                      /var/tmp/kdecache-seli/kpc/kde-icon-cache.data
Size:              15860 kB
Rss:                  24 kB
Referenced:           24 kB

 I assume the first one is stack, search me what the second and third ones are 
(there appears to be one such mapping as the third one for each .so used), 
the last one is a mapping of a large cache file that's nevertheless rarely 
used extensively and even then it's backed by a file. In other words, none of 
this actually uses much of real memory, yet right now it's the process that 
would get killed for using about 70MB memory, even though it's not the 
offender. The offender scores only about 1/3 of its badness, even though it 
uses almost the double amount of memory:

VmPeak:   266508 kB
VmSize:   266504 kB
VmLck:         0 kB
VmHWM:    118208 kB
VmRSS:    118208 kB
VmData:    98512 kB
VmStk:        84 kB
VmExe:        60 kB
VmLib:     48944 kB
VmPTE:       536 kB

And the offender is only 14th in the list of badness candidates. Speaking of 
which, the following is quite useful for seeing all processes sorted by 
badness:

ls /proc/*/oom_score | grep -v self | sed 's/\(.*\)\/\(.*\)/echo -n "\1 "; \
echo -n "`cat \1\/\2 2>\/dev\/null` "; readlink \1\/exe || echo/'| sh | \
sort -nr +1


 Therefore, I suggest doing the following changes in mm/oom_kill.c :

=====
--- linux-2.6.31/mm/oom_kill.c.sav      2010-02-01 22:00:41.614838540 +0100
+++ linux-2.6.31/mm/oom_kill.c  2010-02-01 22:01:08.773757932 +0100
@@ -69,7 +69,7 @@ unsigned long badness(struct task_struct
        /*
         * The memory size of the process is the basis for the badness.
         */
-       points = mm->total_vm;
+       points = get_mm_rss(mm);

        /*
         * After this unlock we can no longer dereference local variable `mm'
@@ -83,21 +83,6 @@ unsigned long badness(struct task_struct
                return ULONG_MAX;

        /*
-        * Processes which fork a lot of child processes are likely
-        * a good choice. We add half the vmsize of the children if they
-        * have an own mm. This prevents forking servers to flood the
-        * machine with an endless amount of children. In case a single
-        * child is eating the vast majority of memory, adding only half
-        * to the parents will make the child our kill candidate of choice.
-        */
-       list_for_each_entry(child, &p->children, sibling) {
-               task_lock(child);
-               if (child->mm != mm && child->mm)
-                       points += child->mm->total_vm/2 + 1;
-               task_unlock(child);
-       }
-
-       /*
         * CPU time is in tens of seconds and run time is in thousands
          * of seconds. There is no particular reason for this other than
          * that it turned out to work very well in practice.
=====

 In other words, use VmRSS for measuring memory usage instead of VmSize, and 
remove child accumulating.

 I hope the above is good enough reason for the first change. VmSize includes 
things like read-only mappings, memory mappings that is actually unused, 
mappings backed by a file, mappings from video drivers, and so on. VmRSS is 
actual real memory used, which is what mostly matters here. While it may not 
be perfect, it is certainly an improvement.

 The second change should be done on the basis that it does more harm than 
good. In this specific case, it does not help to identify the source of the 
problem, and it incorrectly identifies kdeinit as the problem solely on the 
basis that it spawned many other processes. I think it's already quite hinted 
that this is a problem by the fact that you had to add a special protection 
for init - any session manager, process launcher or even xterm used for 
launching apps is yet another init.

 I also have problems finding a case where the child accounting would actually 
help. I mean, in practice, I can certainly come up with something in theory, 
and this looks to me like a solution to a very synthesized problem. In which 
realistic case will one process launch a limited number of children, where 
all of them will consume memory, but just killing the children one by one 
won't avoid the problem reasonably? This is unlikely to avoid a forkbomb, as 
in that case the number of children will be the problem. It is not necessary 
for just one children misbehaving and being restarted, nor will it work 
there. So what is that supposed to fix, and is it more likely than the case 
of a process launching several unrelated children?

 If the children accounting is supposed to handle cases like forked children 
of Apache, then I suggest it is adjusted only to count children that have 
been forked from the parent but there has been no exec(). I'm afraid I don't 
know how to detect that.


 When running a kernel with these changes applied, I can safely do the 
above-described case of running parallel doc generation in KDE. No clearly 
innocent process is selected for killing, the first choice is always an 
offender.

 Moreover, the remedy is almost instant, there is only a fraction of second of 
when the machine is overloaded by the I/O of swapping pages in and out (I do 
not use swap, but there is a large amount of memory used by read-only 
mappings of binaries, libraries or various other files that is in the 
original case rendering the machine unresponsive - I assume this is because 
the kernel tries to kill an innocent process, but the offenders immediatelly 
consume anything that is freed, requiring even memory used by code that is to 
be executed to be swapped in from files again).

 I consider the patches to be definite improvements, so if they are ok, I will 
format them as necessary. Now, what is the catch?

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-01 22:02 Improving OOM killer Lubos Lunak
@ 2010-02-01 23:53 ` David Rientjes
  2010-02-02 21:10   ` Lubos Lunak
  2010-02-03  7:50 ` KOSAKI Motohiro
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: David Rientjes @ 2010-02-01 23:53 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro,
	Balbir Singh, Nick Piggin, Jiri Kosina

On Mon, 1 Feb 2010, Lubos Lunak wrote:

>  Hello,
> 
>  I'd like to suggest some changes to the OOM killer code that I believe 
> improve its behaviour - here it turns it from something completely useless and 
> harmful into something that works very well. I'm first posting changes for 
> review, I'll reformat the patch later as needed if approved. Given that it's 
> been like this for about a decade, I get the feeling there must be some 
> strange catch.
> 

Thanks for your detailed explanation!  It will be very helpful to have 
your perspective and enthusiasm as we address oom killer issues.

I don't quite understand how you can say the oom killer is "completely 
useless and harmful."  It certainly fulfills its purpose, which is to kill 
a memory hogging task so that a page allocation may succeed when reclaim 
has failed to free any memory.  It's useful in that regard, although I 
don't think you'll find any argument that killing user tasks could always 
be described as "harmful" in some way, at least to somebody :)

>  My scenario is working in a KDE desktop session and accidentally running 
> parallel make in doc/ subdirectory of sources of a KDE module. As I use 
> distributed compiling, I run make with -j20 or more, but, as the tool used for 
> processing KDE documentation is quite memory-intensive, running this many 
> of them is more than enough to consume all the 2GB RAM in the machine. What 
> happens in that case is that the machine becomes virtually unresponsible, 
> where even Ctrl+Alt+F1 can take minutes, not to mention some action that'd 
> actually redeem the situation. If I wait long enough for something to happen, 
> which can be even hours, the action that ends the situation is killing one of 
> the most vital KDE processes, rendering the whole session useless and making 
> me lose all unsaved data.
> 

That's the point at which direct reclaim can no longer free any memory and 
we must invoke the oom killer as opposed to potentially deadlocking the 
system.

>  The process tree looks roughly like this:
> 
> init
>   |- kdeinit
>   |  |- ksmserver
>   |  |  |- kwin
>   |  |- <other>
>   |- konsole
>      |- make
>         |- sh
>         |  |- meinproc4
>         |- sh
>         |  |- meinproc4
>         |- <etc>
> 
>  What happens is that OOM killer usually selects either ksmserver (KDE session 
> manager) or kdeinit (KDE master process that spawns most KDE processes). Note 
> that in either case OOM killer does not reach the point of killing the actual 
> offender - it will randomly kill in the tree under kdeinit until it decides 
> to kill ksmserver, which means terminating the desktop session. As konsole is 
> a KUniqueApplication, it forks into background and gets reparented to init, 
> thus getting away from the kdeinit subtree. Since the memory pressure is 
> distributed among several meinproc4 processes, the badness does not get 
> summed up in its make grandparent, as badness() does this only for direct 
> parents.
> 

There's no randomness involved in selecting a task to kill; the oom killer 
does, however, try to kill a child with seperate memory first instead of 
the parent and that selection depends on the ordering of the selected 
task's child list.  As I mentioned very early this morning, that could 
certainly be improved to kill the child with the highest badness() score 
first.  Something like this untested patch could do it:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -439,6 +439,9 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			    const char *message)
 {
 	struct task_struct *c;
+	struct task_struct *victim = NULL;
+	unsigned long victim_points = 0;
+	struct timespec uptime;
 
 	if (printk_ratelimit())
 		dump_header(p, gfp_mask, order, mem);
@@ -455,14 +458,20 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
 					message, task_pid_nr(p), p->comm, points);
 
-	/* Try to kill a child first */
+	/* Try to kill the most memory-hogging child first */
+	do_posix_clock_monotonic_gettime(&uptime);
 	list_for_each_entry(c, &p->children, sibling) {
+		unsigned long cpoints;
+
 		if (c->mm == p->mm)
 			continue;
-		if (!oom_kill_task(c))
-			return 0;
+		cpoints = badness(c, uptime.tv_sec);
+		if (cpoints > victim_points) {
+			victim = c;
+			victim_points = cpoints;
+		}
 	}
-	return oom_kill_task(p);
+	return oom_kill_task(victim ? victim : p);
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR

This works because the conditions on which oom_kill_task(victim) fails is 
when !victim->mm or victim->signal->oom_adj == OOM_DISABLE which we're 
guarded against since badness(victim) returns 0 for both.

The process tree that you posted shows a textbook case for using 
/proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is 
protected from getting selected for oom kill.  In your own words, this 
"spawns most KDE processes," so it's an ideal place to set an oom_adj 
value of OOM_DISABLE since that value is inheritable to children and, 
thus, all children are implicitly protected as well.

>  Each meinproc4 process still uses a considerable amount of memory, so one 
> could assume that the situation would be solved by simply killing them one by 
> one, but it is not so because of using what I consider poor metric for 
> measuring memory usage - VmSize. VmSize, if I'm not mistaken, is the size of 
> the address space taken by the process, which in practice does not say much 
> about how much memory the process actually uses. For example, /proc/*/status 
> for one selected KDE process:
> 
> VmPeak:   534676 kB
> VmSize:   528340 kB
> VmLck:         0 kB
> VmHWM:     73464 kB
> VmRSS:     73388 kB
> VmData:   142332 kB
> VmStk:        92 kB
> VmExe:        44 kB
> VmLib:     91232 kB
> VmPTE:       716 kB
> 
> And various excerpts from /proc/*/smaps for this process:
> ...
> 7f7b3f800000-7f7b40000000 rwxp 00000000 00:00 0
> Size:               8192 kB
> Rss:                  16 kB
> Referenced:           16 kB
> ...
> 7f7b40055000-7f7b44000000 ---p 00000000 00:00 0
> Size:              65196 kB
> Rss:                   0 kB
> Referenced:            0 kB
> ...
> 7f7b443cd000-7f7b445cd000 ---p 0001c000 08:01 
> 790267                     /usr/lib64/kde4/libnsplugin.so
> Size:               2048 kB
> Rss:                   0 kB
> Referenced:            0 kB
> ...
> 7f7b48300000-7f7b4927d000 rw-s 00000000 08:01 
> 58690                      /var/tmp/kdecache-seli/kpc/kde-icon-cache.data
> Size:              15860 kB
> Rss:                  24 kB
> Referenced:           24 kB
> 
>  I assume the first one is stack, search me what the second and third ones are 
> (there appears to be one such mapping as the third one for each .so used), 
> the last one is a mapping of a large cache file that's nevertheless rarely 
> used extensively and even then it's backed by a file. In other words, none of 
> this actually uses much of real memory, yet right now it's the process that 
> would get killed for using about 70MB memory, even though it's not the 
> offender. The offender scores only about 1/3 of its badness, even though it 
> uses almost the double amount of memory:
> 
> VmPeak:   266508 kB
> VmSize:   266504 kB
> VmLck:         0 kB
> VmHWM:    118208 kB
> VmRSS:    118208 kB
> VmData:    98512 kB
> VmStk:        84 kB
> VmExe:        60 kB
> VmLib:     48944 kB
> VmPTE:       536 kB
> 

Using VmSize, however, allows us to define the most important task to kill 
for the oom killer: memory leakers.  Memory leakers are the single most 
important tasks to identify with the oom killer and aren't obvious when 
using rss because leaked memory does not stay resident in RAM.  I 
understand your system may not have such a leaker and it is simply 
overcommitted on a 2GB machine, but using rss loses that ability.  It also 
makes tuning oom killer priorities with /proc/pid/oom_adj almost 
impossible since a task's rss is highly dynamic and we cannot speculate on 
the state of the VM at the time of oom.

>  Therefore, I suggest doing the following changes in mm/oom_kill.c :
> 
> =====
> --- linux-2.6.31/mm/oom_kill.c.sav      2010-02-01 22:00:41.614838540 +0100
> +++ linux-2.6.31/mm/oom_kill.c  2010-02-01 22:01:08.773757932 +0100
> @@ -69,7 +69,7 @@ unsigned long badness(struct task_struct
>         /*
>          * The memory size of the process is the basis for the badness.
>          */
> -       points = mm->total_vm;
> +       points = get_mm_rss(mm);
> 
>         /*
>          * After this unlock we can no longer dereference local variable `mm'
> @@ -83,21 +83,6 @@ unsigned long badness(struct task_struct
>                 return ULONG_MAX;
> 
>         /*
> -        * Processes which fork a lot of child processes are likely
> -        * a good choice. We add half the vmsize of the children if they
> -        * have an own mm. This prevents forking servers to flood the
> -        * machine with an endless amount of children. In case a single
> -        * child is eating the vast majority of memory, adding only half
> -        * to the parents will make the child our kill candidate of choice.
> -        */
> -       list_for_each_entry(child, &p->children, sibling) {
> -               task_lock(child);
> -               if (child->mm != mm && child->mm)
> -                       points += child->mm->total_vm/2 + 1;
> -               task_unlock(child);
> -       }
> -
> -       /*
>          * CPU time is in tens of seconds and run time is in thousands
>           * of seconds. There is no particular reason for this other than
>           * that it turned out to work very well in practice.
> =====
> 
>  In other words, use VmRSS for measuring memory usage instead of VmSize, and 
> remove child accumulating.
> 
>  I hope the above is good enough reason for the first change. VmSize includes 
> things like read-only mappings, memory mappings that is actually unused, 
> mappings backed by a file, mappings from video drivers, and so on. VmRSS is 
> actual real memory used, which is what mostly matters here. While it may not 
> be perfect, it is certainly an improvement.
> 

It's not for a large number of users, the consumer of the largest amount 
of rss is not necessarily the task we always want to kill.  Just because 
an order-0 page allocation fails does not mean we want to kill the task 
that would free the largest amount of RAM.

I understand that KDE is extremely important to your work environment and 
if you lose it, it seems like a failure of Linux and the VM.  However, the 
kernel cannot possibly know what applications you believe to be the most 
important.  For that reason, userspace is able to tune the badness() score 
by writing to /proc/pid/oom_adj as I've suggested you do for kdeinit.  You 
have the ability to protect KDE from getting oom killed, you just need to 
use it.

The natural response to this is: well, you can't possibly expect all users 
to tune /proc/pid/oom_adj just in case we are oom.  I can expect KDE to 
protect itself, though, since it is of sufficient importance to its users 
that it should protect itself, as I expect the same of the other examples 
that have arisen in recent months concerning Xorg.

There's a second caveat, however, to just protecting applications from the 
oom killer using OOM_DISABLE.  And that goes back to my original point of 
being able to define memory leakers.  /proc/pid/oom_adj actually works 
quite well to define when a task is using far more memory than expected.  
You can say the VmSize doesn't accurately reflect the amount of memory an 
application is using, but it is both static enough and well understood 
from a pure arithemtic standpoint for what a sane range is for a 
particular application.  It is then possible to use oom_adj to define when 
that memory usage is egregious and considered "rogue" but the oom killer 
so that it warrants being killed.

>  The second change should be done on the basis that it does more harm than 
> good. In this specific case, it does not help to identify the source of the 
> problem, and it incorrectly identifies kdeinit as the problem solely on the 
> basis that it spawned many other processes. I think it's already quite hinted 
> that this is a problem by the fact that you had to add a special protection 
> for init - any session manager, process launcher or even xterm used for 
> launching apps is yet another init.
> 

The kernel can't be expected to know that, you must tell it via 
/proc/pid/oom_adj.

>  I also have problems finding a case where the child accounting would actually 
> help. I mean, in practice, I can certainly come up with something in theory, 
> and this looks to me like a solution to a very synthesized problem. In which 
> realistic case will one process launch a limited number of children, where 
> all of them will consume memory, but just killing the children one by one 
> won't avoid the problem reasonably? This is unlikely to avoid a forkbomb, as 
> in that case the number of children will be the problem. It is not necessary 
> for just one children misbehaving and being restarted, nor will it work 
> there. So what is that supposed to fix, and is it more likely than the case 
> of a process launching several unrelated children?
> 

Right, I believe Kame is working on a forkbomb detector that would replace 
this logic.

>  Moreover, the remedy is almost instant, there is only a fraction of second of 
> when the machine is overloaded by the I/O of swapping pages in and out (I do 
> not use swap, but there is a large amount of memory used by read-only 
> mappings of binaries, libraries or various other files that is in the 
> original case rendering the machine unresponsive - I assume this is because 
> the kernel tries to kill an innocent process, but the offenders immediatelly 
> consume anything that is freed, requiring even memory used by code that is to 
> be executed to be swapped in from files again).
> 

We may need to look at block congestion timeouts before invoking the oom 
killer in the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-01 23:53 ` David Rientjes
@ 2010-02-02 21:10   ` Lubos Lunak
  2010-02-03  1:41     ` David Rientjes
  0 siblings, 1 reply; 53+ messages in thread
From: Lubos Lunak @ 2010-02-02 21:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro,
	Balbir Singh, Nick Piggin, Jiri Kosina

On Tuesday 02 of February 2010, David Rientjes wrote:
> On Mon, 1 Feb 2010, Lubos Lunak wrote:
> >  Hello,

> I don't quite understand how you can say the oom killer is "completely
> useless and harmful."  It certainly fulfills its purpose, which is to kill
> a memory hogging task so that a page allocation may succeed when reclaim
> has failed to free any memory.

 I started the sentence with "here". And if you compare my description of what 
happens with the 5 goals listed in oom_kill.c, you can see that it fails all 
of them except for 2, and even in that case it is much better to simply 
reboot the computer. So that is why OOM killer _here_ is completely useless 
and harmful, as even panic_on_oom does a better job.

 I'm not saying it's so everywhere, presumably it works somewhere when 
somebody has written it this way, but since some of the design decisions 
appear to be rather poor for desktop systems, "here" is probably not really 
limited only to my computer either.

> >  The process tree looks roughly like this:
> >
> > init
> >   |- kdeinit
> >   |  |- ksmserver
> >   |  |  |- kwin
> >   |  |- <other>
> >   |- konsole
> >      |- make
> >         |- sh
> >         |  |- meinproc4
> >         |- sh
> >         |  |- meinproc4
> >         |- <etc>
> >
> >  What happens is that OOM killer usually selects either ksmserver (KDE
> > session manager) or kdeinit (KDE master process that spawns most KDE
> > processes). Note that in either case OOM killer does not reach the point
> > of killing the actual offender - it will randomly kill in the tree under
> > kdeinit until it decides to kill ksmserver, which means terminating the
> > desktop session. As konsole is a KUniqueApplication, it forks into
> > background and gets reparented to init, thus getting away from the
> > kdeinit subtree. Since the memory pressure is distributed among several
> > meinproc4 processes, the badness does not get summed up in its make
> > grandparent, as badness() does this only for direct parents.
>
> There's no randomness involved in selecting a task to kill;

 That was rather a figure of speech, but even if you want to take it 
literally, then from the user's point of view it is random. Badness of 
kdeinit depends on the number of children it has spawned, badness of 
ksmserver depends for example on the number and size of windows open (as its 
child kwin is a window and compositing manager).

 Not that it really matters - the net result is that OOM killer usually 
decides to kill kdeinit or ksmserver, starts killing their children, vital 
KDE processes, and since the offenders are not among them, it ends up either 
terminating the whole session by killing ksmserver or killing enough vital 
processes there to free enough memory for the offenders to finish their work 
cleanly.

> The process tree that you posted shows a textbook case for using
> /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is
> protected from getting selected for oom kill.  In your own words, this
> "spawns most KDE processes," so it's an ideal place to set an oom_adj
> value of OOM_DISABLE since that value is inheritable to children and,
> thus, all children are implicitly protected as well.

 Yes, it's a textbook case, sadly textbook cases are theory and not practice. 
I didn't mention it in my first mail to keep it shorter, but we have actually 
tried it. First of all, it's rather cumbersome - as it requires root 
priviledges, there is one wrapped needed for setuid and another one to avoid 
setuid side-effects, moreover the setuid root process needs to stay running 
and unset the protection on all children, or it'd be useless again.

 Worse, it worked for about a year or two and now it has only shifted the 
problem elsewhere and that's it. We now protect kdeinit, which means the OOM 
killer's choice will very likely ksmserver then. Ok, so let's say now we 
start protecting also ksmserver, that's some additional hassle setting it up, 
but that's doable. Now there's a good chance the OOM killer's choice will be 
kwin (as a compositing manager it can have quite large mappings because of 
graphics drivers). So ok, we need to protect the window manager, but since 
that's not a hardcoded component like ksmserver, that's even more hassle.

 And, after all that, OOM killer will simply detect yet another innocent 
process. I didn't mention that, but the memory statistics I presented for one 
selected KDE process in my original mail were actually for an ordinary KDE 
application - Konqueror showing a web page. Yet, as you can read in my 
original mail, even though it used only about half memory of what the 
offender used, it still scored almost tripple of its badness score.

 So unless you would suggest I implement my own dynamic badness handling in 
userspace, which I hope we can all agree is nonsense, then oom_adj is a 
cumbersome non-solution here. It may work in some setups, but it doesn't for 
the desktop.

> Using VmSize, however, allows us to define the most important task to kill
> for the oom killer: memory leakers.  Memory leakers are the single most
> important tasks to identify with the oom killer and aren't obvious when
> using rss because leaked memory does not stay resident in RAM.  I
> understand your system may not have such a leaker and it is simply
> overcommitted on a 2GB machine, but using rss loses that ability.

 Interesting point. Am I getting it right that you're saying that VmRSS is 
unsuitable because badness should take into account not only the RAM used by 
the process but also the swap space used by the process? If yes, then this 
rather brings up the question why doesn't the badness calculation then do it 
and uses VmSize instead?

 I mean, as already demonstrated in the original mail, VmSize clearly can be 
very wrong as a representation of memory used. I would actually argue that 
VmRSS is still better, as the leaker would eventually fill the swap and start 
taking up RAM, but either way, how about this then?

-       points = mm->total_vm;
+       points = get_mm_rss(mm) + 
get_mm_space_used_in_swap_but_not_in_other_places_like_file_backing(mm);

 (I don't know if there's a function doing the latter or how to count it. 
Probably not exactly trivial given that I have experience 
with /proc/*/stat*-using tools like top reporting rather wrong numbers for 
swap usage of processes.)

> It also makes tuning oom killer priorities with /proc/pid/oom_adj almost
> impossible since a task's rss is highly dynamic and we cannot speculate on
> the state of the VM at the time of oom.

 I see. However using VmRSS and swap space together avoids this.

> >  In other words, use VmRSS for measuring memory usage instead of VmSize,
> > and remove child accumulating.
> >
> >  I hope the above is good enough reason for the first change. VmSize
> > includes things like read-only mappings, memory mappings that is actually
> > unused, mappings backed by a file, mappings from video drivers, and so
> > on. VmRSS is actual real memory used, which is what mostly matters here.
> > While it may not be perfect, it is certainly an improvement.
>
> It's not for a large number of users,

 You mean, besides all desktop users? In my experience desktop machines start 
getting rather useless when their swap usage starts nearing the total RAM 
amount, so swap should not be that significant. Moreover, again, it's still 
better than VmSize, which can be wildly inaccurate. On my desktop system it 
definitely is.

 Hmm, maybe you're thinking server setup and that's different, I don't know. 
Does the kernel have any "desktop mode"? I wouldn't mind if VmSize was used 
on servers if you insist it is better, but on desktop VmSize is just plain 
wrong. And, again, I think VmRSS+InSwap is better then either.

> the consumer of the largest amount 
> of rss is not necessarily the task we always want to kill.  Just because
> an order-0 page allocation fails does not mean we want to kill the task
> that would free the largest amount of RAM.

 It's still much better than killing the task that would free the largest 
amount of address space. And I cannot think of any better metric than 
VmRSS+InSwap. Can you?

> I understand that KDE is extremely important to your work environment and
> if you lose it, it seems like a failure of Linux and the VM.  However, the
> kernel cannot possibly know what applications you believe to be the most
> important.  For that reason, userspace is able to tune the badness() score
> by writing to /proc/pid/oom_adj as I've suggested you do for kdeinit.  You
> have the ability to protect KDE from getting oom killed, you just need to
> use it.

 As already explained, I can't. Besides, I'm not expecting a miracle, I simply 
expect the kernel to kill the process that takes up the most memory, and the 
kernel can possibly know that, it just doesn't do it. What other evidence do 
you want to be shown that badness calculated for two processes on their 
actual memory usage differs by a multiple of 5 or more?

[snipped description of how oom_adj should help when it in fact wouldn't]

> >  I also have problems finding a case where the child accounting would
> > actually help. I mean, in practice, I can certainly come up with
> > something in theory, and this looks to me like a solution to a very
> > synthesized problem. In which realistic case will one process launch a
> > limited number of children, where all of them will consume memory, but
> > just killing the children one by one won't avoid the problem reasonably?
> > This is unlikely to avoid a forkbomb, as in that case the number of
> > children will be the problem. It is not necessary for just one children
> > misbehaving and being restarted, nor will it work there. So what is that
> > supposed to fix, and is it more likely than the case of a process
> > launching several unrelated children?
>
> Right, I believe Kame is working on a forkbomb detector that would replace
> this logic.

 Until then, can we dump the current code? Because I have provided one case 
where it makes things worse and nobody has provided any case where it makes 
things better or any other justification for its existence. There's no point 
in keeping code for which nobody knows how it improves things (in reality, 
not some textbook case).

 And, in case the justification for it is something like "Apache", can we 
fast-forward to my improved suggestion to limit this only to children that 
are forked but not exec()-ed?

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-02 21:10   ` Lubos Lunak
@ 2010-02-03  1:41     ` David Rientjes
  2010-02-03  1:52       ` KAMEZAWA Hiroyuki
  2010-02-03 22:54       ` Improving OOM killer Lubos Lunak
  0 siblings, 2 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-03  1:41 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro,
	Balbir Singh, Nick Piggin, Jiri Kosina, KAMEZAWA Hiroyuki

On Tue, 2 Feb 2010, Lubos Lunak wrote:

> > > init
> > >   |- kdeinit
> > >   |  |- ksmserver
> > >   |  |  |- kwin
> > >   |  |- <other>
> > >   |- konsole
> > >      |- make
> > >         |- sh
> > >         |  |- meinproc4
> > >         |- sh
> > >         |  |- meinproc4
> > >         |- <etc>
> > >
> > >  What happens is that OOM killer usually selects either ksmserver (KDE
> > > session manager) or kdeinit (KDE master process that spawns most KDE
> > > processes). Note that in either case OOM killer does not reach the point
> > > of killing the actual offender - it will randomly kill in the tree under
> > > kdeinit until it decides to kill ksmserver, which means terminating the
> > > desktop session. As konsole is a KUniqueApplication, it forks into
> > > background and gets reparented to init, thus getting away from the
> > > kdeinit subtree. Since the memory pressure is distributed among several
> > > meinproc4 processes, the badness does not get summed up in its make
> > > grandparent, as badness() does this only for direct parents.
> >
> > There's no randomness involved in selecting a task to kill;
> 
>  That was rather a figure of speech, but even if you want to take it 
> literally, then from the user's point of view it is random. Badness of 
> kdeinit depends on the number of children it has spawned, badness of 
> ksmserver depends for example on the number and size of windows open (as its 
> child kwin is a window and compositing manager).
> 

As I've mentioned, I believe Kame (now cc'd) is working on replacing the 
heuristic that adds the VM size for children into the parent task's 
badness score with a forkbomb detector.  That should certainly help to 
reduce oom killing for parents whose children consume a lot of memory, 
especially in your scenario where kdeinit is responsible for forking most 
KDE processes.

>  Not that it really matters - the net result is that OOM killer usually 
> decides to kill kdeinit or ksmserver, starts killing their children, vital 
> KDE processes, and since the offenders are not among them, it ends up either 
> terminating the whole session by killing ksmserver or killing enough vital 
> processes there to free enough memory for the offenders to finish their work 
> cleanly.
> 

The kernel cannot possibly know what you consider a "vital" process, for 
that understanding you need to tell it using the very powerful 
/proc/pid/oom_adj tunable.  I suspect if you were to product all of 
kdeinit's children by patching it to be OOM_DISABLE so that all threads it 
forks will inherit that value you'd actually see much improved behavior.  
I'd also encourage you to talk to the KDE developers to ensure that proper 
precautions are taken to protect it in such conditions since people who 
use such desktop environments typically don't want them to be sacrificed 
for memory.  The kernel, however, can only provide a mechanism for users 
to define what they believe to be critical since it will vary widely 
depending on how Linux is used for desktops, servers, and embedded 
devices.  Our servers, for example, have vital threads that are running on 
each machine and they need to be protected in the same way.  They set 
themselves to be OOM_DISABLE.

On a side note, it's only possible to lower an oom_adj value for a thread 
if you have CAP_SYS_RESOURCE.  So that would be a prerequisite to setting 
OOM_DISABLE for any thread.  If that's not feasible, there's a workaround: 
user tasks can always increase their own oom_adj value so that they are 
always preferred in oom conditions.  This will act to protect those vital 
tasks by preempting them from getting oom killed without any special 
capability.

> > The process tree that you posted shows a textbook case for using
> > /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is
> > protected from getting selected for oom kill.  In your own words, this
> > "spawns most KDE processes," so it's an ideal place to set an oom_adj
> > value of OOM_DISABLE since that value is inheritable to children and,
> > thus, all children are implicitly protected as well.
> 
>  Yes, it's a textbook case, sadly textbook cases are theory and not practice. 
> I didn't mention it in my first mail to keep it shorter, but we have actually 
> tried it. First of all, it's rather cumbersome - as it requires root 
> priviledges, there is one wrapped needed for setuid and another one to avoid 
> setuid side-effects, moreover the setuid root process needs to stay running 
> and unset the protection on all children, or it'd be useless again.
> 

It only requires CAP_SYS_RESOURCE, actually, and your apparent difficulty 
in writing to a kernel tunable is outside the scope of this discussion, 
unfortunately.  The bottomline is that the kernel provides a very well 
defined interface for tuning the oom killer's badness heuristic so that 
userspace can define what is "vital," whether that is kdeinit, ksmserver, 
or any other thread.  The choice needs to be made to use it, however.

>  Worse, it worked for about a year or two and now it has only shifted the 
> problem elsewhere and that's it. We now protect kdeinit, which means the OOM 
> killer's choice will very likely ksmserver then. Ok, so let's say now we 
> start protecting also ksmserver, that's some additional hassle setting it up, 
> but that's doable. Now there's a good chance the OOM killer's choice will be 
> kwin (as a compositing manager it can have quite large mappings because of 
> graphics drivers). So ok, we need to protect the window manager, but since 
> that's not a hardcoded component like ksmserver, that's even more hassle.
> 

No, you don't need to protect every KDE process from the oom killer unless 
it is going to be a contender for selection.  You could certainly do so 
for completeness, but it shouldn't be required unless the nature of the 
thread demands it such that it forks many vital tasks (kdeinit) or its 
first-generation children's memory consumption can't be known either 
because it depends on how many children it can fork or their memory 
consumption is influenced by the user's work.

> > Using VmSize, however, allows us to define the most important task to kill
> > for the oom killer: memory leakers.  Memory leakers are the single most
> > important tasks to identify with the oom killer and aren't obvious when
> > using rss because leaked memory does not stay resident in RAM.  I
> > understand your system may not have such a leaker and it is simply
> > overcommitted on a 2GB machine, but using rss loses that ability.
> 
>  Interesting point. Am I getting it right that you're saying that VmRSS is 
> unsuitable because badness should take into account not only the RAM used by 
> the process but also the swap space used by the process? If yes, then this 
> rather brings up the question why doesn't the badness calculation then do it 
> and uses VmSize instead?
> 

We don't want to discount VM_RESERVED because the memory it represents 
cannot be swapped, which is also an important indicator of the overall 
memory usage of any particular application.  I understand that your 
particular use case may benefit from waiting on block congestion during 
the page allocation that triggered the oom killer without making any 
progress in direct reclaim, but discounting VM_RESERVED because of its 
(sometimes erroneous) use with VM_IO isn't an appropriate indicator of the 
thread's memory use since we're not considering the amount mapped for I/O.

>  I mean, as already demonstrated in the original mail, VmSize clearly can be 
> very wrong as a representation of memory used. I would actually argue that 
> VmRSS is still better, as the leaker would eventually fill the swap and start 
> taking up RAM, but either way, how about this then?
> 

We simply can't afford to wait for a memory leaker to fill up all of swap 
before the heuristic works to identify it, that would be extremely slow 
depending on the speed of I/O and would actually increase the probability 
of needlessly killing the wrong task because the memory leaker will fill 
up all of swap while other tasks get killed because they have a large rss.  
That heuristic works antagonist to finding the memory leaker since it will 
always keep its rss low since that memory will never be touched again once 
it is leaked.

>  Hmm, maybe you're thinking server setup and that's different, I don't know. 
> Does the kernel have any "desktop mode"? I wouldn't mind if VmSize was used 
> on servers if you insist it is better, but on desktop VmSize is just plain 
> wrong. And, again, I think VmRSS+InSwap is better then either.
> 

The heuristics are always well debated in this forum and there's little 
chance that we'll ever settle on a single formula that works for all 
possible use cases.  That makes oom_adj even more vital to the overall 
efficiency of the oom killer, I really hope you start to use it to your 
advantage.

There are a couple of things that we'll always agree on: its needless to 
kill a task that shares a different set of allowed nodes than the 
triggering page allocation, for example, since it cannot access that 
memory even if freed, and there should be some penalty for tasks that have 
a shorter uptime to break ties.  I agree that most of the penalties as 
currently implemented, however, aren't scientific and don't appropriately 
adjust the score in those cases.  Dividing the entire VM size by 2, for 
example, because the task has a certain trait isn't an ideal formula.

There's also at least one thing that we'd both like to remove: the 
factoring of each child's VM size into the badness score for the parent as 
a way of detecting a forkbomb.  Kame has been working on the forkbomb 
detector specifically to address this issue, so we should stay tuned.

The one thing that I must stress, however, is the need for us to be able 
to define what a "vital" task is and to define what a "memory leaker" is.  
Both of those are currently possible with oom_adj, so we cannot loose this 
functionality.  Changing the baseline to be rss would be highly dynamic 
and not allow us to accurately set an oom_adj value to define when a task 
is rogue on a shared system.

That said, I don't think a complete rewrite of the badness function would 
be a non-starter: we could easily make all tasks scale to have a badness 
score ranging from 0 (never kill) to 1000 (always kill), for example.  We 
just need to propose a sane set of heuristics so that we don't lose the 
configurability from userspace.

I've also disagreed before with always killing the most memory consuming 
task whenever we have an oom.  I agree with you that sometimes that task 
may be the most vital to the system and setting it to be OOM_DISABLE 
should not always be required for a simple order-0 allocation that fails 
somewhere, especially if its ~__GFP_NOFAIL.  What I've recommended to you 
would work with current mainline and at least kernels released within the 
past few years, but I think there may be many more changes we can look to 
make in the future.

> > the consumer of the largest amount 
> > of rss is not necessarily the task we always want to kill.  Just because
> > an order-0 page allocation fails does not mean we want to kill the task
> > that would free the largest amount of RAM.
> 
>  It's still much better than killing the task that would free the largest 
> amount of address space. And I cannot think of any better metric than 
> VmRSS+InSwap. Can you?
> 

Yes, baselining on total_vm so that we understand the total amount of 
memory that will no longer be mapped to that particular process if killed, 
regardless if its a shared library or not, so that we can define vital 
tasks and memory leakers from userspace.

> > I understand that KDE is extremely important to your work environment and
> > if you lose it, it seems like a failure of Linux and the VM.  However, the
> > kernel cannot possibly know what applications you believe to be the most
> > important.  For that reason, userspace is able to tune the badness() score
> > by writing to /proc/pid/oom_adj as I've suggested you do for kdeinit.  You
> > have the ability to protect KDE from getting oom killed, you just need to
> > use it.
> 
>  As already explained, I can't. Besides, I'm not expecting a miracle, I simply 
> expect the kernel to kill the process that takes up the most memory, and the 
> kernel can possibly know that, it just doesn't do it. What other evidence do 
> you want to be shown that badness calculated for two processes on their 
> actual memory usage differs by a multiple of 5 or more?
> 

This is highly likely related to the child VM sizes being accumulated in 
the parent's badness score, we'll have to see how your results vary once 
the forkbomb detector is merged.  I disagree that we always need to kill 
the application consuming the most memory, though, we need to determine 
when it's better to simply fail a ~__GFP_NOFAIL allocation and when 
killing a smaller, lower priority task may be more beneficial to the 
user's work.

> > Right, I believe Kame is working on a forkbomb detector that would replace
> > this logic.
> 
>  Until then, can we dump the current code? Because I have provided one case 
> where it makes things worse and nobody has provided any case where it makes 
> things better or any other justification for its existence. There's no point 
> in keeping code for which nobody knows how it improves things (in reality, 
> not some textbook case).
> 

First of all, I don't think the forkbomb detector is that far in the 
future since a preliminary version of it was also posted, but I think we 
also need a way to address those particular cases in the heuristic.  There 
are real-life cases where out of memory conditions occur specifically 
because of forkbombs so not addressing the issue in the heuristic in some 
way is not appropriate.

>  And, in case the justification for it is something like "Apache", can we 
> fast-forward to my improved suggestion to limit this only to children that 
> are forked but not exec()-ed?
> 

The amount of memory you'll be freeing in simply killing such tasks will 
be minimal, I don't think that's an appropriate heuristic if they all 
share their memory with the parent.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03  1:41     ` David Rientjes
@ 2010-02-03  1:52       ` KAMEZAWA Hiroyuki
  2010-02-03  2:12         ` David Rientjes
  2010-02-03 22:54       ` Improving OOM killer Lubos Lunak
  1 sibling, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-03  1:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina

On Tue, 2 Feb 2010 17:41:41 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Tue, 2 Feb 2010, Lubos Lunak wrote:
> 
> > > > init
> > > >   |- kdeinit
> > > >   |  |- ksmserver
> > > >   |  |  |- kwin
> > > >   |  |- <other>
> > > >   |- konsole
> > > >      |- make
> > > >         |- sh
> > > >         |  |- meinproc4
> > > >         |- sh
> > > >         |  |- meinproc4
> > > >         |- <etc>
> > > >
> > > >  What happens is that OOM killer usually selects either ksmserver (KDE
> > > > session manager) or kdeinit (KDE master process that spawns most KDE
> > > > processes). Note that in either case OOM killer does not reach the point
> > > > of killing the actual offender - it will randomly kill in the tree under
> > > > kdeinit until it decides to kill ksmserver, which means terminating the
> > > > desktop session. As konsole is a KUniqueApplication, it forks into
> > > > background and gets reparented to init, thus getting away from the
> > > > kdeinit subtree. Since the memory pressure is distributed among several
> > > > meinproc4 processes, the badness does not get summed up in its make
> > > > grandparent, as badness() does this only for direct parents.
> > >
> > > There's no randomness involved in selecting a task to kill;
> > 
> >  That was rather a figure of speech, but even if you want to take it 
> > literally, then from the user's point of view it is random. Badness of 
> > kdeinit depends on the number of children it has spawned, badness of 
> > ksmserver depends for example on the number and size of windows open (as its 
> > child kwin is a window and compositing manager).
> > 
> 
> As I've mentioned, I believe Kame (now cc'd) is working on replacing the 
> heuristic that adds the VM size for children into the parent task's 
> badness score with a forkbomb detector. 

I stopped that as I mentioned. I'm heavily disappointed with myself and
would like not to touch oom-killer things for a while.

I'd like to conentrate on memcg for a while, which I've starved for these 3 months.

Then, you don't need to CC me.

Bye,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03  2:12         ` David Rientjes
@ 2010-02-03  2:12           ` KAMEZAWA Hiroyuki
  2010-02-03  2:36             ` [patch] sysctl: clean up vm related variable declarations David Rientjes
  0 siblings, 1 reply; 53+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-03  2:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina

On Tue, 2 Feb 2010 18:12:41 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> On Wed, 3 Feb 2010, KAMEZAWA Hiroyuki wrote:
> 
> > I stopped that as I mentioned. I'm heavily disappointed with myself and
> > would like not to touch oom-killer things for a while.
> > 
> > I'd like to conentrate on memcg for a while, which I've starved for these 3 months.
> > 
> > Then, you don't need to CC me.
> > 
> 
> I'm disappointed to hear that, it would be helpful to get some consensus 
> on the points that we can all agree on from different perspectives.  I'll 
> try to find some time to develop a solution that people will see as a move 
> in the positive direction.
> 
> On a seperate topic, do you have time to repost your sysctl cleanup for 
> Andrew from http://marc.info/?l=linux-kernel&m=126457416729672 with the
> #ifndef CONFIG_MMU fix I mentioned?
> 
I'll not. Feel free to reuse fragments I posted.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03  1:52       ` KAMEZAWA Hiroyuki
@ 2010-02-03  2:12         ` David Rientjes
  2010-02-03  2:12           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 53+ messages in thread
From: David Rientjes @ 2010-02-03  2:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina

On Wed, 3 Feb 2010, KAMEZAWA Hiroyuki wrote:

> I stopped that as I mentioned. I'm heavily disappointed with myself and
> would like not to touch oom-killer things for a while.
> 
> I'd like to conentrate on memcg for a while, which I've starved for these 3 months.
> 
> Then, you don't need to CC me.
> 

I'm disappointed to hear that, it would be helpful to get some consensus 
on the points that we can all agree on from different perspectives.  I'll 
try to find some time to develop a solution that people will see as a move 
in the positive direction.

On a seperate topic, do you have time to repost your sysctl cleanup for 
Andrew from http://marc.info/?l=linux-kernel&m=126457416729672 with the
#ifndef CONFIG_MMU fix I mentioned?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [patch] sysctl: clean up vm related variable declarations
  2010-02-03  2:12           ` KAMEZAWA Hiroyuki
@ 2010-02-03  2:36             ` David Rientjes
  2010-02-03  8:07               ` KOSAKI Motohiro
  2010-02-03  8:17               ` Balbir Singh
  0 siblings, 2 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-03  2:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, minchan.kim, linux-kernel,
	linux-mm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, there are many "extern" declaration in kernel/sysctl.c. "extern"
declaration in *.c file is not appreciated in general.
And Hmm...it seems there are a few redundant declarations.

Because most of sysctl variables are defined in its own header file,
they should be declared in the same style, be done in its own *.h file.

This patch removes some VM(memory management) related sysctl's
variable declaration from kernel/sysctl.c and move them to
proper places.

[rientjes@google.com: #ifdef fixlet]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/mm.h     |    5 +++++
 include/linux/mmzone.h |    1 +
 include/linux/oom.h    |    5 +++++
 kernel/sysctl.c        |   16 ++--------------
 mm/mmap.c              |    5 +++++
 5 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1305,6 +1305,7 @@ int in_gate_area_no_task(unsigned long addr);
 #define in_gate_area(task, addr) ({(void)task; in_gate_area_no_task(addr);})
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
+extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
@@ -1347,5 +1348,9 @@ extern void shake_page(struct page *p, int access);
 extern atomic_long_t mce_bad_pages;
 extern int soft_offline_page(struct page *page, int flags);
 
+#ifndef CONFIG_MMU
+extern int sysctl_nr_trim_pages;
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -760,6 +760,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1];
 int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+extern int percpu_pagelist_fraction; /* for sysctl */
 int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -43,5 +43,10 @@ static inline void oom_killer_enable(void)
 {
 	oom_killer_disabled = false;
 }
+/* for sysctl */
+extern int sysctl_panic_on_oom;
+extern int sysctl_oom_kill_allocating_task;
+extern int sysctl_oom_dump_tasks;
+
 #endif /* __KERNEL__*/
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -50,6 +50,8 @@
 #include <linux/ftrace.h>
 #include <linux/slow-work.h>
 #include <linux/perf_event.h>
+#include <linux/mman.h>
+#include <linux/oom.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -66,11 +68,6 @@
 /* External variables not in a header file. */
 extern int C_A_D;
 extern int print_fatal_signals;
-extern int sysctl_overcommit_memory;
-extern int sysctl_overcommit_ratio;
-extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
 extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;
@@ -79,14 +76,9 @@ extern unsigned int core_pipe_limit;
 extern int pid_max;
 extern int min_free_kbytes;
 extern int pid_max_min, pid_max_max;
-extern int sysctl_drop_caches;
-extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int latencytop_enabled;
 extern int sysctl_nr_open_min, sysctl_nr_open_max;
-#ifndef CONFIG_MMU
-extern int sysctl_nr_trim_pages;
-#endif
 #ifdef CONFIG_RCU_TORTURE_TEST
 extern int rcutorture_runnable;
 #endif /* #ifdef CONFIG_RCU_TORTURE_TEST */
@@ -197,10 +189,6 @@ extern struct ctl_table inotify_table[];
 extern struct ctl_table epoll_table[];
 #endif
 
-#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
-int sysctl_legacy_va_layout;
-#endif
-
 extern int prove_locking;
 extern int lock_stat;
 
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -87,6 +87,11 @@ int sysctl_overcommit_ratio = 50;	/* default is 50% */
 int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
 struct percpu_counter vm_committed_as;
 
+#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
+/* Used by each architecture's private code and sysctl. */
+int sysctl_legacy_va_layout;
+#endif
+
 /*
  * Check that a process has enough memory to allocate a new virtual
  * mapping. 0 means there is enough memory for the allocation to

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-01 22:02 Improving OOM killer Lubos Lunak
  2010-02-01 23:53 ` David Rientjes
@ 2010-02-03  7:50 ` KOSAKI Motohiro
  2010-02-03  9:40   ` David Rientjes
  2010-02-03  8:57 ` Balbir Singh
  2010-02-03 14:49 ` Rik van Riel
  3 siblings, 1 reply; 53+ messages in thread
From: KOSAKI Motohiro @ 2010-02-03  7:50 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: kosaki.motohiro, linux-mm, linux-kernel, Andrew Morton,
	David Rientjes, Balbir Singh, Nick Piggin, Jiri Kosina

> =====
> --- linux-2.6.31/mm/oom_kill.c.sav      2010-02-01 22:00:41.614838540 +0100
> +++ linux-2.6.31/mm/oom_kill.c  2010-02-01 22:01:08.773757932 +0100
> @@ -69,7 +69,7 @@ unsigned long badness(struct task_struct
>         /*
>          * The memory size of the process is the basis for the badness.
>          */
> -       points = mm->total_vm;
> +       points = get_mm_rss(mm);
> 
>         /*
>          * After this unlock we can no longer dereference local variable `mm'
> @@ -83,21 +83,6 @@ unsigned long badness(struct task_struct
>                 return ULONG_MAX;
> 
>         /*
> -        * Processes which fork a lot of child processes are likely
> -        * a good choice. We add half the vmsize of the children if they
> -        * have an own mm. This prevents forking servers to flood the
> -        * machine with an endless amount of children. In case a single
> -        * child is eating the vast majority of memory, adding only half
> -        * to the parents will make the child our kill candidate of choice.
> -        */
> -       list_for_each_entry(child, &p->children, sibling) {
> -               task_lock(child);
> -               if (child->mm != mm && child->mm)
> -                       points += child->mm->total_vm/2 + 1;
> -               task_unlock(child);
> -       }
> -
> -       /*
>          * CPU time is in tens of seconds and run time is in thousands
>           * of seconds. There is no particular reason for this other than
>           * that it turned out to work very well in practice.
> =====
> 
>  In other words, use VmRSS for measuring memory usage instead of VmSize, and 
> remove child accumulating.
> 
>  I hope the above is good enough reason for the first change. VmSize includes 
> things like read-only mappings, memory mappings that is actually unused, 
> mappings backed by a file, mappings from video drivers, and so on. VmRSS is 
> actual real memory used, which is what mostly matters here. While it may not 
> be perfect, it is certainly an improvement.
> 
>  The second change should be done on the basis that it does more harm than 
> good. In this specific case, it does not help to identify the source of the 
> problem, and it incorrectly identifies kdeinit as the problem solely on the 
> basis that it spawned many other processes. I think it's already quite hinted 
> that this is a problem by the fact that you had to add a special protection 
> for init - any session manager, process launcher or even xterm used for 
> launching apps is yet another init.
> 
>  I also have problems finding a case where the child accounting would actually 
> help. I mean, in practice, I can certainly come up with something in theory, 
> and this looks to me like a solution to a very synthesized problem. In which 
> realistic case will one process launch a limited number of children, where 
> all of them will consume memory, but just killing the children one by one 
> won't avoid the problem reasonably? This is unlikely to avoid a forkbomb, as 
> in that case the number of children will be the problem. It is not necessary 
> for just one children misbehaving and being restarted, nor will it work 
> there. So what is that supposed to fix, and is it more likely than the case 
> of a process launching several unrelated children?
> 
>  If the children accounting is supposed to handle cases like forked children 
> of Apache, then I suggest it is adjusted only to count children that have 
> been forked from the parent but there has been no exec(). I'm afraid I don't 
> know how to detect that.
> 
> 
>  When running a kernel with these changes applied, I can safely do the 
> above-described case of running parallel doc generation in KDE. No clearly 
> innocent process is selected for killing, the first choice is always an 
> offender.
> 
>  Moreover, the remedy is almost instant, there is only a fraction of second of 
> when the machine is overloaded by the I/O of swapping pages in and out (I do 
> not use swap, but there is a large amount of memory used by read-only 
> mappings of binaries, libraries or various other files that is in the 
> original case rendering the machine unresponsive - I assume this is because 
> the kernel tries to kill an innocent process, but the offenders immediatelly 
> consume anything that is freed, requiring even memory used by code that is to 
> be executed to be swapped in from files again).
> 
>  I consider the patches to be definite improvements, so if they are ok, I will 
> format them as necessary. Now, what is the catch?

Personally, I think your use case represent to typical desktop and Linux
have to works fine on typical desktop use-case. /proc/pid/oom_adj never fit
desktop use-case. In past discussion, I'v agreed with much people. but I haven't
reach to agree with David Rientjes about this topic.

If you want to merge this patch, you need persuade him. I can't help you. sorry.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] sysctl: clean up vm related variable declarations
  2010-02-03  2:36             ` [patch] sysctl: clean up vm related variable declarations David Rientjes
@ 2010-02-03  8:07               ` KOSAKI Motohiro
  2010-02-03  8:17               ` Balbir Singh
  1 sibling, 0 replies; 53+ messages in thread
From: KOSAKI Motohiro @ 2010-02-03  8:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Balbir Singh,
	minchan.kim, linux-kernel, linux-mm

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, there are many "extern" declaration in kernel/sysctl.c. "extern"
> declaration in *.c file is not appreciated in general.
> And Hmm...it seems there are a few redundant declarations.
> 
> Because most of sysctl variables are defined in its own header file,
> they should be declared in the same style, be done in its own *.h file.
> 
> This patch removes some VM(memory management) related sysctl's
> variable declaration from kernel/sysctl.c and move them to
> proper places.
> 
> [rientjes@google.com: #ifdef fixlet]
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [patch] sysctl: clean up vm related variable declarations
  2010-02-03  2:36             ` [patch] sysctl: clean up vm related variable declarations David Rientjes
  2010-02-03  8:07               ` KOSAKI Motohiro
@ 2010-02-03  8:17               ` Balbir Singh
  1 sibling, 0 replies; 53+ messages in thread
From: Balbir Singh @ 2010-02-03  8:17 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, minchan.kim, linux-kernel,
	linux-mm

* David Rientjes <rientjes@google.com> [2010-02-02 18:36:42]:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, there are many "extern" declaration in kernel/sysctl.c. "extern"
> declaration in *.c file is not appreciated in general.
> And Hmm...it seems there are a few redundant declarations.
> 
> Because most of sysctl variables are defined in its own header file,
> they should be declared in the same style, be done in its own *.h file.
> 
> This patch removes some VM(memory management) related sysctl's
> variable declaration from kernel/sysctl.c and move them to
> proper places.
> 
> [rientjes@google.com: #ifdef fixlet]
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  include/linux/mm.h     |    5 +++++
>  include/linux/mmzone.h |    1 +
>  include/linux/oom.h    |    5 +++++
>  kernel/sysctl.c        |   16 ++--------------
>  mm/mmap.c              |    5 +++++
>  5 files changed, 18 insertions(+), 14 deletions(-)
>

Looks good to me 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-01 22:02 Improving OOM killer Lubos Lunak
  2010-02-01 23:53 ` David Rientjes
  2010-02-03  7:50 ` KOSAKI Motohiro
@ 2010-02-03  8:57 ` Balbir Singh
  2010-02-03 12:10   ` Lubos Lunak
  2010-02-03 14:49 ` Rik van Riel
  3 siblings, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2010-02-03  8:57 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

* Lubos Lunak <l.lunak@suse.cz> [2010-02-01 23:02:37]:

> 
>  Hello,
> 
>  I'd like to suggest some changes to the OOM killer code that I believe 
> improve its behaviour - here it turns it from something completely useless and 
> harmful into something that works very well. I'm first posting changes for 
> review, I'll reformat the patch later as needed if approved. Given that it's 
> been like this for about a decade, I get the feeling there must be some 
> strange catch.
> 
>  My scenario is working in a KDE desktop session and accidentally running 
> parallel make in doc/ subdirectory of sources of a KDE module. As I use 
> distributed compiling, I run make with -j20 or more, but, as the tool used for 
> processing KDE documentation is quite memory-intensive, running this many 
> of them is more than enough to consume all the 2GB RAM in the machine. What 
> happens in that case is that the machine becomes virtually unresponsible, 
> where even Ctrl+Alt+F1 can take minutes, not to mention some action that'd 
> actually redeem the situation. If I wait long enough for something to happen, 
> which can be even hours, the action that ends the situation is killing one of 
> the most vital KDE processes, rendering the whole session useless and making 
> me lose all unsaved data.
> 
>  The process tree looks roughly like this:
> 
> init
>   |- kdeinit
>   |  |- ksmserver
>   |  |  |- kwin
>   |  |- <other>
>   |- konsole
>      |- make
>         |- sh
>         |  |- meinproc4
>         |- sh
>         |  |- meinproc4
>         |- <etc>
> 
>  What happens is that OOM killer usually selects either ksmserver (KDE session 
> manager) or kdeinit (KDE master process that spawns most KDE processes). Note 
> that in either case OOM killer does not reach the point of killing the actual 
> offender - it will randomly kill in the tree under kdeinit until it decides 
> to kill ksmserver, which means terminating the desktop session. As konsole is 
> a KUniqueApplication, it forks into background and gets reparented to init, 
> thus getting away from the kdeinit subtree. Since the memory pressure is 
> distributed among several meinproc4 processes, the badness does not get 
> summed up in its make grandparent, as badness() does this only for direct 
> parents.
> 
>  Each meinproc4 process still uses a considerable amount of memory, so one 
> could assume that the situation would be solved by simply killing them one by 
> one, but it is not so because of using what I consider poor metric for 
> measuring memory usage - VmSize. VmSize, if I'm not mistaken, is the size of 
> the address space taken by the process, which in practice does not say much 
> about how much memory the process actually uses. For example, /proc/*/status 
> for one selected KDE process:
> 
> VmPeak:   534676 kB
> VmSize:   528340 kB
> VmLck:         0 kB
> VmHWM:     73464 kB
> VmRSS:     73388 kB
> VmData:   142332 kB
> VmStk:        92 kB
> VmExe:        44 kB
> VmLib:     91232 kB
> VmPTE:       716 kB
> 
> And various excerpts from /proc/*/smaps for this process:
> ...
> 7f7b3f800000-7f7b40000000 rwxp 00000000 00:00 0
> Size:               8192 kB
> Rss:                  16 kB
> Referenced:           16 kB
> ...
> 7f7b40055000-7f7b44000000 ---p 00000000 00:00 0
> Size:              65196 kB
> Rss:                   0 kB
> Referenced:            0 kB
> ...
> 7f7b443cd000-7f7b445cd000 ---p 0001c000 08:01 
> 790267                     /usr/lib64/kde4/libnsplugin.so
> Size:               2048 kB
> Rss:                   0 kB
> Referenced:            0 kB
> ...
> 7f7b48300000-7f7b4927d000 rw-s 00000000 08:01 
> 58690                      /var/tmp/kdecache-seli/kpc/kde-icon-cache.data
> Size:              15860 kB
> Rss:                  24 kB
> Referenced:           24 kB
> 
>  I assume the first one is stack, search me what the second and third ones are 
> (there appears to be one such mapping as the third one for each .so used), 
> the last one is a mapping of a large cache file that's nevertheless rarely 
> used extensively and even then it's backed by a file. In other words, none of 
> this actually uses much of real memory, yet right now it's the process that 
> would get killed for using about 70MB memory, even though it's not the 
> offender. The offender scores only about 1/3 of its badness, even though it 
> uses almost the double amount of memory:
> 
> VmPeak:   266508 kB
> VmSize:   266504 kB
> VmLck:         0 kB
> VmHWM:    118208 kB
> VmRSS:    118208 kB
> VmData:    98512 kB
> VmStk:        84 kB
> VmExe:        60 kB
> VmLib:     48944 kB
> VmPTE:       536 kB
> 
> And the offender is only 14th in the list of badness candidates. Speaking of 
> which, the following is quite useful for seeing all processes sorted by 
> badness:
> 
> ls /proc/*/oom_score | grep -v self | sed 's/\(.*\)\/\(.*\)/echo -n "\1 "; \
> echo -n "`cat \1\/\2 2>\/dev\/null` "; readlink \1\/exe || echo/'| sh | \
> sort -nr +1
> 
> 
>  Therefore, I suggest doing the following changes in mm/oom_kill.c :
> 
> =====
> --- linux-2.6.31/mm/oom_kill.c.sav      2010-02-01 22:00:41.614838540 +0100
> +++ linux-2.6.31/mm/oom_kill.c  2010-02-01 22:01:08.773757932 +0100
> @@ -69,7 +69,7 @@ unsigned long badness(struct task_struct
>         /*
>          * The memory size of the process is the basis for the badness.
>          */
> -       points = mm->total_vm;
> +       points = get_mm_rss(mm);
> 
>         /*
>          * After this unlock we can no longer dereference local variable `mm'
> @@ -83,21 +83,6 @@ unsigned long badness(struct task_struct
>                 return ULONG_MAX;
> 
>         /*
> -        * Processes which fork a lot of child processes are likely
> -        * a good choice. We add half the vmsize of the children if they
> -        * have an own mm. This prevents forking servers to flood the
> -        * machine with an endless amount of children. In case a single
> -        * child is eating the vast majority of memory, adding only half
> -        * to the parents will make the child our kill candidate of choice.
> -        */
> -       list_for_each_entry(child, &p->children, sibling) {
> -               task_lock(child);
> -               if (child->mm != mm && child->mm)
> -                       points += child->mm->total_vm/2 + 1;
> -               task_unlock(child);
> -       }
> -
> -       /*
>          * CPU time is in tens of seconds and run time is in thousands
>           * of seconds. There is no particular reason for this other than
>           * that it turned out to work very well in practice.
> =====
> 
>  In other words, use VmRSS for measuring memory usage instead of VmSize, and 
> remove child accumulating.

I am not sure of the impact of changing to RSS, although I've
personally believed that RSS based accounting is where we should go,
but we need to consider the following

1. Total VM provides data about potentially swapped pages, overcommit,
etc.
2. RSS alone is not sufficient, RSS does not account for shared pages,
so we ideally need something like PSS.

You could potentially use the memory resource controller for
isolation, but that is a way of working around and parititioning the
system, there are other workarounds like oom-adjust, etc.

I suspect the correct answer would depend on our answers to 1 and 2
and a lot of testing with any changes made.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03  7:50 ` KOSAKI Motohiro
@ 2010-02-03  9:40   ` David Rientjes
  0 siblings, 0 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-03  9:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton, Balbir Singh,
	Nick Piggin, Jiri Kosina

On Wed, 3 Feb 2010, KOSAKI Motohiro wrote:

> Personally, I think your use case represent to typical desktop and Linux
> have to works fine on typical desktop use-case. /proc/pid/oom_adj never fit
> desktop use-case. In past discussion, I'v agreed with much people. but I haven't
> reach to agree with David Rientjes about this topic.
> 

Which point don't you agree with?  I've agreed that heuristic needs to be 
changed and since Kame has decided to abandon his oom killer work, I said 
that I would find time to develop a solution that would be based on 
consensus.  I don't think that simply replacing the baseline with rss, 
rendering oom_adj practically useless for any other purpose other than 
polarizing priorities, and removing any penalty for tasks that fork an 
egregious amount of tasks is acceptable to all parties, though.

When a desktop system runs a vital task that, at all costs, cannot 
possibly be oom killed such as KDE from the user perspective, is it really 
that outrageous of a request to set it to OOM_DISABLE?  No, it's not.  
There are plenty of open source examples of applications that tune their 
own oom_adj values for that reason; userspace input into the oom killer's 
heuristic will always be an integral part of its function.

I believe that we can reach consensus without losing the existing 
functionality that oom_adj provides, namely defining vital system tasks 
and memory leakers, and without this all or nothing type attitude that 
insists we either go with rss as a baseline because "it doesn't select X 
first in my particular example" or you'll just take your ball and go home.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03  8:57 ` Balbir Singh
@ 2010-02-03 12:10   ` Lubos Lunak
  2010-02-03 12:25     ` Balbir Singh
  0 siblings, 1 reply; 53+ messages in thread
From: Lubos Lunak @ 2010-02-03 12:10 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wednesday 03 of February 2010, Balbir Singh wrote:
> * Lubos Lunak <l.lunak@suse.cz> [2010-02-01 23:02:37]:
> >  In other words, use VmRSS for measuring memory usage instead of VmSize,
> > and remove child accumulating.
>
> I am not sure of the impact of changing to RSS, although I've
> personally believed that RSS based accounting is where we should go,
> but we need to consider the following
>
> 1. Total VM provides data about potentially swapped pages,

 Yes, I've already updated my proposal in another mail to switch from VmSize 
to VmRSS+InSwap. I don't know how to find out the second item in code, but at 
this point of discussion that's just details.

> overcommit, 

 I don't understand how this matters. Overcommit is memory for which address 
space has been allocated but not actual memory, right? Then that's exactly 
what I'm claiming is wrong and am trying to reverse. Currently OOM killer 
takes this into account because it uses VmSize, but IMO it shouldn't - if a 
process does malloc(400M) but then it uses only a tiny fraction of that, in 
the case of memory shortage killing that process does not solve anything in 
practice.

> etc.
> 2. RSS alone is not sufficient, RSS does not account for shared pages,
> so we ideally need something like PSS.

 Just to make sure I understand what you mean with "RSS does not account for 
shared pages" - you say that if a page is shared by 4 processes, then when 
calculating badness for them, only 1/4 of the page should be counted for 
each? Yes, I suppose so, that makes sense. That's more like fine-tunning at 
this point though, as long as there's no agreement that moving away from 
VmSize is an improvement.

> I suspect the correct answer would depend on our answers to 1 and 2
> and a lot of testing with any changes made.

 Testing - are there actually any tests for it, or do people just test random 
scenarios when they do changes? Also, I'm curious, what areas is the OOM 
killer actually generally known to work well in? I somehow get the feeling 
from the discussion here that people just tweak oom_adj until it works for 
them.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 12:10   ` Lubos Lunak
@ 2010-02-03 12:25     ` Balbir Singh
  2010-02-03 15:00       ` Minchan Kim
  2010-02-03 21:22       ` Lubos Lunak
  0 siblings, 2 replies; 53+ messages in thread
From: Balbir Singh @ 2010-02-03 12:25 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

* Lubos Lunak <l.lunak@suse.cz> [2010-02-03 13:10:27]:

> On Wednesday 03 of February 2010, Balbir Singh wrote:
> > * Lubos Lunak <l.lunak@suse.cz> [2010-02-01 23:02:37]:
> > >  In other words, use VmRSS for measuring memory usage instead of VmSize,
> > > and remove child accumulating.
> >
> > I am not sure of the impact of changing to RSS, although I've
> > personally believed that RSS based accounting is where we should go,
> > but we need to consider the following
> >
> > 1. Total VM provides data about potentially swapped pages,
> 
>  Yes, I've already updated my proposal in another mail to switch from VmSize 
> to VmRSS+InSwap. I don't know how to find out the second item in code, but at 
> this point of discussion that's just details.
> 

I am yet to catch up with the rest of the thread. Thanks for heads up.

> > overcommit, 
> 
>  I don't understand how this matters. Overcommit is memory for which address 
> space has been allocated but not actual memory, right? Then that's exactly 
> what I'm claiming is wrong and am trying to reverse. Currently OOM killer 
> takes this into account because it uses VmSize, but IMO it shouldn't - if a 
> process does malloc(400M) but then it uses only a tiny fraction of that, in 
> the case of memory shortage killing that process does not solve anything in 
> practice.

We have a way of tracking commmitted address space, which is more
sensible than just allocating memory and is used for tracking
overcommit. I was suggesting that, that might be a better approach.

> 
> > etc.
> > 2. RSS alone is not sufficient, RSS does not account for shared pages,
> > so we ideally need something like PSS.
> 
>  Just to make sure I understand what you mean with "RSS does not account for 
> shared pages" - you say that if a page is shared by 4 processes, then when 
> calculating badness for them, only 1/4 of the page should be counted for 
> each? Yes, I suppose so, that makes sense.

Yes, that is what I am speaking of

> That's more like fine-tunning at 
> this point though, as long as there's no agreement that moving away from 
> VmSize is an improvement.
>

There is no easy way to calculate the Pss today without walking the
page tables, but some simplification there will make it a better and a
more accurate metric.
 
> > I suspect the correct answer would depend on our answers to 1 and 2
> > and a lot of testing with any changes made.
> 
>  Testing - are there actually any tests for it, or do people just test random 
> scenarios when they do changes? Also, I'm curious, what areas is the OOM 
> killer actually generally known to work well in? I somehow get the feeling 
> from the discussion here that people just tweak oom_adj until it works for 
> them.
>

I've mostly found OOM killer to work well for me, but looking at the
design and our discussions I know there need to be certain improvements. 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-01 22:02 Improving OOM killer Lubos Lunak
                   ` (2 preceding siblings ...)
  2010-02-03  8:57 ` Balbir Singh
@ 2010-02-03 14:49 ` Rik van Riel
  2010-02-03 17:01   ` Balbir Singh
  3 siblings, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2010-02-03 14:49 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes,
	KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina

On 02/01/2010 05:02 PM, Lubos Lunak wrote:

>   In other words, use VmRSS for measuring memory usage instead of VmSize, and
> remove child accumulating.

I agree with removing the child accumulating code.  That code can
do a lot of harm with databases like postgresql, or cause the
system's main service (eg. httpd) to be killed when a broken cgi
script used up too much memory.

IIRC the child accumulating code was introduced to deal with
malicious code (fork bombs), but it makes things worse for the
(much more common) situation of a system without malicious
code simply running out of memory due to being very busy.

I have no strong opinion on using RSS vs VmSize.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 12:25     ` Balbir Singh
@ 2010-02-03 15:00       ` Minchan Kim
  2010-02-03 16:06         ` Minchan Kim
  2010-02-03 21:22       ` Lubos Lunak
  1 sibling, 1 reply; 53+ messages in thread
From: Minchan Kim @ 2010-02-03 15:00 UTC (permalink / raw)
  To: balbir
  Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton,
	David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wed, 2010-02-03 at 17:55 +0530, Balbir Singh wrote:
> * Lubos Lunak <l.lunak@suse.cz> [2010-02-03 13:10:27]:
> 
> > On Wednesday 03 of February 2010, Balbir Singh wrote:
> > > * Lubos Lunak <l.lunak@suse.cz> [2010-02-01 23:02:37]:
> > > >  In other words, use VmRSS for measuring memory usage instead of VmSize,
> > > > and remove child accumulating.
> > >
> > > I am not sure of the impact of changing to RSS, although I've
> > > personally believed that RSS based accounting is where we should go,
> > > but we need to consider the following
> > >
> > > 1. Total VM provides data about potentially swapped pages,
> > 
> >  Yes, I've already updated my proposal in another mail to switch from VmSize 
> > to VmRSS+InSwap. I don't know how to find out the second item in code, but at 
> > this point of discussion that's just details.
> > 

We have swap count with mm-count-swap-usage.patch by Kame in mmtom.

> I am yet to catch up with the rest of the thread. Thanks for heads up.
> 
> > > overcommit, 
> > 
> >  I don't understand how this matters. Overcommit is memory for which address 
> > space has been allocated but not actual memory, right? Then that's exactly 
> > what I'm claiming is wrong and am trying to reverse. Currently OOM killer 
> > takes this into account because it uses VmSize, but IMO it shouldn't - if a 
> > process does malloc(400M) but then it uses only a tiny fraction of that, in 
> > the case of memory shortage killing that process does not solve anything in 
> > practice.
> 
> We have a way of tracking commmitted address space, which is more
> sensible than just allocating memory and is used for tracking
> overcommit. I was suggesting that, that might be a better approach.

Yes. It does make sense. At least total_vm doesn't care about
MAP_NORESERVE case. But unfortunately, it's a per CPU not per Process.

> 
> > 
> > > etc.
> > > 2. RSS alone is not sufficient, RSS does not account for shared pages,
> > > so we ideally need something like PSS.
> > 
> >  Just to make sure I understand what you mean with "RSS does not account for 
> > shared pages" - you say that if a page is shared by 4 processes, then when 
> > calculating badness for them, only 1/4 of the page should be counted for 
> > each? Yes, I suppose so, that makes sense.
> 
> Yes, that is what I am speaking of

I agree. If we want to make RSS with base of badness, it's one of things
we have to solve.


-- 
Kind regards,
Minchan Kim


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 15:00       ` Minchan Kim
@ 2010-02-03 16:06         ` Minchan Kim
  0 siblings, 0 replies; 53+ messages in thread
From: Minchan Kim @ 2010-02-03 16:06 UTC (permalink / raw)
  To: balbir
  Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton,
	David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Thu, 2010-02-04 at 00:00 +0900, Minchan Kim wrote:
> On Wed, 2010-02-03 at 17:55 +0530, Balbir Singh wrote:
> > * Lubos Lunak <l.lunak@suse.cz> [2010-02-03 13:10:27]:
> >> >  I don't understand how this matters. Overcommit is memory for which address 
> > > space has been allocated but not actual memory, right? Then that's exactly 
> > > what I'm claiming is wrong and am trying to reverse. Currently OOM killer 
> > > takes this into account because it uses VmSize, but IMO it shouldn't - if a 
> > > process does malloc(400M) but then it uses only a tiny fraction of that, in 
> > > the case of memory shortage killing that process does not solve anything in 
> > > practice.
> > 
> > We have a way of tracking commmitted address space, which is more
> > sensible than just allocating memory and is used for tracking
> > overcommit. I was suggesting that, that might be a better approach.
> 
> Yes. It does make sense. At least total_vm doesn't care about
> MAP_NORESERVE case. But unfortunately, it's a per CPU not per Process.

Sorry for confusing. It was opposite. I slept :)
The commited as doesn't care about MAP_NORESERVE case. 
But it definitely charges memory. so I think total_vm is better than
committed as if we really have to use vmsize heuristic continuously.

But I am not sure that i understand your point about overcommit policy.


-- 
Kind regards,
Minchan Kim


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 14:49 ` Rik van Riel
@ 2010-02-03 17:01   ` Balbir Singh
  2010-02-03 18:58     ` David Rientjes
  0 siblings, 1 reply; 53+ messages in thread
From: Balbir Singh @ 2010-02-03 17:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton,
	David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina

* Rik van Riel <riel@redhat.com> [2010-02-03 09:49:18]:

> On 02/01/2010 05:02 PM, Lubos Lunak wrote:
> 
> >  In other words, use VmRSS for measuring memory usage instead of VmSize, and
> >remove child accumulating.
> 
> I agree with removing the child accumulating code.  That code can
> do a lot of harm with databases like postgresql, or cause the
> system's main service (eg. httpd) to be killed when a broken cgi
> script used up too much memory.
>
> IIRC the child accumulating code was introduced to deal with
> malicious code (fork bombs), but it makes things worse for the
> (much more common) situation of a system without malicious
> code simply running out of memory due to being very busy.
>

For fork bombs, we could do a number of children number test and have
a threshold before we consider a process and its children for
badness().

> I have no strong opinion on using RSS vs VmSize.
> 

David commented and feels strongly about RSS and prefers VmSize.

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 17:01   ` Balbir Singh
@ 2010-02-03 18:58     ` David Rientjes
  2010-02-03 19:29       ` Frans Pop
  2010-02-03 22:55       ` Lubos Lunak
  0 siblings, 2 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-03 18:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Rik van Riel, Lubos Lunak, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wed, 3 Feb 2010, Balbir Singh wrote:

> > IIRC the child accumulating code was introduced to deal with
> > malicious code (fork bombs), but it makes things worse for the
> > (much more common) situation of a system without malicious
> > code simply running out of memory due to being very busy.
> >
> 
> For fork bombs, we could do a number of children number test and have
> a threshold before we consider a process and its children for
> badness().
> 

Yes, we could look for the number of children with seperate mm's and then 
penalize those threads that have forked an egregious amount, say, 500 
tasks.  I think we should check for this threshold within the badness() 
heuristic to identify such forkbombs and not limit it only to certain 
applications.  

My rewrite for the badness() heuristic is centered on the idea that scores 
should range from 0 to 1000, 0 meaning "never kill this task" and 1000 
meaning "kill this task first."  The baseline for a thread, p, may be 
something like this:

	unsigned int badness(struct task_struct *p,
					unsigned long totalram)
	{
		struct task_struct *child;
		struct mm_struct *mm;
		int forkcount = 0;
		long points;

		task_lock(p);
		mm = p->mm;
		if (!mm) {
			task_unlock(p);
			return 0;
		}
		points = (get_mm_rss(mm) +
				get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
				totalram;
		task_unlock(p);

		list_for_each_entry(child, &p->children, sibling)
			/* No lock, child->mm won't be dereferenced */
			if (child->mm && child->mm != mm)
				forkcount++;

		/* Forkbombs get penalized 10% of available RAM */
		if (forkcount > 500)
			points += 100;

		...

		/*
		 * /proc/pid/oom_adj ranges from -1000 to +1000 to either
		 * completely disable oom killing or always prefer it.
		 */
		points += p->signal->oom_adj;

		if (points < 0)
			return 0;
		return (points <= 1000) ? points : 1000;
	}

	static struct task_struct *select_bad_process(...,
						nodemask_t *nodemask)
	{
		struct task_struct *p;
		unsigned long totalram = 0;
		int nid;

		for_each_node_mask(nid, nodemask)
			totalram += NODE_DATA(nid)->node_present_pages;

		for_each_process(p) {
			unsigned int points;

			...

			if (!nodes_intersects(p->mems_allowed, nodemasks))
				continue;

			...
			points = badness(p, totalram);
			...
		}
		...
	}

In this example, /proc/pid/oom_adj now ranges from -1000 to +1000, with 
OOM_DISABLE being -1000, to polarize tasks for oom killing or determine 
when a task is leaking memory because it is using far more memory than it 
should.  The nodemask passed from the page allocator should be intersected 
with current->mems_allowed within the oom killer; userspace is then fully 
aware of what value is an egregious amount of RAM for a task to consume, 
including information it knows about the task's cpuset or mempolicy.  For 
example, it would be very simple for a user to set an oom_adj of -500, 
which means "we discount 50% of the task's allowed memory from being 
considered in the heuristic" or +500, which means "we always allow all 
other system/cpuset/mempolicy tasks to use at least 50% more allowed 
memory than this one."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 18:58     ` David Rientjes
@ 2010-02-03 19:29       ` Frans Pop
  2010-02-03 19:52         ` David Rientjes
  2010-02-03 22:55       ` Lubos Lunak
  1 sibling, 1 reply; 53+ messages in thread
From: Frans Pop @ 2010-02-03 19:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: balbir, riel, l.lunak, linux-mm, linux-kernel, akpm,
	kosaki.motohiro, npiggin, jkosina

David Rientjes wrote:
> /*
> * /proc/pid/oom_adj ranges from -1000 to +1000 to either
> * completely disable oom killing or always prefer it.
> */
> points += p->signal->oom_adj;
> 

Wouldn't that cause a rather huge compatibility issue given that the 
current oom_adj works in a totally different way:

! 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
! ------------------------------------------------------
! This file can be used to adjust the score used to select which processes
! should be killed in an  out-of-memory  situation.  Giving it a high score
! will increase the likelihood of this process being killed by the
! oom-killer.  Valid values are in the range -16 to +15, plus the special
! value -17, which disables oom-killing altogether for this process.

?

Cheers,
FJP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 19:29       ` Frans Pop
@ 2010-02-03 19:52         ` David Rientjes
  2010-02-03 20:12           ` Frans Pop
  0 siblings, 1 reply; 53+ messages in thread
From: David Rientjes @ 2010-02-03 19:52 UTC (permalink / raw)
  To: Frans Pop
  Cc: Balbir Singh, Rik van Riel, l.lunak, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, jkosina, linux-kernel, linux-mm

On Wed, 3 Feb 2010, Frans Pop wrote:

> > * /proc/pid/oom_adj ranges from -1000 to +1000 to either
> > * completely disable oom killing or always prefer it.
> > */
> > points += p->signal->oom_adj;
> > 
> 
> Wouldn't that cause a rather huge compatibility issue given that the 
> current oom_adj works in a totally different way:
> 
> ! 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
> ! ------------------------------------------------------
> ! This file can be used to adjust the score used to select which processes
> ! should be killed in an  out-of-memory  situation.  Giving it a high score
> ! will increase the likelihood of this process being killed by the
> ! oom-killer.  Valid values are in the range -16 to +15, plus the special
> ! value -17, which disables oom-killing altogether for this process.
> 
> ?
> 

I thought about whether we'd need an additional, complementary tunable 
such as /proc/pid/oom_bias that would effect this new memory-charging bias 
in the heuristic.  It could be implemented so that writing to oom_adj 
would clear oom_bias and vice versa.

Although that would certainly be possible, I didn't propose it for a 
couple of reasons:

 - it would clutter the space to have two seperate tunables when the 
   metrics that /proc/pid/oom_adj uses has become obsolete by the new
   baseline as a fraction of total RAM, and

 - we have always exported OOM_DISABLE, OOM_ADJUST_MIN, and OOM_ADJUST_MAX
   via include/oom.h so that userspace should use them sanely.  Setting
   a particular oom_adj value for anything other than OOM_DISABLE means 
   the score will be relative to other system tasks, so its a value that 
   is typically calibrated at runtime rather than static, hardcoded 
   values.

We could reuse /proc/pid/oom_adj for the new heuristic by severely 
reducing its granularity than it otherwise would by doing
(oom_adj * 1000 / OOM_ADJUST_MAX), but that will eventually become 
annoying and much more difficult to document.

Given your citation, I don't think we've ever described /proc/pid/oom_adj 
outside of the implementation as a bitshift, either.  So its use right now 
for anything other than OOM_DISABLE is probably based on scalar thinking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 19:52         ` David Rientjes
@ 2010-02-03 20:12           ` Frans Pop
  2010-02-03 20:26             ` David Rientjes
  0 siblings, 1 reply; 53+ messages in thread
From: Frans Pop @ 2010-02-03 20:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Rik van Riel, l.lunak, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, jkosina, linux-kernel, linux-mm

On Wednesday 03 February 2010, David Rientjes wrote:
>  - we have always exported OOM_DISABLE, OOM_ADJUST_MIN, and
> OOM_ADJUST_MAX via include/oom.h so that userspace should use them
> sanely.  Setting a particular oom_adj value for anything other than
> OOM_DISABLE means the score will be relative to other system tasks, so
> its a value that is typically calibrated at runtime rather than static,
> hardcoded values.

That doesn't take into account:
- applications where the oom_adj value is hardcoded to a specific value
  (for whatever reason)
- sysadmin scripts that set oom_adj from the console

I would think that oom_adj is a documented part of the userspace ABI and 
that the change you propose does not fit the normal backwards 
compatibility requirements for exposed tunables.

I think that at least any user who's currently setting oom_adj to -17 has a 
right to expect that to continue to mean "oom killer disabled". And for 
any other value they should get a similar impact to the current impact, 
and not one that's reduced by a factor 66.

> We could reuse /proc/pid/oom_adj for the new heuristic by severely
> reducing its granularity than it otherwise would by doing
> (oom_adj * 1000 / OOM_ADJUST_MAX), but that will eventually become
> annoying and much more difficult to document.

Probably quite true, but maybe unavoidable if one accepts the above.

But I'll readily admit I'm not the final authority on this.

Cheers,
FJP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 20:12           ` Frans Pop
@ 2010-02-03 20:26             ` David Rientjes
  0 siblings, 0 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-03 20:26 UTC (permalink / raw)
  To: Frans Pop
  Cc: Balbir Singh, Rik van Riel, l.lunak, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, jkosina, linux-kernel, linux-mm

On Wed, 3 Feb 2010, Frans Pop wrote:

> That doesn't take into account:
> - applications where the oom_adj value is hardcoded to a specific value
>   (for whatever reason)
> - sysadmin scripts that set oom_adj from the console
> 

The fundamentals are the same: negative values mean the task is less 
likely to be preferred and positive values mean the task is more likely, 
only the scale is different.  That scale is exported by the kernel via 
OOM_ADJUST_MIN and OOM_ADJUST_MAX and has been since 2006.  I don't think 
we need to preserve legacy applications or scripts that use hardcoded 
values without importing linux/oom.h.

> I would think that oom_adj is a documented part of the userspace ABI and 
> that the change you propose does not fit the normal backwards 
> compatibility requirements for exposed tunables.
> 

The range is documented (but it should have been documented as being from 
OOM_ADJUST_MIN to OOM_ADJUST_MAX) but its implementation as a bitshift is 
not; it simply says that positive values mean the task is more preferred 
and negative values mean it is less preferred.  Those semantics are 
preserved.

> I think that at least any user who's currently setting oom_adj to -17 has a 
> right to expect that to continue to mean "oom killer disabled". And for 
> any other value they should get a similar impact to the current impact, 
> and not one that's reduced by a factor 66.
> 

If the baseline changes as we all agree it needs to such that oom_adj no 
longer represents the same thing it did in the first place (it would 
become a linear bias), I think this breakage is actually beneficial.  
Users will now be able to tune their oom_adj values based on a fraction of 
system memory to bias their applications either preferrably or otherwise.

I think we should look at Linux over the next couple of years and decide 
if we want to be married to the current semantics of oom_adj that are 
going to change (as it would require being a factor of 66, as you 
mentioned) when the implementation it was designed for has vanished.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 12:25     ` Balbir Singh
  2010-02-03 15:00       ` Minchan Kim
@ 2010-02-03 21:22       ` Lubos Lunak
  1 sibling, 0 replies; 53+ messages in thread
From: Lubos Lunak @ 2010-02-03 21:22 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wednesday 03 of February 2010, Balbir Singh wrote:
> * Lubos Lunak <l.lunak@suse.cz> [2010-02-03 13:10:27]:
> > On Wednesday 03 of February 2010, Balbir Singh wrote:
> > > 2. RSS alone is not sufficient, RSS does not account for shared pages,
> > > so we ideally need something like PSS.
> >
> >  Just to make sure I understand what you mean with "RSS does not account
> > for shared pages" - you say that if a page is shared by 4 processes, then
> > when calculating badness for them, only 1/4 of the page should be counted
> > for each? Yes, I suppose so, that makes sense.
>
> Yes, that is what I am speaking of
>
> > That's more like fine-tunning at
> > this point though, as long as there's no agreement that moving away from
> > VmSize is an improvement.
>
> There is no easy way to calculate the Pss today without walking the
> page tables, but some simplification there will make it a better and a
> more accurate metric.

 OOM should be a rare situation, so doing a little amount of counting 
shouldn't be a big deal. Especially if the machine is otherwise busy waiting 
for the HDD paging stuff out and in again and has plenty of CPU time to 
waste.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03  1:41     ` David Rientjes
  2010-02-03  1:52       ` KAMEZAWA Hiroyuki
@ 2010-02-03 22:54       ` Lubos Lunak
  2010-02-04  0:00         ` David Rientjes
  1 sibling, 1 reply; 53+ messages in thread
From: Lubos Lunak @ 2010-02-03 22:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro,
	Balbir Singh, Nick Piggin, Jiri Kosina, KAMEZAWA Hiroyuki


 Given that the badness() proposal I see in your another mail uses 
get_mm_rss(), I take it that you've meanwhile changed your mind on the VmSize 
vs VmRSS argument and considered that argument irrelevant now. I will comment 
only on the suggested use of oom_adj on the desktop here. And actually I hope 
that if something reasonably similar to your badness() proposal replaces the 
current one it will make any use of oom_adj not needed on the desktop in the 
usual case, so this may be irrelevant as well.

On Wednesday 03 of February 2010, David Rientjes wrote:
> On Tue, 2 Feb 2010, Lubos Lunak wrote:
> >  Not that it really matters - the net result is that OOM killer usually
> > decides to kill kdeinit or ksmserver, starts killing their children,
> > vital KDE processes, and since the offenders are not among them, it ends
> > up either terminating the whole session by killing ksmserver or killing
> > enough vital processes there to free enough memory for the offenders to
> > finish their work cleanly.
>
> The kernel cannot possibly know what you consider a "vital" process, for
> that understanding you need to tell it using the very powerful
> /proc/pid/oom_adj tunable.  I suspect if you were to product all of
> kdeinit's children by patching it to be OOM_DISABLE so that all threads it
> forks will inherit that value you'd actually see much improved behavior.

 No. Almost everything in KDE is spawned by kdeinit, so everything would get 
the adjustment, which means nothing would in practice get the adjustment.

> I'd also encourage you to talk to the KDE developers to ensure that proper
> precautions are taken to protect it in such conditions since people who
> use such desktop environments typically don't want them to be sacrificed
> for memory.

 I am a KDE developer, it's written in my signature. And I've already talked 
enough to the KDE developer who has done the oom_adj code that's already 
there, as that's also me. I don't know kernel internals, but that doesn't 
mean I'm completely clueless about the topic of the discussion I've started.

> >  Worse, it worked for about a year or two and now it has only shifted the
> > problem elsewhere and that's it. We now protect kdeinit, which means the
> > OOM killer's choice will very likely ksmserver then. Ok, so let's say now
> > we start protecting also ksmserver, that's some additional hassle setting
> > it up, but that's doable. Now there's a good chance the OOM killer's
> > choice will be kwin (as a compositing manager it can have quite large
> > mappings because of graphics drivers). So ok, we need to protect the
> > window manager, but since that's not a hardcoded component like
> > ksmserver, that's even more hassle.
>
> No, you don't need to protect every KDE process from the oom killer unless
> it is going to be a contender for selection.  You could certainly do so
> for completeness, but it shouldn't be required unless the nature of the
> thread demands it such that it forks many vital tasks (kdeinit) or its
> first-generation children's memory consumption can't be known either
> because it depends on how many children it can fork or their memory
> consumption is influenced by the user's work.

 1) I think you missed that I said that every KDE application with the current 
algorithm can be potentially a contender for selection, and I provided 
numbers to demonstrate that in a selected case. Just because such application 
is not vital does not mean it's good for it to get killed instead of an 
obvious offender.

 2) You probably do not realize the complexity involved in using oom_adj in a 
desktop. Even when doing that manually I would have some difficulty finding 
the right setup for my own desktop use. It'd be probably virtually impossible 
to write code that would do it at least somewhat right with all the widely 
differing various desktop setups that dynamically change.

 3) oom_adj is ultimately just a kludge to handle special cases where the 
heuristic doesn't get it right for whatever strange reason. But even you 
yourself in another mail presented a heuristic that I believe would make any 
use of oom_adj on the desktop unnecessary in the usual cases. The usual 
desktop is not a special case.

> The heuristics are always well debated in this forum and there's little
> chance that we'll ever settle on a single formula that works for all
> possible use cases.  That makes oom_adj even more vital to the overall
> efficiency of the oom killer, I really hope you start to use it to your
> advantage.

 I really hope your latest badness() heuristics proposal allows us to dump 
even the oom_adj use we already have.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 18:58     ` David Rientjes
  2010-02-03 19:29       ` Frans Pop
@ 2010-02-03 22:55       ` Lubos Lunak
  2010-02-04  0:05         ` David Rientjes
  1 sibling, 1 reply; 53+ messages in thread
From: Lubos Lunak @ 2010-02-03 22:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wednesday 03 of February 2010, David Rientjes wrote:
> My rewrite for the badness() heuristic is centered on the idea that scores
> should range from 0 to 1000, 0 meaning "never kill this task" and 1000
> meaning "kill this task first."  The baseline for a thread, p, may be
> something like this:
>
> 	unsigned int badness(struct task_struct *p,
> 					unsigned long totalram)
> 	{
> 		struct task_struct *child;
> 		struct mm_struct *mm;
> 		int forkcount = 0;
> 		long points;
>
> 		task_lock(p);
> 		mm = p->mm;
> 		if (!mm) {
> 			task_unlock(p);
> 			return 0;
> 		}
> 		points = (get_mm_rss(mm) +
> 				get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
> 				totalram;
> 		task_unlock(p);
>
> 		list_for_each_entry(child, &p->children, sibling)
> 			/* No lock, child->mm won't be dereferenced */
> 			if (child->mm && child->mm != mm)
> 				forkcount++;
>
> 		/* Forkbombs get penalized 10% of available RAM */
> 		if (forkcount > 500)
> 			points += 100;

 As far as I'm concerned, this is a huge improvement over the current code 
(and, incidentally :), quite close to what I originally wanted). I'd be 
willing to test it in few real-world desktop cases if you provide a patch.

> 		/*
> 		 * /proc/pid/oom_adj ranges from -1000 to +1000 to either
> 		 * completely disable oom killing or always prefer it.
> 		 */
> 		points += p->signal->oom_adj;

 This changes semantics of oom_adj, but given that I expect the above to make 
oom_adj unnecessary on the desktop for the normal cases, I don't really mind.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 22:54       ` Improving OOM killer Lubos Lunak
@ 2010-02-04  0:00         ` David Rientjes
  0 siblings, 0 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-04  0:00 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro,
	Balbir Singh, Nick Piggin, Jiri Kosina, KAMEZAWA Hiroyuki

On Wed, 3 Feb 2010, Lubos Lunak wrote:

> 
>  Given that the badness() proposal I see in your another mail uses 
> get_mm_rss(), I take it that you've meanwhile changed your mind on the VmSize 
> vs VmRSS argument and considered that argument irrelevant now.

The argument was never to never factor rss into the heuristic, the 
argument was to prevent the loss of functionality of oom_adj and being 
able to define memory leakers from userspace.  With my proposal, I believe 
the new semantics of oom_adj are even clearer than before and allow users 
to either discount or bias a task with a quantity that they are familiar 
with: memory.

My rough draft was written in a mail editor, so it's completely untested 
and even has a couple of flaws: we need to discount free hugetlb memory 
from allowed nodes, we need to intersect the passed nodemask with 
current's cpuset, etc.

> I will comment 
> only on the suggested use of oom_adj on the desktop here. And actually I hope 
> that if something reasonably similar to your badness() proposal replaces the 
> current one it will make any use of oom_adj not needed on the desktop in the 
> usual case, so this may be irrelevant as well.
> 

If you define "on the desktop" performance of the oom killer merely as 
protecting a windows environment, then it should be helpful.  I'd still 
recommend using OOM_DISABLE for those tasks, though, because I agree that 
for users in that environment, KDE getting oom killed is just not a viable 
solution.

> > The kernel cannot possibly know what you consider a "vital" process, for
> > that understanding you need to tell it using the very powerful
> > /proc/pid/oom_adj tunable.  I suspect if you were to product all of
> > kdeinit's children by patching it to be OOM_DISABLE so that all threads it
> > forks will inherit that value you'd actually see much improved behavior.
> 
>  No. Almost everything in KDE is spawned by kdeinit, so everything would get 
> the adjustment, which means nothing would in practice get the adjustment.
> 

It depends on whether you change the oom_adj of children that you no 
longer want to protect which have been forked from kdeinit.

> > I'd also encourage you to talk to the KDE developers to ensure that proper
> > precautions are taken to protect it in such conditions since people who
> > use such desktop environments typically don't want them to be sacrificed
> > for memory.
> 
>  I am a KDE developer, it's written in my signature. And I've already talked 
> enough to the KDE developer who has done the oom_adj code that's already 
> there, as that's also me. I don't know kernel internals, but that doesn't 
> mean I'm completely clueless about the topic of the discussion I've started.
> 

Then I'd recommend that you protect those tasks with OOM_DISABLE, 
otherwise they will always be candidates for oom kill; the only way to 
explicitly prevent that is by changing oom_adj or moving it to its own 
memory controller cgroup.  A kernel oom heursitic that is implemented for 
a wide variety of platforms, including desktops, servers, and embedded 
devices, will never identify KDE as a vital task that cannot possibly be 
killed unless you tell the kernel it has that priority.  Whether you 
choose to use that power or not is up to the KDE team.

>  1) I think you missed that I said that every KDE application with the current 
> algorithm can be potentially a contender for selection, and I provided 
> numbers to demonstrate that in a selected case. Just because such application 
> is not vital does not mean it's good for it to get killed instead of an 
> obvious offender.
> 

This is exaggerating the point quite a bit, I don't think every single KDE 
thread is going to have a badness() score that is higher than all other 
system tasks all the time.  I think that there are the likely candidates 
that you've identified (kdeinit, ksmserver, etc) that are much more prone 
to high badness() scores given their total_vm size and the number of 
children they fork, but I don't think this is representative of every KDE 
thread.

>  2) You probably do not realize the complexity involved in using oom_adj in a 
> desktop. Even when doing that manually I would have some difficulty finding 
> the right setup for my own desktop use. It'd be probably virtually impossible 
> to write code that would do it at least somewhat right with all the widely 
> differing various desktop setups that dynamically change.
> 

Used in combination with /proc/pid/oom_score, it gives you a pretty good 
snapshot of how oom killer priorities look at any moment in time.  In your 
particular use case, however, you seem to be arguing from a perspective of 
only protecting certain tasks that you've identified from being oom killed 
for desktop environments, namely KDE.  For that, there is no confusion to 
be had: use OOM_DISABLE.  For server environments that I'm also concerned 
about, the oom_adj range is much more important to define a killing 
priority when used in combination with cpusets.

>  3) oom_adj is ultimately just a kludge to handle special cases where the 
> heuristic doesn't get it right for whatever strange reason. But even you 
> yourself in another mail presented a heuristic that I believe would make any 
> use of oom_adj on the desktop unnecessary in the usual cases. The usual 
> desktop is not a special case.
> 

The kernel will _always_ need user input into which tasks it believes to 
be vital.  For you, that's KDE.  For me, that's one of our job schedulers.  

> > The heuristics are always well debated in this forum and there's little
> > chance that we'll ever settle on a single formula that works for all
> > possible use cases.  That makes oom_adj even more vital to the overall
> > efficiency of the oom killer, I really hope you start to use it to your
> > advantage.
> 
>  I really hope your latest badness() heuristics proposal allows us to dump 
> even the oom_adj use we already have.
> 

For your environment, I hope the same.  In production servers we'll still 
need the ability to tune /proc/pid/oom_adj to define memory leakers and 
tasks using far more memory than expected, so perhaps my rough draft can 
be a launching pad into a positive discussion about the future of the 
heuristic based on consensus and input from all impacted parties.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-03 22:55       ` Lubos Lunak
@ 2010-02-04  0:05         ` David Rientjes
  2010-02-04  0:18           ` Rik van Riel
                             ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-04  0:05 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wed, 3 Feb 2010, Lubos Lunak wrote:

> > 	unsigned int badness(struct task_struct *p,
> > 					unsigned long totalram)
> > 	{
> > 		struct task_struct *child;
> > 		struct mm_struct *mm;
> > 		int forkcount = 0;
> > 		long points;
> >
> > 		task_lock(p);
> > 		mm = p->mm;
> > 		if (!mm) {
> > 			task_unlock(p);
> > 			return 0;
> > 		}
> > 		points = (get_mm_rss(mm) +
> > 				get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
> > 				totalram;
> > 		task_unlock(p);
> >
> > 		list_for_each_entry(child, &p->children, sibling)
> > 			/* No lock, child->mm won't be dereferenced */
> > 			if (child->mm && child->mm != mm)
> > 				forkcount++;
> >
> > 		/* Forkbombs get penalized 10% of available RAM */
> > 		if (forkcount > 500)
> > 			points += 100;
> 
>  As far as I'm concerned, this is a huge improvement over the current code 
> (and, incidentally :), quite close to what I originally wanted). I'd be 
> willing to test it in few real-world desktop cases if you provide a patch.
> 

There're some things that still need to be worked out, like discounting 
hugetlb pages on each allowed node, respecting current's cpuset mems, 
etc., but I think it gives us a good rough draft of where we might end up.  
I did use the get_mm_rss() that you suggested, but I think it's more 
helpful in the context of a fraction of total memory allowed so the other 
heursitics (forkbomb, root tasks, nice'd tasks, etc) are penalizing the 
points in a known quantity rather than a manipulation of that baseline.

Do you have any comments about the forkbomb detector or its threshold that 
I've put in my heuristic?  I think detecting these scenarios is still an 
important issue that we need to address instead of simply removing it from 
consideration entirely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04  0:05         ` David Rientjes
@ 2010-02-04  0:18           ` Rik van Riel
  2010-02-04 21:48             ` David Rientjes
  2010-02-04  7:58           ` Lubos Lunak
  2010-02-04  9:50           ` Jiri Kosina
  2 siblings, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2010-02-04  0:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lubos Lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On 02/03/2010 07:05 PM, David Rientjes wrote:
> On Wed, 3 Feb 2010, Lubos Lunak wrote:

>>> 		/* Forkbombs get penalized 10% of available RAM */
>>> 		if (forkcount>  500)
>>> 			points += 100;

> Do you have any comments about the forkbomb detector or its threshold that
> I've put in my heuristic?  I think detecting these scenarios is still an
> important issue that we need to address instead of simply removing it from
> consideration entirely.

I believe that malicious users are best addressed in person,
or preemptively through cgroups and rlimits.

Having a process with over 500 children is quite possible
with things like apache, Oracle, postgres and other forking
daemons.

Killing the parent process can result in the service
becoming unavailable, and in some cases even data
corruption.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04  0:05         ` David Rientjes
  2010-02-04  0:18           ` Rik van Riel
@ 2010-02-04  7:58           ` Lubos Lunak
  2010-02-04 21:34             ` David Rientjes
  2010-02-04  9:50           ` Jiri Kosina
  2 siblings, 1 reply; 53+ messages in thread
From: Lubos Lunak @ 2010-02-04  7:58 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Thursday 04 of February 2010, David Rientjes wrote:
> On Wed, 3 Feb 2010, Lubos Lunak wrote:
> >  As far as I'm concerned, this is a huge improvement over the current
> > code (and, incidentally :), quite close to what I originally wanted). I'd
> > be willing to test it in few real-world desktop cases if you provide a
> > patch.
>
> There're some things that still need to be worked out,

 Ok. Just please do not let the perfect stand in the way of the good for way 
too long.

> Do you have any comments about the forkbomb detector or its threshold that
> I've put in my heuristic?  I think detecting these scenarios is still an
> important issue that we need to address instead of simply removing it from
> consideration entirely.

 I think before finding out the answer it should be first figured out what the 
question is :). Besides the vague "forkbomb" description I still don't know 
what realistic scenarios this is supposed to handle. IMO trying to cover 
intentional abuse is a lost fight, so I think the purpose of this code should 
be just to handle cases when there's a mistake leading to relatively fast 
spawning of children of a specific parent that'll lead to OOM. The shape of 
the children subtree doesn't matter, it can be either a parent with many 
direct children, or children being created recursively, I think any case is 
possible here. A realistic example would be e.g. by mistake 
typing 'make -j22' instead of 'make -j2' and overloading the machine by too 
many g++ instances. That would be actually a non-trivial tree of children, 
with recursive make and sh processes in it.

 A good way to detect this would be checking in badness() if the process has 
any children with relatively low CPU and real time values (let's say 
something less than a minute). If yes, the badness score should also account 
for all these children, recursively. I'm not sure about the exact formula, 
just summing up the memory usage like it is done now does not fit your 0-1000 
score idea, and it's also wrong because it doesn't take sharing of memory 
into consideration (e.g. a KDE app with several kdelibs-based children could 
achieve a massive score here because of extensive sharing, even though the 
actual memory usage increase caused by them could be insignificant). I don't 
know kernel internals, so I don't know how feasible it would be, but one 
naive idea would be to simply count how big portion of the total memory all 
these considered processes occupy.

 This indeed would not handle the case when a tree of processes would slowly 
leak, for example there being a bug in Apache and all the forked children of 
the master process leaking memory equally, but none of the single children 
leaking enough to score more than a single unrelated innocent process. Here I 
question how realistic such scenario actually would be, and mainly the actual 
possibility of detecting such case. I do not see how code could distinguish 
this from the case of using Konsole or XTerm to launch a number of unrelated 
KDE/X applications each of which would occupy a considerable amount of 
memory. Here clearly killing the Konsole/XTerm and all the spawned 
applications with it is incorrect, so with no obvious offender the OOM killer 
would simply have to pick something. And since you now probably feel the urge 
to point out oom_adj again, I want to point out again that it's not a very 
good solution for the desktop and that Konsole/XTerm should not have such 
protection, unless the user explicitly does it themselves - e.g. Konsole can 
be set to infinite scrollback, so when accidentally running something that 
produces a huge amount of output Konsole actually could be the only right 
process to kill. So I think the case of slowly leaking group of children 
cannot be reasonably solved in code.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04  0:05         ` David Rientjes
  2010-02-04  0:18           ` Rik van Riel
  2010-02-04  7:58           ` Lubos Lunak
@ 2010-02-04  9:50           ` Jiri Kosina
  2010-02-04 21:39             ` David Rientjes
  2 siblings, 1 reply; 53+ messages in thread
From: Jiri Kosina @ 2010-02-04  9:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lubos Lunak, Balbir Singh, Rik van Riel, linux-mm, linux-kernel,
	Andrew Morton, KOSAKI Motohiro, Nick Piggin

On Wed, 3 Feb 2010, David Rientjes wrote:

> > > 		/* Forkbombs get penalized 10% of available RAM */
> > > 		if (forkcount > 500)
> > > 			points += 100;
> > 
> >  As far as I'm concerned, this is a huge improvement over the current code 
> > (and, incidentally :), quite close to what I originally wanted). I'd be 
> > willing to test it in few real-world desktop cases if you provide a patch.
[ ... ]
> Do you have any comments about the forkbomb detector or its threshold that 
> I've put in my heuristic?  I think detecting these scenarios is still an 
> important issue that we need to address instead of simply removing it from 
> consideration entirely.

Why does OOM killer care about forkbombs *at all*?

If we really want kernel to detect forkbombs (*), we'd have to establish 
completely separate infrastructure for that (with its own knobs for tuning 
and possibilities of disabling it completely).

The task of OOM killer is to find the process that caused the system 
to run out of memory, and wipe it, it's as simple as that.

The connection to forkbombs seems to be so loose that I don't see it.

(*) How is forkbomb even defined? Where does the magic constant in 
'forkcount > 500' come from? If your aim is to penalize server processes 
on very loaded web/database servers, then this is probably correct 
aproach. Otherwise, I don't seem to see the point.

Thanks,

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04  7:58           ` Lubos Lunak
@ 2010-02-04 21:34             ` David Rientjes
  2010-02-10 20:54               ` Lubos Lunak
  0 siblings, 1 reply; 53+ messages in thread
From: David Rientjes @ 2010-02-04 21:34 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Thu, 4 Feb 2010, Lubos Lunak wrote:

> > There're some things that still need to be worked out,
> 
>  Ok. Just please do not let the perfect stand in the way of the good for way 
> too long.
> 

The changes to the heuristic will be a part of a larger patchset that 
basically rewrites the oom killer since there have actually been many 
other suggestions and asides that have been mentioned during the course of 
the discussion: mempolicy targeted oom killing, killing a child only with 
the highest badness score first, always ensuring the killed task shares a 
set of allowed nodes with current, preempting the oom killer for GFP_DMA 
allocations, etc.  I'll write that patchset within the next couple of 
days, but we're still talking about 2.6.35 material at the earliest.  If 
you could test it with your particular usecase we can make sure to work 
any heuristic problems out in the early stages.

> > Do you have any comments about the forkbomb detector or its threshold that
> > I've put in my heuristic?  I think detecting these scenarios is still an
> > important issue that we need to address instead of simply removing it from
> > consideration entirely.
> 
>  I think before finding out the answer it should be first figured out what the 
> question is :). Besides the vague "forkbomb" description I still don't know 
> what realistic scenarios this is supposed to handle. IMO trying to cover 
> intentional abuse is a lost fight, so I think the purpose of this code should 
> be just to handle cases when there's a mistake leading to relatively fast 
> spawning of children of a specific parent that'll lead to OOM.

Yes, forkbombs are not always malicious, they can be the result of buggy 
code and there's no other kernel mechanism that will hold them off so that 
the machine is still usable.  If a task forks and execve's thousands of 
threads on your 2GB desktop machine either because its malicious, its a 
bug, or a the user made a mistake, that's going to be detrimental 
depending on the nature of what was executed especially to your 
interactivity :)  Keep in mind that the forking parent such as a job 
scheduler or terminal and all of its individual children may have very 
small rss and swap statistics, even though cumulatively its a problem.  
In that scenario, KDE is once again going to become the target on your 
machine when in reality we should target the forkbomb.

How we target the forkbomb could be another topic for discussion.  Recall 
that the oom killer does not necessarily always kill the task with the 
highest badness() score; it will always attempt to kill one of its 
children with a seperate mm first.  That doesn't help us as much as it 
could in the forkbomb scenario since the parent could likely continue to 
fork; the end result you'll see is difficulty in getting to a command line 
where you can find and kill the parent yourself.  Thus, we need to preempt 
the preference to kill the child first when we've detected a forkbomb task 
and it is selected for oom kill (detecting a forkbomb is only a 
penalization in the heuristic, it doesn't mean automatic killing, so 
another task consuming much more memory could be chosen instead).

> The shape of 
> the children subtree doesn't matter, it can be either a parent with many 
> direct children, or children being created recursively, I think any case is 
> possible here. A realistic example would be e.g. by mistake 
> typing 'make -j22' instead of 'make -j2' and overloading the machine by too 
> many g++ instances. That would be actually a non-trivial tree of children, 
> with recursive make and sh processes in it.
> 

We can't address recursive forkbombing in the oom killer with any 
efficiency, but luckily those cases aren't very common.

>  A good way to detect this would be checking in badness() if the process has 
> any children with relatively low CPU and real time values (let's say 
> something less than a minute). If yes, the badness score should also account 
> for all these children, recursively. I'm not sure about the exact formula, 
> just summing up the memory usage like it is done now does not fit your 0-1000 
> score idea, and it's also wrong because it doesn't take sharing of memory 
> into consideration (e.g. a KDE app with several kdelibs-based children could 
> achieve a massive score here because of extensive sharing, even though the 
> actual memory usage increase caused by them could be insignificant). I don't 
> know kernel internals, so I don't know how feasible it would be, but one 
> naive idea would be to simply count how big portion of the total memory all 
> these considered processes occupy.
> 

badness() can't be called recursively on children, so we need to look for 
the simple metrics like you mentioned: cpu time, for example.  I think we 
should also look at uid

>  This indeed would not handle the case when a tree of processes would slowly 
> leak, for example there being a bug in Apache and all the forked children of 
> the master process leaking memory equally, but none of the single children 
> leaking enough to score more than a single unrelated innocent process. Here I 
> question how realistic such scenario actually would be, and mainly the actual 
> possibility of detecting such case. I do not see how code could distinguish 
> this from the case of using Konsole or XTerm to launch a number of unrelated 
> KDE/X applications each of which would occupy a considerable amount of 
> memory. Here clearly killing the Konsole/XTerm and all the spawned 
> applications with it is incorrect, so with no obvious offender the OOM killer 
> would simply have to pick something.

The memory consumption of these children were not considered in my rough 
draft, it was simply a counter of how many first-generation children each 
task has.  When Konsole forks a very large number of tasks, is it 
unreasonable to bias it with a penalty of perhaps 10% of RAM?  Or should 
we take the lowest rss size of those children, multiply it by the number 
of children, and penalize it after it reaches a certain threshold (500? 
1000?) as the "cost of running the parent"?  This heavily biases parents 
that have forked a very large number of small threads that would be 
directly killed whereas the large, memory-hog children are killed 
themselves based on their own badness() heuristic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04  9:50           ` Jiri Kosina
@ 2010-02-04 21:39             ` David Rientjes
  2010-02-05  7:35               ` Oliver Neukum
  0 siblings, 1 reply; 53+ messages in thread
From: David Rientjes @ 2010-02-04 21:39 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Lubos Lunak, Balbir Singh, Rik van Riel, linux-mm, linux-kernel,
	Andrew Morton, KOSAKI Motohiro, Nick Piggin

On Thu, 4 Feb 2010, Jiri Kosina wrote:

> Why does OOM killer care about forkbombs *at all*?
> 

Because the cumulative effects of a forkbomb are detrimental to the 
system and the badness() heursitic favors large memory consumers very 
heavily.  Thus, the forkbomb is never really a strong candidate for oom 
kill since the parent may consume very little memory itself and meanwhile 
KDE or another large memory consumer will get innocently killed instead as 
a result.

> If we really want kernel to detect forkbombs (*), we'd have to establish 
> completely separate infrastructure for that (with its own knobs for tuning 
> and possibilities of disabling it completely).
> 

That's what we're trying to do, we can look at the shear number of 
children that the parent has forked and check for it to be over a certain 
"forkbombing threshold" (which, yes, can be tuned from userspace), the 
uptime of those children, their resident set size, etc., to attempt to 
find a sane heuristic that penalizes them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04  0:18           ` Rik van Riel
@ 2010-02-04 21:48             ` David Rientjes
  2010-02-04 22:06               ` Rik van Riel
  2010-02-04 22:31               ` Frans Pop
  0 siblings, 2 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-04 21:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Lubos Lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wed, 3 Feb 2010, Rik van Riel wrote:

> > Do you have any comments about the forkbomb detector or its threshold that
> > I've put in my heuristic?  I think detecting these scenarios is still an
> > important issue that we need to address instead of simply removing it from
> > consideration entirely.
> 
> I believe that malicious users are best addressed in person,
> or preemptively through cgroups and rlimits.
> 

Forkbombs need not be the result of malicious users.

> Having a process with over 500 children is quite possible
> with things like apache, Oracle, postgres and other forking
> daemons.
> 

It's clear that the forkbomb threshold would need to be definable from 
userspace and probably default to something high such as 1000.

Keep in mind that we're in the oom killer here, though.  So we're out of 
memory and we need to kill something; should Apache, Oracle, and postgres 
not be penalized for their cost of running by factoring in something like 
this?

	(lowest rss size of children) * (# of first-generation children) / 
			(forkbomb threshold)

> Killing the parent process can result in the service
> becoming unavailable, and in some cases even data
> corruption.
> 

There's only one possible rememdy for that, which is OOM_DISABLE; the oom 
killer cannot possibly predict data corruption as the result of killing a 
process and this is no different.  Everything besides init, kthreads, 
OOM_DISABLE threads, and threads that do not share the same cpuset, memcg, 
or set of allowed mempolicy nodes are candidates for oom kill.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04 21:48             ` David Rientjes
@ 2010-02-04 22:06               ` Rik van Riel
  2010-02-04 22:14                 ` David Rientjes
  2010-02-04 22:31               ` Frans Pop
  1 sibling, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2010-02-04 22:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Lubos Lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On 02/04/2010 04:48 PM, David Rientjes wrote:

> Keep in mind that we're in the oom killer here, though.  So we're out of
> memory and we need to kill something; should Apache, Oracle, and postgres
> not be penalized for their cost of running by factoring in something like
> this?

No, they should not.

The goal of the OOM killer is to kill some process, so the
system can continue running and automatically become available
again for whatever workload the system was running.

Killing the parent process of one of the system daemons does
not achieve that goal, because you now caused a service to no
longer be available.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04 22:06               ` Rik van Riel
@ 2010-02-04 22:14                 ` David Rientjes
  2010-02-10 20:54                   ` Lubos Lunak
  0 siblings, 1 reply; 53+ messages in thread
From: David Rientjes @ 2010-02-04 22:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Lubos Lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Thu, 4 Feb 2010, Rik van Riel wrote:

> > Keep in mind that we're in the oom killer here, though.  So we're out of
> > memory and we need to kill something; should Apache, Oracle, and postgres
> > not be penalized for their cost of running by factoring in something like
> > this?
> 
> No, they should not.
> 
> The goal of the OOM killer is to kill some process, so the
> system can continue running and automatically become available
> again for whatever workload the system was running.
> 
> Killing the parent process of one of the system daemons does
> not achieve that goal, because you now caused a service to no
> longer be available.
> 

The system daemon wouldn't be killed, though.  You're right that this 
heuristic would prefer the system daemon slightly more as a result of the 
forkbomb penalty, but the oom killer always attempts to sacrifice a child 
with a seperate mm before killing the selected task.  Since the forkbomb 
heuristic only adds up those children with seperate mms, we're guaranteed 
to not kill the daemon itself.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04 21:48             ` David Rientjes
  2010-02-04 22:06               ` Rik van Riel
@ 2010-02-04 22:31               ` Frans Pop
  2010-02-04 22:53                 ` David Rientjes
  1 sibling, 1 reply; 53+ messages in thread
From: Frans Pop @ 2010-02-04 22:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: riel, l.lunak, balbir, linux-mm, linux-kernel, akpm,
	kosaki.motohiro, npiggin, jkosina

David Rientjes wrote:
> It's clear that the forkbomb threshold would need to be definable from
> userspace and probably default to something high such as 1000.
> 
> Keep in mind that we're in the oom killer here, though.  So we're out of
> memory and we need to kill something; should Apache, Oracle, and postgres
> not be penalized for their cost of running by factoring in something like
> this?
> 
> (lowest rss size of children) * (# of first-generation children) /
>                    (forkbomb threshold)

Shouldn't fork bomb detection take into account the age of children?
After all, long running processes with a lot of long running children are 
rather unlikely to be runaway fork _bombs_.

Children for desktop environments are more likely to be long running than 
e.g. a server process that's being DOSed.
The goal of the OOM killer is IIUC trying to identify the process thats 
causing the immediate problem so in this example it should prefer latter.

Cheers,
FJP

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04 22:31               ` Frans Pop
@ 2010-02-04 22:53                 ` David Rientjes
  0 siblings, 0 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-04 22:53 UTC (permalink / raw)
  To: Frans Pop
  Cc: Rik van Riel, l.lunak, Balbir Singh, linux-mm, linux-kernel,
	Andrew Morton, KOSAKI Motohiro, Nick Piggin, jkosina

On Thu, 4 Feb 2010, Frans Pop wrote:

> Shouldn't fork bomb detection take into account the age of children?
> After all, long running processes with a lot of long running children are 
> rather unlikely to be runaway fork _bombs_.
> 

Yeah, Lubos mentioned using cpu time as a requirement, in addition to the 
already existing child->mm != parent->mm, as a prerequisite to be added 
into the tally to check the forkbomb threshold.  I think something like 
this would be appropriate:

	struct task_cputime task_time;
	int forkcount = 0;
	int child_rss = 0;

	...

	list_for_each_entry(child, &p->children, sibling) {
		unsigned long runtime;

		task_lock(child);
		if (!child->mm || child->mm == p->mm) {
			task_unlock(child);
			continue;
		}
		thread_group_cputime(child, &task_time);
		runtime = cputime_to_jiffies(task_time.utime) +
				cputime_to_jiffies(task_time.stime);

		/*
		 * Only threads that have run for less than a second are
		 * considered toward the forkbomb, these threads rarely
		 * get to execute at all in such cases anyway.
		 */
		if (runtime < HZ) {
			task_unlock(child);
			continue;
		}
		child_rss += get_mm_rss(child->mm);
		forkcount++;
	}

	if (forkcount > sysctl_oom_forkbomb_thres) {
		/*
		 * Penalize forkbombs by considering the average rss and
		 * how many factors we are over the threshold.
		 */
		points += child_rss / sysctl_oom_forkbomb_thres;
	}

I changed the calculation from lowest child rss to average child rss, so 
this is functionally equivalent to

(average rss size of children) * (# of first-generated execve children) /
			sysctl_oom_forkbomb_thres

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04 21:39             ` David Rientjes
@ 2010-02-05  7:35               ` Oliver Neukum
  2010-02-10  3:10                 ` David Rientjes
  0 siblings, 1 reply; 53+ messages in thread
From: Oliver Neukum @ 2010-02-05  7:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Jiri Kosina, Lubos Lunak, Balbir Singh, Rik van Riel, linux-mm,
	linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin

Am Donnerstag, 4. Februar 2010 22:39:08 schrieb David Rientjes:
> > If we really want kernel to detect forkbombs (*), we'd have to establish 
> > completely separate infrastructure for that (with its own knobs for tuning 
> > and possibilities of disabling it completely).
> > 
> 
> That's what we're trying to do, we can look at the shear number of 
> children that the parent has forked and check for it to be over a certain 
> "forkbombing threshold" (which, yes, can be tuned from userspace), the 
> uptime of those children, their resident set size, etc., to attempt to 
> find a sane heuristic that penalizes them.

Wouldn't it be saner to have a selection by user, so that users that
are over the overcommit limit are targeted?

	Regards
		Oliver

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-05  7:35               ` Oliver Neukum
@ 2010-02-10  3:10                 ` David Rientjes
  0 siblings, 0 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-10  3:10 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Jiri Kosina, Lubos Lunak, Balbir Singh, Rik van Riel, linux-mm,
	linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin

On Fri, 5 Feb 2010, Oliver Neukum wrote:

> > That's what we're trying to do, we can look at the shear number of 
> > children that the parent has forked and check for it to be over a certain 
> > "forkbombing threshold" (which, yes, can be tuned from userspace), the 
> > uptime of those children, their resident set size, etc., to attempt to 
> > find a sane heuristic that penalizes them.
> 
> Wouldn't it be saner to have a selection by user, so that users that
> are over the overcommit limit are targeted?
> 

It's rather unnecessary for the forkbomb case because then it would 
unfairly penalize any user that runs lots of tasks.  The forkbomb handling 
code is really to prevent either user error, bugs in the application, or 
maliciousness.  The goal isn't necessarily to kill anything that forks an 
egregious amount of tasks (which is why we always prefer a child with a 
seperate address space than a parent, anyway) but rather to try to make 
sure the system is still usable and recoverable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04 21:34             ` David Rientjes
@ 2010-02-10 20:54               ` Lubos Lunak
  2010-02-10 21:09                 ` Rik van Riel
  2010-02-10 22:25                 ` David Rientjes
  0 siblings, 2 replies; 53+ messages in thread
From: Lubos Lunak @ 2010-02-10 20:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Thursday 04 of February 2010, David Rientjes wrote:
> On Thu, 4 Feb 2010, Lubos Lunak wrote:
> >  I think before finding out the answer it should be first figured out
> > what the question is :). Besides the vague "forkbomb" description I still
> > don't know what realistic scenarios this is supposed to handle. IMO
> > trying to cover intentional abuse is a lost fight, so I think the purpose
> > of this code should be just to handle cases when there's a mistake
> > leading to relatively fast spawning of children of a specific parent
> > that'll lead to OOM.
>
> Yes, forkbombs are not always malicious, they can be the result of buggy
> code and there's no other kernel mechanism that will hold them off so that
> the machine is still usable.  If a task forks and execve's thousands of
> threads on your 2GB desktop machine either because its malicious, its a
> bug, or a the user made a mistake, that's going to be detrimental
> depending on the nature of what was executed especially to your
> interactivity :)  Keep in mind that the forking parent such as a job
> scheduler or terminal and all of its individual children may have very
> small rss and swap statistics, even though cumulatively its a problem.

 Which is why I suggested summing up the memory of the parent and its 
children.

> > The shape of
> > the children subtree doesn't matter, it can be either a parent with many
> > direct children, or children being created recursively, I think any case
> > is possible here. A realistic example would be e.g. by mistake
> > typing 'make -j22' instead of 'make -j2' and overloading the machine by
> > too many g++ instances. That would be actually a non-trivial tree of
> > children, with recursive make and sh processes in it.
>
> We can't address recursive forkbombing in the oom killer with any
> efficiency, but luckily those cases aren't very common.

 Right, I've never run a recursive make that brought my machine to its knees. 
Oh, wait.

 And why exactly is iterating over 1st level children efficient enough and 
doing that recursively is not? I don't find it significantly more expensive 
and badness() is hardly a bottleneck anyway.

> >  A good way to detect this would be checking in badness() if the process
> > has any children with relatively low CPU and real time values (let's say
> > something less than a minute). If yes, the badness score should also
> > account for all these children, recursively. I'm not sure about the exact
> > formula, just summing up the memory usage like it is done now does not
> > fit your 0-1000 score idea, and it's also wrong because it doesn't take
> > sharing of memory into consideration (e.g. a KDE app with several
> > kdelibs-based children could achieve a massive score here because of
> > extensive sharing, even though the actual memory usage increase caused by
> > them could be insignificant). I don't know kernel internals, so I don't
> > know how feasible it would be, but one naive idea would be to simply
> > count how big portion of the total memory all these considered processes
> > occupy.
>
> badness() can't be called recursively on children,
> the simple metrics like you mentioned: cpu time, for example.

 I didn't mean calling badness() recursively, just computing total memory 
usage of the parent and all the children together (and trying not to count 
shared parts more than once). If doing that accurately is too expensive, 
summing rss+swap for the parent and using only unshared memory for the 
children seems reasonably close to it and pretty cheap (surely if the code 
can find out rss+swap for each child it can equally easily find out how much 
of it is unshared?). With the assumption that the shared memory will be 
hopefully reasonably accounted in the parent's memory usage and thus only 
unshared portions for children are needed, this appears to be much more 
precise than trying to sum up rss for all like your proposal does.

> >  This indeed would not handle the case when a tree of processes would
> > slowly leak, for example there being a bug in Apache and all the forked
> > children of the master process leaking memory equally, but none of the
> > single children leaking enough to score more than a single unrelated
> > innocent process. Here I question how realistic such scenario actually
> > would be, and mainly the actual possibility of detecting such case. I do
> > not see how code could distinguish this from the case of using Konsole or
> > XTerm to launch a number of unrelated KDE/X applications each of which
> > would occupy a considerable amount of memory. Here clearly killing the
> > Konsole/XTerm and all the spawned applications with it is incorrect, so
> > with no obvious offender the OOM killer would simply have to pick
> > something.
>
> The memory consumption of these children were not considered in my rough
> draft, it was simply a counter of how many first-generation children each
> task has.

 Why exactly do you think only 1st generation children matter? Look again at 
the process tree posted by me and you'll see it solves nothing there. I still 
fail to see why counting also all other generations should be considered 
anything more than a negligible penalty for something that's not a bottleneck 
at all.

> When Konsole forks a very large number of tasks, is it 
> unreasonable to bias it with a penalty of perhaps 10% of RAM?

 It appears unresonable to penalize it with a random magic number.

> Or should 
> we take the lowest rss size of those children, multiply it by the number
> of children, and penalize it after it reaches a certain threshold (500?
> 1000?) as the "cost of running the parent"?

 And another magic number. And again it doesn't solve my initial problem, 
where the number of processes taking part in the OOM situation is counted 
only in tens.

 Simply computing the cost of the whole children subtree (or a reasonable 
approximation) avoids the need for any magic numbers and gives a much better 
representation of how costly the subtree is, since, well, it is the cost 
itself.

> This heavily biases parents 
> that have forked a very large number of small threads that would be
> directly killed whereas the large, memory-hog children are killed
> themselves based on their own badness() heuristic.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-04 22:14                 ` David Rientjes
@ 2010-02-10 20:54                   ` Lubos Lunak
  2010-02-10 21:10                     ` Rik van Riel
  0 siblings, 1 reply; 53+ messages in thread
From: Lubos Lunak @ 2010-02-10 20:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Rik van Riel, Balbir Singh, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Thursday 04 of February 2010, David Rientjes wrote:
> On Thu, 4 Feb 2010, Rik van Riel wrote:
> > The goal of the OOM killer is to kill some process, so the
> > system can continue running and automatically become available
> > again for whatever workload the system was running.
> >
> > Killing the parent process of one of the system daemons does
> > not achieve that goal, because you now caused a service to no
> > longer be available.
>
> The system daemon wouldn't be killed, though.  You're right that this
> heuristic would prefer the system daemon slightly more as a result of the
> forkbomb penalty, but the oom killer always attempts to sacrifice a child
> with a seperate mm before killing the selected task.  Since the forkbomb
> heuristic only adds up those children with seperate mms, we're guaranteed
> to not kill the daemon itself.

 Which however can mean that not killing this system daemon will be traded for 
DoS-ing the whole system, if the daemon keeps spawning new children as soon 
as the OOM killer frees up resources for them.

 This looks like wrong solution to me, it's like trying to save a target by 
shooting all incoming bombs instead of shooting the bomber. If the OOM 
situation is caused by one or a limited number of its children, or if the 
system daemon is not reponsible for the forkbomb (e.g. it's only a subtree of 
its children), then it won't be selected for killing anyway. If it is 
responsible for the forkbomb, the OOM killer can trying killing the bombs 
forever to no avail.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-10 20:54               ` Lubos Lunak
@ 2010-02-10 21:09                 ` Rik van Riel
  2010-02-10 21:34                   ` Lubos Lunak
  2010-02-10 22:25                 ` David Rientjes
  1 sibling, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2010-02-10 21:09 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: David Rientjes, Balbir Singh, linux-mm, linux-kernel,
	Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On 02/10/2010 03:54 PM, Lubos Lunak wrote:

>   Simply computing the cost of the whole children subtree (or a reasonable
> approximation) avoids the need for any magic numbers and gives a much better
> representation of how costly the subtree is, since, well, it is the cost
> itself.

That assumes you want to kill off that entire tree.

You will not want to do that when a web server or
database server runs out of memory, because the
goal of the OOM killer is to allow the system to
continue to run and be useful. This means keeping
the services available...

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-10 20:54                   ` Lubos Lunak
@ 2010-02-10 21:10                     ` Rik van Riel
  2010-02-10 21:29                       ` Lubos Lunak
  2010-02-10 22:18                       ` Alan Cox
  0 siblings, 2 replies; 53+ messages in thread
From: Rik van Riel @ 2010-02-10 21:10 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: David Rientjes, Balbir Singh, linux-mm, linux-kernel,
	Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On 02/10/2010 03:54 PM, Lubos Lunak wrote:

>   Which however can mean that not killing this system daemon will be traded for
> DoS-ing the whole system, if the daemon keeps spawning new children as soon
> as the OOM killer frees up resources for them.

Killing the system daemon *is* a DoS.

It would stop eg. the database or the web server, which is
generally the main task of systems that run a database or
a web server.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-10 21:10                     ` Rik van Riel
@ 2010-02-10 21:29                       ` Lubos Lunak
  2010-02-10 22:18                       ` Alan Cox
  1 sibling, 0 replies; 53+ messages in thread
From: Lubos Lunak @ 2010-02-10 21:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Rientjes, Balbir Singh, linux-mm, linux-kernel,
	Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wednesday 10 of February 2010, Rik van Riel wrote:
> On 02/10/2010 03:54 PM, Lubos Lunak wrote:
> >   Which however can mean that not killing this system daemon will be
> > traded for DoS-ing the whole system, if the daemon keeps spawning new
> > children as soon as the OOM killer frees up resources for them.
>
> Killing the system daemon *is* a DoS.

 Maybe, but if there are two such system daemons on the machine, it's only 
half of the other DoS. And since that system daemon has already been 
identified as a forkbomb, it's probably already useless anyway and killing 
the children won't save anything. In which realistic case a system daemon has 
children that together cause OOM, yet can still be considered working after 
you kill one or a limited number of those children?

> It would stop eg. the database or the web server, which is
> generally the main task of systems that run a database or
> a web server.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-10 21:09                 ` Rik van Riel
@ 2010-02-10 21:34                   ` Lubos Lunak
  0 siblings, 0 replies; 53+ messages in thread
From: Lubos Lunak @ 2010-02-10 21:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Rientjes, Balbir Singh, linux-mm, linux-kernel,
	Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wednesday 10 of February 2010, Rik van Riel wrote:
> On 02/10/2010 03:54 PM, Lubos Lunak wrote:
> >   Simply computing the cost of the whole children subtree (or a
> > reasonable approximation) avoids the need for any magic numbers and gives
> > a much better representation of how costly the subtree is, since, well,
> > it is the cost itself.
>
> That assumes you want to kill off that entire tree.

 As said in another mail, I think I actually do, since the entire tree is 
indentified as the problem. But regardless of that, surely computing the cost 
of a forkbomb by computing something that is close to the actual cost of it 
is better than trying magic numbers?

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-10 21:10                     ` Rik van Riel
  2010-02-10 21:29                       ` Lubos Lunak
@ 2010-02-10 22:18                       ` Alan Cox
  2010-02-10 22:31                         ` David Rientjes
  2010-02-11  9:50                         ` Lubos Lunak
  1 sibling, 2 replies; 53+ messages in thread
From: Alan Cox @ 2010-02-10 22:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Lubos Lunak, David Rientjes, Balbir Singh, linux-mm, linux-kernel,
	Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina

> Killing the system daemon *is* a DoS.
> 
> It would stop eg. the database or the web server, which is
> generally the main task of systems that run a database or
> a web server.

One of the problems with picking on tasks that fork a lot is that
describes apache perfectly. So a high loaded apache will get shot over a
rapid memory eating cgi script.

Any heuristic is going to be iffy - but that isn't IMHO a good one to
work from. If anything "who allocated lots of RAM recently" may be a
better guide but we don't keep stats for that.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-10 20:54               ` Lubos Lunak
  2010-02-10 21:09                 ` Rik van Riel
@ 2010-02-10 22:25                 ` David Rientjes
  2010-02-11 10:16                   ` Lubos Lunak
  1 sibling, 1 reply; 53+ messages in thread
From: David Rientjes @ 2010-02-10 22:25 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wed, 10 Feb 2010, Lubos Lunak wrote:

> > Yes, forkbombs are not always malicious, they can be the result of buggy
> > code and there's no other kernel mechanism that will hold them off so that
> > the machine is still usable.  If a task forks and execve's thousands of
> > threads on your 2GB desktop machine either because its malicious, its a
> > bug, or a the user made a mistake, that's going to be detrimental
> > depending on the nature of what was executed especially to your
> > interactivity :)  Keep in mind that the forking parent such as a job
> > scheduler or terminal and all of its individual children may have very
> > small rss and swap statistics, even though cumulatively its a problem.
> 
>  Which is why I suggested summing up the memory of the parent and its 
> children.
> 

That's almost identical to the current heuristic where we sum half the 
size of the children's VM size, unfortunately it's not a good indicator of 
forkbombs since in your particular example it would be detrimental to 
kdeinit.  My heursitic considers runtime of the children as an indicator 
of a forkbombing parent since such tasks don't typically get to run 
anyway.  The rss or swap usage of a child with a seperate address space 
simply isn't relevant to the badness score of the parent, it unfairly 
penalizes medium/large server jobs.

> > We can't address recursive forkbombing in the oom killer with any
> > efficiency, but luckily those cases aren't very common.
> 
>  Right, I've never run a recursive make that brought my machine to its knees. 
> Oh, wait.
> 

That's completely outside the scope of the oom killer, though: it is _not_ 
the oom killer's responsibility for enforcing a kernel-wide forkbomb 
policy, which would be much better handled at execve() time.

It's a very small part of my badness heuristic, depending on the average 
size of the children's rss and swap usage, because we want to slightly 
penalize tasks that fork an extremely large number of tasks that have no 
substantial runtime; memory is being consumed but very little work is 
getting done by those thousand children.  This would most often than not 
be used only to break ties when two parents have similar memory 
consumption themselves but one is obviously oversubscribing the system.

>  And why exactly is iterating over 1st level children efficient enough and 
> doing that recursively is not? I don't find it significantly more expensive 
> and badness() is hardly a bottleneck anyway.
> 

If we look at children's memory usage recursively, then we'll always end 
up selecting init_task.

> > The memory consumption of these children were not considered in my rough
> > draft, it was simply a counter of how many first-generation children each
> > task has.
> 
>  Why exactly do you think only 1st generation children matter? Look again at 
> the process tree posted by me and you'll see it solves nothing there. I still 
> fail to see why counting also all other generations should be considered 
> anything more than a negligible penalty for something that's not a bottleneck 
> at all.
> 

You're specifying a problem that is outside the scope of the oom killer, 
sorry.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-10 22:18                       ` Alan Cox
@ 2010-02-10 22:31                         ` David Rientjes
  2010-02-11  9:50                         ` Lubos Lunak
  1 sibling, 0 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-10 22:31 UTC (permalink / raw)
  To: Alan Cox
  Cc: Rik van Riel, Lubos Lunak, Balbir Singh, linux-mm, linux-kernel,
	Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wed, 10 Feb 2010, Alan Cox wrote:

> One of the problems with picking on tasks that fork a lot is that
> describes apache perfectly. So a high loaded apache will get shot over a
> rapid memory eating cgi script.
> 

With my rewrite, the oom killer would not select apache but rather the 
child with a seperate address space that is consuming the most amount of 
allowed memory and only when a configurable number of such children (1000 
by default) have not had any runtime.  My heuristic is only meant to 
slightly penalize such tasks so that they can be distinguished from oom 
kill from other parents with comparable memory usage.  Enforcing a strict 
forkbomb policy is out of the scope of the oom killer, though, so no 
attempt was made.

> Any heuristic is going to be iffy - but that isn't IMHO a good one to
> work from. If anything "who allocated lots of RAM recently" may be a
> better guide but we don't keep stats for that.
> 

That's what my heuristic basically does, if a parent is identified as a 
forkbomb, then it is only penalized by averaging the memory consumption of 
those children and then multiplying it by the same number of times the 
configurable forkbomb threshold was reached.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-10 22:18                       ` Alan Cox
  2010-02-10 22:31                         ` David Rientjes
@ 2010-02-11  9:50                         ` Lubos Lunak
  1 sibling, 0 replies; 53+ messages in thread
From: Lubos Lunak @ 2010-02-11  9:50 UTC (permalink / raw)
  To: Alan Cox
  Cc: Rik van Riel, David Rientjes, Balbir Singh, linux-mm,
	linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin,
	Jiri Kosina

On Wednesday 10 of February 2010, Alan Cox wrote:
> > Killing the system daemon *is* a DoS.
> >
> > It would stop eg. the database or the web server, which is
> > generally the main task of systems that run a database or
> > a web server.
>
> One of the problems with picking on tasks that fork a lot is that
> describes apache perfectly. So a high loaded apache will get shot over a
> rapid memory eating cgi script.

 It will not. If it's only a single cgi script, that that child should be 
selected by badness(), not the parent.

 I personally consider the logic of trying to find the offender using 
badness() and then killing its child instead to be flawed. Already badness() 
itself should select what to kill and that should be killed. If it's a single 
process that is the offender, it should be killed. If badness() decides it is 
a whole subtree responsible for the situation, then the top of it needs to be 
killed, otherwise the reason for the problem will remain.

 I expect the current logic of trying to kill children first is based on the 
system daemon logic, but if e.g. Apache master process itself causes OOM, 
then the kernel itself has to way to find out if it's an important process 
that should be protected or if it's some random process causing a forkbomb. 
From the kernel point's of view, if the Apache master process caused the 
problem, the the problem should be solved there. If the reason for the 
problem was actually e.g. a temporary high load on the server, then Apache is 
probably misconfigured, and if it really should stay running no matter what, 
then I guess that's the case to use oom_adj. But otherwise, from OOM killer's 
point of view, that is where the problem was.

 Of course, the algorithm used in badness() should be careful not to propagate 
the excessive memory usage in that case to the innocent parent. This problem 
existed in the current code until it was fixed by the "/2" recently, and at 
least my current proposal actually suffers from it too. But I envision 
something like this could handle it nicely (pseudocode):

int oom_children_memory_usage(task)
    {
    // Memory shared with the parent should not be counted again.
    // Since it's expensive to find that out exactly, just assume
    // that the amount of shared memory that is not shared with the parent
    // is insignificant.
    total = unshared_rss(task)+unshared_swap(task);
    foreach_child(child,task)
        total += oom_children_memory_usage(child);
    return total;
    }
int badness(task)
    {
    int total_memory = 0;
    ...
    int max_child_memory = 0; // memory used by that child
    int max_child_memory_2 = 0; // the 2nd most memory used by a child
    foreach_child(child,task)
        {
        if(sharing_the_same_memory(child,task))
            continue;
        if( real_time(child) > 1minute )
            continue; // running long, not a forkbomb
        int memory = oom_children_memory_usage(task);
        total_memory += memory;
        if( memory > max_child_memory )
            {
            max_child_memory_2 = max_child_memory;
            max_child_memory = memory;
            }
        else if( memory > max_child_memory_2 )
            max_child_memory_2 = memory;
        }
    if( max_child_memory_2 != 0 ) // there were at least two children
        {
        if( max_child_memory > max_child_memory_2 / 2 )
            {
// There is only a single child that contributes the majority of memory
// used by all children. Do not add it to the total, so that if that process
// is the biggest offender, the killer picks it instead of this parent.
            total_memory -= max_child_memory;
            }
        }
    ...
    }

 The logic is simply that a process is responsible for its children only if 
their cost is similar. If one of them stands out, it is responsible for 
itself and the parent is not. This is intentionally not done recursively in 
oom_children_memory_usage() to cover also the case when e.g. parallel make 
runs too many processes wrapped by shell, in that case making any of those 
shell instances responsible for its child doesn't help anything, but making 
make responsible for all of them helps.

 Alternatively, if somebody has a good use case where first going after a 
child may make sense, then it perhaps would help to 
add 'oom_recently_killed_children' to each task, and increasing it whenever a 
child is killed instead of the responsible parent. As soon as the value 
within a reasonably short time is higher than let's say 5, then apparently 
killing children does not help and the mastermind has to go.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-10 22:25                 ` David Rientjes
@ 2010-02-11 10:16                   ` Lubos Lunak
  2010-02-11 21:17                     ` David Rientjes
  0 siblings, 1 reply; 53+ messages in thread
From: Lubos Lunak @ 2010-02-11 10:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Wednesday 10 of February 2010, David Rientjes wrote:
> On Wed, 10 Feb 2010, Lubos Lunak wrote:
> >  Which is why I suggested summing up the memory of the parent and its
> > children.
>
> That's almost identical to the current heuristic where we sum half the
> size of the children's VM size, unfortunately it's not a good indicator of
> forkbombs since in your particular example it would be detrimental to
> kdeinit.

 I believe that with the algorithm no longer using VmSize and being careful 
not to count shared memory more than once this would not be an issue and 
kdeinit would be reasonably safe. KDE does not use _that_ much memory to 
score higher than something that caused OOM :).

> My heursitic considers runtime of the children as an indicator 
> of a forkbombing parent since such tasks don't typically get to run
> anyway.  The rss or swap usage of a child with a seperate address space
> simply isn't relevant to the badness score of the parent, it unfairly
> penalizes medium/large server jobs.

 Our definitions of 'forkbomb' then perhaps differ a bit. I 
consider 'make -j100' a kind of a forkbomb too, it will very likely overload 
the machine too as soon as the gcc instances use up all the memory. For that 
reason also using CPU time <1second will not work here, while using real time 
<1minute would.

 That long timeout would have the weakness that when running at the same time 
reasonable 'make -j4' and Firefox that'd immediatelly go crazy, then maybe 
the make job could be targeted instead if its total cost would go higher. 
However, here I again believe that the fixed metrics for computing memory 
usage would work well enough to let that happen only when the total cost of 
the make job would be actually higher than that of the offender and in that 
case it is kind of an offender too.

 Your protection seems to cover only "for(;;) if(fork() == 0) break;" , while 
I believe mine could handle also "make -j100" or the bash forkbomb ":()
{ :|:& };:" (i.e. "for(;;) fork();").

> > > We can't address recursive forkbombing in the oom killer with any
> > > efficiency, but luckily those cases aren't very common.
> >
> >  Right, I've never run a recursive make that brought my machine to its
> > knees. Oh, wait.
>
> That's completely outside the scope of the oom killer, though: it is _not_
> the oom killer's responsibility for enforcing a kernel-wide forkbomb
> policy

 Why? It repeatedly causes OOM here (and in fact it is the only common OOM or 
forkbomb I ever encounter). If OOM killer is the right place to protect 
against a forkbomb that spawns a large number of 1st level children, then I 
don't see how this is different.

> >  And why exactly is iterating over 1st level children efficient enough
> > and doing that recursively is not? I don't find it significantly more
> > expensive and badness() is hardly a bottleneck anyway.
>
> If we look at children's memory usage recursively, then we'll always end
> up selecting init_task.

 Not if the algorithm does not propagate the top of the problematic subtree 
higher, see my reply to Alan Cox.

> >  Why exactly do you think only 1st generation children matter? Look again
> > at the process tree posted by me and you'll see it solves nothing there.
> > I still fail to see why counting also all other generations should be
> > considered anything more than a negligible penalty for something that's
> > not a bottleneck at all.
>
> You're specifying a problem that is outside the scope of the oom killer,
> sorry.

 But it could be inside of the scope, since it causes OOM, and I don't think 
it's an unrealistic or rare use case. I don't consider it less likely than 
spawning a large number of direct children. If you want to cover only 
certified reasons for causing OOM, it can be as well said that userspace is 
not allowed to cause OOM at all.

-- 
 Lubos Lunak
 openSUSE Boosters team, KDE developer
 l.lunak@suse.cz , l.lunak@kde.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Improving OOM killer
  2010-02-11 10:16                   ` Lubos Lunak
@ 2010-02-11 21:17                     ` David Rientjes
  0 siblings, 0 replies; 53+ messages in thread
From: David Rientjes @ 2010-02-11 21:17 UTC (permalink / raw)
  To: Lubos Lunak
  Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton,
	KOSAKI Motohiro, Nick Piggin, Jiri Kosina

On Thu, 11 Feb 2010, Lubos Lunak wrote:

>  I believe that with the algorithm no longer using VmSize and being careful 
> not to count shared memory more than once this would not be an issue and 
> kdeinit would be reasonably safe. KDE does not use _that_ much memory to 
> score higher than something that caused OOM :).
> 

Your suggestion of summing up the memory of the parent and its children 
would clearly bias kdeinit if it forks most of kde's threads as you 
mentioned earlier in the thread.  Imagine it, or another server 
application that Rik mentioned, if all children are first generation: then 
it would always be selected if that it is the only task operating on the 
system.  For a web server, for instance, where each query is handled by a 
seperate thread, we'd obviously prefer to kill a child thread instead of 
making the entire server unresponsive.  That type of algorithm in the oom 
killer and to kill the parent instead is just a non-starter.

>  Our definitions of 'forkbomb' then perhaps differ a bit. I 
> consider 'make -j100' a kind of a forkbomb too, it will very likely overload 
> the machine too as soon as the gcc instances use up all the memory. For that 
> reason also using CPU time <1second will not work here, while using real time 
> <1minute would.
> 

1 minute?  Unless you've got one of SGI's 4K cpu machines where these 1000 
threads would actually get any runtime _at_all_ in such circumstances, 
that threshold is unreasonable.

A valid point that wasn't raised is although we can't always detect out of 
control forking applications, we certainly should do some due diligence in 
making sure other applications aren't unfairly penalized when you do 
make -j100, for example.  That's not the job of the forkbomb detector in 
my heuristic, however, it's the job of the baseline itself.  In such 
scenarios (and when we can't allocate or free any memory), the baseline is 
responsible for identifying these tasks and killing them itself because 
they are using an excessive amount of memory.

>  Your protection seems to cover only "for(;;) if(fork() == 0) break;" , while 
> I believe mine could handle also "make -j100" or the bash forkbomb ":()
> { :|:& };:" (i.e. "for(;;) fork();").
> 

Again, it's not protection against forkbombs: the oom killer is not the 
place where you want to enforce any policy that prohibits that.  

>  Why? It repeatedly causes OOM here (and in fact it is the only common OOM or 
> forkbomb I ever encounter). If OOM killer is the right place to protect 
> against a forkbomb that spawns a large number of 1st level children, then I 
> don't see how this is different.
> 

We're not protecting against a large number of first-generation children, 
we're simply penalizing them because the oom killer chooses to kill a 
large memory-hogging task instead of the parent first.  This shouldn't be 
described as "forkbomb detection" because thats outside the scope of the 
oom killer or VM, for that matter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2010-02-11 21:17 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-01 22:02 Improving OOM killer Lubos Lunak
2010-02-01 23:53 ` David Rientjes
2010-02-02 21:10   ` Lubos Lunak
2010-02-03  1:41     ` David Rientjes
2010-02-03  1:52       ` KAMEZAWA Hiroyuki
2010-02-03  2:12         ` David Rientjes
2010-02-03  2:12           ` KAMEZAWA Hiroyuki
2010-02-03  2:36             ` [patch] sysctl: clean up vm related variable declarations David Rientjes
2010-02-03  8:07               ` KOSAKI Motohiro
2010-02-03  8:17               ` Balbir Singh
2010-02-03 22:54       ` Improving OOM killer Lubos Lunak
2010-02-04  0:00         ` David Rientjes
2010-02-03  7:50 ` KOSAKI Motohiro
2010-02-03  9:40   ` David Rientjes
2010-02-03  8:57 ` Balbir Singh
2010-02-03 12:10   ` Lubos Lunak
2010-02-03 12:25     ` Balbir Singh
2010-02-03 15:00       ` Minchan Kim
2010-02-03 16:06         ` Minchan Kim
2010-02-03 21:22       ` Lubos Lunak
2010-02-03 14:49 ` Rik van Riel
2010-02-03 17:01   ` Balbir Singh
2010-02-03 18:58     ` David Rientjes
2010-02-03 19:29       ` Frans Pop
2010-02-03 19:52         ` David Rientjes
2010-02-03 20:12           ` Frans Pop
2010-02-03 20:26             ` David Rientjes
2010-02-03 22:55       ` Lubos Lunak
2010-02-04  0:05         ` David Rientjes
2010-02-04  0:18           ` Rik van Riel
2010-02-04 21:48             ` David Rientjes
2010-02-04 22:06               ` Rik van Riel
2010-02-04 22:14                 ` David Rientjes
2010-02-10 20:54                   ` Lubos Lunak
2010-02-10 21:10                     ` Rik van Riel
2010-02-10 21:29                       ` Lubos Lunak
2010-02-10 22:18                       ` Alan Cox
2010-02-10 22:31                         ` David Rientjes
2010-02-11  9:50                         ` Lubos Lunak
2010-02-04 22:31               ` Frans Pop
2010-02-04 22:53                 ` David Rientjes
2010-02-04  7:58           ` Lubos Lunak
2010-02-04 21:34             ` David Rientjes
2010-02-10 20:54               ` Lubos Lunak
2010-02-10 21:09                 ` Rik van Riel
2010-02-10 21:34                   ` Lubos Lunak
2010-02-10 22:25                 ` David Rientjes
2010-02-11 10:16                   ` Lubos Lunak
2010-02-11 21:17                     ` David Rientjes
2010-02-04  9:50           ` Jiri Kosina
2010-02-04 21:39             ` David Rientjes
2010-02-05  7:35               ` Oliver Neukum
2010-02-10  3:10                 ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).