* Improving OOM killer
@ 2010-02-01 22:02 Lubos Lunak
2010-02-01 23:53 ` David Rientjes
` (3 more replies)
0 siblings, 4 replies; 53+ messages in thread
From: Lubos Lunak @ 2010-02-01 22:02 UTC (permalink / raw)
To: linux-mm, linux-kernel
Cc: Andrew Morton, David Rientjes, KOSAKI Motohiro, Balbir Singh,
Nick Piggin, Jiri Kosina
Hello,
I'd like to suggest some changes to the OOM killer code that I believe
improve its behaviour - here it turns it from something completely useless and
harmful into something that works very well. I'm first posting changes for
review, I'll reformat the patch later as needed if approved. Given that it's
been like this for about a decade, I get the feeling there must be some
strange catch.
My scenario is working in a KDE desktop session and accidentally running
parallel make in doc/ subdirectory of sources of a KDE module. As I use
distributed compiling, I run make with -j20 or more, but, as the tool used for
processing KDE documentation is quite memory-intensive, running this many
of them is more than enough to consume all the 2GB RAM in the machine. What
happens in that case is that the machine becomes virtually unresponsible,
where even Ctrl+Alt+F1 can take minutes, not to mention some action that'd
actually redeem the situation. If I wait long enough for something to happen,
which can be even hours, the action that ends the situation is killing one of
the most vital KDE processes, rendering the whole session useless and making
me lose all unsaved data.
The process tree looks roughly like this:
init
|- kdeinit
| |- ksmserver
| | |- kwin
| |- <other>
|- konsole
|- make
|- sh
| |- meinproc4
|- sh
| |- meinproc4
|- <etc>
What happens is that OOM killer usually selects either ksmserver (KDE session
manager) or kdeinit (KDE master process that spawns most KDE processes). Note
that in either case OOM killer does not reach the point of killing the actual
offender - it will randomly kill in the tree under kdeinit until it decides
to kill ksmserver, which means terminating the desktop session. As konsole is
a KUniqueApplication, it forks into background and gets reparented to init,
thus getting away from the kdeinit subtree. Since the memory pressure is
distributed among several meinproc4 processes, the badness does not get
summed up in its make grandparent, as badness() does this only for direct
parents.
Each meinproc4 process still uses a considerable amount of memory, so one
could assume that the situation would be solved by simply killing them one by
one, but it is not so because of using what I consider poor metric for
measuring memory usage - VmSize. VmSize, if I'm not mistaken, is the size of
the address space taken by the process, which in practice does not say much
about how much memory the process actually uses. For example, /proc/*/status
for one selected KDE process:
VmPeak: 534676 kB
VmSize: 528340 kB
VmLck: 0 kB
VmHWM: 73464 kB
VmRSS: 73388 kB
VmData: 142332 kB
VmStk: 92 kB
VmExe: 44 kB
VmLib: 91232 kB
VmPTE: 716 kB
And various excerpts from /proc/*/smaps for this process:
...
7f7b3f800000-7f7b40000000 rwxp 00000000 00:00 0
Size: 8192 kB
Rss: 16 kB
Referenced: 16 kB
...
7f7b40055000-7f7b44000000 ---p 00000000 00:00 0
Size: 65196 kB
Rss: 0 kB
Referenced: 0 kB
...
7f7b443cd000-7f7b445cd000 ---p 0001c000 08:01
790267 /usr/lib64/kde4/libnsplugin.so
Size: 2048 kB
Rss: 0 kB
Referenced: 0 kB
...
7f7b48300000-7f7b4927d000 rw-s 00000000 08:01
58690 /var/tmp/kdecache-seli/kpc/kde-icon-cache.data
Size: 15860 kB
Rss: 24 kB
Referenced: 24 kB
I assume the first one is stack, search me what the second and third ones are
(there appears to be one such mapping as the third one for each .so used),
the last one is a mapping of a large cache file that's nevertheless rarely
used extensively and even then it's backed by a file. In other words, none of
this actually uses much of real memory, yet right now it's the process that
would get killed for using about 70MB memory, even though it's not the
offender. The offender scores only about 1/3 of its badness, even though it
uses almost the double amount of memory:
VmPeak: 266508 kB
VmSize: 266504 kB
VmLck: 0 kB
VmHWM: 118208 kB
VmRSS: 118208 kB
VmData: 98512 kB
VmStk: 84 kB
VmExe: 60 kB
VmLib: 48944 kB
VmPTE: 536 kB
And the offender is only 14th in the list of badness candidates. Speaking of
which, the following is quite useful for seeing all processes sorted by
badness:
ls /proc/*/oom_score | grep -v self | sed 's/\(.*\)\/\(.*\)/echo -n "\1 "; \
echo -n "`cat \1\/\2 2>\/dev\/null` "; readlink \1\/exe || echo/'| sh | \
sort -nr +1
Therefore, I suggest doing the following changes in mm/oom_kill.c :
=====
--- linux-2.6.31/mm/oom_kill.c.sav 2010-02-01 22:00:41.614838540 +0100
+++ linux-2.6.31/mm/oom_kill.c 2010-02-01 22:01:08.773757932 +0100
@@ -69,7 +69,7 @@ unsigned long badness(struct task_struct
/*
* The memory size of the process is the basis for the badness.
*/
- points = mm->total_vm;
+ points = get_mm_rss(mm);
/*
* After this unlock we can no longer dereference local variable `mm'
@@ -83,21 +83,6 @@ unsigned long badness(struct task_struct
return ULONG_MAX;
/*
- * Processes which fork a lot of child processes are likely
- * a good choice. We add half the vmsize of the children if they
- * have an own mm. This prevents forking servers to flood the
- * machine with an endless amount of children. In case a single
- * child is eating the vast majority of memory, adding only half
- * to the parents will make the child our kill candidate of choice.
- */
- list_for_each_entry(child, &p->children, sibling) {
- task_lock(child);
- if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
- task_unlock(child);
- }
-
- /*
* CPU time is in tens of seconds and run time is in thousands
* of seconds. There is no particular reason for this other than
* that it turned out to work very well in practice.
=====
In other words, use VmRSS for measuring memory usage instead of VmSize, and
remove child accumulating.
I hope the above is good enough reason for the first change. VmSize includes
things like read-only mappings, memory mappings that is actually unused,
mappings backed by a file, mappings from video drivers, and so on. VmRSS is
actual real memory used, which is what mostly matters here. While it may not
be perfect, it is certainly an improvement.
The second change should be done on the basis that it does more harm than
good. In this specific case, it does not help to identify the source of the
problem, and it incorrectly identifies kdeinit as the problem solely on the
basis that it spawned many other processes. I think it's already quite hinted
that this is a problem by the fact that you had to add a special protection
for init - any session manager, process launcher or even xterm used for
launching apps is yet another init.
I also have problems finding a case where the child accounting would actually
help. I mean, in practice, I can certainly come up with something in theory,
and this looks to me like a solution to a very synthesized problem. In which
realistic case will one process launch a limited number of children, where
all of them will consume memory, but just killing the children one by one
won't avoid the problem reasonably? This is unlikely to avoid a forkbomb, as
in that case the number of children will be the problem. It is not necessary
for just one children misbehaving and being restarted, nor will it work
there. So what is that supposed to fix, and is it more likely than the case
of a process launching several unrelated children?
If the children accounting is supposed to handle cases like forked children
of Apache, then I suggest it is adjusted only to count children that have
been forked from the parent but there has been no exec(). I'm afraid I don't
know how to detect that.
When running a kernel with these changes applied, I can safely do the
above-described case of running parallel doc generation in KDE. No clearly
innocent process is selected for killing, the first choice is always an
offender.
Moreover, the remedy is almost instant, there is only a fraction of second of
when the machine is overloaded by the I/O of swapping pages in and out (I do
not use swap, but there is a large amount of memory used by read-only
mappings of binaries, libraries or various other files that is in the
original case rendering the machine unresponsive - I assume this is because
the kernel tries to kill an innocent process, but the offenders immediatelly
consume anything that is freed, requiring even memory used by code that is to
be executed to be swapped in from files again).
I consider the patches to be definite improvements, so if they are ok, I will
format them as necessary. Now, what is the catch?
--
Lubos Lunak
openSUSE Boosters team, KDE developer
l.lunak@suse.cz , l.lunak@kde.org
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: Improving OOM killer 2010-02-01 22:02 Improving OOM killer Lubos Lunak @ 2010-02-01 23:53 ` David Rientjes 2010-02-02 21:10 ` Lubos Lunak 2010-02-03 7:50 ` KOSAKI Motohiro ` (2 subsequent siblings) 3 siblings, 1 reply; 53+ messages in thread From: David Rientjes @ 2010-02-01 23:53 UTC (permalink / raw) To: Lubos Lunak Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina On Mon, 1 Feb 2010, Lubos Lunak wrote: > Hello, > > I'd like to suggest some changes to the OOM killer code that I believe > improve its behaviour - here it turns it from something completely useless and > harmful into something that works very well. I'm first posting changes for > review, I'll reformat the patch later as needed if approved. Given that it's > been like this for about a decade, I get the feeling there must be some > strange catch. > Thanks for your detailed explanation! It will be very helpful to have your perspective and enthusiasm as we address oom killer issues. I don't quite understand how you can say the oom killer is "completely useless and harmful." It certainly fulfills its purpose, which is to kill a memory hogging task so that a page allocation may succeed when reclaim has failed to free any memory. It's useful in that regard, although I don't think you'll find any argument that killing user tasks could always be described as "harmful" in some way, at least to somebody :) > My scenario is working in a KDE desktop session and accidentally running > parallel make in doc/ subdirectory of sources of a KDE module. As I use > distributed compiling, I run make with -j20 or more, but, as the tool used for > processing KDE documentation is quite memory-intensive, running this many > of them is more than enough to consume all the 2GB RAM in the machine. What > happens in that case is that the machine becomes virtually unresponsible, > where even Ctrl+Alt+F1 can take minutes, not to mention some action that'd > actually redeem the situation. If I wait long enough for something to happen, > which can be even hours, the action that ends the situation is killing one of > the most vital KDE processes, rendering the whole session useless and making > me lose all unsaved data. > That's the point at which direct reclaim can no longer free any memory and we must invoke the oom killer as opposed to potentially deadlocking the system. > The process tree looks roughly like this: > > init > |- kdeinit > | |- ksmserver > | | |- kwin > | |- <other> > |- konsole > |- make > |- sh > | |- meinproc4 > |- sh > | |- meinproc4 > |- <etc> > > What happens is that OOM killer usually selects either ksmserver (KDE session > manager) or kdeinit (KDE master process that spawns most KDE processes). Note > that in either case OOM killer does not reach the point of killing the actual > offender - it will randomly kill in the tree under kdeinit until it decides > to kill ksmserver, which means terminating the desktop session. As konsole is > a KUniqueApplication, it forks into background and gets reparented to init, > thus getting away from the kdeinit subtree. Since the memory pressure is > distributed among several meinproc4 processes, the badness does not get > summed up in its make grandparent, as badness() does this only for direct > parents. > There's no randomness involved in selecting a task to kill; the oom killer does, however, try to kill a child with seperate memory first instead of the parent and that selection depends on the ordering of the selected task's child list. As I mentioned very early this morning, that could certainly be improved to kill the child with the highest badness() score first. Something like this untested patch could do it: diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -439,6 +439,9 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, const char *message) { struct task_struct *c; + struct task_struct *victim = NULL; + unsigned long victim_points = 0; + struct timespec uptime; if (printk_ratelimit()) dump_header(p, gfp_mask, order, mem); @@ -455,14 +458,20 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n", message, task_pid_nr(p), p->comm, points); - /* Try to kill a child first */ + /* Try to kill the most memory-hogging child first */ + do_posix_clock_monotonic_gettime(&uptime); list_for_each_entry(c, &p->children, sibling) { + unsigned long cpoints; + if (c->mm == p->mm) continue; - if (!oom_kill_task(c)) - return 0; + cpoints = badness(c, uptime.tv_sec); + if (cpoints > victim_points) { + victim = c; + victim_points = cpoints; + } } - return oom_kill_task(p); + return oom_kill_task(victim ? victim : p); } #ifdef CONFIG_CGROUP_MEM_RES_CTLR This works because the conditions on which oom_kill_task(victim) fails is when !victim->mm or victim->signal->oom_adj == OOM_DISABLE which we're guarded against since badness(victim) returns 0 for both. The process tree that you posted shows a textbook case for using /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is protected from getting selected for oom kill. In your own words, this "spawns most KDE processes," so it's an ideal place to set an oom_adj value of OOM_DISABLE since that value is inheritable to children and, thus, all children are implicitly protected as well. > Each meinproc4 process still uses a considerable amount of memory, so one > could assume that the situation would be solved by simply killing them one by > one, but it is not so because of using what I consider poor metric for > measuring memory usage - VmSize. VmSize, if I'm not mistaken, is the size of > the address space taken by the process, which in practice does not say much > about how much memory the process actually uses. For example, /proc/*/status > for one selected KDE process: > > VmPeak: 534676 kB > VmSize: 528340 kB > VmLck: 0 kB > VmHWM: 73464 kB > VmRSS: 73388 kB > VmData: 142332 kB > VmStk: 92 kB > VmExe: 44 kB > VmLib: 91232 kB > VmPTE: 716 kB > > And various excerpts from /proc/*/smaps for this process: > ... > 7f7b3f800000-7f7b40000000 rwxp 00000000 00:00 0 > Size: 8192 kB > Rss: 16 kB > Referenced: 16 kB > ... > 7f7b40055000-7f7b44000000 ---p 00000000 00:00 0 > Size: 65196 kB > Rss: 0 kB > Referenced: 0 kB > ... > 7f7b443cd000-7f7b445cd000 ---p 0001c000 08:01 > 790267 /usr/lib64/kde4/libnsplugin.so > Size: 2048 kB > Rss: 0 kB > Referenced: 0 kB > ... > 7f7b48300000-7f7b4927d000 rw-s 00000000 08:01 > 58690 /var/tmp/kdecache-seli/kpc/kde-icon-cache.data > Size: 15860 kB > Rss: 24 kB > Referenced: 24 kB > > I assume the first one is stack, search me what the second and third ones are > (there appears to be one such mapping as the third one for each .so used), > the last one is a mapping of a large cache file that's nevertheless rarely > used extensively and even then it's backed by a file. In other words, none of > this actually uses much of real memory, yet right now it's the process that > would get killed for using about 70MB memory, even though it's not the > offender. The offender scores only about 1/3 of its badness, even though it > uses almost the double amount of memory: > > VmPeak: 266508 kB > VmSize: 266504 kB > VmLck: 0 kB > VmHWM: 118208 kB > VmRSS: 118208 kB > VmData: 98512 kB > VmStk: 84 kB > VmExe: 60 kB > VmLib: 48944 kB > VmPTE: 536 kB > Using VmSize, however, allows us to define the most important task to kill for the oom killer: memory leakers. Memory leakers are the single most important tasks to identify with the oom killer and aren't obvious when using rss because leaked memory does not stay resident in RAM. I understand your system may not have such a leaker and it is simply overcommitted on a 2GB machine, but using rss loses that ability. It also makes tuning oom killer priorities with /proc/pid/oom_adj almost impossible since a task's rss is highly dynamic and we cannot speculate on the state of the VM at the time of oom. > Therefore, I suggest doing the following changes in mm/oom_kill.c : > > ===== > --- linux-2.6.31/mm/oom_kill.c.sav 2010-02-01 22:00:41.614838540 +0100 > +++ linux-2.6.31/mm/oom_kill.c 2010-02-01 22:01:08.773757932 +0100 > @@ -69,7 +69,7 @@ unsigned long badness(struct task_struct > /* > * The memory size of the process is the basis for the badness. > */ > - points = mm->total_vm; > + points = get_mm_rss(mm); > > /* > * After this unlock we can no longer dereference local variable `mm' > @@ -83,21 +83,6 @@ unsigned long badness(struct task_struct > return ULONG_MAX; > > /* > - * Processes which fork a lot of child processes are likely > - * a good choice. We add half the vmsize of the children if they > - * have an own mm. This prevents forking servers to flood the > - * machine with an endless amount of children. In case a single > - * child is eating the vast majority of memory, adding only half > - * to the parents will make the child our kill candidate of choice. > - */ > - list_for_each_entry(child, &p->children, sibling) { > - task_lock(child); > - if (child->mm != mm && child->mm) > - points += child->mm->total_vm/2 + 1; > - task_unlock(child); > - } > - > - /* > * CPU time is in tens of seconds and run time is in thousands > * of seconds. There is no particular reason for this other than > * that it turned out to work very well in practice. > ===== > > In other words, use VmRSS for measuring memory usage instead of VmSize, and > remove child accumulating. > > I hope the above is good enough reason for the first change. VmSize includes > things like read-only mappings, memory mappings that is actually unused, > mappings backed by a file, mappings from video drivers, and so on. VmRSS is > actual real memory used, which is what mostly matters here. While it may not > be perfect, it is certainly an improvement. > It's not for a large number of users, the consumer of the largest amount of rss is not necessarily the task we always want to kill. Just because an order-0 page allocation fails does not mean we want to kill the task that would free the largest amount of RAM. I understand that KDE is extremely important to your work environment and if you lose it, it seems like a failure of Linux and the VM. However, the kernel cannot possibly know what applications you believe to be the most important. For that reason, userspace is able to tune the badness() score by writing to /proc/pid/oom_adj as I've suggested you do for kdeinit. You have the ability to protect KDE from getting oom killed, you just need to use it. The natural response to this is: well, you can't possibly expect all users to tune /proc/pid/oom_adj just in case we are oom. I can expect KDE to protect itself, though, since it is of sufficient importance to its users that it should protect itself, as I expect the same of the other examples that have arisen in recent months concerning Xorg. There's a second caveat, however, to just protecting applications from the oom killer using OOM_DISABLE. And that goes back to my original point of being able to define memory leakers. /proc/pid/oom_adj actually works quite well to define when a task is using far more memory than expected. You can say the VmSize doesn't accurately reflect the amount of memory an application is using, but it is both static enough and well understood from a pure arithemtic standpoint for what a sane range is for a particular application. It is then possible to use oom_adj to define when that memory usage is egregious and considered "rogue" but the oom killer so that it warrants being killed. > The second change should be done on the basis that it does more harm than > good. In this specific case, it does not help to identify the source of the > problem, and it incorrectly identifies kdeinit as the problem solely on the > basis that it spawned many other processes. I think it's already quite hinted > that this is a problem by the fact that you had to add a special protection > for init - any session manager, process launcher or even xterm used for > launching apps is yet another init. > The kernel can't be expected to know that, you must tell it via /proc/pid/oom_adj. > I also have problems finding a case where the child accounting would actually > help. I mean, in practice, I can certainly come up with something in theory, > and this looks to me like a solution to a very synthesized problem. In which > realistic case will one process launch a limited number of children, where > all of them will consume memory, but just killing the children one by one > won't avoid the problem reasonably? This is unlikely to avoid a forkbomb, as > in that case the number of children will be the problem. It is not necessary > for just one children misbehaving and being restarted, nor will it work > there. So what is that supposed to fix, and is it more likely than the case > of a process launching several unrelated children? > Right, I believe Kame is working on a forkbomb detector that would replace this logic. > Moreover, the remedy is almost instant, there is only a fraction of second of > when the machine is overloaded by the I/O of swapping pages in and out (I do > not use swap, but there is a large amount of memory used by read-only > mappings of binaries, libraries or various other files that is in the > original case rendering the machine unresponsive - I assume this is because > the kernel tries to kill an innocent process, but the offenders immediatelly > consume anything that is freed, requiring even memory used by code that is to > be executed to be swapped in from files again). > We may need to look at block congestion timeouts before invoking the oom killer in the page allocator. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-01 23:53 ` David Rientjes @ 2010-02-02 21:10 ` Lubos Lunak 2010-02-03 1:41 ` David Rientjes 0 siblings, 1 reply; 53+ messages in thread From: Lubos Lunak @ 2010-02-02 21:10 UTC (permalink / raw) To: David Rientjes Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina On Tuesday 02 of February 2010, David Rientjes wrote: > On Mon, 1 Feb 2010, Lubos Lunak wrote: > > Hello, > I don't quite understand how you can say the oom killer is "completely > useless and harmful." It certainly fulfills its purpose, which is to kill > a memory hogging task so that a page allocation may succeed when reclaim > has failed to free any memory. I started the sentence with "here". And if you compare my description of what happens with the 5 goals listed in oom_kill.c, you can see that it fails all of them except for 2, and even in that case it is much better to simply reboot the computer. So that is why OOM killer _here_ is completely useless and harmful, as even panic_on_oom does a better job. I'm not saying it's so everywhere, presumably it works somewhere when somebody has written it this way, but since some of the design decisions appear to be rather poor for desktop systems, "here" is probably not really limited only to my computer either. > > The process tree looks roughly like this: > > > > init > > |- kdeinit > > | |- ksmserver > > | | |- kwin > > | |- <other> > > |- konsole > > |- make > > |- sh > > | |- meinproc4 > > |- sh > > | |- meinproc4 > > |- <etc> > > > > What happens is that OOM killer usually selects either ksmserver (KDE > > session manager) or kdeinit (KDE master process that spawns most KDE > > processes). Note that in either case OOM killer does not reach the point > > of killing the actual offender - it will randomly kill in the tree under > > kdeinit until it decides to kill ksmserver, which means terminating the > > desktop session. As konsole is a KUniqueApplication, it forks into > > background and gets reparented to init, thus getting away from the > > kdeinit subtree. Since the memory pressure is distributed among several > > meinproc4 processes, the badness does not get summed up in its make > > grandparent, as badness() does this only for direct parents. > > There's no randomness involved in selecting a task to kill; That was rather a figure of speech, but even if you want to take it literally, then from the user's point of view it is random. Badness of kdeinit depends on the number of children it has spawned, badness of ksmserver depends for example on the number and size of windows open (as its child kwin is a window and compositing manager). Not that it really matters - the net result is that OOM killer usually decides to kill kdeinit or ksmserver, starts killing their children, vital KDE processes, and since the offenders are not among them, it ends up either terminating the whole session by killing ksmserver or killing enough vital processes there to free enough memory for the offenders to finish their work cleanly. > The process tree that you posted shows a textbook case for using > /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is > protected from getting selected for oom kill. In your own words, this > "spawns most KDE processes," so it's an ideal place to set an oom_adj > value of OOM_DISABLE since that value is inheritable to children and, > thus, all children are implicitly protected as well. Yes, it's a textbook case, sadly textbook cases are theory and not practice. I didn't mention it in my first mail to keep it shorter, but we have actually tried it. First of all, it's rather cumbersome - as it requires root priviledges, there is one wrapped needed for setuid and another one to avoid setuid side-effects, moreover the setuid root process needs to stay running and unset the protection on all children, or it'd be useless again. Worse, it worked for about a year or two and now it has only shifted the problem elsewhere and that's it. We now protect kdeinit, which means the OOM killer's choice will very likely ksmserver then. Ok, so let's say now we start protecting also ksmserver, that's some additional hassle setting it up, but that's doable. Now there's a good chance the OOM killer's choice will be kwin (as a compositing manager it can have quite large mappings because of graphics drivers). So ok, we need to protect the window manager, but since that's not a hardcoded component like ksmserver, that's even more hassle. And, after all that, OOM killer will simply detect yet another innocent process. I didn't mention that, but the memory statistics I presented for one selected KDE process in my original mail were actually for an ordinary KDE application - Konqueror showing a web page. Yet, as you can read in my original mail, even though it used only about half memory of what the offender used, it still scored almost tripple of its badness score. So unless you would suggest I implement my own dynamic badness handling in userspace, which I hope we can all agree is nonsense, then oom_adj is a cumbersome non-solution here. It may work in some setups, but it doesn't for the desktop. > Using VmSize, however, allows us to define the most important task to kill > for the oom killer: memory leakers. Memory leakers are the single most > important tasks to identify with the oom killer and aren't obvious when > using rss because leaked memory does not stay resident in RAM. I > understand your system may not have such a leaker and it is simply > overcommitted on a 2GB machine, but using rss loses that ability. Interesting point. Am I getting it right that you're saying that VmRSS is unsuitable because badness should take into account not only the RAM used by the process but also the swap space used by the process? If yes, then this rather brings up the question why doesn't the badness calculation then do it and uses VmSize instead? I mean, as already demonstrated in the original mail, VmSize clearly can be very wrong as a representation of memory used. I would actually argue that VmRSS is still better, as the leaker would eventually fill the swap and start taking up RAM, but either way, how about this then? - points = mm->total_vm; + points = get_mm_rss(mm) + get_mm_space_used_in_swap_but_not_in_other_places_like_file_backing(mm); (I don't know if there's a function doing the latter or how to count it. Probably not exactly trivial given that I have experience with /proc/*/stat*-using tools like top reporting rather wrong numbers for swap usage of processes.) > It also makes tuning oom killer priorities with /proc/pid/oom_adj almost > impossible since a task's rss is highly dynamic and we cannot speculate on > the state of the VM at the time of oom. I see. However using VmRSS and swap space together avoids this. > > In other words, use VmRSS for measuring memory usage instead of VmSize, > > and remove child accumulating. > > > > I hope the above is good enough reason for the first change. VmSize > > includes things like read-only mappings, memory mappings that is actually > > unused, mappings backed by a file, mappings from video drivers, and so > > on. VmRSS is actual real memory used, which is what mostly matters here. > > While it may not be perfect, it is certainly an improvement. > > It's not for a large number of users, You mean, besides all desktop users? In my experience desktop machines start getting rather useless when their swap usage starts nearing the total RAM amount, so swap should not be that significant. Moreover, again, it's still better than VmSize, which can be wildly inaccurate. On my desktop system it definitely is. Hmm, maybe you're thinking server setup and that's different, I don't know. Does the kernel have any "desktop mode"? I wouldn't mind if VmSize was used on servers if you insist it is better, but on desktop VmSize is just plain wrong. And, again, I think VmRSS+InSwap is better then either. > the consumer of the largest amount > of rss is not necessarily the task we always want to kill. Just because > an order-0 page allocation fails does not mean we want to kill the task > that would free the largest amount of RAM. It's still much better than killing the task that would free the largest amount of address space. And I cannot think of any better metric than VmRSS+InSwap. Can you? > I understand that KDE is extremely important to your work environment and > if you lose it, it seems like a failure of Linux and the VM. However, the > kernel cannot possibly know what applications you believe to be the most > important. For that reason, userspace is able to tune the badness() score > by writing to /proc/pid/oom_adj as I've suggested you do for kdeinit. You > have the ability to protect KDE from getting oom killed, you just need to > use it. As already explained, I can't. Besides, I'm not expecting a miracle, I simply expect the kernel to kill the process that takes up the most memory, and the kernel can possibly know that, it just doesn't do it. What other evidence do you want to be shown that badness calculated for two processes on their actual memory usage differs by a multiple of 5 or more? [snipped description of how oom_adj should help when it in fact wouldn't] > > I also have problems finding a case where the child accounting would > > actually help. I mean, in practice, I can certainly come up with > > something in theory, and this looks to me like a solution to a very > > synthesized problem. In which realistic case will one process launch a > > limited number of children, where all of them will consume memory, but > > just killing the children one by one won't avoid the problem reasonably? > > This is unlikely to avoid a forkbomb, as in that case the number of > > children will be the problem. It is not necessary for just one children > > misbehaving and being restarted, nor will it work there. So what is that > > supposed to fix, and is it more likely than the case of a process > > launching several unrelated children? > > Right, I believe Kame is working on a forkbomb detector that would replace > this logic. Until then, can we dump the current code? Because I have provided one case where it makes things worse and nobody has provided any case where it makes things better or any other justification for its existence. There's no point in keeping code for which nobody knows how it improves things (in reality, not some textbook case). And, in case the justification for it is something like "Apache", can we fast-forward to my improved suggestion to limit this only to children that are forked but not exec()-ed? -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-02 21:10 ` Lubos Lunak @ 2010-02-03 1:41 ` David Rientjes 2010-02-03 1:52 ` KAMEZAWA Hiroyuki 2010-02-03 22:54 ` Improving OOM killer Lubos Lunak 0 siblings, 2 replies; 53+ messages in thread From: David Rientjes @ 2010-02-03 1:41 UTC (permalink / raw) To: Lubos Lunak Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina, KAMEZAWA Hiroyuki On Tue, 2 Feb 2010, Lubos Lunak wrote: > > > init > > > |- kdeinit > > > | |- ksmserver > > > | | |- kwin > > > | |- <other> > > > |- konsole > > > |- make > > > |- sh > > > | |- meinproc4 > > > |- sh > > > | |- meinproc4 > > > |- <etc> > > > > > > What happens is that OOM killer usually selects either ksmserver (KDE > > > session manager) or kdeinit (KDE master process that spawns most KDE > > > processes). Note that in either case OOM killer does not reach the point > > > of killing the actual offender - it will randomly kill in the tree under > > > kdeinit until it decides to kill ksmserver, which means terminating the > > > desktop session. As konsole is a KUniqueApplication, it forks into > > > background and gets reparented to init, thus getting away from the > > > kdeinit subtree. Since the memory pressure is distributed among several > > > meinproc4 processes, the badness does not get summed up in its make > > > grandparent, as badness() does this only for direct parents. > > > > There's no randomness involved in selecting a task to kill; > > That was rather a figure of speech, but even if you want to take it > literally, then from the user's point of view it is random. Badness of > kdeinit depends on the number of children it has spawned, badness of > ksmserver depends for example on the number and size of windows open (as its > child kwin is a window and compositing manager). > As I've mentioned, I believe Kame (now cc'd) is working on replacing the heuristic that adds the VM size for children into the parent task's badness score with a forkbomb detector. That should certainly help to reduce oom killing for parents whose children consume a lot of memory, especially in your scenario where kdeinit is responsible for forking most KDE processes. > Not that it really matters - the net result is that OOM killer usually > decides to kill kdeinit or ksmserver, starts killing their children, vital > KDE processes, and since the offenders are not among them, it ends up either > terminating the whole session by killing ksmserver or killing enough vital > processes there to free enough memory for the offenders to finish their work > cleanly. > The kernel cannot possibly know what you consider a "vital" process, for that understanding you need to tell it using the very powerful /proc/pid/oom_adj tunable. I suspect if you were to product all of kdeinit's children by patching it to be OOM_DISABLE so that all threads it forks will inherit that value you'd actually see much improved behavior. I'd also encourage you to talk to the KDE developers to ensure that proper precautions are taken to protect it in such conditions since people who use such desktop environments typically don't want them to be sacrificed for memory. The kernel, however, can only provide a mechanism for users to define what they believe to be critical since it will vary widely depending on how Linux is used for desktops, servers, and embedded devices. Our servers, for example, have vital threads that are running on each machine and they need to be protected in the same way. They set themselves to be OOM_DISABLE. On a side note, it's only possible to lower an oom_adj value for a thread if you have CAP_SYS_RESOURCE. So that would be a prerequisite to setting OOM_DISABLE for any thread. If that's not feasible, there's a workaround: user tasks can always increase their own oom_adj value so that they are always preferred in oom conditions. This will act to protect those vital tasks by preempting them from getting oom killed without any special capability. > > The process tree that you posted shows a textbook case for using > > /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is > > protected from getting selected for oom kill. In your own words, this > > "spawns most KDE processes," so it's an ideal place to set an oom_adj > > value of OOM_DISABLE since that value is inheritable to children and, > > thus, all children are implicitly protected as well. > > Yes, it's a textbook case, sadly textbook cases are theory and not practice. > I didn't mention it in my first mail to keep it shorter, but we have actually > tried it. First of all, it's rather cumbersome - as it requires root > priviledges, there is one wrapped needed for setuid and another one to avoid > setuid side-effects, moreover the setuid root process needs to stay running > and unset the protection on all children, or it'd be useless again. > It only requires CAP_SYS_RESOURCE, actually, and your apparent difficulty in writing to a kernel tunable is outside the scope of this discussion, unfortunately. The bottomline is that the kernel provides a very well defined interface for tuning the oom killer's badness heuristic so that userspace can define what is "vital," whether that is kdeinit, ksmserver, or any other thread. The choice needs to be made to use it, however. > Worse, it worked for about a year or two and now it has only shifted the > problem elsewhere and that's it. We now protect kdeinit, which means the OOM > killer's choice will very likely ksmserver then. Ok, so let's say now we > start protecting also ksmserver, that's some additional hassle setting it up, > but that's doable. Now there's a good chance the OOM killer's choice will be > kwin (as a compositing manager it can have quite large mappings because of > graphics drivers). So ok, we need to protect the window manager, but since > that's not a hardcoded component like ksmserver, that's even more hassle. > No, you don't need to protect every KDE process from the oom killer unless it is going to be a contender for selection. You could certainly do so for completeness, but it shouldn't be required unless the nature of the thread demands it such that it forks many vital tasks (kdeinit) or its first-generation children's memory consumption can't be known either because it depends on how many children it can fork or their memory consumption is influenced by the user's work. > > Using VmSize, however, allows us to define the most important task to kill > > for the oom killer: memory leakers. Memory leakers are the single most > > important tasks to identify with the oom killer and aren't obvious when > > using rss because leaked memory does not stay resident in RAM. I > > understand your system may not have such a leaker and it is simply > > overcommitted on a 2GB machine, but using rss loses that ability. > > Interesting point. Am I getting it right that you're saying that VmRSS is > unsuitable because badness should take into account not only the RAM used by > the process but also the swap space used by the process? If yes, then this > rather brings up the question why doesn't the badness calculation then do it > and uses VmSize instead? > We don't want to discount VM_RESERVED because the memory it represents cannot be swapped, which is also an important indicator of the overall memory usage of any particular application. I understand that your particular use case may benefit from waiting on block congestion during the page allocation that triggered the oom killer without making any progress in direct reclaim, but discounting VM_RESERVED because of its (sometimes erroneous) use with VM_IO isn't an appropriate indicator of the thread's memory use since we're not considering the amount mapped for I/O. > I mean, as already demonstrated in the original mail, VmSize clearly can be > very wrong as a representation of memory used. I would actually argue that > VmRSS is still better, as the leaker would eventually fill the swap and start > taking up RAM, but either way, how about this then? > We simply can't afford to wait for a memory leaker to fill up all of swap before the heuristic works to identify it, that would be extremely slow depending on the speed of I/O and would actually increase the probability of needlessly killing the wrong task because the memory leaker will fill up all of swap while other tasks get killed because they have a large rss. That heuristic works antagonist to finding the memory leaker since it will always keep its rss low since that memory will never be touched again once it is leaked. > Hmm, maybe you're thinking server setup and that's different, I don't know. > Does the kernel have any "desktop mode"? I wouldn't mind if VmSize was used > on servers if you insist it is better, but on desktop VmSize is just plain > wrong. And, again, I think VmRSS+InSwap is better then either. > The heuristics are always well debated in this forum and there's little chance that we'll ever settle on a single formula that works for all possible use cases. That makes oom_adj even more vital to the overall efficiency of the oom killer, I really hope you start to use it to your advantage. There are a couple of things that we'll always agree on: its needless to kill a task that shares a different set of allowed nodes than the triggering page allocation, for example, since it cannot access that memory even if freed, and there should be some penalty for tasks that have a shorter uptime to break ties. I agree that most of the penalties as currently implemented, however, aren't scientific and don't appropriately adjust the score in those cases. Dividing the entire VM size by 2, for example, because the task has a certain trait isn't an ideal formula. There's also at least one thing that we'd both like to remove: the factoring of each child's VM size into the badness score for the parent as a way of detecting a forkbomb. Kame has been working on the forkbomb detector specifically to address this issue, so we should stay tuned. The one thing that I must stress, however, is the need for us to be able to define what a "vital" task is and to define what a "memory leaker" is. Both of those are currently possible with oom_adj, so we cannot loose this functionality. Changing the baseline to be rss would be highly dynamic and not allow us to accurately set an oom_adj value to define when a task is rogue on a shared system. That said, I don't think a complete rewrite of the badness function would be a non-starter: we could easily make all tasks scale to have a badness score ranging from 0 (never kill) to 1000 (always kill), for example. We just need to propose a sane set of heuristics so that we don't lose the configurability from userspace. I've also disagreed before with always killing the most memory consuming task whenever we have an oom. I agree with you that sometimes that task may be the most vital to the system and setting it to be OOM_DISABLE should not always be required for a simple order-0 allocation that fails somewhere, especially if its ~__GFP_NOFAIL. What I've recommended to you would work with current mainline and at least kernels released within the past few years, but I think there may be many more changes we can look to make in the future. > > the consumer of the largest amount > > of rss is not necessarily the task we always want to kill. Just because > > an order-0 page allocation fails does not mean we want to kill the task > > that would free the largest amount of RAM. > > It's still much better than killing the task that would free the largest > amount of address space. And I cannot think of any better metric than > VmRSS+InSwap. Can you? > Yes, baselining on total_vm so that we understand the total amount of memory that will no longer be mapped to that particular process if killed, regardless if its a shared library or not, so that we can define vital tasks and memory leakers from userspace. > > I understand that KDE is extremely important to your work environment and > > if you lose it, it seems like a failure of Linux and the VM. However, the > > kernel cannot possibly know what applications you believe to be the most > > important. For that reason, userspace is able to tune the badness() score > > by writing to /proc/pid/oom_adj as I've suggested you do for kdeinit. You > > have the ability to protect KDE from getting oom killed, you just need to > > use it. > > As already explained, I can't. Besides, I'm not expecting a miracle, I simply > expect the kernel to kill the process that takes up the most memory, and the > kernel can possibly know that, it just doesn't do it. What other evidence do > you want to be shown that badness calculated for two processes on their > actual memory usage differs by a multiple of 5 or more? > This is highly likely related to the child VM sizes being accumulated in the parent's badness score, we'll have to see how your results vary once the forkbomb detector is merged. I disagree that we always need to kill the application consuming the most memory, though, we need to determine when it's better to simply fail a ~__GFP_NOFAIL allocation and when killing a smaller, lower priority task may be more beneficial to the user's work. > > Right, I believe Kame is working on a forkbomb detector that would replace > > this logic. > > Until then, can we dump the current code? Because I have provided one case > where it makes things worse and nobody has provided any case where it makes > things better or any other justification for its existence. There's no point > in keeping code for which nobody knows how it improves things (in reality, > not some textbook case). > First of all, I don't think the forkbomb detector is that far in the future since a preliminary version of it was also posted, but I think we also need a way to address those particular cases in the heuristic. There are real-life cases where out of memory conditions occur specifically because of forkbombs so not addressing the issue in the heuristic in some way is not appropriate. > And, in case the justification for it is something like "Apache", can we > fast-forward to my improved suggestion to limit this only to children that > are forked but not exec()-ed? > The amount of memory you'll be freeing in simply killing such tasks will be minimal, I don't think that's an appropriate heuristic if they all share their memory with the parent. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 1:41 ` David Rientjes @ 2010-02-03 1:52 ` KAMEZAWA Hiroyuki 2010-02-03 2:12 ` David Rientjes 2010-02-03 22:54 ` Improving OOM killer Lubos Lunak 1 sibling, 1 reply; 53+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-02-03 1:52 UTC (permalink / raw) To: David Rientjes Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina On Tue, 2 Feb 2010 17:41:41 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > On Tue, 2 Feb 2010, Lubos Lunak wrote: > > > > > init > > > > |- kdeinit > > > > | |- ksmserver > > > > | | |- kwin > > > > | |- <other> > > > > |- konsole > > > > |- make > > > > |- sh > > > > | |- meinproc4 > > > > |- sh > > > > | |- meinproc4 > > > > |- <etc> > > > > > > > > What happens is that OOM killer usually selects either ksmserver (KDE > > > > session manager) or kdeinit (KDE master process that spawns most KDE > > > > processes). Note that in either case OOM killer does not reach the point > > > > of killing the actual offender - it will randomly kill in the tree under > > > > kdeinit until it decides to kill ksmserver, which means terminating the > > > > desktop session. As konsole is a KUniqueApplication, it forks into > > > > background and gets reparented to init, thus getting away from the > > > > kdeinit subtree. Since the memory pressure is distributed among several > > > > meinproc4 processes, the badness does not get summed up in its make > > > > grandparent, as badness() does this only for direct parents. > > > > > > There's no randomness involved in selecting a task to kill; > > > > That was rather a figure of speech, but even if you want to take it > > literally, then from the user's point of view it is random. Badness of > > kdeinit depends on the number of children it has spawned, badness of > > ksmserver depends for example on the number and size of windows open (as its > > child kwin is a window and compositing manager). > > > > As I've mentioned, I believe Kame (now cc'd) is working on replacing the > heuristic that adds the VM size for children into the parent task's > badness score with a forkbomb detector. I stopped that as I mentioned. I'm heavily disappointed with myself and would like not to touch oom-killer things for a while. I'd like to conentrate on memcg for a while, which I've starved for these 3 months. Then, you don't need to CC me. Bye, -Kame ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 1:52 ` KAMEZAWA Hiroyuki @ 2010-02-03 2:12 ` David Rientjes 2010-02-03 2:12 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 53+ messages in thread From: David Rientjes @ 2010-02-03 2:12 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina On Wed, 3 Feb 2010, KAMEZAWA Hiroyuki wrote: > I stopped that as I mentioned. I'm heavily disappointed with myself and > would like not to touch oom-killer things for a while. > > I'd like to conentrate on memcg for a while, which I've starved for these 3 months. > > Then, you don't need to CC me. > I'm disappointed to hear that, it would be helpful to get some consensus on the points that we can all agree on from different perspectives. I'll try to find some time to develop a solution that people will see as a move in the positive direction. On a seperate topic, do you have time to repost your sysctl cleanup for Andrew from http://marc.info/?l=linux-kernel&m=126457416729672 with the #ifndef CONFIG_MMU fix I mentioned? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 2:12 ` David Rientjes @ 2010-02-03 2:12 ` KAMEZAWA Hiroyuki 2010-02-03 2:36 ` [patch] sysctl: clean up vm related variable declarations David Rientjes 0 siblings, 1 reply; 53+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-02-03 2:12 UTC (permalink / raw) To: David Rientjes Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina On Tue, 2 Feb 2010 18:12:41 -0800 (PST) David Rientjes <rientjes@google.com> wrote: > On Wed, 3 Feb 2010, KAMEZAWA Hiroyuki wrote: > > > I stopped that as I mentioned. I'm heavily disappointed with myself and > > would like not to touch oom-killer things for a while. > > > > I'd like to conentrate on memcg for a while, which I've starved for these 3 months. > > > > Then, you don't need to CC me. > > > > I'm disappointed to hear that, it would be helpful to get some consensus > on the points that we can all agree on from different perspectives. I'll > try to find some time to develop a solution that people will see as a move > in the positive direction. > > On a seperate topic, do you have time to repost your sysctl cleanup for > Andrew from http://marc.info/?l=linux-kernel&m=126457416729672 with the > #ifndef CONFIG_MMU fix I mentioned? > I'll not. Feel free to reuse fragments I posted. Thanks, -Kame ^ permalink raw reply [flat|nested] 53+ messages in thread
* [patch] sysctl: clean up vm related variable declarations 2010-02-03 2:12 ` KAMEZAWA Hiroyuki @ 2010-02-03 2:36 ` David Rientjes 2010-02-03 8:07 ` KOSAKI Motohiro 2010-02-03 8:17 ` Balbir Singh 0 siblings, 2 replies; 53+ messages in thread From: David Rientjes @ 2010-02-03 2:36 UTC (permalink / raw) To: Andrew Morton Cc: KAMEZAWA Hiroyuki, Balbir Singh, minchan.kim, linux-kernel, linux-mm From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Now, there are many "extern" declaration in kernel/sysctl.c. "extern" declaration in *.c file is not appreciated in general. And Hmm...it seems there are a few redundant declarations. Because most of sysctl variables are defined in its own header file, they should be declared in the same style, be done in its own *.h file. This patch removes some VM(memory management) related sysctl's variable declaration from kernel/sysctl.c and move them to proper places. [rientjes@google.com: #ifdef fixlet] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/mm.h | 5 +++++ include/linux/mmzone.h | 1 + include/linux/oom.h | 5 +++++ kernel/sysctl.c | 16 ++-------------- mm/mmap.c | 5 +++++ 5 files changed, 18 insertions(+), 14 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1305,6 +1305,7 @@ int in_gate_area_no_task(unsigned long addr); #define in_gate_area(task, addr) ({(void)task; in_gate_area_no_task(addr);}) #endif /* __HAVE_ARCH_GATE_AREA */ +extern int sysctl_drop_caches; int drop_caches_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, @@ -1347,5 +1348,9 @@ extern void shake_page(struct page *p, int access); extern atomic_long_t mce_bad_pages; extern int soft_offline_page(struct page *page, int flags); +#ifndef CONFIG_MMU +extern int sysctl_nr_trim_pages; +#endif + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -760,6 +760,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int, extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1]; int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +extern int percpu_pagelist_fraction; /* for sysctl */ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -43,5 +43,10 @@ static inline void oom_killer_enable(void) { oom_killer_disabled = false; } +/* for sysctl */ +extern int sysctl_panic_on_oom; +extern int sysctl_oom_kill_allocating_task; +extern int sysctl_oom_dump_tasks; + #endif /* __KERNEL__*/ #endif /* _INCLUDE_LINUX_OOM_H */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -50,6 +50,8 @@ #include <linux/ftrace.h> #include <linux/slow-work.h> #include <linux/perf_event.h> +#include <linux/mman.h> +#include <linux/oom.h> #include <asm/uaccess.h> #include <asm/processor.h> @@ -66,11 +68,6 @@ /* External variables not in a header file. */ extern int C_A_D; extern int print_fatal_signals; -extern int sysctl_overcommit_memory; -extern int sysctl_overcommit_ratio; -extern int sysctl_panic_on_oom; -extern int sysctl_oom_kill_allocating_task; -extern int sysctl_oom_dump_tasks; extern int max_threads; extern int core_uses_pid; extern int suid_dumpable; @@ -79,14 +76,9 @@ extern unsigned int core_pipe_limit; extern int pid_max; extern int min_free_kbytes; extern int pid_max_min, pid_max_max; -extern int sysctl_drop_caches; -extern int percpu_pagelist_fraction; extern int compat_log; extern int latencytop_enabled; extern int sysctl_nr_open_min, sysctl_nr_open_max; -#ifndef CONFIG_MMU -extern int sysctl_nr_trim_pages; -#endif #ifdef CONFIG_RCU_TORTURE_TEST extern int rcutorture_runnable; #endif /* #ifdef CONFIG_RCU_TORTURE_TEST */ @@ -197,10 +189,6 @@ extern struct ctl_table inotify_table[]; extern struct ctl_table epoll_table[]; #endif -#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT -int sysctl_legacy_va_layout; -#endif - extern int prove_locking; extern int lock_stat; diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -87,6 +87,11 @@ int sysctl_overcommit_ratio = 50; /* default is 50% */ int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT; struct percpu_counter vm_committed_as; +#ifdef HAVE_ARCH_PICK_MMAP_LAYOUT +/* Used by each architecture's private code and sysctl. */ +int sysctl_legacy_va_layout; +#endif + /* * Check that a process has enough memory to allocate a new virtual * mapping. 0 means there is enough memory for the allocation to ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch] sysctl: clean up vm related variable declarations 2010-02-03 2:36 ` [patch] sysctl: clean up vm related variable declarations David Rientjes @ 2010-02-03 8:07 ` KOSAKI Motohiro 2010-02-03 8:17 ` Balbir Singh 1 sibling, 0 replies; 53+ messages in thread From: KOSAKI Motohiro @ 2010-02-03 8:07 UTC (permalink / raw) To: David Rientjes Cc: kosaki.motohiro, Andrew Morton, KAMEZAWA Hiroyuki, Balbir Singh, minchan.kim, linux-kernel, linux-mm > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > Now, there are many "extern" declaration in kernel/sysctl.c. "extern" > declaration in *.c file is not appreciated in general. > And Hmm...it seems there are a few redundant declarations. > > Because most of sysctl variables are defined in its own header file, > they should be declared in the same style, be done in its own *.h file. > > This patch removes some VM(memory management) related sysctl's > variable declaration from kernel/sysctl.c and move them to > proper places. > > [rientjes@google.com: #ifdef fixlet] > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Signed-off-by: David Rientjes <rientjes@google.com> Looks good to me. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [patch] sysctl: clean up vm related variable declarations 2010-02-03 2:36 ` [patch] sysctl: clean up vm related variable declarations David Rientjes 2010-02-03 8:07 ` KOSAKI Motohiro @ 2010-02-03 8:17 ` Balbir Singh 1 sibling, 0 replies; 53+ messages in thread From: Balbir Singh @ 2010-02-03 8:17 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, KAMEZAWA Hiroyuki, minchan.kim, linux-kernel, linux-mm * David Rientjes <rientjes@google.com> [2010-02-02 18:36:42]: > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > Now, there are many "extern" declaration in kernel/sysctl.c. "extern" > declaration in *.c file is not appreciated in general. > And Hmm...it seems there are a few redundant declarations. > > Because most of sysctl variables are defined in its own header file, > they should be declared in the same style, be done in its own *.h file. > > This patch removes some VM(memory management) related sysctl's > variable declaration from kernel/sysctl.c and move them to > proper places. > > [rientjes@google.com: #ifdef fixlet] > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > Signed-off-by: David Rientjes <rientjes@google.com> > --- > include/linux/mm.h | 5 +++++ > include/linux/mmzone.h | 1 + > include/linux/oom.h | 5 +++++ > kernel/sysctl.c | 16 ++-------------- > mm/mmap.c | 5 +++++ > 5 files changed, 18 insertions(+), 14 deletions(-) > Looks good to me -- Balbir ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 1:41 ` David Rientjes 2010-02-03 1:52 ` KAMEZAWA Hiroyuki @ 2010-02-03 22:54 ` Lubos Lunak 2010-02-04 0:00 ` David Rientjes 1 sibling, 1 reply; 53+ messages in thread From: Lubos Lunak @ 2010-02-03 22:54 UTC (permalink / raw) To: David Rientjes Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina, KAMEZAWA Hiroyuki Given that the badness() proposal I see in your another mail uses get_mm_rss(), I take it that you've meanwhile changed your mind on the VmSize vs VmRSS argument and considered that argument irrelevant now. I will comment only on the suggested use of oom_adj on the desktop here. And actually I hope that if something reasonably similar to your badness() proposal replaces the current one it will make any use of oom_adj not needed on the desktop in the usual case, so this may be irrelevant as well. On Wednesday 03 of February 2010, David Rientjes wrote: > On Tue, 2 Feb 2010, Lubos Lunak wrote: > > Not that it really matters - the net result is that OOM killer usually > > decides to kill kdeinit or ksmserver, starts killing their children, > > vital KDE processes, and since the offenders are not among them, it ends > > up either terminating the whole session by killing ksmserver or killing > > enough vital processes there to free enough memory for the offenders to > > finish their work cleanly. > > The kernel cannot possibly know what you consider a "vital" process, for > that understanding you need to tell it using the very powerful > /proc/pid/oom_adj tunable. I suspect if you were to product all of > kdeinit's children by patching it to be OOM_DISABLE so that all threads it > forks will inherit that value you'd actually see much improved behavior. No. Almost everything in KDE is spawned by kdeinit, so everything would get the adjustment, which means nothing would in practice get the adjustment. > I'd also encourage you to talk to the KDE developers to ensure that proper > precautions are taken to protect it in such conditions since people who > use such desktop environments typically don't want them to be sacrificed > for memory. I am a KDE developer, it's written in my signature. And I've already talked enough to the KDE developer who has done the oom_adj code that's already there, as that's also me. I don't know kernel internals, but that doesn't mean I'm completely clueless about the topic of the discussion I've started. > > Worse, it worked for about a year or two and now it has only shifted the > > problem elsewhere and that's it. We now protect kdeinit, which means the > > OOM killer's choice will very likely ksmserver then. Ok, so let's say now > > we start protecting also ksmserver, that's some additional hassle setting > > it up, but that's doable. Now there's a good chance the OOM killer's > > choice will be kwin (as a compositing manager it can have quite large > > mappings because of graphics drivers). So ok, we need to protect the > > window manager, but since that's not a hardcoded component like > > ksmserver, that's even more hassle. > > No, you don't need to protect every KDE process from the oom killer unless > it is going to be a contender for selection. You could certainly do so > for completeness, but it shouldn't be required unless the nature of the > thread demands it such that it forks many vital tasks (kdeinit) or its > first-generation children's memory consumption can't be known either > because it depends on how many children it can fork or their memory > consumption is influenced by the user's work. 1) I think you missed that I said that every KDE application with the current algorithm can be potentially a contender for selection, and I provided numbers to demonstrate that in a selected case. Just because such application is not vital does not mean it's good for it to get killed instead of an obvious offender. 2) You probably do not realize the complexity involved in using oom_adj in a desktop. Even when doing that manually I would have some difficulty finding the right setup for my own desktop use. It'd be probably virtually impossible to write code that would do it at least somewhat right with all the widely differing various desktop setups that dynamically change. 3) oom_adj is ultimately just a kludge to handle special cases where the heuristic doesn't get it right for whatever strange reason. But even you yourself in another mail presented a heuristic that I believe would make any use of oom_adj on the desktop unnecessary in the usual cases. The usual desktop is not a special case. > The heuristics are always well debated in this forum and there's little > chance that we'll ever settle on a single formula that works for all > possible use cases. That makes oom_adj even more vital to the overall > efficiency of the oom killer, I really hope you start to use it to your > advantage. I really hope your latest badness() heuristics proposal allows us to dump even the oom_adj use we already have. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 22:54 ` Improving OOM killer Lubos Lunak @ 2010-02-04 0:00 ` David Rientjes 0 siblings, 0 replies; 53+ messages in thread From: David Rientjes @ 2010-02-04 0:00 UTC (permalink / raw) To: Lubos Lunak Cc: linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina, KAMEZAWA Hiroyuki On Wed, 3 Feb 2010, Lubos Lunak wrote: > > Given that the badness() proposal I see in your another mail uses > get_mm_rss(), I take it that you've meanwhile changed your mind on the VmSize > vs VmRSS argument and considered that argument irrelevant now. The argument was never to never factor rss into the heuristic, the argument was to prevent the loss of functionality of oom_adj and being able to define memory leakers from userspace. With my proposal, I believe the new semantics of oom_adj are even clearer than before and allow users to either discount or bias a task with a quantity that they are familiar with: memory. My rough draft was written in a mail editor, so it's completely untested and even has a couple of flaws: we need to discount free hugetlb memory from allowed nodes, we need to intersect the passed nodemask with current's cpuset, etc. > I will comment > only on the suggested use of oom_adj on the desktop here. And actually I hope > that if something reasonably similar to your badness() proposal replaces the > current one it will make any use of oom_adj not needed on the desktop in the > usual case, so this may be irrelevant as well. > If you define "on the desktop" performance of the oom killer merely as protecting a windows environment, then it should be helpful. I'd still recommend using OOM_DISABLE for those tasks, though, because I agree that for users in that environment, KDE getting oom killed is just not a viable solution. > > The kernel cannot possibly know what you consider a "vital" process, for > > that understanding you need to tell it using the very powerful > > /proc/pid/oom_adj tunable. I suspect if you were to product all of > > kdeinit's children by patching it to be OOM_DISABLE so that all threads it > > forks will inherit that value you'd actually see much improved behavior. > > No. Almost everything in KDE is spawned by kdeinit, so everything would get > the adjustment, which means nothing would in practice get the adjustment. > It depends on whether you change the oom_adj of children that you no longer want to protect which have been forked from kdeinit. > > I'd also encourage you to talk to the KDE developers to ensure that proper > > precautions are taken to protect it in such conditions since people who > > use such desktop environments typically don't want them to be sacrificed > > for memory. > > I am a KDE developer, it's written in my signature. And I've already talked > enough to the KDE developer who has done the oom_adj code that's already > there, as that's also me. I don't know kernel internals, but that doesn't > mean I'm completely clueless about the topic of the discussion I've started. > Then I'd recommend that you protect those tasks with OOM_DISABLE, otherwise they will always be candidates for oom kill; the only way to explicitly prevent that is by changing oom_adj or moving it to its own memory controller cgroup. A kernel oom heursitic that is implemented for a wide variety of platforms, including desktops, servers, and embedded devices, will never identify KDE as a vital task that cannot possibly be killed unless you tell the kernel it has that priority. Whether you choose to use that power or not is up to the KDE team. > 1) I think you missed that I said that every KDE application with the current > algorithm can be potentially a contender for selection, and I provided > numbers to demonstrate that in a selected case. Just because such application > is not vital does not mean it's good for it to get killed instead of an > obvious offender. > This is exaggerating the point quite a bit, I don't think every single KDE thread is going to have a badness() score that is higher than all other system tasks all the time. I think that there are the likely candidates that you've identified (kdeinit, ksmserver, etc) that are much more prone to high badness() scores given their total_vm size and the number of children they fork, but I don't think this is representative of every KDE thread. > 2) You probably do not realize the complexity involved in using oom_adj in a > desktop. Even when doing that manually I would have some difficulty finding > the right setup for my own desktop use. It'd be probably virtually impossible > to write code that would do it at least somewhat right with all the widely > differing various desktop setups that dynamically change. > Used in combination with /proc/pid/oom_score, it gives you a pretty good snapshot of how oom killer priorities look at any moment in time. In your particular use case, however, you seem to be arguing from a perspective of only protecting certain tasks that you've identified from being oom killed for desktop environments, namely KDE. For that, there is no confusion to be had: use OOM_DISABLE. For server environments that I'm also concerned about, the oom_adj range is much more important to define a killing priority when used in combination with cpusets. > 3) oom_adj is ultimately just a kludge to handle special cases where the > heuristic doesn't get it right for whatever strange reason. But even you > yourself in another mail presented a heuristic that I believe would make any > use of oom_adj on the desktop unnecessary in the usual cases. The usual > desktop is not a special case. > The kernel will _always_ need user input into which tasks it believes to be vital. For you, that's KDE. For me, that's one of our job schedulers. > > The heuristics are always well debated in this forum and there's little > > chance that we'll ever settle on a single formula that works for all > > possible use cases. That makes oom_adj even more vital to the overall > > efficiency of the oom killer, I really hope you start to use it to your > > advantage. > > I really hope your latest badness() heuristics proposal allows us to dump > even the oom_adj use we already have. > For your environment, I hope the same. In production servers we'll still need the ability to tune /proc/pid/oom_adj to define memory leakers and tasks using far more memory than expected, so perhaps my rough draft can be a launching pad into a positive discussion about the future of the heuristic based on consensus and input from all impacted parties. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-01 22:02 Improving OOM killer Lubos Lunak 2010-02-01 23:53 ` David Rientjes @ 2010-02-03 7:50 ` KOSAKI Motohiro 2010-02-03 9:40 ` David Rientjes 2010-02-03 8:57 ` Balbir Singh 2010-02-03 14:49 ` Rik van Riel 3 siblings, 1 reply; 53+ messages in thread From: KOSAKI Motohiro @ 2010-02-03 7:50 UTC (permalink / raw) To: Lubos Lunak Cc: kosaki.motohiro, linux-mm, linux-kernel, Andrew Morton, David Rientjes, Balbir Singh, Nick Piggin, Jiri Kosina > ===== > --- linux-2.6.31/mm/oom_kill.c.sav 2010-02-01 22:00:41.614838540 +0100 > +++ linux-2.6.31/mm/oom_kill.c 2010-02-01 22:01:08.773757932 +0100 > @@ -69,7 +69,7 @@ unsigned long badness(struct task_struct > /* > * The memory size of the process is the basis for the badness. > */ > - points = mm->total_vm; > + points = get_mm_rss(mm); > > /* > * After this unlock we can no longer dereference local variable `mm' > @@ -83,21 +83,6 @@ unsigned long badness(struct task_struct > return ULONG_MAX; > > /* > - * Processes which fork a lot of child processes are likely > - * a good choice. We add half the vmsize of the children if they > - * have an own mm. This prevents forking servers to flood the > - * machine with an endless amount of children. In case a single > - * child is eating the vast majority of memory, adding only half > - * to the parents will make the child our kill candidate of choice. > - */ > - list_for_each_entry(child, &p->children, sibling) { > - task_lock(child); > - if (child->mm != mm && child->mm) > - points += child->mm->total_vm/2 + 1; > - task_unlock(child); > - } > - > - /* > * CPU time is in tens of seconds and run time is in thousands > * of seconds. There is no particular reason for this other than > * that it turned out to work very well in practice. > ===== > > In other words, use VmRSS for measuring memory usage instead of VmSize, and > remove child accumulating. > > I hope the above is good enough reason for the first change. VmSize includes > things like read-only mappings, memory mappings that is actually unused, > mappings backed by a file, mappings from video drivers, and so on. VmRSS is > actual real memory used, which is what mostly matters here. While it may not > be perfect, it is certainly an improvement. > > The second change should be done on the basis that it does more harm than > good. In this specific case, it does not help to identify the source of the > problem, and it incorrectly identifies kdeinit as the problem solely on the > basis that it spawned many other processes. I think it's already quite hinted > that this is a problem by the fact that you had to add a special protection > for init - any session manager, process launcher or even xterm used for > launching apps is yet another init. > > I also have problems finding a case where the child accounting would actually > help. I mean, in practice, I can certainly come up with something in theory, > and this looks to me like a solution to a very synthesized problem. In which > realistic case will one process launch a limited number of children, where > all of them will consume memory, but just killing the children one by one > won't avoid the problem reasonably? This is unlikely to avoid a forkbomb, as > in that case the number of children will be the problem. It is not necessary > for just one children misbehaving and being restarted, nor will it work > there. So what is that supposed to fix, and is it more likely than the case > of a process launching several unrelated children? > > If the children accounting is supposed to handle cases like forked children > of Apache, then I suggest it is adjusted only to count children that have > been forked from the parent but there has been no exec(). I'm afraid I don't > know how to detect that. > > > When running a kernel with these changes applied, I can safely do the > above-described case of running parallel doc generation in KDE. No clearly > innocent process is selected for killing, the first choice is always an > offender. > > Moreover, the remedy is almost instant, there is only a fraction of second of > when the machine is overloaded by the I/O of swapping pages in and out (I do > not use swap, but there is a large amount of memory used by read-only > mappings of binaries, libraries or various other files that is in the > original case rendering the machine unresponsive - I assume this is because > the kernel tries to kill an innocent process, but the offenders immediatelly > consume anything that is freed, requiring even memory used by code that is to > be executed to be swapped in from files again). > > I consider the patches to be definite improvements, so if they are ok, I will > format them as necessary. Now, what is the catch? Personally, I think your use case represent to typical desktop and Linux have to works fine on typical desktop use-case. /proc/pid/oom_adj never fit desktop use-case. In past discussion, I'v agreed with much people. but I haven't reach to agree with David Rientjes about this topic. If you want to merge this patch, you need persuade him. I can't help you. sorry. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 7:50 ` KOSAKI Motohiro @ 2010-02-03 9:40 ` David Rientjes 0 siblings, 0 replies; 53+ messages in thread From: David Rientjes @ 2010-02-03 9:40 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton, Balbir Singh, Nick Piggin, Jiri Kosina On Wed, 3 Feb 2010, KOSAKI Motohiro wrote: > Personally, I think your use case represent to typical desktop and Linux > have to works fine on typical desktop use-case. /proc/pid/oom_adj never fit > desktop use-case. In past discussion, I'v agreed with much people. but I haven't > reach to agree with David Rientjes about this topic. > Which point don't you agree with? I've agreed that heuristic needs to be changed and since Kame has decided to abandon his oom killer work, I said that I would find time to develop a solution that would be based on consensus. I don't think that simply replacing the baseline with rss, rendering oom_adj practically useless for any other purpose other than polarizing priorities, and removing any penalty for tasks that fork an egregious amount of tasks is acceptable to all parties, though. When a desktop system runs a vital task that, at all costs, cannot possibly be oom killed such as KDE from the user perspective, is it really that outrageous of a request to set it to OOM_DISABLE? No, it's not. There are plenty of open source examples of applications that tune their own oom_adj values for that reason; userspace input into the oom killer's heuristic will always be an integral part of its function. I believe that we can reach consensus without losing the existing functionality that oom_adj provides, namely defining vital system tasks and memory leakers, and without this all or nothing type attitude that insists we either go with rss as a baseline because "it doesn't select X first in my particular example" or you'll just take your ball and go home. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-01 22:02 Improving OOM killer Lubos Lunak 2010-02-01 23:53 ` David Rientjes 2010-02-03 7:50 ` KOSAKI Motohiro @ 2010-02-03 8:57 ` Balbir Singh 2010-02-03 12:10 ` Lubos Lunak 2010-02-03 14:49 ` Rik van Riel 3 siblings, 1 reply; 53+ messages in thread From: Balbir Singh @ 2010-02-03 8:57 UTC (permalink / raw) To: Lubos Lunak Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina * Lubos Lunak <l.lunak@suse.cz> [2010-02-01 23:02:37]: > > Hello, > > I'd like to suggest some changes to the OOM killer code that I believe > improve its behaviour - here it turns it from something completely useless and > harmful into something that works very well. I'm first posting changes for > review, I'll reformat the patch later as needed if approved. Given that it's > been like this for about a decade, I get the feeling there must be some > strange catch. > > My scenario is working in a KDE desktop session and accidentally running > parallel make in doc/ subdirectory of sources of a KDE module. As I use > distributed compiling, I run make with -j20 or more, but, as the tool used for > processing KDE documentation is quite memory-intensive, running this many > of them is more than enough to consume all the 2GB RAM in the machine. What > happens in that case is that the machine becomes virtually unresponsible, > where even Ctrl+Alt+F1 can take minutes, not to mention some action that'd > actually redeem the situation. If I wait long enough for something to happen, > which can be even hours, the action that ends the situation is killing one of > the most vital KDE processes, rendering the whole session useless and making > me lose all unsaved data. > > The process tree looks roughly like this: > > init > |- kdeinit > | |- ksmserver > | | |- kwin > | |- <other> > |- konsole > |- make > |- sh > | |- meinproc4 > |- sh > | |- meinproc4 > |- <etc> > > What happens is that OOM killer usually selects either ksmserver (KDE session > manager) or kdeinit (KDE master process that spawns most KDE processes). Note > that in either case OOM killer does not reach the point of killing the actual > offender - it will randomly kill in the tree under kdeinit until it decides > to kill ksmserver, which means terminating the desktop session. As konsole is > a KUniqueApplication, it forks into background and gets reparented to init, > thus getting away from the kdeinit subtree. Since the memory pressure is > distributed among several meinproc4 processes, the badness does not get > summed up in its make grandparent, as badness() does this only for direct > parents. > > Each meinproc4 process still uses a considerable amount of memory, so one > could assume that the situation would be solved by simply killing them one by > one, but it is not so because of using what I consider poor metric for > measuring memory usage - VmSize. VmSize, if I'm not mistaken, is the size of > the address space taken by the process, which in practice does not say much > about how much memory the process actually uses. For example, /proc/*/status > for one selected KDE process: > > VmPeak: 534676 kB > VmSize: 528340 kB > VmLck: 0 kB > VmHWM: 73464 kB > VmRSS: 73388 kB > VmData: 142332 kB > VmStk: 92 kB > VmExe: 44 kB > VmLib: 91232 kB > VmPTE: 716 kB > > And various excerpts from /proc/*/smaps for this process: > ... > 7f7b3f800000-7f7b40000000 rwxp 00000000 00:00 0 > Size: 8192 kB > Rss: 16 kB > Referenced: 16 kB > ... > 7f7b40055000-7f7b44000000 ---p 00000000 00:00 0 > Size: 65196 kB > Rss: 0 kB > Referenced: 0 kB > ... > 7f7b443cd000-7f7b445cd000 ---p 0001c000 08:01 > 790267 /usr/lib64/kde4/libnsplugin.so > Size: 2048 kB > Rss: 0 kB > Referenced: 0 kB > ... > 7f7b48300000-7f7b4927d000 rw-s 00000000 08:01 > 58690 /var/tmp/kdecache-seli/kpc/kde-icon-cache.data > Size: 15860 kB > Rss: 24 kB > Referenced: 24 kB > > I assume the first one is stack, search me what the second and third ones are > (there appears to be one such mapping as the third one for each .so used), > the last one is a mapping of a large cache file that's nevertheless rarely > used extensively and even then it's backed by a file. In other words, none of > this actually uses much of real memory, yet right now it's the process that > would get killed for using about 70MB memory, even though it's not the > offender. The offender scores only about 1/3 of its badness, even though it > uses almost the double amount of memory: > > VmPeak: 266508 kB > VmSize: 266504 kB > VmLck: 0 kB > VmHWM: 118208 kB > VmRSS: 118208 kB > VmData: 98512 kB > VmStk: 84 kB > VmExe: 60 kB > VmLib: 48944 kB > VmPTE: 536 kB > > And the offender is only 14th in the list of badness candidates. Speaking of > which, the following is quite useful for seeing all processes sorted by > badness: > > ls /proc/*/oom_score | grep -v self | sed 's/\(.*\)\/\(.*\)/echo -n "\1 "; \ > echo -n "`cat \1\/\2 2>\/dev\/null` "; readlink \1\/exe || echo/'| sh | \ > sort -nr +1 > > > Therefore, I suggest doing the following changes in mm/oom_kill.c : > > ===== > --- linux-2.6.31/mm/oom_kill.c.sav 2010-02-01 22:00:41.614838540 +0100 > +++ linux-2.6.31/mm/oom_kill.c 2010-02-01 22:01:08.773757932 +0100 > @@ -69,7 +69,7 @@ unsigned long badness(struct task_struct > /* > * The memory size of the process is the basis for the badness. > */ > - points = mm->total_vm; > + points = get_mm_rss(mm); > > /* > * After this unlock we can no longer dereference local variable `mm' > @@ -83,21 +83,6 @@ unsigned long badness(struct task_struct > return ULONG_MAX; > > /* > - * Processes which fork a lot of child processes are likely > - * a good choice. We add half the vmsize of the children if they > - * have an own mm. This prevents forking servers to flood the > - * machine with an endless amount of children. In case a single > - * child is eating the vast majority of memory, adding only half > - * to the parents will make the child our kill candidate of choice. > - */ > - list_for_each_entry(child, &p->children, sibling) { > - task_lock(child); > - if (child->mm != mm && child->mm) > - points += child->mm->total_vm/2 + 1; > - task_unlock(child); > - } > - > - /* > * CPU time is in tens of seconds and run time is in thousands > * of seconds. There is no particular reason for this other than > * that it turned out to work very well in practice. > ===== > > In other words, use VmRSS for measuring memory usage instead of VmSize, and > remove child accumulating. I am not sure of the impact of changing to RSS, although I've personally believed that RSS based accounting is where we should go, but we need to consider the following 1. Total VM provides data about potentially swapped pages, overcommit, etc. 2. RSS alone is not sufficient, RSS does not account for shared pages, so we ideally need something like PSS. You could potentially use the memory resource controller for isolation, but that is a way of working around and parititioning the system, there are other workarounds like oom-adjust, etc. I suspect the correct answer would depend on our answers to 1 and 2 and a lot of testing with any changes made. -- Balbir ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 8:57 ` Balbir Singh @ 2010-02-03 12:10 ` Lubos Lunak 2010-02-03 12:25 ` Balbir Singh 0 siblings, 1 reply; 53+ messages in thread From: Lubos Lunak @ 2010-02-03 12:10 UTC (permalink / raw) To: balbir Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wednesday 03 of February 2010, Balbir Singh wrote: > * Lubos Lunak <l.lunak@suse.cz> [2010-02-01 23:02:37]: > > In other words, use VmRSS for measuring memory usage instead of VmSize, > > and remove child accumulating. > > I am not sure of the impact of changing to RSS, although I've > personally believed that RSS based accounting is where we should go, > but we need to consider the following > > 1. Total VM provides data about potentially swapped pages, Yes, I've already updated my proposal in another mail to switch from VmSize to VmRSS+InSwap. I don't know how to find out the second item in code, but at this point of discussion that's just details. > overcommit, I don't understand how this matters. Overcommit is memory for which address space has been allocated but not actual memory, right? Then that's exactly what I'm claiming is wrong and am trying to reverse. Currently OOM killer takes this into account because it uses VmSize, but IMO it shouldn't - if a process does malloc(400M) but then it uses only a tiny fraction of that, in the case of memory shortage killing that process does not solve anything in practice. > etc. > 2. RSS alone is not sufficient, RSS does not account for shared pages, > so we ideally need something like PSS. Just to make sure I understand what you mean with "RSS does not account for shared pages" - you say that if a page is shared by 4 processes, then when calculating badness for them, only 1/4 of the page should be counted for each? Yes, I suppose so, that makes sense. That's more like fine-tunning at this point though, as long as there's no agreement that moving away from VmSize is an improvement. > I suspect the correct answer would depend on our answers to 1 and 2 > and a lot of testing with any changes made. Testing - are there actually any tests for it, or do people just test random scenarios when they do changes? Also, I'm curious, what areas is the OOM killer actually generally known to work well in? I somehow get the feeling from the discussion here that people just tweak oom_adj until it works for them. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 12:10 ` Lubos Lunak @ 2010-02-03 12:25 ` Balbir Singh 2010-02-03 15:00 ` Minchan Kim 2010-02-03 21:22 ` Lubos Lunak 0 siblings, 2 replies; 53+ messages in thread From: Balbir Singh @ 2010-02-03 12:25 UTC (permalink / raw) To: Lubos Lunak Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina * Lubos Lunak <l.lunak@suse.cz> [2010-02-03 13:10:27]: > On Wednesday 03 of February 2010, Balbir Singh wrote: > > * Lubos Lunak <l.lunak@suse.cz> [2010-02-01 23:02:37]: > > > In other words, use VmRSS for measuring memory usage instead of VmSize, > > > and remove child accumulating. > > > > I am not sure of the impact of changing to RSS, although I've > > personally believed that RSS based accounting is where we should go, > > but we need to consider the following > > > > 1. Total VM provides data about potentially swapped pages, > > Yes, I've already updated my proposal in another mail to switch from VmSize > to VmRSS+InSwap. I don't know how to find out the second item in code, but at > this point of discussion that's just details. > I am yet to catch up with the rest of the thread. Thanks for heads up. > > overcommit, > > I don't understand how this matters. Overcommit is memory for which address > space has been allocated but not actual memory, right? Then that's exactly > what I'm claiming is wrong and am trying to reverse. Currently OOM killer > takes this into account because it uses VmSize, but IMO it shouldn't - if a > process does malloc(400M) but then it uses only a tiny fraction of that, in > the case of memory shortage killing that process does not solve anything in > practice. We have a way of tracking commmitted address space, which is more sensible than just allocating memory and is used for tracking overcommit. I was suggesting that, that might be a better approach. > > > etc. > > 2. RSS alone is not sufficient, RSS does not account for shared pages, > > so we ideally need something like PSS. > > Just to make sure I understand what you mean with "RSS does not account for > shared pages" - you say that if a page is shared by 4 processes, then when > calculating badness for them, only 1/4 of the page should be counted for > each? Yes, I suppose so, that makes sense. Yes, that is what I am speaking of > That's more like fine-tunning at > this point though, as long as there's no agreement that moving away from > VmSize is an improvement. > There is no easy way to calculate the Pss today without walking the page tables, but some simplification there will make it a better and a more accurate metric. > > I suspect the correct answer would depend on our answers to 1 and 2 > > and a lot of testing with any changes made. > > Testing - are there actually any tests for it, or do people just test random > scenarios when they do changes? Also, I'm curious, what areas is the OOM > killer actually generally known to work well in? I somehow get the feeling > from the discussion here that people just tweak oom_adj until it works for > them. > I've mostly found OOM killer to work well for me, but looking at the design and our discussions I know there need to be certain improvements. -- Balbir ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 12:25 ` Balbir Singh @ 2010-02-03 15:00 ` Minchan Kim 2010-02-03 16:06 ` Minchan Kim 2010-02-03 21:22 ` Lubos Lunak 1 sibling, 1 reply; 53+ messages in thread From: Minchan Kim @ 2010-02-03 15:00 UTC (permalink / raw) To: balbir Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton, David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wed, 2010-02-03 at 17:55 +0530, Balbir Singh wrote: > * Lubos Lunak <l.lunak@suse.cz> [2010-02-03 13:10:27]: > > > On Wednesday 03 of February 2010, Balbir Singh wrote: > > > * Lubos Lunak <l.lunak@suse.cz> [2010-02-01 23:02:37]: > > > > In other words, use VmRSS for measuring memory usage instead of VmSize, > > > > and remove child accumulating. > > > > > > I am not sure of the impact of changing to RSS, although I've > > > personally believed that RSS based accounting is where we should go, > > > but we need to consider the following > > > > > > 1. Total VM provides data about potentially swapped pages, > > > > Yes, I've already updated my proposal in another mail to switch from VmSize > > to VmRSS+InSwap. I don't know how to find out the second item in code, but at > > this point of discussion that's just details. > > We have swap count with mm-count-swap-usage.patch by Kame in mmtom. > I am yet to catch up with the rest of the thread. Thanks for heads up. > > > > overcommit, > > > > I don't understand how this matters. Overcommit is memory for which address > > space has been allocated but not actual memory, right? Then that's exactly > > what I'm claiming is wrong and am trying to reverse. Currently OOM killer > > takes this into account because it uses VmSize, but IMO it shouldn't - if a > > process does malloc(400M) but then it uses only a tiny fraction of that, in > > the case of memory shortage killing that process does not solve anything in > > practice. > > We have a way of tracking commmitted address space, which is more > sensible than just allocating memory and is used for tracking > overcommit. I was suggesting that, that might be a better approach. Yes. It does make sense. At least total_vm doesn't care about MAP_NORESERVE case. But unfortunately, it's a per CPU not per Process. > > > > > > etc. > > > 2. RSS alone is not sufficient, RSS does not account for shared pages, > > > so we ideally need something like PSS. > > > > Just to make sure I understand what you mean with "RSS does not account for > > shared pages" - you say that if a page is shared by 4 processes, then when > > calculating badness for them, only 1/4 of the page should be counted for > > each? Yes, I suppose so, that makes sense. > > Yes, that is what I am speaking of I agree. If we want to make RSS with base of badness, it's one of things we have to solve. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 15:00 ` Minchan Kim @ 2010-02-03 16:06 ` Minchan Kim 0 siblings, 0 replies; 53+ messages in thread From: Minchan Kim @ 2010-02-03 16:06 UTC (permalink / raw) To: balbir Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton, David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Thu, 2010-02-04 at 00:00 +0900, Minchan Kim wrote: > On Wed, 2010-02-03 at 17:55 +0530, Balbir Singh wrote: > > * Lubos Lunak <l.lunak@suse.cz> [2010-02-03 13:10:27]: > >> > I don't understand how this matters. Overcommit is memory for which address > > > space has been allocated but not actual memory, right? Then that's exactly > > > what I'm claiming is wrong and am trying to reverse. Currently OOM killer > > > takes this into account because it uses VmSize, but IMO it shouldn't - if a > > > process does malloc(400M) but then it uses only a tiny fraction of that, in > > > the case of memory shortage killing that process does not solve anything in > > > practice. > > > > We have a way of tracking commmitted address space, which is more > > sensible than just allocating memory and is used for tracking > > overcommit. I was suggesting that, that might be a better approach. > > Yes. It does make sense. At least total_vm doesn't care about > MAP_NORESERVE case. But unfortunately, it's a per CPU not per Process. Sorry for confusing. It was opposite. I slept :) The commited as doesn't care about MAP_NORESERVE case. But it definitely charges memory. so I think total_vm is better than committed as if we really have to use vmsize heuristic continuously. But I am not sure that i understand your point about overcommit policy. -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 12:25 ` Balbir Singh 2010-02-03 15:00 ` Minchan Kim @ 2010-02-03 21:22 ` Lubos Lunak 1 sibling, 0 replies; 53+ messages in thread From: Lubos Lunak @ 2010-02-03 21:22 UTC (permalink / raw) To: balbir Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wednesday 03 of February 2010, Balbir Singh wrote: > * Lubos Lunak <l.lunak@suse.cz> [2010-02-03 13:10:27]: > > On Wednesday 03 of February 2010, Balbir Singh wrote: > > > 2. RSS alone is not sufficient, RSS does not account for shared pages, > > > so we ideally need something like PSS. > > > > Just to make sure I understand what you mean with "RSS does not account > > for shared pages" - you say that if a page is shared by 4 processes, then > > when calculating badness for them, only 1/4 of the page should be counted > > for each? Yes, I suppose so, that makes sense. > > Yes, that is what I am speaking of > > > That's more like fine-tunning at > > this point though, as long as there's no agreement that moving away from > > VmSize is an improvement. > > There is no easy way to calculate the Pss today without walking the > page tables, but some simplification there will make it a better and a > more accurate metric. OOM should be a rare situation, so doing a little amount of counting shouldn't be a big deal. Especially if the machine is otherwise busy waiting for the HDD paging stuff out and in again and has plenty of CPU time to waste. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-01 22:02 Improving OOM killer Lubos Lunak ` (2 preceding siblings ...) 2010-02-03 8:57 ` Balbir Singh @ 2010-02-03 14:49 ` Rik van Riel 2010-02-03 17:01 ` Balbir Singh 3 siblings, 1 reply; 53+ messages in thread From: Rik van Riel @ 2010-02-03 14:49 UTC (permalink / raw) To: Lubos Lunak Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes, KOSAKI Motohiro, Balbir Singh, Nick Piggin, Jiri Kosina On 02/01/2010 05:02 PM, Lubos Lunak wrote: > In other words, use VmRSS for measuring memory usage instead of VmSize, and > remove child accumulating. I agree with removing the child accumulating code. That code can do a lot of harm with databases like postgresql, or cause the system's main service (eg. httpd) to be killed when a broken cgi script used up too much memory. IIRC the child accumulating code was introduced to deal with malicious code (fork bombs), but it makes things worse for the (much more common) situation of a system without malicious code simply running out of memory due to being very busy. I have no strong opinion on using RSS vs VmSize. -- All rights reversed. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 14:49 ` Rik van Riel @ 2010-02-03 17:01 ` Balbir Singh 2010-02-03 18:58 ` David Rientjes 0 siblings, 1 reply; 53+ messages in thread From: Balbir Singh @ 2010-02-03 17:01 UTC (permalink / raw) To: Rik van Riel Cc: Lubos Lunak, linux-mm, linux-kernel, Andrew Morton, David Rientjes, KOSAKI Motohiro, Nick Piggin, Jiri Kosina * Rik van Riel <riel@redhat.com> [2010-02-03 09:49:18]: > On 02/01/2010 05:02 PM, Lubos Lunak wrote: > > > In other words, use VmRSS for measuring memory usage instead of VmSize, and > >remove child accumulating. > > I agree with removing the child accumulating code. That code can > do a lot of harm with databases like postgresql, or cause the > system's main service (eg. httpd) to be killed when a broken cgi > script used up too much memory. > > IIRC the child accumulating code was introduced to deal with > malicious code (fork bombs), but it makes things worse for the > (much more common) situation of a system without malicious > code simply running out of memory due to being very busy. > For fork bombs, we could do a number of children number test and have a threshold before we consider a process and its children for badness(). > I have no strong opinion on using RSS vs VmSize. > David commented and feels strongly about RSS and prefers VmSize. -- Balbir ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 17:01 ` Balbir Singh @ 2010-02-03 18:58 ` David Rientjes 2010-02-03 19:29 ` Frans Pop 2010-02-03 22:55 ` Lubos Lunak 0 siblings, 2 replies; 53+ messages in thread From: David Rientjes @ 2010-02-03 18:58 UTC (permalink / raw) To: Balbir Singh Cc: Rik van Riel, Lubos Lunak, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wed, 3 Feb 2010, Balbir Singh wrote: > > IIRC the child accumulating code was introduced to deal with > > malicious code (fork bombs), but it makes things worse for the > > (much more common) situation of a system without malicious > > code simply running out of memory due to being very busy. > > > > For fork bombs, we could do a number of children number test and have > a threshold before we consider a process and its children for > badness(). > Yes, we could look for the number of children with seperate mm's and then penalize those threads that have forked an egregious amount, say, 500 tasks. I think we should check for this threshold within the badness() heuristic to identify such forkbombs and not limit it only to certain applications. My rewrite for the badness() heuristic is centered on the idea that scores should range from 0 to 1000, 0 meaning "never kill this task" and 1000 meaning "kill this task first." The baseline for a thread, p, may be something like this: unsigned int badness(struct task_struct *p, unsigned long totalram) { struct task_struct *child; struct mm_struct *mm; int forkcount = 0; long points; task_lock(p); mm = p->mm; if (!mm) { task_unlock(p); return 0; } points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 / totalram; task_unlock(p); list_for_each_entry(child, &p->children, sibling) /* No lock, child->mm won't be dereferenced */ if (child->mm && child->mm != mm) forkcount++; /* Forkbombs get penalized 10% of available RAM */ if (forkcount > 500) points += 100; ... /* * /proc/pid/oom_adj ranges from -1000 to +1000 to either * completely disable oom killing or always prefer it. */ points += p->signal->oom_adj; if (points < 0) return 0; return (points <= 1000) ? points : 1000; } static struct task_struct *select_bad_process(..., nodemask_t *nodemask) { struct task_struct *p; unsigned long totalram = 0; int nid; for_each_node_mask(nid, nodemask) totalram += NODE_DATA(nid)->node_present_pages; for_each_process(p) { unsigned int points; ... if (!nodes_intersects(p->mems_allowed, nodemasks)) continue; ... points = badness(p, totalram); ... } ... } In this example, /proc/pid/oom_adj now ranges from -1000 to +1000, with OOM_DISABLE being -1000, to polarize tasks for oom killing or determine when a task is leaking memory because it is using far more memory than it should. The nodemask passed from the page allocator should be intersected with current->mems_allowed within the oom killer; userspace is then fully aware of what value is an egregious amount of RAM for a task to consume, including information it knows about the task's cpuset or mempolicy. For example, it would be very simple for a user to set an oom_adj of -500, which means "we discount 50% of the task's allowed memory from being considered in the heuristic" or +500, which means "we always allow all other system/cpuset/mempolicy tasks to use at least 50% more allowed memory than this one." ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 18:58 ` David Rientjes @ 2010-02-03 19:29 ` Frans Pop 2010-02-03 19:52 ` David Rientjes 2010-02-03 22:55 ` Lubos Lunak 1 sibling, 1 reply; 53+ messages in thread From: Frans Pop @ 2010-02-03 19:29 UTC (permalink / raw) To: David Rientjes Cc: balbir, riel, l.lunak, linux-mm, linux-kernel, akpm, kosaki.motohiro, npiggin, jkosina David Rientjes wrote: > /* > * /proc/pid/oom_adj ranges from -1000 to +1000 to either > * completely disable oom killing or always prefer it. > */ > points += p->signal->oom_adj; > Wouldn't that cause a rather huge compatibility issue given that the current oom_adj works in a totally different way: ! 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score ! ------------------------------------------------------ ! This file can be used to adjust the score used to select which processes ! should be killed in an out-of-memory situation. Giving it a high score ! will increase the likelihood of this process being killed by the ! oom-killer. Valid values are in the range -16 to +15, plus the special ! value -17, which disables oom-killing altogether for this process. ? Cheers, FJP ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 19:29 ` Frans Pop @ 2010-02-03 19:52 ` David Rientjes 2010-02-03 20:12 ` Frans Pop 0 siblings, 1 reply; 53+ messages in thread From: David Rientjes @ 2010-02-03 19:52 UTC (permalink / raw) To: Frans Pop Cc: Balbir Singh, Rik van Riel, l.lunak, Andrew Morton, KOSAKI Motohiro, Nick Piggin, jkosina, linux-kernel, linux-mm On Wed, 3 Feb 2010, Frans Pop wrote: > > * /proc/pid/oom_adj ranges from -1000 to +1000 to either > > * completely disable oom killing or always prefer it. > > */ > > points += p->signal->oom_adj; > > > > Wouldn't that cause a rather huge compatibility issue given that the > current oom_adj works in a totally different way: > > ! 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score > ! ------------------------------------------------------ > ! This file can be used to adjust the score used to select which processes > ! should be killed in an out-of-memory situation. Giving it a high score > ! will increase the likelihood of this process being killed by the > ! oom-killer. Valid values are in the range -16 to +15, plus the special > ! value -17, which disables oom-killing altogether for this process. > > ? > I thought about whether we'd need an additional, complementary tunable such as /proc/pid/oom_bias that would effect this new memory-charging bias in the heuristic. It could be implemented so that writing to oom_adj would clear oom_bias and vice versa. Although that would certainly be possible, I didn't propose it for a couple of reasons: - it would clutter the space to have two seperate tunables when the metrics that /proc/pid/oom_adj uses has become obsolete by the new baseline as a fraction of total RAM, and - we have always exported OOM_DISABLE, OOM_ADJUST_MIN, and OOM_ADJUST_MAX via include/oom.h so that userspace should use them sanely. Setting a particular oom_adj value for anything other than OOM_DISABLE means the score will be relative to other system tasks, so its a value that is typically calibrated at runtime rather than static, hardcoded values. We could reuse /proc/pid/oom_adj for the new heuristic by severely reducing its granularity than it otherwise would by doing (oom_adj * 1000 / OOM_ADJUST_MAX), but that will eventually become annoying and much more difficult to document. Given your citation, I don't think we've ever described /proc/pid/oom_adj outside of the implementation as a bitshift, either. So its use right now for anything other than OOM_DISABLE is probably based on scalar thinking. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 19:52 ` David Rientjes @ 2010-02-03 20:12 ` Frans Pop 2010-02-03 20:26 ` David Rientjes 0 siblings, 1 reply; 53+ messages in thread From: Frans Pop @ 2010-02-03 20:12 UTC (permalink / raw) To: David Rientjes Cc: Balbir Singh, Rik van Riel, l.lunak, Andrew Morton, KOSAKI Motohiro, Nick Piggin, jkosina, linux-kernel, linux-mm On Wednesday 03 February 2010, David Rientjes wrote: > - we have always exported OOM_DISABLE, OOM_ADJUST_MIN, and > OOM_ADJUST_MAX via include/oom.h so that userspace should use them > sanely. Setting a particular oom_adj value for anything other than > OOM_DISABLE means the score will be relative to other system tasks, so > its a value that is typically calibrated at runtime rather than static, > hardcoded values. That doesn't take into account: - applications where the oom_adj value is hardcoded to a specific value (for whatever reason) - sysadmin scripts that set oom_adj from the console I would think that oom_adj is a documented part of the userspace ABI and that the change you propose does not fit the normal backwards compatibility requirements for exposed tunables. I think that at least any user who's currently setting oom_adj to -17 has a right to expect that to continue to mean "oom killer disabled". And for any other value they should get a similar impact to the current impact, and not one that's reduced by a factor 66. > We could reuse /proc/pid/oom_adj for the new heuristic by severely > reducing its granularity than it otherwise would by doing > (oom_adj * 1000 / OOM_ADJUST_MAX), but that will eventually become > annoying and much more difficult to document. Probably quite true, but maybe unavoidable if one accepts the above. But I'll readily admit I'm not the final authority on this. Cheers, FJP ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 20:12 ` Frans Pop @ 2010-02-03 20:26 ` David Rientjes 0 siblings, 0 replies; 53+ messages in thread From: David Rientjes @ 2010-02-03 20:26 UTC (permalink / raw) To: Frans Pop Cc: Balbir Singh, Rik van Riel, l.lunak, Andrew Morton, KOSAKI Motohiro, Nick Piggin, jkosina, linux-kernel, linux-mm On Wed, 3 Feb 2010, Frans Pop wrote: > That doesn't take into account: > - applications where the oom_adj value is hardcoded to a specific value > (for whatever reason) > - sysadmin scripts that set oom_adj from the console > The fundamentals are the same: negative values mean the task is less likely to be preferred and positive values mean the task is more likely, only the scale is different. That scale is exported by the kernel via OOM_ADJUST_MIN and OOM_ADJUST_MAX and has been since 2006. I don't think we need to preserve legacy applications or scripts that use hardcoded values without importing linux/oom.h. > I would think that oom_adj is a documented part of the userspace ABI and > that the change you propose does not fit the normal backwards > compatibility requirements for exposed tunables. > The range is documented (but it should have been documented as being from OOM_ADJUST_MIN to OOM_ADJUST_MAX) but its implementation as a bitshift is not; it simply says that positive values mean the task is more preferred and negative values mean it is less preferred. Those semantics are preserved. > I think that at least any user who's currently setting oom_adj to -17 has a > right to expect that to continue to mean "oom killer disabled". And for > any other value they should get a similar impact to the current impact, > and not one that's reduced by a factor 66. > If the baseline changes as we all agree it needs to such that oom_adj no longer represents the same thing it did in the first place (it would become a linear bias), I think this breakage is actually beneficial. Users will now be able to tune their oom_adj values based on a fraction of system memory to bias their applications either preferrably or otherwise. I think we should look at Linux over the next couple of years and decide if we want to be married to the current semantics of oom_adj that are going to change (as it would require being a factor of 66, as you mentioned) when the implementation it was designed for has vanished. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 18:58 ` David Rientjes 2010-02-03 19:29 ` Frans Pop @ 2010-02-03 22:55 ` Lubos Lunak 2010-02-04 0:05 ` David Rientjes 1 sibling, 1 reply; 53+ messages in thread From: Lubos Lunak @ 2010-02-03 22:55 UTC (permalink / raw) To: David Rientjes Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wednesday 03 of February 2010, David Rientjes wrote: > My rewrite for the badness() heuristic is centered on the idea that scores > should range from 0 to 1000, 0 meaning "never kill this task" and 1000 > meaning "kill this task first." The baseline for a thread, p, may be > something like this: > > unsigned int badness(struct task_struct *p, > unsigned long totalram) > { > struct task_struct *child; > struct mm_struct *mm; > int forkcount = 0; > long points; > > task_lock(p); > mm = p->mm; > if (!mm) { > task_unlock(p); > return 0; > } > points = (get_mm_rss(mm) + > get_mm_counter(mm, MM_SWAPENTS)) * 1000 / > totalram; > task_unlock(p); > > list_for_each_entry(child, &p->children, sibling) > /* No lock, child->mm won't be dereferenced */ > if (child->mm && child->mm != mm) > forkcount++; > > /* Forkbombs get penalized 10% of available RAM */ > if (forkcount > 500) > points += 100; As far as I'm concerned, this is a huge improvement over the current code (and, incidentally :), quite close to what I originally wanted). I'd be willing to test it in few real-world desktop cases if you provide a patch. > /* > * /proc/pid/oom_adj ranges from -1000 to +1000 to either > * completely disable oom killing or always prefer it. > */ > points += p->signal->oom_adj; This changes semantics of oom_adj, but given that I expect the above to make oom_adj unnecessary on the desktop for the normal cases, I don't really mind. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-03 22:55 ` Lubos Lunak @ 2010-02-04 0:05 ` David Rientjes 2010-02-04 0:18 ` Rik van Riel ` (2 more replies) 0 siblings, 3 replies; 53+ messages in thread From: David Rientjes @ 2010-02-04 0:05 UTC (permalink / raw) To: Lubos Lunak Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wed, 3 Feb 2010, Lubos Lunak wrote: > > unsigned int badness(struct task_struct *p, > > unsigned long totalram) > > { > > struct task_struct *child; > > struct mm_struct *mm; > > int forkcount = 0; > > long points; > > > > task_lock(p); > > mm = p->mm; > > if (!mm) { > > task_unlock(p); > > return 0; > > } > > points = (get_mm_rss(mm) + > > get_mm_counter(mm, MM_SWAPENTS)) * 1000 / > > totalram; > > task_unlock(p); > > > > list_for_each_entry(child, &p->children, sibling) > > /* No lock, child->mm won't be dereferenced */ > > if (child->mm && child->mm != mm) > > forkcount++; > > > > /* Forkbombs get penalized 10% of available RAM */ > > if (forkcount > 500) > > points += 100; > > As far as I'm concerned, this is a huge improvement over the current code > (and, incidentally :), quite close to what I originally wanted). I'd be > willing to test it in few real-world desktop cases if you provide a patch. > There're some things that still need to be worked out, like discounting hugetlb pages on each allowed node, respecting current's cpuset mems, etc., but I think it gives us a good rough draft of where we might end up. I did use the get_mm_rss() that you suggested, but I think it's more helpful in the context of a fraction of total memory allowed so the other heursitics (forkbomb, root tasks, nice'd tasks, etc) are penalizing the points in a known quantity rather than a manipulation of that baseline. Do you have any comments about the forkbomb detector or its threshold that I've put in my heuristic? I think detecting these scenarios is still an important issue that we need to address instead of simply removing it from consideration entirely. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 0:05 ` David Rientjes @ 2010-02-04 0:18 ` Rik van Riel 2010-02-04 21:48 ` David Rientjes 2010-02-04 7:58 ` Lubos Lunak 2010-02-04 9:50 ` Jiri Kosina 2 siblings, 1 reply; 53+ messages in thread From: Rik van Riel @ 2010-02-04 0:18 UTC (permalink / raw) To: David Rientjes Cc: Lubos Lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On 02/03/2010 07:05 PM, David Rientjes wrote: > On Wed, 3 Feb 2010, Lubos Lunak wrote: >>> /* Forkbombs get penalized 10% of available RAM */ >>> if (forkcount> 500) >>> points += 100; > Do you have any comments about the forkbomb detector or its threshold that > I've put in my heuristic? I think detecting these scenarios is still an > important issue that we need to address instead of simply removing it from > consideration entirely. I believe that malicious users are best addressed in person, or preemptively through cgroups and rlimits. Having a process with over 500 children is quite possible with things like apache, Oracle, postgres and other forking daemons. Killing the parent process can result in the service becoming unavailable, and in some cases even data corruption. -- All rights reversed. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 0:18 ` Rik van Riel @ 2010-02-04 21:48 ` David Rientjes 2010-02-04 22:06 ` Rik van Riel 2010-02-04 22:31 ` Frans Pop 0 siblings, 2 replies; 53+ messages in thread From: David Rientjes @ 2010-02-04 21:48 UTC (permalink / raw) To: Rik van Riel Cc: Lubos Lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wed, 3 Feb 2010, Rik van Riel wrote: > > Do you have any comments about the forkbomb detector or its threshold that > > I've put in my heuristic? I think detecting these scenarios is still an > > important issue that we need to address instead of simply removing it from > > consideration entirely. > > I believe that malicious users are best addressed in person, > or preemptively through cgroups and rlimits. > Forkbombs need not be the result of malicious users. > Having a process with over 500 children is quite possible > with things like apache, Oracle, postgres and other forking > daemons. > It's clear that the forkbomb threshold would need to be definable from userspace and probably default to something high such as 1000. Keep in mind that we're in the oom killer here, though. So we're out of memory and we need to kill something; should Apache, Oracle, and postgres not be penalized for their cost of running by factoring in something like this? (lowest rss size of children) * (# of first-generation children) / (forkbomb threshold) > Killing the parent process can result in the service > becoming unavailable, and in some cases even data > corruption. > There's only one possible rememdy for that, which is OOM_DISABLE; the oom killer cannot possibly predict data corruption as the result of killing a process and this is no different. Everything besides init, kthreads, OOM_DISABLE threads, and threads that do not share the same cpuset, memcg, or set of allowed mempolicy nodes are candidates for oom kill. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 21:48 ` David Rientjes @ 2010-02-04 22:06 ` Rik van Riel 2010-02-04 22:14 ` David Rientjes 2010-02-04 22:31 ` Frans Pop 1 sibling, 1 reply; 53+ messages in thread From: Rik van Riel @ 2010-02-04 22:06 UTC (permalink / raw) To: David Rientjes Cc: Lubos Lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On 02/04/2010 04:48 PM, David Rientjes wrote: > Keep in mind that we're in the oom killer here, though. So we're out of > memory and we need to kill something; should Apache, Oracle, and postgres > not be penalized for their cost of running by factoring in something like > this? No, they should not. The goal of the OOM killer is to kill some process, so the system can continue running and automatically become available again for whatever workload the system was running. Killing the parent process of one of the system daemons does not achieve that goal, because you now caused a service to no longer be available. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 22:06 ` Rik van Riel @ 2010-02-04 22:14 ` David Rientjes 2010-02-10 20:54 ` Lubos Lunak 0 siblings, 1 reply; 53+ messages in thread From: David Rientjes @ 2010-02-04 22:14 UTC (permalink / raw) To: Rik van Riel Cc: Lubos Lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Thu, 4 Feb 2010, Rik van Riel wrote: > > Keep in mind that we're in the oom killer here, though. So we're out of > > memory and we need to kill something; should Apache, Oracle, and postgres > > not be penalized for their cost of running by factoring in something like > > this? > > No, they should not. > > The goal of the OOM killer is to kill some process, so the > system can continue running and automatically become available > again for whatever workload the system was running. > > Killing the parent process of one of the system daemons does > not achieve that goal, because you now caused a service to no > longer be available. > The system daemon wouldn't be killed, though. You're right that this heuristic would prefer the system daemon slightly more as a result of the forkbomb penalty, but the oom killer always attempts to sacrifice a child with a seperate mm before killing the selected task. Since the forkbomb heuristic only adds up those children with seperate mms, we're guaranteed to not kill the daemon itself. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 22:14 ` David Rientjes @ 2010-02-10 20:54 ` Lubos Lunak 2010-02-10 21:10 ` Rik van Riel 0 siblings, 1 reply; 53+ messages in thread From: Lubos Lunak @ 2010-02-10 20:54 UTC (permalink / raw) To: David Rientjes Cc: Rik van Riel, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Thursday 04 of February 2010, David Rientjes wrote: > On Thu, 4 Feb 2010, Rik van Riel wrote: > > The goal of the OOM killer is to kill some process, so the > > system can continue running and automatically become available > > again for whatever workload the system was running. > > > > Killing the parent process of one of the system daemons does > > not achieve that goal, because you now caused a service to no > > longer be available. > > The system daemon wouldn't be killed, though. You're right that this > heuristic would prefer the system daemon slightly more as a result of the > forkbomb penalty, but the oom killer always attempts to sacrifice a child > with a seperate mm before killing the selected task. Since the forkbomb > heuristic only adds up those children with seperate mms, we're guaranteed > to not kill the daemon itself. Which however can mean that not killing this system daemon will be traded for DoS-ing the whole system, if the daemon keeps spawning new children as soon as the OOM killer frees up resources for them. This looks like wrong solution to me, it's like trying to save a target by shooting all incoming bombs instead of shooting the bomber. If the OOM situation is caused by one or a limited number of its children, or if the system daemon is not reponsible for the forkbomb (e.g. it's only a subtree of its children), then it won't be selected for killing anyway. If it is responsible for the forkbomb, the OOM killer can trying killing the bombs forever to no avail. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-10 20:54 ` Lubos Lunak @ 2010-02-10 21:10 ` Rik van Riel 2010-02-10 21:29 ` Lubos Lunak 2010-02-10 22:18 ` Alan Cox 0 siblings, 2 replies; 53+ messages in thread From: Rik van Riel @ 2010-02-10 21:10 UTC (permalink / raw) To: Lubos Lunak Cc: David Rientjes, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On 02/10/2010 03:54 PM, Lubos Lunak wrote: > Which however can mean that not killing this system daemon will be traded for > DoS-ing the whole system, if the daemon keeps spawning new children as soon > as the OOM killer frees up resources for them. Killing the system daemon *is* a DoS. It would stop eg. the database or the web server, which is generally the main task of systems that run a database or a web server. -- All rights reversed. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-10 21:10 ` Rik van Riel @ 2010-02-10 21:29 ` Lubos Lunak 2010-02-10 22:18 ` Alan Cox 1 sibling, 0 replies; 53+ messages in thread From: Lubos Lunak @ 2010-02-10 21:29 UTC (permalink / raw) To: Rik van Riel Cc: David Rientjes, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wednesday 10 of February 2010, Rik van Riel wrote: > On 02/10/2010 03:54 PM, Lubos Lunak wrote: > > Which however can mean that not killing this system daemon will be > > traded for DoS-ing the whole system, if the daemon keeps spawning new > > children as soon as the OOM killer frees up resources for them. > > Killing the system daemon *is* a DoS. Maybe, but if there are two such system daemons on the machine, it's only half of the other DoS. And since that system daemon has already been identified as a forkbomb, it's probably already useless anyway and killing the children won't save anything. In which realistic case a system daemon has children that together cause OOM, yet can still be considered working after you kill one or a limited number of those children? > It would stop eg. the database or the web server, which is > generally the main task of systems that run a database or > a web server. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-10 21:10 ` Rik van Riel 2010-02-10 21:29 ` Lubos Lunak @ 2010-02-10 22:18 ` Alan Cox 2010-02-10 22:31 ` David Rientjes 2010-02-11 9:50 ` Lubos Lunak 1 sibling, 2 replies; 53+ messages in thread From: Alan Cox @ 2010-02-10 22:18 UTC (permalink / raw) To: Rik van Riel Cc: Lubos Lunak, David Rientjes, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina > Killing the system daemon *is* a DoS. > > It would stop eg. the database or the web server, which is > generally the main task of systems that run a database or > a web server. One of the problems with picking on tasks that fork a lot is that describes apache perfectly. So a high loaded apache will get shot over a rapid memory eating cgi script. Any heuristic is going to be iffy - but that isn't IMHO a good one to work from. If anything "who allocated lots of RAM recently" may be a better guide but we don't keep stats for that. Alan ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-10 22:18 ` Alan Cox @ 2010-02-10 22:31 ` David Rientjes 2010-02-11 9:50 ` Lubos Lunak 1 sibling, 0 replies; 53+ messages in thread From: David Rientjes @ 2010-02-10 22:31 UTC (permalink / raw) To: Alan Cox Cc: Rik van Riel, Lubos Lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wed, 10 Feb 2010, Alan Cox wrote: > One of the problems with picking on tasks that fork a lot is that > describes apache perfectly. So a high loaded apache will get shot over a > rapid memory eating cgi script. > With my rewrite, the oom killer would not select apache but rather the child with a seperate address space that is consuming the most amount of allowed memory and only when a configurable number of such children (1000 by default) have not had any runtime. My heuristic is only meant to slightly penalize such tasks so that they can be distinguished from oom kill from other parents with comparable memory usage. Enforcing a strict forkbomb policy is out of the scope of the oom killer, though, so no attempt was made. > Any heuristic is going to be iffy - but that isn't IMHO a good one to > work from. If anything "who allocated lots of RAM recently" may be a > better guide but we don't keep stats for that. > That's what my heuristic basically does, if a parent is identified as a forkbomb, then it is only penalized by averaging the memory consumption of those children and then multiplying it by the same number of times the configurable forkbomb threshold was reached. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-10 22:18 ` Alan Cox 2010-02-10 22:31 ` David Rientjes @ 2010-02-11 9:50 ` Lubos Lunak 1 sibling, 0 replies; 53+ messages in thread From: Lubos Lunak @ 2010-02-11 9:50 UTC (permalink / raw) To: Alan Cox Cc: Rik van Riel, David Rientjes, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wednesday 10 of February 2010, Alan Cox wrote: > > Killing the system daemon *is* a DoS. > > > > It would stop eg. the database or the web server, which is > > generally the main task of systems that run a database or > > a web server. > > One of the problems with picking on tasks that fork a lot is that > describes apache perfectly. So a high loaded apache will get shot over a > rapid memory eating cgi script. It will not. If it's only a single cgi script, that that child should be selected by badness(), not the parent. I personally consider the logic of trying to find the offender using badness() and then killing its child instead to be flawed. Already badness() itself should select what to kill and that should be killed. If it's a single process that is the offender, it should be killed. If badness() decides it is a whole subtree responsible for the situation, then the top of it needs to be killed, otherwise the reason for the problem will remain. I expect the current logic of trying to kill children first is based on the system daemon logic, but if e.g. Apache master process itself causes OOM, then the kernel itself has to way to find out if it's an important process that should be protected or if it's some random process causing a forkbomb. >From the kernel point's of view, if the Apache master process caused the problem, the the problem should be solved there. If the reason for the problem was actually e.g. a temporary high load on the server, then Apache is probably misconfigured, and if it really should stay running no matter what, then I guess that's the case to use oom_adj. But otherwise, from OOM killer's point of view, that is where the problem was. Of course, the algorithm used in badness() should be careful not to propagate the excessive memory usage in that case to the innocent parent. This problem existed in the current code until it was fixed by the "/2" recently, and at least my current proposal actually suffers from it too. But I envision something like this could handle it nicely (pseudocode): int oom_children_memory_usage(task) { // Memory shared with the parent should not be counted again. // Since it's expensive to find that out exactly, just assume // that the amount of shared memory that is not shared with the parent // is insignificant. total = unshared_rss(task)+unshared_swap(task); foreach_child(child,task) total += oom_children_memory_usage(child); return total; } int badness(task) { int total_memory = 0; ... int max_child_memory = 0; // memory used by that child int max_child_memory_2 = 0; // the 2nd most memory used by a child foreach_child(child,task) { if(sharing_the_same_memory(child,task)) continue; if( real_time(child) > 1minute ) continue; // running long, not a forkbomb int memory = oom_children_memory_usage(task); total_memory += memory; if( memory > max_child_memory ) { max_child_memory_2 = max_child_memory; max_child_memory = memory; } else if( memory > max_child_memory_2 ) max_child_memory_2 = memory; } if( max_child_memory_2 != 0 ) // there were at least two children { if( max_child_memory > max_child_memory_2 / 2 ) { // There is only a single child that contributes the majority of memory // used by all children. Do not add it to the total, so that if that process // is the biggest offender, the killer picks it instead of this parent. total_memory -= max_child_memory; } } ... } The logic is simply that a process is responsible for its children only if their cost is similar. If one of them stands out, it is responsible for itself and the parent is not. This is intentionally not done recursively in oom_children_memory_usage() to cover also the case when e.g. parallel make runs too many processes wrapped by shell, in that case making any of those shell instances responsible for its child doesn't help anything, but making make responsible for all of them helps. Alternatively, if somebody has a good use case where first going after a child may make sense, then it perhaps would help to add 'oom_recently_killed_children' to each task, and increasing it whenever a child is killed instead of the responsible parent. As soon as the value within a reasonably short time is higher than let's say 5, then apparently killing children does not help and the mastermind has to go. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 21:48 ` David Rientjes 2010-02-04 22:06 ` Rik van Riel @ 2010-02-04 22:31 ` Frans Pop 2010-02-04 22:53 ` David Rientjes 1 sibling, 1 reply; 53+ messages in thread From: Frans Pop @ 2010-02-04 22:31 UTC (permalink / raw) To: David Rientjes Cc: riel, l.lunak, balbir, linux-mm, linux-kernel, akpm, kosaki.motohiro, npiggin, jkosina David Rientjes wrote: > It's clear that the forkbomb threshold would need to be definable from > userspace and probably default to something high such as 1000. > > Keep in mind that we're in the oom killer here, though. So we're out of > memory and we need to kill something; should Apache, Oracle, and postgres > not be penalized for their cost of running by factoring in something like > this? > > (lowest rss size of children) * (# of first-generation children) / > (forkbomb threshold) Shouldn't fork bomb detection take into account the age of children? After all, long running processes with a lot of long running children are rather unlikely to be runaway fork _bombs_. Children for desktop environments are more likely to be long running than e.g. a server process that's being DOSed. The goal of the OOM killer is IIUC trying to identify the process thats causing the immediate problem so in this example it should prefer latter. Cheers, FJP ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 22:31 ` Frans Pop @ 2010-02-04 22:53 ` David Rientjes 0 siblings, 0 replies; 53+ messages in thread From: David Rientjes @ 2010-02-04 22:53 UTC (permalink / raw) To: Frans Pop Cc: Rik van Riel, l.lunak, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, jkosina On Thu, 4 Feb 2010, Frans Pop wrote: > Shouldn't fork bomb detection take into account the age of children? > After all, long running processes with a lot of long running children are > rather unlikely to be runaway fork _bombs_. > Yeah, Lubos mentioned using cpu time as a requirement, in addition to the already existing child->mm != parent->mm, as a prerequisite to be added into the tally to check the forkbomb threshold. I think something like this would be appropriate: struct task_cputime task_time; int forkcount = 0; int child_rss = 0; ... list_for_each_entry(child, &p->children, sibling) { unsigned long runtime; task_lock(child); if (!child->mm || child->mm == p->mm) { task_unlock(child); continue; } thread_group_cputime(child, &task_time); runtime = cputime_to_jiffies(task_time.utime) + cputime_to_jiffies(task_time.stime); /* * Only threads that have run for less than a second are * considered toward the forkbomb, these threads rarely * get to execute at all in such cases anyway. */ if (runtime < HZ) { task_unlock(child); continue; } child_rss += get_mm_rss(child->mm); forkcount++; } if (forkcount > sysctl_oom_forkbomb_thres) { /* * Penalize forkbombs by considering the average rss and * how many factors we are over the threshold. */ points += child_rss / sysctl_oom_forkbomb_thres; } I changed the calculation from lowest child rss to average child rss, so this is functionally equivalent to (average rss size of children) * (# of first-generated execve children) / sysctl_oom_forkbomb_thres ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 0:05 ` David Rientjes 2010-02-04 0:18 ` Rik van Riel @ 2010-02-04 7:58 ` Lubos Lunak 2010-02-04 21:34 ` David Rientjes 2010-02-04 9:50 ` Jiri Kosina 2 siblings, 1 reply; 53+ messages in thread From: Lubos Lunak @ 2010-02-04 7:58 UTC (permalink / raw) To: David Rientjes Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Thursday 04 of February 2010, David Rientjes wrote: > On Wed, 3 Feb 2010, Lubos Lunak wrote: > > As far as I'm concerned, this is a huge improvement over the current > > code (and, incidentally :), quite close to what I originally wanted). I'd > > be willing to test it in few real-world desktop cases if you provide a > > patch. > > There're some things that still need to be worked out, Ok. Just please do not let the perfect stand in the way of the good for way too long. > Do you have any comments about the forkbomb detector or its threshold that > I've put in my heuristic? I think detecting these scenarios is still an > important issue that we need to address instead of simply removing it from > consideration entirely. I think before finding out the answer it should be first figured out what the question is :). Besides the vague "forkbomb" description I still don't know what realistic scenarios this is supposed to handle. IMO trying to cover intentional abuse is a lost fight, so I think the purpose of this code should be just to handle cases when there's a mistake leading to relatively fast spawning of children of a specific parent that'll lead to OOM. The shape of the children subtree doesn't matter, it can be either a parent with many direct children, or children being created recursively, I think any case is possible here. A realistic example would be e.g. by mistake typing 'make -j22' instead of 'make -j2' and overloading the machine by too many g++ instances. That would be actually a non-trivial tree of children, with recursive make and sh processes in it. A good way to detect this would be checking in badness() if the process has any children with relatively low CPU and real time values (let's say something less than a minute). If yes, the badness score should also account for all these children, recursively. I'm not sure about the exact formula, just summing up the memory usage like it is done now does not fit your 0-1000 score idea, and it's also wrong because it doesn't take sharing of memory into consideration (e.g. a KDE app with several kdelibs-based children could achieve a massive score here because of extensive sharing, even though the actual memory usage increase caused by them could be insignificant). I don't know kernel internals, so I don't know how feasible it would be, but one naive idea would be to simply count how big portion of the total memory all these considered processes occupy. This indeed would not handle the case when a tree of processes would slowly leak, for example there being a bug in Apache and all the forked children of the master process leaking memory equally, but none of the single children leaking enough to score more than a single unrelated innocent process. Here I question how realistic such scenario actually would be, and mainly the actual possibility of detecting such case. I do not see how code could distinguish this from the case of using Konsole or XTerm to launch a number of unrelated KDE/X applications each of which would occupy a considerable amount of memory. Here clearly killing the Konsole/XTerm and all the spawned applications with it is incorrect, so with no obvious offender the OOM killer would simply have to pick something. And since you now probably feel the urge to point out oom_adj again, I want to point out again that it's not a very good solution for the desktop and that Konsole/XTerm should not have such protection, unless the user explicitly does it themselves - e.g. Konsole can be set to infinite scrollback, so when accidentally running something that produces a huge amount of output Konsole actually could be the only right process to kill. So I think the case of slowly leaking group of children cannot be reasonably solved in code. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 7:58 ` Lubos Lunak @ 2010-02-04 21:34 ` David Rientjes 2010-02-10 20:54 ` Lubos Lunak 0 siblings, 1 reply; 53+ messages in thread From: David Rientjes @ 2010-02-04 21:34 UTC (permalink / raw) To: Lubos Lunak Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Thu, 4 Feb 2010, Lubos Lunak wrote: > > There're some things that still need to be worked out, > > Ok. Just please do not let the perfect stand in the way of the good for way > too long. > The changes to the heuristic will be a part of a larger patchset that basically rewrites the oom killer since there have actually been many other suggestions and asides that have been mentioned during the course of the discussion: mempolicy targeted oom killing, killing a child only with the highest badness score first, always ensuring the killed task shares a set of allowed nodes with current, preempting the oom killer for GFP_DMA allocations, etc. I'll write that patchset within the next couple of days, but we're still talking about 2.6.35 material at the earliest. If you could test it with your particular usecase we can make sure to work any heuristic problems out in the early stages. > > Do you have any comments about the forkbomb detector or its threshold that > > I've put in my heuristic? I think detecting these scenarios is still an > > important issue that we need to address instead of simply removing it from > > consideration entirely. > > I think before finding out the answer it should be first figured out what the > question is :). Besides the vague "forkbomb" description I still don't know > what realistic scenarios this is supposed to handle. IMO trying to cover > intentional abuse is a lost fight, so I think the purpose of this code should > be just to handle cases when there's a mistake leading to relatively fast > spawning of children of a specific parent that'll lead to OOM. Yes, forkbombs are not always malicious, they can be the result of buggy code and there's no other kernel mechanism that will hold them off so that the machine is still usable. If a task forks and execve's thousands of threads on your 2GB desktop machine either because its malicious, its a bug, or a the user made a mistake, that's going to be detrimental depending on the nature of what was executed especially to your interactivity :) Keep in mind that the forking parent such as a job scheduler or terminal and all of its individual children may have very small rss and swap statistics, even though cumulatively its a problem. In that scenario, KDE is once again going to become the target on your machine when in reality we should target the forkbomb. How we target the forkbomb could be another topic for discussion. Recall that the oom killer does not necessarily always kill the task with the highest badness() score; it will always attempt to kill one of its children with a seperate mm first. That doesn't help us as much as it could in the forkbomb scenario since the parent could likely continue to fork; the end result you'll see is difficulty in getting to a command line where you can find and kill the parent yourself. Thus, we need to preempt the preference to kill the child first when we've detected a forkbomb task and it is selected for oom kill (detecting a forkbomb is only a penalization in the heuristic, it doesn't mean automatic killing, so another task consuming much more memory could be chosen instead). > The shape of > the children subtree doesn't matter, it can be either a parent with many > direct children, or children being created recursively, I think any case is > possible here. A realistic example would be e.g. by mistake > typing 'make -j22' instead of 'make -j2' and overloading the machine by too > many g++ instances. That would be actually a non-trivial tree of children, > with recursive make and sh processes in it. > We can't address recursive forkbombing in the oom killer with any efficiency, but luckily those cases aren't very common. > A good way to detect this would be checking in badness() if the process has > any children with relatively low CPU and real time values (let's say > something less than a minute). If yes, the badness score should also account > for all these children, recursively. I'm not sure about the exact formula, > just summing up the memory usage like it is done now does not fit your 0-1000 > score idea, and it's also wrong because it doesn't take sharing of memory > into consideration (e.g. a KDE app with several kdelibs-based children could > achieve a massive score here because of extensive sharing, even though the > actual memory usage increase caused by them could be insignificant). I don't > know kernel internals, so I don't know how feasible it would be, but one > naive idea would be to simply count how big portion of the total memory all > these considered processes occupy. > badness() can't be called recursively on children, so we need to look for the simple metrics like you mentioned: cpu time, for example. I think we should also look at uid > This indeed would not handle the case when a tree of processes would slowly > leak, for example there being a bug in Apache and all the forked children of > the master process leaking memory equally, but none of the single children > leaking enough to score more than a single unrelated innocent process. Here I > question how realistic such scenario actually would be, and mainly the actual > possibility of detecting such case. I do not see how code could distinguish > this from the case of using Konsole or XTerm to launch a number of unrelated > KDE/X applications each of which would occupy a considerable amount of > memory. Here clearly killing the Konsole/XTerm and all the spawned > applications with it is incorrect, so with no obvious offender the OOM killer > would simply have to pick something. The memory consumption of these children were not considered in my rough draft, it was simply a counter of how many first-generation children each task has. When Konsole forks a very large number of tasks, is it unreasonable to bias it with a penalty of perhaps 10% of RAM? Or should we take the lowest rss size of those children, multiply it by the number of children, and penalize it after it reaches a certain threshold (500? 1000?) as the "cost of running the parent"? This heavily biases parents that have forked a very large number of small threads that would be directly killed whereas the large, memory-hog children are killed themselves based on their own badness() heuristic. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 21:34 ` David Rientjes @ 2010-02-10 20:54 ` Lubos Lunak 2010-02-10 21:09 ` Rik van Riel 2010-02-10 22:25 ` David Rientjes 0 siblings, 2 replies; 53+ messages in thread From: Lubos Lunak @ 2010-02-10 20:54 UTC (permalink / raw) To: David Rientjes Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Thursday 04 of February 2010, David Rientjes wrote: > On Thu, 4 Feb 2010, Lubos Lunak wrote: > > I think before finding out the answer it should be first figured out > > what the question is :). Besides the vague "forkbomb" description I still > > don't know what realistic scenarios this is supposed to handle. IMO > > trying to cover intentional abuse is a lost fight, so I think the purpose > > of this code should be just to handle cases when there's a mistake > > leading to relatively fast spawning of children of a specific parent > > that'll lead to OOM. > > Yes, forkbombs are not always malicious, they can be the result of buggy > code and there's no other kernel mechanism that will hold them off so that > the machine is still usable. If a task forks and execve's thousands of > threads on your 2GB desktop machine either because its malicious, its a > bug, or a the user made a mistake, that's going to be detrimental > depending on the nature of what was executed especially to your > interactivity :) Keep in mind that the forking parent such as a job > scheduler or terminal and all of its individual children may have very > small rss and swap statistics, even though cumulatively its a problem. Which is why I suggested summing up the memory of the parent and its children. > > The shape of > > the children subtree doesn't matter, it can be either a parent with many > > direct children, or children being created recursively, I think any case > > is possible here. A realistic example would be e.g. by mistake > > typing 'make -j22' instead of 'make -j2' and overloading the machine by > > too many g++ instances. That would be actually a non-trivial tree of > > children, with recursive make and sh processes in it. > > We can't address recursive forkbombing in the oom killer with any > efficiency, but luckily those cases aren't very common. Right, I've never run a recursive make that brought my machine to its knees. Oh, wait. And why exactly is iterating over 1st level children efficient enough and doing that recursively is not? I don't find it significantly more expensive and badness() is hardly a bottleneck anyway. > > A good way to detect this would be checking in badness() if the process > > has any children with relatively low CPU and real time values (let's say > > something less than a minute). If yes, the badness score should also > > account for all these children, recursively. I'm not sure about the exact > > formula, just summing up the memory usage like it is done now does not > > fit your 0-1000 score idea, and it's also wrong because it doesn't take > > sharing of memory into consideration (e.g. a KDE app with several > > kdelibs-based children could achieve a massive score here because of > > extensive sharing, even though the actual memory usage increase caused by > > them could be insignificant). I don't know kernel internals, so I don't > > know how feasible it would be, but one naive idea would be to simply > > count how big portion of the total memory all these considered processes > > occupy. > > badness() can't be called recursively on children, > the simple metrics like you mentioned: cpu time, for example. I didn't mean calling badness() recursively, just computing total memory usage of the parent and all the children together (and trying not to count shared parts more than once). If doing that accurately is too expensive, summing rss+swap for the parent and using only unshared memory for the children seems reasonably close to it and pretty cheap (surely if the code can find out rss+swap for each child it can equally easily find out how much of it is unshared?). With the assumption that the shared memory will be hopefully reasonably accounted in the parent's memory usage and thus only unshared portions for children are needed, this appears to be much more precise than trying to sum up rss for all like your proposal does. > > This indeed would not handle the case when a tree of processes would > > slowly leak, for example there being a bug in Apache and all the forked > > children of the master process leaking memory equally, but none of the > > single children leaking enough to score more than a single unrelated > > innocent process. Here I question how realistic such scenario actually > > would be, and mainly the actual possibility of detecting such case. I do > > not see how code could distinguish this from the case of using Konsole or > > XTerm to launch a number of unrelated KDE/X applications each of which > > would occupy a considerable amount of memory. Here clearly killing the > > Konsole/XTerm and all the spawned applications with it is incorrect, so > > with no obvious offender the OOM killer would simply have to pick > > something. > > The memory consumption of these children were not considered in my rough > draft, it was simply a counter of how many first-generation children each > task has. Why exactly do you think only 1st generation children matter? Look again at the process tree posted by me and you'll see it solves nothing there. I still fail to see why counting also all other generations should be considered anything more than a negligible penalty for something that's not a bottleneck at all. > When Konsole forks a very large number of tasks, is it > unreasonable to bias it with a penalty of perhaps 10% of RAM? It appears unresonable to penalize it with a random magic number. > Or should > we take the lowest rss size of those children, multiply it by the number > of children, and penalize it after it reaches a certain threshold (500? > 1000?) as the "cost of running the parent"? And another magic number. And again it doesn't solve my initial problem, where the number of processes taking part in the OOM situation is counted only in tens. Simply computing the cost of the whole children subtree (or a reasonable approximation) avoids the need for any magic numbers and gives a much better representation of how costly the subtree is, since, well, it is the cost itself. > This heavily biases parents > that have forked a very large number of small threads that would be > directly killed whereas the large, memory-hog children are killed > themselves based on their own badness() heuristic. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-10 20:54 ` Lubos Lunak @ 2010-02-10 21:09 ` Rik van Riel 2010-02-10 21:34 ` Lubos Lunak 2010-02-10 22:25 ` David Rientjes 1 sibling, 1 reply; 53+ messages in thread From: Rik van Riel @ 2010-02-10 21:09 UTC (permalink / raw) To: Lubos Lunak Cc: David Rientjes, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On 02/10/2010 03:54 PM, Lubos Lunak wrote: > Simply computing the cost of the whole children subtree (or a reasonable > approximation) avoids the need for any magic numbers and gives a much better > representation of how costly the subtree is, since, well, it is the cost > itself. That assumes you want to kill off that entire tree. You will not want to do that when a web server or database server runs out of memory, because the goal of the OOM killer is to allow the system to continue to run and be useful. This means keeping the services available... -- All rights reversed. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-10 21:09 ` Rik van Riel @ 2010-02-10 21:34 ` Lubos Lunak 0 siblings, 0 replies; 53+ messages in thread From: Lubos Lunak @ 2010-02-10 21:34 UTC (permalink / raw) To: Rik van Riel Cc: David Rientjes, Balbir Singh, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wednesday 10 of February 2010, Rik van Riel wrote: > On 02/10/2010 03:54 PM, Lubos Lunak wrote: > > Simply computing the cost of the whole children subtree (or a > > reasonable approximation) avoids the need for any magic numbers and gives > > a much better representation of how costly the subtree is, since, well, > > it is the cost itself. > > That assumes you want to kill off that entire tree. As said in another mail, I think I actually do, since the entire tree is indentified as the problem. But regardless of that, surely computing the cost of a forkbomb by computing something that is close to the actual cost of it is better than trying magic numbers? -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-10 20:54 ` Lubos Lunak 2010-02-10 21:09 ` Rik van Riel @ 2010-02-10 22:25 ` David Rientjes 2010-02-11 10:16 ` Lubos Lunak 1 sibling, 1 reply; 53+ messages in thread From: David Rientjes @ 2010-02-10 22:25 UTC (permalink / raw) To: Lubos Lunak Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wed, 10 Feb 2010, Lubos Lunak wrote: > > Yes, forkbombs are not always malicious, they can be the result of buggy > > code and there's no other kernel mechanism that will hold them off so that > > the machine is still usable. If a task forks and execve's thousands of > > threads on your 2GB desktop machine either because its malicious, its a > > bug, or a the user made a mistake, that's going to be detrimental > > depending on the nature of what was executed especially to your > > interactivity :) Keep in mind that the forking parent such as a job > > scheduler or terminal and all of its individual children may have very > > small rss and swap statistics, even though cumulatively its a problem. > > Which is why I suggested summing up the memory of the parent and its > children. > That's almost identical to the current heuristic where we sum half the size of the children's VM size, unfortunately it's not a good indicator of forkbombs since in your particular example it would be detrimental to kdeinit. My heursitic considers runtime of the children as an indicator of a forkbombing parent since such tasks don't typically get to run anyway. The rss or swap usage of a child with a seperate address space simply isn't relevant to the badness score of the parent, it unfairly penalizes medium/large server jobs. > > We can't address recursive forkbombing in the oom killer with any > > efficiency, but luckily those cases aren't very common. > > Right, I've never run a recursive make that brought my machine to its knees. > Oh, wait. > That's completely outside the scope of the oom killer, though: it is _not_ the oom killer's responsibility for enforcing a kernel-wide forkbomb policy, which would be much better handled at execve() time. It's a very small part of my badness heuristic, depending on the average size of the children's rss and swap usage, because we want to slightly penalize tasks that fork an extremely large number of tasks that have no substantial runtime; memory is being consumed but very little work is getting done by those thousand children. This would most often than not be used only to break ties when two parents have similar memory consumption themselves but one is obviously oversubscribing the system. > And why exactly is iterating over 1st level children efficient enough and > doing that recursively is not? I don't find it significantly more expensive > and badness() is hardly a bottleneck anyway. > If we look at children's memory usage recursively, then we'll always end up selecting init_task. > > The memory consumption of these children were not considered in my rough > > draft, it was simply a counter of how many first-generation children each > > task has. > > Why exactly do you think only 1st generation children matter? Look again at > the process tree posted by me and you'll see it solves nothing there. I still > fail to see why counting also all other generations should be considered > anything more than a negligible penalty for something that's not a bottleneck > at all. > You're specifying a problem that is outside the scope of the oom killer, sorry. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-10 22:25 ` David Rientjes @ 2010-02-11 10:16 ` Lubos Lunak 2010-02-11 21:17 ` David Rientjes 0 siblings, 1 reply; 53+ messages in thread From: Lubos Lunak @ 2010-02-11 10:16 UTC (permalink / raw) To: David Rientjes Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Wednesday 10 of February 2010, David Rientjes wrote: > On Wed, 10 Feb 2010, Lubos Lunak wrote: > > Which is why I suggested summing up the memory of the parent and its > > children. > > That's almost identical to the current heuristic where we sum half the > size of the children's VM size, unfortunately it's not a good indicator of > forkbombs since in your particular example it would be detrimental to > kdeinit. I believe that with the algorithm no longer using VmSize and being careful not to count shared memory more than once this would not be an issue and kdeinit would be reasonably safe. KDE does not use _that_ much memory to score higher than something that caused OOM :). > My heursitic considers runtime of the children as an indicator > of a forkbombing parent since such tasks don't typically get to run > anyway. The rss or swap usage of a child with a seperate address space > simply isn't relevant to the badness score of the parent, it unfairly > penalizes medium/large server jobs. Our definitions of 'forkbomb' then perhaps differ a bit. I consider 'make -j100' a kind of a forkbomb too, it will very likely overload the machine too as soon as the gcc instances use up all the memory. For that reason also using CPU time <1second will not work here, while using real time <1minute would. That long timeout would have the weakness that when running at the same time reasonable 'make -j4' and Firefox that'd immediatelly go crazy, then maybe the make job could be targeted instead if its total cost would go higher. However, here I again believe that the fixed metrics for computing memory usage would work well enough to let that happen only when the total cost of the make job would be actually higher than that of the offender and in that case it is kind of an offender too. Your protection seems to cover only "for(;;) if(fork() == 0) break;" , while I believe mine could handle also "make -j100" or the bash forkbomb ":() { :|:& };:" (i.e. "for(;;) fork();"). > > > We can't address recursive forkbombing in the oom killer with any > > > efficiency, but luckily those cases aren't very common. > > > > Right, I've never run a recursive make that brought my machine to its > > knees. Oh, wait. > > That's completely outside the scope of the oom killer, though: it is _not_ > the oom killer's responsibility for enforcing a kernel-wide forkbomb > policy Why? It repeatedly causes OOM here (and in fact it is the only common OOM or forkbomb I ever encounter). If OOM killer is the right place to protect against a forkbomb that spawns a large number of 1st level children, then I don't see how this is different. > > And why exactly is iterating over 1st level children efficient enough > > and doing that recursively is not? I don't find it significantly more > > expensive and badness() is hardly a bottleneck anyway. > > If we look at children's memory usage recursively, then we'll always end > up selecting init_task. Not if the algorithm does not propagate the top of the problematic subtree higher, see my reply to Alan Cox. > > Why exactly do you think only 1st generation children matter? Look again > > at the process tree posted by me and you'll see it solves nothing there. > > I still fail to see why counting also all other generations should be > > considered anything more than a negligible penalty for something that's > > not a bottleneck at all. > > You're specifying a problem that is outside the scope of the oom killer, > sorry. But it could be inside of the scope, since it causes OOM, and I don't think it's an unrealistic or rare use case. I don't consider it less likely than spawning a large number of direct children. If you want to cover only certified reasons for causing OOM, it can be as well said that userspace is not allowed to cause OOM at all. -- Lubos Lunak openSUSE Boosters team, KDE developer l.lunak@suse.cz , l.lunak@kde.org ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-11 10:16 ` Lubos Lunak @ 2010-02-11 21:17 ` David Rientjes 0 siblings, 0 replies; 53+ messages in thread From: David Rientjes @ 2010-02-11 21:17 UTC (permalink / raw) To: Lubos Lunak Cc: Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin, Jiri Kosina On Thu, 11 Feb 2010, Lubos Lunak wrote: > I believe that with the algorithm no longer using VmSize and being careful > not to count shared memory more than once this would not be an issue and > kdeinit would be reasonably safe. KDE does not use _that_ much memory to > score higher than something that caused OOM :). > Your suggestion of summing up the memory of the parent and its children would clearly bias kdeinit if it forks most of kde's threads as you mentioned earlier in the thread. Imagine it, or another server application that Rik mentioned, if all children are first generation: then it would always be selected if that it is the only task operating on the system. For a web server, for instance, where each query is handled by a seperate thread, we'd obviously prefer to kill a child thread instead of making the entire server unresponsive. That type of algorithm in the oom killer and to kill the parent instead is just a non-starter. > Our definitions of 'forkbomb' then perhaps differ a bit. I > consider 'make -j100' a kind of a forkbomb too, it will very likely overload > the machine too as soon as the gcc instances use up all the memory. For that > reason also using CPU time <1second will not work here, while using real time > <1minute would. > 1 minute? Unless you've got one of SGI's 4K cpu machines where these 1000 threads would actually get any runtime _at_all_ in such circumstances, that threshold is unreasonable. A valid point that wasn't raised is although we can't always detect out of control forking applications, we certainly should do some due diligence in making sure other applications aren't unfairly penalized when you do make -j100, for example. That's not the job of the forkbomb detector in my heuristic, however, it's the job of the baseline itself. In such scenarios (and when we can't allocate or free any memory), the baseline is responsible for identifying these tasks and killing them itself because they are using an excessive amount of memory. > Your protection seems to cover only "for(;;) if(fork() == 0) break;" , while > I believe mine could handle also "make -j100" or the bash forkbomb ":() > { :|:& };:" (i.e. "for(;;) fork();"). > Again, it's not protection against forkbombs: the oom killer is not the place where you want to enforce any policy that prohibits that. > Why? It repeatedly causes OOM here (and in fact it is the only common OOM or > forkbomb I ever encounter). If OOM killer is the right place to protect > against a forkbomb that spawns a large number of 1st level children, then I > don't see how this is different. > We're not protecting against a large number of first-generation children, we're simply penalizing them because the oom killer chooses to kill a large memory-hogging task instead of the parent first. This shouldn't be described as "forkbomb detection" because thats outside the scope of the oom killer or VM, for that matter. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 0:05 ` David Rientjes 2010-02-04 0:18 ` Rik van Riel 2010-02-04 7:58 ` Lubos Lunak @ 2010-02-04 9:50 ` Jiri Kosina 2010-02-04 21:39 ` David Rientjes 2 siblings, 1 reply; 53+ messages in thread From: Jiri Kosina @ 2010-02-04 9:50 UTC (permalink / raw) To: David Rientjes Cc: Lubos Lunak, Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin On Wed, 3 Feb 2010, David Rientjes wrote: > > > /* Forkbombs get penalized 10% of available RAM */ > > > if (forkcount > 500) > > > points += 100; > > > > As far as I'm concerned, this is a huge improvement over the current code > > (and, incidentally :), quite close to what I originally wanted). I'd be > > willing to test it in few real-world desktop cases if you provide a patch. [ ... ] > Do you have any comments about the forkbomb detector or its threshold that > I've put in my heuristic? I think detecting these scenarios is still an > important issue that we need to address instead of simply removing it from > consideration entirely. Why does OOM killer care about forkbombs *at all*? If we really want kernel to detect forkbombs (*), we'd have to establish completely separate infrastructure for that (with its own knobs for tuning and possibilities of disabling it completely). The task of OOM killer is to find the process that caused the system to run out of memory, and wipe it, it's as simple as that. The connection to forkbombs seems to be so loose that I don't see it. (*) How is forkbomb even defined? Where does the magic constant in 'forkcount > 500' come from? If your aim is to penalize server processes on very loaded web/database servers, then this is probably correct aproach. Otherwise, I don't seem to see the point. Thanks, -- Jiri Kosina SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 9:50 ` Jiri Kosina @ 2010-02-04 21:39 ` David Rientjes 2010-02-05 7:35 ` Oliver Neukum 0 siblings, 1 reply; 53+ messages in thread From: David Rientjes @ 2010-02-04 21:39 UTC (permalink / raw) To: Jiri Kosina Cc: Lubos Lunak, Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin On Thu, 4 Feb 2010, Jiri Kosina wrote: > Why does OOM killer care about forkbombs *at all*? > Because the cumulative effects of a forkbomb are detrimental to the system and the badness() heursitic favors large memory consumers very heavily. Thus, the forkbomb is never really a strong candidate for oom kill since the parent may consume very little memory itself and meanwhile KDE or another large memory consumer will get innocently killed instead as a result. > If we really want kernel to detect forkbombs (*), we'd have to establish > completely separate infrastructure for that (with its own knobs for tuning > and possibilities of disabling it completely). > That's what we're trying to do, we can look at the shear number of children that the parent has forked and check for it to be over a certain "forkbombing threshold" (which, yes, can be tuned from userspace), the uptime of those children, their resident set size, etc., to attempt to find a sane heuristic that penalizes them. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-04 21:39 ` David Rientjes @ 2010-02-05 7:35 ` Oliver Neukum 2010-02-10 3:10 ` David Rientjes 0 siblings, 1 reply; 53+ messages in thread From: Oliver Neukum @ 2010-02-05 7:35 UTC (permalink / raw) To: David Rientjes Cc: Jiri Kosina, Lubos Lunak, Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin Am Donnerstag, 4. Februar 2010 22:39:08 schrieb David Rientjes: > > If we really want kernel to detect forkbombs (*), we'd have to establish > > completely separate infrastructure for that (with its own knobs for tuning > > and possibilities of disabling it completely). > > > > That's what we're trying to do, we can look at the shear number of > children that the parent has forked and check for it to be over a certain > "forkbombing threshold" (which, yes, can be tuned from userspace), the > uptime of those children, their resident set size, etc., to attempt to > find a sane heuristic that penalizes them. Wouldn't it be saner to have a selection by user, so that users that are over the overcommit limit are targeted? Regards Oliver ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: Improving OOM killer 2010-02-05 7:35 ` Oliver Neukum @ 2010-02-10 3:10 ` David Rientjes 0 siblings, 0 replies; 53+ messages in thread From: David Rientjes @ 2010-02-10 3:10 UTC (permalink / raw) To: Oliver Neukum Cc: Jiri Kosina, Lubos Lunak, Balbir Singh, Rik van Riel, linux-mm, linux-kernel, Andrew Morton, KOSAKI Motohiro, Nick Piggin On Fri, 5 Feb 2010, Oliver Neukum wrote: > > That's what we're trying to do, we can look at the shear number of > > children that the parent has forked and check for it to be over a certain > > "forkbombing threshold" (which, yes, can be tuned from userspace), the > > uptime of those children, their resident set size, etc., to attempt to > > find a sane heuristic that penalizes them. > > Wouldn't it be saner to have a selection by user, so that users that > are over the overcommit limit are targeted? > It's rather unnecessary for the forkbomb case because then it would unfairly penalize any user that runs lots of tasks. The forkbomb handling code is really to prevent either user error, bugs in the application, or maliciousness. The goal isn't necessarily to kill anything that forks an egregious amount of tasks (which is why we always prefer a child with a seperate address space than a parent, anyway) but rather to try to make sure the system is still usable and recoverable. ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2010-02-11 21:17 UTC | newest] Thread overview: 53+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-02-01 22:02 Improving OOM killer Lubos Lunak 2010-02-01 23:53 ` David Rientjes 2010-02-02 21:10 ` Lubos Lunak 2010-02-03 1:41 ` David Rientjes 2010-02-03 1:52 ` KAMEZAWA Hiroyuki 2010-02-03 2:12 ` David Rientjes 2010-02-03 2:12 ` KAMEZAWA Hiroyuki 2010-02-03 2:36 ` [patch] sysctl: clean up vm related variable declarations David Rientjes 2010-02-03 8:07 ` KOSAKI Motohiro 2010-02-03 8:17 ` Balbir Singh 2010-02-03 22:54 ` Improving OOM killer Lubos Lunak 2010-02-04 0:00 ` David Rientjes 2010-02-03 7:50 ` KOSAKI Motohiro 2010-02-03 9:40 ` David Rientjes 2010-02-03 8:57 ` Balbir Singh 2010-02-03 12:10 ` Lubos Lunak 2010-02-03 12:25 ` Balbir Singh 2010-02-03 15:00 ` Minchan Kim 2010-02-03 16:06 ` Minchan Kim 2010-02-03 21:22 ` Lubos Lunak 2010-02-03 14:49 ` Rik van Riel 2010-02-03 17:01 ` Balbir Singh 2010-02-03 18:58 ` David Rientjes 2010-02-03 19:29 ` Frans Pop 2010-02-03 19:52 ` David Rientjes 2010-02-03 20:12 ` Frans Pop 2010-02-03 20:26 ` David Rientjes 2010-02-03 22:55 ` Lubos Lunak 2010-02-04 0:05 ` David Rientjes 2010-02-04 0:18 ` Rik van Riel 2010-02-04 21:48 ` David Rientjes 2010-02-04 22:06 ` Rik van Riel 2010-02-04 22:14 ` David Rientjes 2010-02-10 20:54 ` Lubos Lunak 2010-02-10 21:10 ` Rik van Riel 2010-02-10 21:29 ` Lubos Lunak 2010-02-10 22:18 ` Alan Cox 2010-02-10 22:31 ` David Rientjes 2010-02-11 9:50 ` Lubos Lunak 2010-02-04 22:31 ` Frans Pop 2010-02-04 22:53 ` David Rientjes 2010-02-04 7:58 ` Lubos Lunak 2010-02-04 21:34 ` David Rientjes 2010-02-10 20:54 ` Lubos Lunak 2010-02-10 21:09 ` Rik van Riel 2010-02-10 21:34 ` Lubos Lunak 2010-02-10 22:25 ` David Rientjes 2010-02-11 10:16 ` Lubos Lunak 2010-02-11 21:17 ` David Rientjes 2010-02-04 9:50 ` Jiri Kosina 2010-02-04 21:39 ` David Rientjes 2010-02-05 7:35 ` Oliver Neukum 2010-02-10 3:10 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox