* Terminate process that fails on a constrained allocation
@ 2006-02-08 18:05 Christoph Lameter
2006-02-08 18:13 ` Andi Kleen
2006-02-08 18:33 ` Paul Jackson
0 siblings, 2 replies; 28+ messages in thread
From: Christoph Lameter @ 2006-02-08 18:05 UTC (permalink / raw)
To: akpm; +Cc: pj, ak, linux-kernel
Some allocations are restricted to a limited set of nodes (due to memory
policies or cpuset constraints). If the page allocator is not able to find
enough memory then that does not mean that overall system memory is low.
In particular going postal and more or less randomly shooting at processes
is not likely going to help the situation but may just lead to suicide (the
whole system coming down).
It is better to signal to the process that no memory exists given the
constraints that the process (or the configuration of the process) has
placed on the allocation behavior. The process may be killed but then the
sysadmin or developer can investigate the situation. The solution is similar
to what we do when we run out of hugepages.
This patch adds a check before the out of memory killer is invoked. At that
point performance considerations do not matter much so we just scan the zonelist
and reconstruct a list of nodes. If the list of nodes does not contain all
online nodes then this is a constrained allocation and we should not call
the OOM killer.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 09:55:21.000000000 -0800
@@ -1011,6 +1011,28 @@ rebalance:
if (page)
goto got_pg;
+#ifdef CONFIG_NUMA
+ /*
+ * In the NUMA case we may have gotten here because the
+ * memory policies or cpusets have restricted the allocation.
+ */
+ {
+ nodemask_t nodes;
+
+ nodes_empty(nodes);
+ for (z = zonelist->zones; *z; z++)
+ if (cpuset_zone_allowed(*z, gfp_mask))
+ node_set((*z)->zone_pgdat->node_id,
+ nodes);
+ /*
+ * If we were only allowed to allocate from
+ * a subset of nodes then terminate the process.
+ */
+ if (!nodes_subset(node_online_map, nodes))
+ return NULL;
+ }
+#endif
+
out_of_memory(gfp_mask, order);
goto restart;
}
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: Terminate process that fails on a constrained allocation 2006-02-08 18:05 Terminate process that fails on a constrained allocation Christoph Lameter @ 2006-02-08 18:13 ` Andi Kleen 2006-02-08 18:34 ` Paul Jackson 2006-02-08 18:33 ` Paul Jackson 1 sibling, 1 reply; 28+ messages in thread From: Andi Kleen @ 2006-02-08 18:13 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, pj, linux-kernel On Wednesday 08 February 2006 19:05, Christoph Lameter wrote: > This patch adds a check before the out of memory killer is invoked. At that > point performance considerations do not matter much so we just scan the zonelist > and reconstruct a list of nodes. If the list of nodes does not contain all > online nodes then this is a constrained allocation and we should not call > the OOM killer. Looks good. Ok I would have used an noinline function instead of putting the code inline to prevent register pressure etc. and make it more readable. -Andi ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 18:13 ` Andi Kleen @ 2006-02-08 18:34 ` Paul Jackson 2006-02-08 18:54 ` Christoph Lameter 0 siblings, 1 reply; 28+ messages in thread From: Paul Jackson @ 2006-02-08 18:34 UTC (permalink / raw) To: Andi Kleen; +Cc: clameter, akpm, linux-kernel Andi wrote: > Ok I would have used an noinline function instead of putting the code inline > to prevent register pressure etc. and make it more readable. good idea. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 18:34 ` Paul Jackson @ 2006-02-08 18:54 ` Christoph Lameter 2006-02-08 19:01 ` Andi Kleen 0 siblings, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 18:54 UTC (permalink / raw) To: Paul Jackson; +Cc: Andi Kleen, akpm, linux-kernel On Wed, 8 Feb 2006, Paul Jackson wrote: > good idea. Ok. Here is V2: Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before the out of memory killer is invoked. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should not calle the OOM killer. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.16-rc2/mm/page_alloc.c =================================================================== --- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800 +++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 10:53:08.000000000 -0800 @@ -817,6 +818,27 @@ failed: #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ /* + * check if a given zonelist allows allocation over all the nodes + * in the system. + */ +int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask) +{ +#ifdef CONFIG_NUMA + struct zone **z; + nodemask_t nodes; + + nodes = node_online_map; + for (z = zonelist->zones; *z; z++) + if (cpuset_zone_allowed(*z, gfp_mask)) + node_clear((*z)->zone_pgdat->node_id, + nodes); + return !nodes_empty(nodes); +#else + return 0; +#endif +} + +/* * Return 1 if free pages are above 'mark'. This takes into account the order * of the allocation. */ @@ -1011,6 +1033,9 @@ rebalance: if (page) goto got_pg; + if (zonelist_incomplete(zonelist, gfp_mask)) + return NULL; + out_of_memory(gfp_mask, order); goto restart; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 18:54 ` Christoph Lameter @ 2006-02-08 19:01 ` Andi Kleen 2006-02-08 19:15 ` Christoph Lameter 0 siblings, 1 reply; 28+ messages in thread From: Andi Kleen @ 2006-02-08 19:01 UTC (permalink / raw) To: Christoph Lameter; +Cc: Paul Jackson, akpm, linux-kernel On Wednesday 08 February 2006 19:54, Christoph Lameter wrote: > Index: linux-2.6.16-rc2/mm/page_alloc.c > =================================================================== > --- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800 > +++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 10:53:08.000000000 -0800 > @@ -817,6 +818,27 @@ failed: > #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ > > /* > + * check if a given zonelist allows allocation over all the nodes > + * in the system. > + */ > +int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask) static noinline -Andi ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 19:01 ` Andi Kleen @ 2006-02-08 19:15 ` Christoph Lameter 0 siblings, 0 replies; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 19:15 UTC (permalink / raw) To: Andi Kleen; +Cc: Paul Jackson, akpm, linux-kernel On Wed, 8 Feb 2006, Andi Kleen wrote: > static noinline Next rev: Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before the out of memory killer is invoked. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should not calle the OOM killer. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.16-rc2/mm/page_alloc.c =================================================================== --- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800 +++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 11:14:32.000000000 -0800 @@ -612,6 +612,7 @@ void drain_remote_pages(void) } local_irq_restore(flags); } + #endif #if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU) @@ -817,6 +818,27 @@ failed: #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ /* + * Check if a given zonelist is complete and allows allocation over all + * the nodes in the system. + */ +static noinline int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask) +{ +#ifdef CONFIG_NUMA + struct zone **z; + nodemask_t nodes; + + nodes = node_online_map; + for (z = zonelist->zones; *z; z++) + if (cpuset_zone_allowed(*z, gfp_mask)) + node_clear((*z)->zone_pgdat->node_id, + nodes); + return !nodes_empty(nodes); +#else + return 0; +#endif +} + +/* * Return 1 if free pages are above 'mark'. This takes into account the order * of the allocation. */ @@ -1011,6 +1033,15 @@ rebalance: if (page) goto got_pg; + /* + * If we allocated using a zonelist that does not include + * all nodes in the system then the OOM situation may be + * due to configured limitations. Just abort the application + * instead of calling the OOM killer. + */ + if (zonelist_incomplete(zonelist, gfp_mask)) + return NULL; + out_of_memory(gfp_mask, order); goto restart; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 18:05 Terminate process that fails on a constrained allocation Christoph Lameter 2006-02-08 18:13 ` Andi Kleen @ 2006-02-08 18:33 ` Paul Jackson 2006-02-08 18:42 ` Christoph Lameter 1 sibling, 1 reply; 28+ messages in thread From: Paul Jackson @ 2006-02-08 18:33 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, ak, linux-kernel A couple of comments ... It took me an extra couple of passes to understand this code. I wonder if the following, essentially equivalent (if I didn't break something - never tested this) is easier to understand: #ifdef CONFIG_NUMA /* * In the NUMA case we may have gotten here because the * memory policies or cpusets have restricted the allocation. */ { nodemask_t nodes; /* compute nodes not allowd */ nodes = node_online_map; for (z = zonelist->zones; *z; z++) if (cpuset_zone_allowed(*z, gfp_mask)) node_clear((*z)->zone_pgdat->node_id, nodes); /* * If there are any nodes left set in 'nodes', these * are nodes the cpuset or mempolicy settings aren't * letting us use. In that case, return NULL to the * current task, rather than invoking out_of_memory() * on the system. */ if (!nodes_empty(nodes)) return NULL; } #endif Second point - I thought I had already throttled the oom_killer to some degree, with the lines, in mm/oom_kill.c select_bad_process(): /* If p's nodes don't overlap ours, it won't help to kill p. */ if (!cpuset_excl_nodes_overlap(p)) continue; What your patch is doing affectively disables the oom_killer for big numa systems, rather than having it operate within the set of tasks using overlapping resources. Do we need this more radical constraint on the oom_killer? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 18:33 ` Paul Jackson @ 2006-02-08 18:42 ` Christoph Lameter 2006-02-08 18:57 ` Paul Jackson 0 siblings, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 18:42 UTC (permalink / raw) To: Paul Jackson; +Cc: akpm, ak, linux-kernel On Wed, 8 Feb 2006, Paul Jackson wrote: > What your patch is doing affectively disables the oom_killer for > big numa systems, rather than having it operate within the set > of tasks using overlapping resources. No it only disables the oom killer for constrained allocations. If the big numa system uses a cpuset that limits the number of nodes then those allocations occurring within these cpusets are constrained allocations which will then lead to the killing of the application that allocated too much memory. I think this is much more consistent than trying to tame the OOM killer enough to stay in a certain cpuset. F.e. a sysadmin may mistakenly start a process allocating too much memory in a cpuset. The OOM killer will then start randomly shooting other processes one of which may be a critical process.. Ouch. > Do we need this more radical constraint on the oom_killer? Yes. One can make the OOM killer go postal if one specifies a memory policy that contains only one node and then allocates enough memory. Actually a good Denial of service attack than can be used with cpusets and memory policies. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 18:42 ` Christoph Lameter @ 2006-02-08 18:57 ` Paul Jackson 2006-02-08 19:02 ` Christoph Lameter 2006-02-08 19:05 ` Andi Kleen 0 siblings, 2 replies; 28+ messages in thread From: Paul Jackson @ 2006-02-08 18:57 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, ak, linux-kernel > No it only disables the oom killer for constrained allocations. But on many big numa systems, the way they are administered, that affectively disables the oom killer. > F.e. a sysadmin may mistakenly start a process allocating too much memory > in a cpuset. The OOM killer will then start randomly shooting other > processes one of which may be a critical process.. Ouch. That same exact claim could be made to justify a patch that removed mm/oom_kill.c entirely. And the same exact claim could be made, dropping the "in a cpuset" phrase, on an UMA desktop PC. The basic premise of the oom killer is that, instead of blowing off the caller, who might innocently have asked for one page too many after some other hog used up all the available memory, we try to pick the worst offender. Get the worst offender, not just who ever finally hit the limit. > Actually a good Denial of service attack than can be used with cpusets > and memory policies. Nothing special about cpusets there. The same DOS opportunity exists on simple one cpu, one node systems. ... Granted, I am objecting to this patch with mixed feelings. I've yet to be convinced that the oom killer is our friend, and half of me (not seriously) is almost wishing it were gone. Would another option be to continue to fine tune the heuristics that the oom killer uses to pick its next victim? What situation did you hit that motivated this change? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 18:57 ` Paul Jackson @ 2006-02-08 19:02 ` Christoph Lameter 2006-02-08 19:05 ` Andi Kleen 1 sibling, 0 replies; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 19:02 UTC (permalink / raw) To: Paul Jackson; +Cc: akpm, ak, linux-kernel On Wed, 8 Feb 2006, Paul Jackson wrote: > > F.e. a sysadmin may mistakenly start a process allocating too much memory > > in a cpuset. The OOM killer will then start randomly shooting other > > processes one of which may be a critical process.. Ouch. > > That same exact claim could be made to justify a patch that > removed mm/oom_kill.c entirely. And the same exact claim > could be made, dropping the "in a cpuset" phrase, on an UMA > desktop PC. Well the justification for the OOM Killer was that it keeps the system as a whole alive ..... cpusets are not vital to the system. Usually the folks using cpusets are very aware of the memory usage of their processes and I think they would hate that a random process be killed. The obvious case is that someone starts a process that allocates large amount of memory while other processes are already running. > The basic premise of the oom killer is that, instead of blowing > off the caller, who might innocently have asked for one page > too many after some other hog used up all the available memory, > we try to pick the worst offender. > > Get the worst offender, not just who ever finally hit the limit. Typically the worst offender is who just asked for the page. > I've yet to be convinced that the oom killer is our friend, > and half of me (not seriously) is almost wishing it were > gone. I definitely agree with you. Lets at least keep it from doing harm as much as possible. > Would another option be to continue to fine tune the heuristics > that the oom killer uses to pick its next victim? > > What situation did you hit that motivated this change? mbind to one node. Allocate on that node until you trigger the OOM killer. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 18:57 ` Paul Jackson 2006-02-08 19:02 ` Christoph Lameter @ 2006-02-08 19:05 ` Andi Kleen 2006-02-08 20:22 ` Paul Jackson 1 sibling, 1 reply; 28+ messages in thread From: Andi Kleen @ 2006-02-08 19:05 UTC (permalink / raw) To: Paul Jackson; +Cc: Christoph Lameter, akpm, linux-kernel On Wednesday 08 February 2006 19:57, Paul Jackson wrote: > > No it only disables the oom killer for constrained allocations. > > But on many big numa systems, the way they are administered, > that affectively disables the oom killer. I guess it won't matter because they are administrated with appropiate ulimits I guess. And to be honest the OOM killer never really worked all that well, so it's not a big loss. [still often wish we had that "global virtual memory ulimit for uid"] > I've yet to be convinced that the oom killer is our friend, > and half of me (not seriously) is almost wishing it were > gone. It's more than half of me near seriously agreeing with you. > Would another option be to continue to fine tune the heuristics > that the oom killer uses to pick its next victim? > > What situation did you hit that motivated this change? It's a long known design bug of the NUMA policy, but it recently hit with some test program again. -Andi ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 19:05 ` Andi Kleen @ 2006-02-08 20:22 ` Paul Jackson 2006-02-08 20:36 ` Christoph Lameter 0 siblings, 1 reply; 28+ messages in thread From: Paul Jackson @ 2006-02-08 20:22 UTC (permalink / raw) To: Andi Kleen; +Cc: clameter, akpm, linux-kernel If the oom killer is of any use on a numa system using set_mempolicy, mbind and/or cpusets, then it would be of use in the bootcpuset -- that cpuset which is often setup to hold the classic Unix daemons. I'm not sure we want to do that. If we changed the mm/oom_kill.c select_bad_process() constraint from /* If p's nodes don't overlap ours, it won't help to kill p. */ if (!cpuset_excl_nodes_overlap(p)) continue; which kills any task that has any overlap with nodes with the current task: 1) to the much tighter limit of only killing tasks that are on the same or a subset of the same nodes, and 2) changed it to look at -both- the cpuset and MPOL_BIND constraints, then oom killer would work in a bootcpuset rather as it does now on simple UMA systems. The new logic would be -- only kill tasks that are constrained to the same or a subset of the same nodes as the current task. With this, the problem you describe with the oom killer and a task that has used mbind or cpuset to just allow a single node would be fixed -- only tasks constrained to just that node, no more, would be considered for killing. At the same time, a typical bootcpuset would have oom killer behaviour resembling what people have become accustomed to. If we are going to neuter or eliminate the oom killer, it seems like we should do it across the board, not just on numa boxes using some form of memory constraint (mempolicy or cpuset). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 20:22 ` Paul Jackson @ 2006-02-08 20:36 ` Christoph Lameter 2006-02-08 20:55 ` Paul Jackson 0 siblings, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 20:36 UTC (permalink / raw) To: Paul Jackson; +Cc: Andi Kleen, akpm, linux-kernel On Wed, 8 Feb 2006, Paul Jackson wrote: > The new logic would be -- only kill tasks that are constrained to > the same or a subset of the same nodes as the current task. If a task has restricted its memory allocation to one node and does excessive allocations then that process needs to die not other processes that are harmlessly running on the node and that may not be allocating memory at the time. > At the same time, a typical bootcpuset would have oom killer behaviour > resembling what people have become accustomed to. People are accustomed of having random processes killed? <shudder> > If we are going to neuter or eliminate the oom killer, it seems like > we should do it across the board, not just on numa boxes using > some form of memory constraint (mempolicy or cpuset). We are terminating processes perform restricted allocations. Restricted allocations are only possible on NUMA boxes so the phrase "numa boxes using some form of memory constraint" is a tautology. OOM killing makes sense for global allocations if the system is really tight on memory and survival is the main goal even if he have to cannibalize processes. However, cannibalism is still a taboo. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 20:36 ` Christoph Lameter @ 2006-02-08 20:55 ` Paul Jackson 2006-02-08 21:01 ` Andi Kleen 0 siblings, 1 reply; 28+ messages in thread From: Paul Jackson @ 2006-02-08 20:55 UTC (permalink / raw) To: Christoph Lameter; +Cc: ak, akpm, linux-kernel Can anyone give us a good reason why we shouldn't just remove the oom killer, entirely? Christoph wrote: > If a task has restricted its memory allocation to one node and does > excessive allocations then that process needs to die not other processes > that are harmlessly running on the node and that may not be allocating > memory at the time. That _exact_ same argument applies to a system that only has one node. If we want to remove the oom killer, lets just remove the oom killer. > People are accustomed of having random processes killed? <shudder> That's what the oom killer does ... well, it makes an honest effort not to be random. So, yes, since it has been there a long time, people are used to it. Maybe they don't like it, maybe with good reason. But it is there. > OOM killing makes > sense for global allocations if the system is really tight on memory and > survival is the main goal If that argument justifies OOM killing on a simple UMA system, then surely, for -some- critical tasks, it justifies it on a big NUMA system. Either OOM is useful in some cases or it is not. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 20:55 ` Paul Jackson @ 2006-02-08 21:01 ` Andi Kleen 2006-02-08 21:03 ` Paul Jackson 0 siblings, 1 reply; 28+ messages in thread From: Andi Kleen @ 2006-02-08 21:01 UTC (permalink / raw) To: Paul Jackson; +Cc: Christoph Lameter, akpm, linux-kernel On Wednesday 08 February 2006 21:55, Paul Jackson wrote: > If that argument justifies OOM killing on a simple UMA system, then > surely, for -some- critical tasks, it justifies it on a big NUMA system. > > Either OOM is useful in some cases or it is not. I don't think you really want to open a full scale "is the oom killer needed" thread. Check the archives - there have been some going on for months. But I think we can agree that together with mbind the oom killer is pretty useless, can't we? -Andi (who is definitely in favour of Christoph's latest patch) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 21:01 ` Andi Kleen @ 2006-02-08 21:03 ` Paul Jackson 2006-02-08 21:21 ` Andi Kleen 0 siblings, 1 reply; 28+ messages in thread From: Paul Jackson @ 2006-02-08 21:03 UTC (permalink / raw) To: Andi Kleen; +Cc: clameter, akpm, linux-kernel > I don't think you really want to open a full scale "is the oom killer needed" > thread. Check the archives - there have been some going on for months. > > But I think we can agree that together with mbind the oom killer is pretty > useless, can't we? Excellent points. I approve this patch. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 21:03 ` Paul Jackson @ 2006-02-08 21:21 ` Andi Kleen 2006-02-08 21:39 ` Andrew Morton 0 siblings, 1 reply; 28+ messages in thread From: Andi Kleen @ 2006-02-08 21:21 UTC (permalink / raw) To: Paul Jackson; +Cc: clameter, akpm, linux-kernel On Wednesday 08 February 2006 22:03, Paul Jackson wrote: > > I don't think you really want to open a full scale "is the oom killer needed" > > thread. Check the archives - there have been some going on for months. > > > > But I think we can agree that together with mbind the oom killer is pretty > > useless, can't we? > > Excellent points. > > I approve this patch. I think it should be put into 2.6.16. Andrew? I had the small objection about adding static noinline, but it's really not important and the patch can be used as it. -Andi ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 21:21 ` Andi Kleen @ 2006-02-08 21:39 ` Andrew Morton 2006-02-08 22:11 ` Christoph Lameter 2006-02-08 23:28 ` Christoph Lameter 0 siblings, 2 replies; 28+ messages in thread From: Andrew Morton @ 2006-02-08 21:39 UTC (permalink / raw) To: Andi Kleen; +Cc: pj, clameter, linux-kernel Andi Kleen <ak@suse.de> wrote: > > On Wednesday 08 February 2006 22:03, Paul Jackson wrote: > > > I don't think you really want to open a full scale "is the oom killer needed" > > > thread. Check the archives - there have been some going on for months. > > > > > > But I think we can agree that together with mbind the oom killer is pretty > > > useless, can't we? > > > > Excellent points. > > > > I approve this patch. > > I think it should be put into 2.6.16. Andrew? Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL return? I doubt it, in which case this patch will cause oopses and hangs. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 21:39 ` Andrew Morton @ 2006-02-08 22:11 ` Christoph Lameter 2006-02-08 22:41 ` Andi Kleen 2006-02-08 22:48 ` Christoph Lameter 2006-02-08 23:28 ` Christoph Lameter 1 sibling, 2 replies; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 22:11 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, pj, linux-kernel On Wed, 8 Feb 2006, Andrew Morton wrote: > > I think it should be put into 2.6.16. Andrew? > > Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL > return? I doubt it, in which case this patch will cause oopses and hangs. I sent you a patch with static inline..... But I am having second thoughts about this patch. Paul is partially right. Maybe we can move the logic into the out_of_memory handler for now? That would allow us to implement more sophisticated things later (for example page migration would allow us to move memory of processes that can also allocate on other nodes from the nodes where we lack memory) and Paul may put something in there to address his concerns. --- Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before the out of memory killer is invoked. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should not calle the OOM killer. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.16-rc2/mm/page_alloc.c =================================================================== --- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800 +++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 14:06:12.000000000 -0800 @@ -1011,8 +1012,10 @@ rebalance: if (page) goto got_pg; - out_of_memory(gfp_mask, order); - goto restart; + if (out_of_memory(zonelist, gfp_mask, order)) + goto restart; + + return NULL; } /* Index: linux-2.6.16-rc2/include/linux/swap.h =================================================================== --- linux-2.6.16-rc2.orig/include/linux/swap.h 2006-02-02 22:03:08.000000000 -0800 +++ linux-2.6.16-rc2/include/linux/swap.h 2006-02-08 14:10:02.000000000 -0800 @@ -147,7 +147,7 @@ struct swap_list_t { #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages) /* linux/mm/oom_kill.c */ -extern void out_of_memory(gfp_t gfp_mask, int order); +extern int out_of_memory(struct zonelist *zl, gfp_t gfp_mask, int order); /* linux/mm/memory.c */ extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *); Index: linux-2.6.16-rc2/mm/oom_kill.c =================================================================== --- linux-2.6.16-rc2.orig/mm/oom_kill.c 2006-02-02 22:03:08.000000000 -0800 +++ linux-2.6.16-rc2/mm/oom_kill.c 2006-02-08 14:09:43.000000000 -0800 @@ -131,6 +131,27 @@ unsigned long badness(struct task_struct } /* + * check if a given zonelist allows allocation over all the nodes + * in the system. + */ +static noinline int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask) +{ +#ifdef CONFIG_NUMA + struct zone **z; + nodemask_t nodes; + + nodes = node_online_map; + for (z = zonelist->zones; *z; z++) + if (cpuset_zone_allowed(*z, gfp_mask)) + node_clear((*z)->zone_pgdat->node_id, + nodes); + return !nodes_empty(nodes); +#else + return 0; +#endif +} + +/* * Simple selection loop. We chose the process with the highest * number of 'points'. We expect the caller will lock the tasklist. * @@ -262,12 +283,23 @@ static struct mm_struct *oom_kill_proces * killing a random task (bad), letting the system crash (worse) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. + * + * Function returns 1 if the allocation should be retried. + * zero if the allocation should fail. */ -void out_of_memory(gfp_t gfp_mask, int order) +int out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) { struct mm_struct *mm = NULL; task_t * p; + /* + * Simply fail an allocation that does not allow + * allocation on all nodes. We may want to have + * more sophisticated logic here in the future. + */ + if (zonelist_incomplete(zonelist, gfp_mask)) + return 0; + if (printk_ratelimit()) { printk("oom-killer: gfp_mask=0x%x, order=%d\n", gfp_mask, order); @@ -306,4 +338,5 @@ retry: */ if (!test_thread_flag(TIF_MEMDIE)) schedule_timeout_interruptible(1); + return 1; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 22:11 ` Christoph Lameter @ 2006-02-08 22:41 ` Andi Kleen 2006-02-08 23:29 ` Christoph Lameter 2006-02-08 23:35 ` Andrew Morton 2006-02-08 22:48 ` Christoph Lameter 1 sibling, 2 replies; 28+ messages in thread From: Andi Kleen @ 2006-02-08 22:41 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, pj, linux-kernel On Wednesday 08 February 2006 23:11, Christoph Lameter wrote: > On Wed, 8 Feb 2006, Andrew Morton wrote: > > > > I think it should be put into 2.6.16. Andrew? > > > > Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL > > return? I doubt it, in which case this patch will cause oopses and hangs. > > I sent you a patch with static inline..... noinline > But I am having second thoughts > about this patch. Paul is partially right. Maybe we can move the logic > into the out_of_memory handler for now? That would allow us to implement > more sophisticated things later I have my doubts that's really worth it, but ok. > (for example page migration would allow us > to move memory of processes that can also allocate on other nodes from the > nodes where we lack memory) and Paul may put something in there to > address his concerns. > > --- > > Terminate process that fails on a constrained allocation Patch looks good for me too. Thanks. Unfortunately Andrew's point with the GFP_NOFS still applies :/ But I would consider any caller of this not handling NULL be broken. Andrew do you have any stronger evidence it's a real problem? Another way would be to force a default non strict policy with GFP_NOFS, but that would be somewhat ugly again and impact the fast paths. -Andi ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 22:41 ` Andi Kleen @ 2006-02-08 23:29 ` Christoph Lameter 2006-02-08 23:35 ` Andrew Morton 1 sibling, 0 replies; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 23:29 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, pj, linux-kernel On Wed, 8 Feb 2006, Andi Kleen wrote: > Unfortunately Andrew's point with the GFP_NOFS still applies :/ > But I would consider any caller of this not handling NULL be broken. > Andrew do you have any stronger evidence it's a real problem? I would think that a caller will have to set __GFP_NOFAIL if a NULL is unacceptable. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 22:41 ` Andi Kleen 2006-02-08 23:29 ` Christoph Lameter @ 2006-02-08 23:35 ` Andrew Morton 1 sibling, 0 replies; 28+ messages in thread From: Andrew Morton @ 2006-02-08 23:35 UTC (permalink / raw) To: Andi Kleen; +Cc: clameter, pj, linux-kernel Andi Kleen <ak@suse.de> wrote: > > > Unfortunately Andrew's point with the GFP_NOFS still applies :/ > But I would consider any caller of this not handling NULL be broken. Sure. > Andrew do you have any stronger evidence it's a real problem? No. About a million years ago I had a make-alloc_pages-fail-at-1%-rate debug patch. Doing that again would be an interesting exercise. Another option is to only fail userspace allocations (__GFP_HIGHMEM is set, or a new flag in GFP_HIGHUSER). For the workloads which the NUMA guys care about I expect this is a 99% solution and the amount of code which it puts at risk is vastly less. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 22:11 ` Christoph Lameter 2006-02-08 22:41 ` Andi Kleen @ 2006-02-08 22:48 ` Christoph Lameter 1 sibling, 0 replies; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 22:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, pj, linux-kernel Cleaned up and tested version. Better comments and I forgot to modify drivers/char/sysrq.c: Terminate process that fails on a constrained allocation Some allocations are restricted to a limited set of nodes (due to memory policies or cpuset constraints). If the page allocator is not able to find enough memory then that does not mean that overall system memory is low. In particular going postal and more or less randomly shooting at processes is not likely going to help the situation but may just lead to suicide (the whole system coming down). It is better to signal to the process that no memory exists given the constraints that the process (or the configuration of the process) has placed on the allocation behavior. The process may be killed but then the sysadmin or developer can investigate the situation. The solution is similar to what we do when running out of hugepages. This patch adds a check before we kill processes. At that point performance considerations do not matter much so we just scan the zonelist and reconstruct a list of nodes. If the list of nodes does not contain all online nodes then this is a constrained allocation and we should not kill processes but fail the allocation. Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.16-rc2/mm/page_alloc.c =================================================================== --- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800 +++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 14:28:50.000000000 -0800 @@ -1011,8 +1011,10 @@ rebalance: if (page) goto got_pg; - out_of_memory(gfp_mask, order); - goto restart; + if (out_of_memory(zonelist, gfp_mask, order)) + goto restart; + + return NULL; } /* Index: linux-2.6.16-rc2/include/linux/swap.h =================================================================== --- linux-2.6.16-rc2.orig/include/linux/swap.h 2006-02-02 22:03:08.000000000 -0800 +++ linux-2.6.16-rc2/include/linux/swap.h 2006-02-08 14:28:50.000000000 -0800 @@ -147,7 +147,7 @@ struct swap_list_t { #define vm_swap_full() (nr_swap_pages*2 < total_swap_pages) /* linux/mm/oom_kill.c */ -extern void out_of_memory(gfp_t gfp_mask, int order); +extern int out_of_memory(struct zonelist *zl, gfp_t gfp_mask, int order); /* linux/mm/memory.c */ extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *); Index: linux-2.6.16-rc2/mm/oom_kill.c =================================================================== --- linux-2.6.16-rc2.orig/mm/oom_kill.c 2006-02-02 22:03:08.000000000 -0800 +++ linux-2.6.16-rc2/mm/oom_kill.c 2006-02-08 14:36:27.000000000 -0800 @@ -131,6 +131,27 @@ unsigned long badness(struct task_struct } /* + * check if a given zonelist allows allocation over all the nodes + * in the system. + */ +static noinline int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask) +{ +#ifdef CONFIG_NUMA + struct zone **z; + nodemask_t nodes; + + nodes = node_online_map; + for (z = zonelist->zones; *z; z++) + if (cpuset_zone_allowed(*z, gfp_mask)) + node_clear((*z)->zone_pgdat->node_id, + nodes); + return !nodes_empty(nodes); +#else + return 0; +#endif +} + +/* * Simple selection loop. We chose the process with the highest * number of 'points'. We expect the caller will lock the tasklist. * @@ -256,18 +277,33 @@ static struct mm_struct *oom_kill_proces } /** - * oom_kill - kill the "best" process when we run out of memory + * out_of_memory - deal with out of memory situations + * + * If we are out of memory then check if this was due to the allocation + * being restricted to only a part of system memory. If so then + * fail the allocation. * * If we run out of memory, we have the choice between either * killing a random task (bad), letting the system crash (worse) * OR try to be smart about which process to kill. Note that we * don't have to be perfect here, we just have to be good. + * + * Returns 1 if the allocation should be retried. + * 0 if the allocation should fail. */ -void out_of_memory(gfp_t gfp_mask, int order) +int out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) { struct mm_struct *mm = NULL; task_t * p; + /* + * Simply fail an allocation that does not allow + * allocation on all nodes. We may want to have + * more sophisticated logic here in the future. + */ + if (zonelist_incomplete(zonelist, gfp_mask)) + return 0; + if (printk_ratelimit()) { printk("oom-killer: gfp_mask=0x%x, order=%d\n", gfp_mask, order); @@ -306,4 +342,5 @@ retry: */ if (!test_thread_flag(TIF_MEMDIE)) schedule_timeout_interruptible(1); + return 1; } Index: linux-2.6.16-rc2/drivers/char/sysrq.c =================================================================== --- linux-2.6.16-rc2.orig/drivers/char/sysrq.c 2006-02-02 22:03:08.000000000 -0800 +++ linux-2.6.16-rc2/drivers/char/sysrq.c 2006-02-08 14:29:17.000000000 -0800 @@ -243,7 +243,7 @@ static struct sysrq_key_op sysrq_term_op static void moom_callback(void *ignored) { - out_of_memory(GFP_KERNEL, 0); + out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL], GFP_KERNEL, 0); } static DECLARE_WORK(moom_work, moom_callback, NULL); ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 21:39 ` Andrew Morton 2006-02-08 22:11 ` Christoph Lameter @ 2006-02-08 23:28 ` Christoph Lameter 2006-02-08 23:43 ` Andrew Morton 1 sibling, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 23:28 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, pj, linux-kernel On Wed, 8 Feb 2006, Andrew Morton wrote: > > I think it should be put into 2.6.16. Andrew? > > Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL > return? I doubt it, in which case this patch will cause oopses and hangs. If a caller cannot handle NULL then __GFP_NOFAIL has to be set, right? So we will never get to the piece of code under discussion. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 23:28 ` Christoph Lameter @ 2006-02-08 23:43 ` Andrew Morton 2006-02-08 23:54 ` Christoph Lameter 0 siblings, 1 reply; 28+ messages in thread From: Andrew Morton @ 2006-02-08 23:43 UTC (permalink / raw) To: Christoph Lameter; +Cc: ak, pj, linux-kernel Christoph Lameter <clameter@engr.sgi.com> wrote: > > On Wed, 8 Feb 2006, Andrew Morton wrote: > > > > I think it should be put into 2.6.16. Andrew? > > > > Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL > > return? I doubt it, in which case this patch will cause oopses and hangs. > > If a caller cannot handle NULL then __GFP_NOFAIL has to be set, right? That would assume non-buggy code. I'm talking about the exercising of hitherto-unused codepaths. We've fixed many, many pieces of code which simply assumed that kmalloc(GFP_KERNEL) succeeds. I doubt if many such simple bugs still exist, but there will be more subtle ones in there. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 23:43 ` Andrew Morton @ 2006-02-08 23:54 ` Christoph Lameter 2006-02-08 23:57 ` Andi Kleen 0 siblings, 1 reply; 28+ messages in thread From: Christoph Lameter @ 2006-02-08 23:54 UTC (permalink / raw) To: Andrew Morton; +Cc: ak, pj, linux-kernel On Wed, 8 Feb 2006, Andrew Morton wrote: > > If a caller cannot handle NULL then __GFP_NOFAIL has to be set, right? > > That would assume non-buggy code. I'm talking about the exercising of > hitherto-unused codepaths. We've fixed many, many pieces of code which > simply assumed that kmalloc(GFP_KERNEL) succeeds. I doubt if many such > simple bugs still exist, but there will be more subtle ones in there. We could add __GFP_NOFAIL to kmem_getpages in slab.c to insure that kmalloc waits rather than return NULL. Also a too drastic measure right? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation 2006-02-08 23:54 ` Christoph Lameter @ 2006-02-08 23:57 ` Andi Kleen 0 siblings, 0 replies; 28+ messages in thread From: Andi Kleen @ 2006-02-08 23:57 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, pj, linux-kernel On Thursday 09 February 2006 00:54, Christoph Lameter wrote: > On Wed, 8 Feb 2006, Andrew Morton wrote: > > > > If a caller cannot handle NULL then __GFP_NOFAIL has to be set, right? > > > > That would assume non-buggy code. I'm talking about the exercising of > > hitherto-unused codepaths. We've fixed many, many pieces of code which > > simply assumed that kmalloc(GFP_KERNEL) succeeds. I doubt if many such > > simple bugs still exist, but there will be more subtle ones in there. > > We could add __GFP_NOFAIL to kmem_getpages in slab.c to insure that > kmalloc waits rather than return NULL. Also a too drastic measure right? Definitely too drastic. I bet that would cause deadlocks in quite some loads. -Andi ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Terminate process that fails on a constrained allocation @ 2006-02-08 20:14 linux 0 siblings, 0 replies; 28+ messages in thread From: linux @ 2006-02-08 20:14 UTC (permalink / raw) To: clameter; +Cc: linux-kernel Would perhaps a less drastic solution, that at least supports the common partitioned-system configuration, be to limit the oom killer to processes whose nodes are a *subset* of ours? That way, a limited process won't kill any unlimited processes, but it can fight with other processes with the same limits. However, an unlimited process discovering oom can kill anything on the system if necessary. (This requires a modified version of cpuset_excl_nodes_overlap that calls nodes_subset() instead of nodes_intersects(), but it's pretty straightforward.) ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2006-02-08 23:57 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-02-08 18:05 Terminate process that fails on a constrained allocation Christoph Lameter 2006-02-08 18:13 ` Andi Kleen 2006-02-08 18:34 ` Paul Jackson 2006-02-08 18:54 ` Christoph Lameter 2006-02-08 19:01 ` Andi Kleen 2006-02-08 19:15 ` Christoph Lameter 2006-02-08 18:33 ` Paul Jackson 2006-02-08 18:42 ` Christoph Lameter 2006-02-08 18:57 ` Paul Jackson 2006-02-08 19:02 ` Christoph Lameter 2006-02-08 19:05 ` Andi Kleen 2006-02-08 20:22 ` Paul Jackson 2006-02-08 20:36 ` Christoph Lameter 2006-02-08 20:55 ` Paul Jackson 2006-02-08 21:01 ` Andi Kleen 2006-02-08 21:03 ` Paul Jackson 2006-02-08 21:21 ` Andi Kleen 2006-02-08 21:39 ` Andrew Morton 2006-02-08 22:11 ` Christoph Lameter 2006-02-08 22:41 ` Andi Kleen 2006-02-08 23:29 ` Christoph Lameter 2006-02-08 23:35 ` Andrew Morton 2006-02-08 22:48 ` Christoph Lameter 2006-02-08 23:28 ` Christoph Lameter 2006-02-08 23:43 ` Andrew Morton 2006-02-08 23:54 ` Christoph Lameter 2006-02-08 23:57 ` Andi Kleen -- strict thread matches above, loose matches on Subject: below -- 2006-02-08 20:14 linux
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox