* [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
@ 2007-06-05 22:39 David Rientjes
2007-06-05 22:40 ` Christoph Lameter
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: David Rientjes @ 2007-06-05 22:39 UTC (permalink / raw)
To: Andrew Morton; +Cc: Andi Kleen, Christoph Lameter, Paul Jackson, linux-kernel
Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5.
OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access
memory outside its cpuset because it could potentially cause other
exclusive cpusets to OOM themselves.
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
kernel/cpuset.c | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
if (node_isset(node, current->mems_allowed))
return 1;
- /*
- * Allow tasks that have access to memory reserves because they have
- * been OOM killed to get memory anywhere.
- */
- if (unlikely(test_thread_flag(TIF_MEMDIE)))
- return 1;
if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */
return 0;
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 22:39 [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally David Rientjes @ 2007-06-05 22:40 ` Christoph Lameter 2007-06-05 22:42 ` David Rientjes 2007-06-05 23:01 ` Paul Jackson 2007-06-06 6:20 ` Peter Zijlstra 2 siblings, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2007-06-05 22:40 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson, linux-kernel On Tue, 5 Jun 2007, David Rientjes wrote: > Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5. > > OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access > memory outside its cpuset because it could potentially cause other > exclusive cpusets to OOM themselves. Those terminating tasks need that memory to terminate properly right? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 22:40 ` Christoph Lameter @ 2007-06-05 22:42 ` David Rientjes 0 siblings, 0 replies; 30+ messages in thread From: David Rientjes @ 2007-06-05 22:42 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson, linux-kernel On Tue, 5 Jun 2007, Christoph Lameter wrote: > On Tue, 5 Jun 2007, David Rientjes wrote: > > > Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5. > > > > OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access > > memory outside its cpuset because it could potentially cause other > > exclusive cpusets to OOM themselves. > > Those terminating tasks need that memory to terminate properly right? > No, they need access only to the memory reserves provided by the difference in their low and no watermarks on a per-cpuset basis. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 22:39 [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally David Rientjes 2007-06-05 22:40 ` Christoph Lameter @ 2007-06-05 23:01 ` Paul Jackson 2007-06-05 23:16 ` David Rientjes 2007-06-06 6:20 ` Peter Zijlstra 2 siblings, 1 reply; 30+ messages in thread From: Paul Jackson @ 2007-06-05 23:01 UTC (permalink / raw) To: David Rientjes; +Cc: akpm, ak, clameter, linux-kernel > OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access > memory outside its cpuset because it could potentially cause other > exclusive cpusets to OOM themselves. I'm a little surprised at this suggested change -- I'd have thought that it was a good idea to let tasks marked for extinction get memory anywhere, as they were going to use that memory to exit, and free up lots more memory. I'm pretty sure we have this same policy in other places in the kernel, besides cpusets. Did you intend to change them too? If a MEMDIE task is taking enough memory to OOM other tasks anywhere in the system, then doesn't that mean your entire system was in deep yogurt, and we're just haggling over who to blame for the upcoming crash? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:01 ` Paul Jackson @ 2007-06-05 23:16 ` David Rientjes 2007-06-05 23:19 ` Paul Jackson 2007-06-05 23:20 ` Christoph Lameter 0 siblings, 2 replies; 30+ messages in thread From: David Rientjes @ 2007-06-05 23:16 UTC (permalink / raw) To: Paul Jackson; +Cc: akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, Paul Jackson wrote: > I'm a little surprised at this suggested change -- I'd have thought > that it was a good idea to let tasks marked for extinction get memory > anywhere, as they were going to use that memory to exit, and free up > lots more memory. > The intended purpose of TIF_MEMDIE was to allocate pages without being bound by the watermarks so that they have access to memory reserves on the per-zone level. If the cpuset doesn't have access to a zone, whether it's memory reserve or not, it shouldn't allocate there. > I'm pretty sure we have this same policy in other places in the > kernel, besides cpusets. Did you intend to change them too? > You'd have to cite them first. > If a MEMDIE task is taking enough memory to OOM other tasks anywhere > in the system, then doesn't that mean your entire system was in deep > yogurt, and we're just haggling over who to blame for the upcoming > crash? > No, it means that it can allocate anywhere based on the zonelist ordering and then can OOM a very small exclusive cpuset that would never have had any memory pressure if it wasn't violated. David ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:16 ` David Rientjes @ 2007-06-05 23:19 ` Paul Jackson 2007-06-05 23:20 ` Christoph Lameter 1 sibling, 0 replies; 30+ messages in thread From: Paul Jackson @ 2007-06-05 23:19 UTC (permalink / raw) To: David Rientjes; +Cc: akpm, ak, clameter, linux-kernel > The intended purpose of TIF_MEMDIE was to allocate pages without being Ok then ... you probably right. I'll stand down. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:16 ` David Rientjes 2007-06-05 23:19 ` Paul Jackson @ 2007-06-05 23:20 ` Christoph Lameter 2007-06-05 23:25 ` David Rientjes 1 sibling, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2007-06-05 23:20 UTC (permalink / raw) To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, David Rientjes wrote: > No, it means that it can allocate anywhere based on the zonelist ordering > and then can OOM a very small exclusive cpuset that would never have had > any memory pressure if it wasn't violated. But the alternative is that the exiting process does not save its data. What is this very small exclusive cpuset? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:20 ` Christoph Lameter @ 2007-06-05 23:25 ` David Rientjes 2007-06-05 23:32 ` Christoph Lameter 0 siblings, 1 reply; 30+ messages in thread From: David Rientjes @ 2007-06-05 23:25 UTC (permalink / raw) To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, Christoph Lameter wrote: > But the alternative is that the exiting process does not save its > data. > The same condition that occurs when there is a system-wide OOM, yes. Exclusive cpusets cannot be violated for such allocations outside of the obvious GFP_ATOMIC exception. > What is this very small exclusive cpuset? > That's arbitrary. The idea is that an exclusive cpuset should not encounter memory pressure because another exclusive cpuset encountered an OOM condition because its zones happened to be higher on the zonelist. Notice how, without this change, it's possible to allocate on a node outside our mems_allowed before we use our own memory reserves. David ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:25 ` David Rientjes @ 2007-06-05 23:32 ` Christoph Lameter 2007-06-05 23:44 ` David Rientjes 0 siblings, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2007-06-05 23:32 UTC (permalink / raw) To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, David Rientjes wrote: > > But the alternative is that the exiting process does not save its > > data. > The same condition that occurs when there is a system-wide OOM, yes. > Exclusive cpusets cannot be violated for such allocations outside of the > obvious GFP_ATOMIC exception. But with the patch the process would be able to terminate. There is no global OOM situation. If there would be a global OOM situation then TIF_MEMDIE would not help. > > What is this very small exclusive cpuset? > That's arbitrary. The idea is that an exclusive cpuset should not > encounter memory pressure because another exclusive cpuset encountered an > OOM condition because its zones happened to be higher on the zonelist. > Notice how, without this change, it's possible to allocate on a node > outside our mems_allowed before we use our own memory reserves. So its seems that the patch is addressing an imagined situation? I think the allocation outside of our mems_allowed is fine when it serves to terminate the process and thereby release resources. It is certainly better than having the process corrupt data by only partially writing back its data. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:32 ` Christoph Lameter @ 2007-06-05 23:44 ` David Rientjes 2007-06-05 23:55 ` Paul Jackson 2007-06-05 23:57 ` Christoph Lameter 0 siblings, 2 replies; 30+ messages in thread From: David Rientjes @ 2007-06-05 23:44 UTC (permalink / raw) To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, Christoph Lameter wrote: > But with the patch the process would be able to terminate. There is no > global OOM situation. If there would be a global OOM situation then > TIF_MEMDIE would not help. > Sure it would, it would have access to memory reserves because of the change in watermarks through get_page_from_freelist(). If that fails, we can't allocate elsewhere because then we have taken exclusive memory from other applications and is contrary to the definition of mem_exclusive. You need to construct your cpuset hierarchy with these scenarios in mind; when you ask for an exclusive cpuset, it shouldn't come with a disclaimer that says "if another cpuset that is also exclusive happens to OOM, we'll steal your memory anyway and it's not our problem if the dying task gets stuck in D state and doesn't exit synchronously or reliably because all we did was send it a SIGKILL." > So its seems that the patch is addressing an imagined situation? > No, it's returning us to the previous logic where an exclusive cpuset was actually exclusive. And, again, without this change it is possible to allocate in other exclusive cpusets without first exhausting your own memory reserves. That's wrong. David ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:44 ` David Rientjes @ 2007-06-05 23:55 ` Paul Jackson 2007-06-06 1:17 ` David Rientjes 2007-06-05 23:57 ` Christoph Lameter 1 sibling, 1 reply; 30+ messages in thread From: Paul Jackson @ 2007-06-05 23:55 UTC (permalink / raw) To: David Rientjes; +Cc: clameter, akpm, ak, clameter, linux-kernel > If that fails, we can't allocate elsewhere because then we have taken > exclusive memory from other applications and is contrary to the definition > of mem_exclusive. Well, I can't speak to the 'real' meaning of TIF_MEMDIE with authority, but I can speak to the meaning of cpuset flags. The mem_exclusive flag doesn't mean this. It means that you cannot overlap the memory of a sibling cpuset. You will, necessarily, still overlap the memory of your ancestor cpusets. Whether or not you make any use of the mem_exclusive flag, you still get the same (limited) guarantees of memory usage -- namely that your memory won't be used by tasks in non-overlapping cpusets, with some exceptions, such as: 1) memory handed out to interrupt code, 2) memory handed out for GFP_ATOMIC requests, and 3) tasks marked PF_EXITING -- will soon free up memory Tasks in cpusets ancestor to your tasks cpuset can always, easily, use memory on the same nodes your task is on. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:55 ` Paul Jackson @ 2007-06-06 1:17 ` David Rientjes 2007-06-06 1:20 ` Paul Jackson 0 siblings, 1 reply; 30+ messages in thread From: David Rientjes @ 2007-06-06 1:17 UTC (permalink / raw) To: Paul Jackson; +Cc: clameter, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, Paul Jackson wrote: > Well, I can't speak to the 'real' meaning of TIF_MEMDIE with authority, > but I can speak to the meaning of cpuset flags. > > The mem_exclusive flag doesn't mean this. > > It means that you cannot overlap the memory of a sibling cpuset. > Which, with this patch, we will respect for tasks marked TIF_MEMDIE as well. > You will, necessarily, still overlap the memory of your ancestor cpusets. > And that's what nearest_exclusive_ancestor() determines later on if we're not requesting GFP_HARDWALL. > Whether or not you make any use of the mem_exclusive flag, you still > get the same (limited) guarantees of memory usage -- namely that your > memory won't be used by tasks in non-overlapping cpusets, with some > exceptions, such as: > 1) memory handed out to interrupt code, > 2) memory handed out for GFP_ATOMIC requests, and > 3) tasks marked PF_EXITING -- will soon free up memory > This is precisely the point: we already respect PF_EXITING tasks with their ability to allocate outside their own cpuset. That gets set in do_exit() when a task is in receipt of the SIGKILL from the OOM killer during the exit path. Between these time periods (the time when we issue the OOM SIGKILL and the time we're marked PF_EXITING in do_exit()), we should not allow allocations outside of our cpuset because we do not yet have the guarantee that they will exit synchronously or reliably. > Tasks in cpusets ancestor to your tasks cpuset can always, easily, > use memory on the same nodes your task is on. > Sure, that behavior is unchanged. We're relying on nearest_exclusive_ancestor() to determine if such nodes overlap. David ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 1:17 ` David Rientjes @ 2007-06-06 1:20 ` Paul Jackson 0 siblings, 0 replies; 30+ messages in thread From: Paul Jackson @ 2007-06-06 1:20 UTC (permalink / raw) To: David Rientjes; +Cc: clameter, akpm, ak, clameter, linux-kernel > Sure, that behavior is unchanged. We're relying on > nearest_exclusive_ancestor() to determine if such nodes overlap. Ok ... my points on cpusets semantics having been heard, I stand back down on the matter of memory semantics, where I am not the master. Thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:44 ` David Rientjes 2007-06-05 23:55 ` Paul Jackson @ 2007-06-05 23:57 ` Christoph Lameter 2007-06-06 1:23 ` David Rientjes 1 sibling, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2007-06-05 23:57 UTC (permalink / raw) To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, David Rientjes wrote: > If that fails, we can't allocate elsewhere because then we have taken > exclusive memory from other applications and is contrary to the definition > of mem_exclusive. You need to construct your cpuset hierarchy with these > scenarios in mind; when you ask for an exclusive cpuset, it shouldn't come > with a disclaimer that says "if another cpuset that is also exclusive > happens to OOM, we'll steal your memory anyway and it's not our problem if > the dying task gets stuck in D state and doesn't exit synchronously or > reliably because all we did was send it a SIGKILL." Exclusive is not as absolute as you may think. There is also the GFP_KERNEL exception. Processes stuck in D state is another issue with reliability. > > So its seems that the patch is addressing an imagined situation? > No, it's returning us to the previous logic where an exclusive cpuset was > actually exclusive. > > And, again, without this change it is possible to allocate in other > exclusive cpusets without first exhausting your own memory reserves. > That's wrong. That is already occurring with GFP_KERNEL. So your patch really does not have the purifying effect on exclusivity that you expect. This looks all more like hunting for elusive idealistic cpuset behavior to me. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 23:57 ` Christoph Lameter @ 2007-06-06 1:23 ` David Rientjes 2007-06-06 1:32 ` Christoph Lameter 0 siblings, 1 reply; 30+ messages in thread From: David Rientjes @ 2007-06-06 1:23 UTC (permalink / raw) To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, Christoph Lameter wrote: > Exclusive is not as absolute as you may think. There is also the > GFP_KERNEL exception. > Memory exclusivity with respect to cpusets should guarantee that memory nodes do not overlap with siblings if they are marked with mems_exclusive. The patch simply preserves that behavior through the time period between when the OOM killer issues a SIGKILL and the task is exiting and marked with PF_EXITING. Obviously GFP_KERNEL allocations can allocate regardless of our memory exclusivity, but the point is that a job in one exclusive cpuset should not have the ability to effect the performance (in terms of reclaim and swap), memory usage, or survival of jobs in other exclusive cpusets because it was out of memory. > Processes stuck in D state is another issue with reliability. > But it's a reality that we need to respect. It happens and when it does it has the potential to hamper other cpusets that we setup to be exclusive themselves. David ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 1:23 ` David Rientjes @ 2007-06-06 1:32 ` Christoph Lameter 2007-06-06 1:40 ` David Rientjes 0 siblings, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2007-06-06 1:32 UTC (permalink / raw) To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, David Rientjes wrote: > Obviously GFP_KERNEL allocations can allocate regardless of our memory > exclusivity, but the point is that a job in one exclusive cpuset should > not have the ability to effect the performance (in terms of reclaim and > swap), memory usage, or survival of jobs in other exclusive cpusets > because it was out of memory. Right but the process is terminating thus only requiring limited resources to get finished. The process does not have the ability to affect the other cpusets. I.e. it cannot directly allocate outside of the cpuset. The system has that capability and the system is handling the termination of the process and should terminate the process in a clean way if possible. > > Processes stuck in D state is another issue with reliability. > But it's a reality that we need to respect. It happens and when it does > it has the potential to hamper other cpusets that we setup to be exclusive > themselves. The fact that process hang because of some software deficiency is an exceptional failure scenario. The system in general is not fully operational anymore anyways. Typically one or the other lock is held which makes certain kernel functionality inaccessible. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 1:32 ` Christoph Lameter @ 2007-06-06 1:40 ` David Rientjes 2007-06-06 1:54 ` Christoph Lameter 0 siblings, 1 reply; 30+ messages in thread From: David Rientjes @ 2007-06-06 1:40 UTC (permalink / raw) To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, Christoph Lameter wrote: > On Tue, 5 Jun 2007, David Rientjes wrote: > > > Obviously GFP_KERNEL allocations can allocate regardless of our memory > > exclusivity, but the point is that a job in one exclusive cpuset should > > not have the ability to effect the performance (in terms of reclaim and > > swap), memory usage, or survival of jobs in other exclusive cpusets > > because it was out of memory. > > Right but the process is terminating thus only requiring limited resources > to get finished. The process does not have the ability to affect the other > cpusets. I.e. it cannot directly allocate outside of the cpuset. The > system has that capability and the system is handling the termination of > the process and should terminate the process in a clean way if possible. > And those limited resources should be available in the difference between low and no watermarks as defined by each zone in the cpuset's mems_allowed. Regardless, we should not allow allocations outside of the cpuset because we have marked it TIF_MEMDIE and we don't have any explicit guarantee that it is exiting yet and not mlock'ing an excessive amount of memory that exceeds the capacity of all cpuset nodes. David ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 1:40 ` David Rientjes @ 2007-06-06 1:54 ` Christoph Lameter 2007-06-06 3:29 ` David Rientjes 0 siblings, 1 reply; 30+ messages in thread From: Christoph Lameter @ 2007-06-06 1:54 UTC (permalink / raw) To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, David Rientjes wrote: > mems_allowed. Regardless, we should not allow allocations outside of the > cpuset because we have marked it TIF_MEMDIE and we don't have any explicit > guarantee that it is exiting yet and not mlock'ing an excessive amount of > memory that exceeds the capacity of all cpuset nodes. Hmmmm... But we have sent it a SIGKILL. If the process is following conventions then it is exiting. Of course the process could be abusing the system and attempting to OOM the whole system as an act of revenge for being killed but isnt this a bit far fetched? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 1:54 ` Christoph Lameter @ 2007-06-06 3:29 ` David Rientjes 0 siblings, 0 replies; 30+ messages in thread From: David Rientjes @ 2007-06-06 3:29 UTC (permalink / raw) To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Tue, 5 Jun 2007, Christoph Lameter wrote: > Hmmmm... But we have sent it a SIGKILL. If the process is following > conventions then it is exiting. Of course the process could be abusing the > system and attempting to OOM the whole system as an act of revenge for > being killed but isnt this a bit far fetched? > It's not abusing the system because it was killed, it was killed because it was abusing the system and attempted to mlock more memory than allowed in its exclusive mems. So we OOM the task but it can continue mlock'ing memory and intrude on the memory of other cpusets because of the TIF_MEMDIE exception and our SIGKILL hasn't been handled yet. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-05 22:39 [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally David Rientjes 2007-06-05 22:40 ` Christoph Lameter 2007-06-05 23:01 ` Paul Jackson @ 2007-06-06 6:20 ` Peter Zijlstra 2007-06-06 6:42 ` David Rientjes 2 siblings, 1 reply; 30+ messages in thread From: Peter Zijlstra @ 2007-06-06 6:20 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson, linux-kernel On Tue, 2007-06-05 at 15:39 -0700, David Rientjes wrote: > Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5. > > OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access > memory outside its cpuset because it could potentially cause other > exclusive cpusets to OOM themselves. > > Cc: Andi Kleen <ak@suse.de> > Cc: Christoph Lameter <clameter@engr.sgi.com> > Cc: Paul Jackson <pj@sgi.com> > Signed-off-by: David Rientjes <rientjes@google.com> > --- > kernel/cpuset.c | 6 ------ > 1 files changed, 0 insertions(+), 6 deletions(-) > > diff --git a/kernel/cpuset.c b/kernel/cpuset.c > --- a/kernel/cpuset.c > +++ b/kernel/cpuset.c > @@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask) > might_sleep_if(!(gfp_mask & __GFP_HARDWALL)); > if (node_isset(node, current->mems_allowed)) > return 1; > - /* > - * Allow tasks that have access to memory reserves because they have > - * been OOM killed to get memory anywhere. > - */ > - if (unlikely(test_thread_flag(TIF_MEMDIE))) > - return 1; > if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */ > return 0; > This seems a little pointless, since cpuset_zone_allowed_softwall() is only effective with ALLOC_CPUSET, and the ALLOC_NO_WATERMARKS allocations opened up by TIF_MEMDIE don't use that. Also I agree with Christoph's reasoning; this is not for the application but for the system. Hence the cpuset does not get violated for the application [ something I tried to argue before, glad Christoph now agrees with me :-) ]. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 6:20 ` Peter Zijlstra @ 2007-06-06 6:42 ` David Rientjes 2007-06-06 7:09 ` Peter Zijlstra 0 siblings, 1 reply; 30+ messages in thread From: David Rientjes @ 2007-06-06 6:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson, linux-kernel On Wed, 6 Jun 2007, Peter Zijlstra wrote: > > diff --git a/kernel/cpuset.c b/kernel/cpuset.c > > --- a/kernel/cpuset.c > > +++ b/kernel/cpuset.c > > @@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask) > > might_sleep_if(!(gfp_mask & __GFP_HARDWALL)); > > if (node_isset(node, current->mems_allowed)) > > return 1; > > - /* > > - * Allow tasks that have access to memory reserves because they have > > - * been OOM killed to get memory anywhere. > > - */ > > - if (unlikely(test_thread_flag(TIF_MEMDIE))) > > - return 1; > > if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */ > > return 0; > > > > This seems a little pointless, since cpuset_zone_allowed_softwall() is > only effective with ALLOC_CPUSET, and the ALLOC_NO_WATERMARKS > allocations opened up by TIF_MEMDIE don't use that. > That's the change. Memory reserves on a per-zone level would now be used with respect to ALLOC_NO_WATERMARKS because TIF_MEMDIE tasks no longer have an explicit bypass for them. If the node is not in the task's mems_allowed and it's a __GFP_HARDWALL allocation or if it's neither PF_EXITING or in the nearest_exclusive_ancestor() of that task's cpuset, then we move on to the next zone in get_page_from_freelist(). The problem the patch addresses is that, as you mentioned, tasks with TIF_MEMDIE have an explicit bypass over using any memory reserves and can thus, depending on zonelist ordering, allocate first on nodes outside its mems_allowed even in the case of an exclusive cpuset before exhausting its own memory. That behavior is wrong. David ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 6:42 ` David Rientjes @ 2007-06-06 7:09 ` Peter Zijlstra 2007-06-06 7:18 ` David Rientjes 0 siblings, 1 reply; 30+ messages in thread From: Peter Zijlstra @ 2007-06-06 7:09 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson, linux-kernel On Tue, 2007-06-05 at 23:42 -0700, David Rientjes wrote: > On Wed, 6 Jun 2007, Peter Zijlstra wrote: > > > > diff --git a/kernel/cpuset.c b/kernel/cpuset.c > > > --- a/kernel/cpuset.c > > > +++ b/kernel/cpuset.c > > > @@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask) > > > might_sleep_if(!(gfp_mask & __GFP_HARDWALL)); > > > if (node_isset(node, current->mems_allowed)) > > > return 1; > > > - /* > > > - * Allow tasks that have access to memory reserves because they have > > > - * been OOM killed to get memory anywhere. > > > - */ > > > - if (unlikely(test_thread_flag(TIF_MEMDIE))) > > > - return 1; > > > if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */ > > > return 0; > > > > > > > This seems a little pointless, since cpuset_zone_allowed_softwall() is > > only effective with ALLOC_CPUSET, and the ALLOC_NO_WATERMARKS > > allocations opened up by TIF_MEMDIE don't use that. > > > > That's the change. Memory reserves on a per-zone level would now be used > with respect to ALLOC_NO_WATERMARKS because TIF_MEMDIE tasks no longer > have an explicit bypass for them. If the node is not in the task's > mems_allowed and it's a __GFP_HARDWALL allocation or if it's neither > PF_EXITING or in the nearest_exclusive_ancestor() of that task's cpuset, > then we move on to the next zone in get_page_from_freelist(). > > The problem the patch addresses is that, as you mentioned, tasks with > TIF_MEMDIE have an explicit bypass over using any memory reserves and can > thus, depending on zonelist ordering, allocate first on nodes outside its > mems_allowed even in the case of an exclusive cpuset before exhausting its > own memory. That behavior is wrong. Right, I see your point; however considering that its a system allocation, and all these constraints get violated by interrupts anyway, its more of an application container than a strict allocation container. Are you actually seeing the described behaviour, or just being pedantic (nothing wrong with that per-se)? I would actually be bold and make it worse by proposing something like this, which has the benefit of making the reserve threshold system wide. --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1eef614..870a791 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1630,6 +1630,21 @@ rebalance: && !in_interrupt()) { if (!(gfp_mask & __GFP_NOMEMALLOC)) { nofail_alloc: + /* + * break out of mempolicy boundaries + */ + zonelist = NODE_DATA(numa_node_id())->node_zonelists + + gfp_zone(gfp_mask); + + /* + * Before going bare metal, try to get a page above the + * critical threshold - ignoring CPU sets. + */ + page = get_page_from_freelist(gfp_mask, order, zonelist, + ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER); + if (page) + goto got_pg; + /* go through the zonelist yet again, ignoring mins */ page = get_page_from_freelist(gfp_mask, order, zonelist, ALLOC_NO_WATERMARKS); ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 7:09 ` Peter Zijlstra @ 2007-06-06 7:18 ` David Rientjes 2007-06-06 7:34 ` Paul Jackson 0 siblings, 1 reply; 30+ messages in thread From: David Rientjes @ 2007-06-06 7:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson, linux-kernel On Wed, 6 Jun 2007, Peter Zijlstra wrote: > Right, I see your point; however considering that its a system > allocation, and all these constraints get violated by interrupts anyway, > its more of an application container than a strict allocation container. > It is not necessarily system allocations at all. It is quite possible, as described earlier, to mlock a large quantity of memory that exceeds the capacity of all nodes in a task's mems_allowed and then when this task OOM's to continue allocating memory in get_user_pages() but with the TIF_MEMDIE exception that allows it to allocate anywhere. Then other cpusets get their memory infringed upon and they can OOM themselves even though there was no preexisting memory pressure. This happens before we return to userspace in the mlock() and thus we cannot properly handle the SIGKILL sent down from the OOM killer. > Are you actually seeing the described behaviour, or just being pedantic > (nothing wrong with that per-se)? > Yes, as described above. An exclusive cpuset containing only root, system-critical tasks was OOM'd when it was not under any memory pressure at all because a separate exclusive cpuset mlock'd a gigantic amount of memory and it could not reliably exit because the mlock continued to allocate outside its own cpuset and eventually OOM'd system-critical tasks or depleated all system memory. True story. At the least, if this patch is not agreed upon, I would suggest adding sanity checks against gfp_mask in cpuset_zone_allowed() to ensure we should allow the allocation in TIF_MEMDIE circumstances to avoid the above scenario. David ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 7:18 ` David Rientjes @ 2007-06-06 7:34 ` Paul Jackson 2007-06-06 7:39 ` Andrew Morton 2007-06-06 7:48 ` David Rientjes 0 siblings, 2 replies; 30+ messages in thread From: Paul Jackson @ 2007-06-06 7:34 UTC (permalink / raw) To: David Rientjes; +Cc: peterz, akpm, ak, clameter, linux-kernel > a separate exclusive cpuset mlock'd a gigantic amount of > memory and it could not reliably exit because the mlock continued to > allocate outside its own cpuset and eventually OOM'd system-critical tasks > or depleated all system memory. Seems like that mlock code is able then to get great globs of memory without returning to user space ... perhaps that's where the fix should be ... that code should quit chewing up memory if it's marked MEMDIE or some such? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 7:34 ` Paul Jackson @ 2007-06-06 7:39 ` Andrew Morton 2007-06-06 7:48 ` David Rientjes 1 sibling, 0 replies; 30+ messages in thread From: Andrew Morton @ 2007-06-06 7:39 UTC (permalink / raw) To: Paul Jackson; +Cc: David Rientjes, peterz, ak, clameter, linux-kernel On Wed, 6 Jun 2007 00:34:21 -0700 Paul Jackson <pj@sgi.com> wrote: > > a separate exclusive cpuset mlock'd a gigantic amount of > > memory and it could not reliably exit because the mlock continued to > > allocate outside its own cpuset and eventually OOM'd system-critical tasks > > or depleated all system memory. > > Seems like that mlock code is able then to get great globs of memory > without returning to user space ... perhaps that's where the fix > should be ... that code should quit chewing up memory if it's > marked MEMDIE or some such? yup. A fix for that is in the pipeline: bale from get_user_pages() if TIF_MEMDIE is set. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 7:34 ` Paul Jackson 2007-06-06 7:39 ` Andrew Morton @ 2007-06-06 7:48 ` David Rientjes 2007-06-06 7:56 ` Paul Jackson 2007-06-06 7:56 ` Peter Zijlstra 1 sibling, 2 replies; 30+ messages in thread From: David Rientjes @ 2007-06-06 7:48 UTC (permalink / raw) To: Paul Jackson; +Cc: peterz, akpm, ak, clameter, linux-kernel On Wed, 6 Jun 2007, Paul Jackson wrote: > Seems like that mlock code is able then to get great globs of memory > without returning to user space ... perhaps that's where the fix > should be ... that code should quit chewing up memory if it's > marked MEMDIE or some such? > That's one case. Are there others? The TIF_MEMDIE exception in cpuset_zone_allowed_softwall() allowed this problem in mlock(). If it had not been allowed to allocate anywhere based simply on the zonelist ordering, the mlock iteration would break because it could not handle the fault. Thus, at the least, we should make sure that memory is not allocated outside of a task's mems_allowed unless we do sanity checks against gfp_mask in the TIF_MEMDIE case via cpuset_zone_allowed_softwall() to make sure a rouge application doesn't cause the same trouble. That is, unless you can guarantee this type of problem will not happen again through any other means. The logic needs to be with the TIF_MEMDIE exception to grant access to memory outside the cpuset only when it is relevant to the OOM killed task's prompt exit. David ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 7:48 ` David Rientjes @ 2007-06-06 7:56 ` Paul Jackson 2007-06-06 8:00 ` Andrew Morton 2007-06-06 7:56 ` Peter Zijlstra 1 sibling, 1 reply; 30+ messages in thread From: Paul Jackson @ 2007-06-06 7:56 UTC (permalink / raw) To: David Rientjes; +Cc: peterz, akpm, ak, clameter, linux-kernel David wrote: > That is, unless you can guarantee this type of problem will not happen again Well, I certainly cannot guarantee that. Heck, I can't even guarantee isn't happening right now, somewhere else. But I'm no memory guru. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 7:56 ` Paul Jackson @ 2007-06-06 8:00 ` Andrew Morton 2007-06-06 8:03 ` Peter Zijlstra 0 siblings, 1 reply; 30+ messages in thread From: Andrew Morton @ 2007-06-06 8:00 UTC (permalink / raw) To: Paul Jackson; +Cc: David Rientjes, peterz, ak, clameter, linux-kernel On Wed, 6 Jun 2007 00:56:10 -0700 Paul Jackson <pj@sgi.com> wrote: > David wrote: > > That is, unless you can guarantee this type of problem will not happen again > > Well, I certainly cannot guarantee that. The only place I can think of where the kernel will sit there allocating huge amounts of memory like this is in readahead. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 8:00 ` Andrew Morton @ 2007-06-06 8:03 ` Peter Zijlstra 0 siblings, 0 replies; 30+ messages in thread From: Peter Zijlstra @ 2007-06-06 8:03 UTC (permalink / raw) To: Andrew Morton; +Cc: Paul Jackson, David Rientjes, ak, clameter, linux-kernel On Wed, 2007-06-06 at 01:00 -0700, Andrew Morton wrote: > On Wed, 6 Jun 2007 00:56:10 -0700 Paul Jackson <pj@sgi.com> wrote: > > > David wrote: > > > That is, unless you can guarantee this type of problem will not happen again > > > > Well, I certainly cannot guarantee that. > > The only place I can think of where the kernel will sit there allocating > huge amounts of memory like this is in readahead. We just created one more, the execve string copy. However that uses get_user_pages(), so if that gets fixed we're save there too. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally 2007-06-06 7:48 ` David Rientjes 2007-06-06 7:56 ` Paul Jackson @ 2007-06-06 7:56 ` Peter Zijlstra 1 sibling, 0 replies; 30+ messages in thread From: Peter Zijlstra @ 2007-06-06 7:56 UTC (permalink / raw) To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel On Wed, 2007-06-06 at 00:48 -0700, David Rientjes wrote: > On Wed, 6 Jun 2007, Paul Jackson wrote: > > > Seems like that mlock code is able then to get great globs of memory > > without returning to user space ... perhaps that's where the fix > > should be ... that code should quit chewing up memory if it's > > marked MEMDIE or some such? > > > > That's one case. Are there others? > > The TIF_MEMDIE exception in cpuset_zone_allowed_softwall() allowed this > problem in mlock(). If it had not been allowed to allocate anywhere > based simply on the zonelist ordering, the mlock iteration would break > because it could not handle the fault. > > Thus, at the least, we should make sure that memory is not allocated > outside of a task's mems_allowed unless we do sanity checks against > gfp_mask in the TIF_MEMDIE case via cpuset_zone_allowed_softwall() to make > sure a rouge application doesn't cause the same trouble. That is, unless > you can guarantee this type of problem will not happen again through any > other means. The logic needs to be with the TIF_MEMDIE exception to grant > access to memory outside the cpuset only when it is relevant to the OOM > killed task's prompt exit. I don't think your patch alone would have been sufficient. With it it would have depleted the local reserves and then jumped onwards to other nodes (since the ALLOC_NO_WATERMARKS allocation doesn't have ALLOC_CPUSET). Unless there was a mem-policy restricting the zonelist (not sure if cpusets and mem-policies are independent like that) But your point stands. ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2007-06-06 8:03 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-06-05 22:39 [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally David Rientjes 2007-06-05 22:40 ` Christoph Lameter 2007-06-05 22:42 ` David Rientjes 2007-06-05 23:01 ` Paul Jackson 2007-06-05 23:16 ` David Rientjes 2007-06-05 23:19 ` Paul Jackson 2007-06-05 23:20 ` Christoph Lameter 2007-06-05 23:25 ` David Rientjes 2007-06-05 23:32 ` Christoph Lameter 2007-06-05 23:44 ` David Rientjes 2007-06-05 23:55 ` Paul Jackson 2007-06-06 1:17 ` David Rientjes 2007-06-06 1:20 ` Paul Jackson 2007-06-05 23:57 ` Christoph Lameter 2007-06-06 1:23 ` David Rientjes 2007-06-06 1:32 ` Christoph Lameter 2007-06-06 1:40 ` David Rientjes 2007-06-06 1:54 ` Christoph Lameter 2007-06-06 3:29 ` David Rientjes 2007-06-06 6:20 ` Peter Zijlstra 2007-06-06 6:42 ` David Rientjes 2007-06-06 7:09 ` Peter Zijlstra 2007-06-06 7:18 ` David Rientjes 2007-06-06 7:34 ` Paul Jackson 2007-06-06 7:39 ` Andrew Morton 2007-06-06 7:48 ` David Rientjes 2007-06-06 7:56 ` Paul Jackson 2007-06-06 8:00 ` Andrew Morton 2007-06-06 8:03 ` Peter Zijlstra 2007-06-06 7:56 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox