public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
@ 2007-06-05 22:39 David Rientjes
  2007-06-05 22:40 ` Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: David Rientjes @ 2007-06-05 22:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, Christoph Lameter, Paul Jackson, linux-kernel

Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5.

OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access 
memory outside its cpuset because it could potentially cause other 
exclusive cpusets to OOM themselves.

Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 kernel/cpuset.c |    6 ------
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
 	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
 	if (node_isset(node, current->mems_allowed))
 		return 1;
-	/*
-	 * Allow tasks that have access to memory reserves because they have
-	 * been OOM killed to get memory anywhere.
-	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE)))
-		return 1;
 	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
 		return 0;
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 22:39 [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally David Rientjes
@ 2007-06-05 22:40 ` Christoph Lameter
  2007-06-05 22:42   ` David Rientjes
  2007-06-05 23:01 ` Paul Jackson
  2007-06-06  6:20 ` Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-06-05 22:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson,
	linux-kernel

On Tue, 5 Jun 2007, David Rientjes wrote:

> Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5.
> 
> OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access 
> memory outside its cpuset because it could potentially cause other 
> exclusive cpusets to OOM themselves.

Those terminating tasks need that memory to terminate properly right?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 22:40 ` Christoph Lameter
@ 2007-06-05 22:42   ` David Rientjes
  0 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2007-06-05 22:42 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson,
	linux-kernel

On Tue, 5 Jun 2007, Christoph Lameter wrote:

> On Tue, 5 Jun 2007, David Rientjes wrote:
> 
> > Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5.
> > 
> > OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access 
> > memory outside its cpuset because it could potentially cause other 
> > exclusive cpusets to OOM themselves.
> 
> Those terminating tasks need that memory to terminate properly right?
> 

No, they need access only to the memory reserves provided by the 
difference in their low and no watermarks on a per-cpuset basis.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 22:39 [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally David Rientjes
  2007-06-05 22:40 ` Christoph Lameter
@ 2007-06-05 23:01 ` Paul Jackson
  2007-06-05 23:16   ` David Rientjes
  2007-06-06  6:20 ` Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Paul Jackson @ 2007-06-05 23:01 UTC (permalink / raw)
  To: David Rientjes; +Cc: akpm, ak, clameter, linux-kernel

> OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access 
> memory outside its cpuset because it could potentially cause other 
> exclusive cpusets to OOM themselves.

I'm a little surprised at this suggested change -- I'd have thought
that it was a good idea to let tasks marked for extinction get memory
anywhere, as they were going to use that memory to exit, and free up
lots more memory.

I'm pretty sure we have this same policy in other places in the
kernel, besides cpusets.  Did you intend to change them too?

If a MEMDIE task is taking enough memory to OOM other tasks anywhere
in the system, then doesn't that mean your entire system was in deep
yogurt, and we're just haggling over who to blame for the upcoming
crash?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:01 ` Paul Jackson
@ 2007-06-05 23:16   ` David Rientjes
  2007-06-05 23:19     ` Paul Jackson
  2007-06-05 23:20     ` Christoph Lameter
  0 siblings, 2 replies; 30+ messages in thread
From: David Rientjes @ 2007-06-05 23:16 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, Paul Jackson wrote:

> I'm a little surprised at this suggested change -- I'd have thought
> that it was a good idea to let tasks marked for extinction get memory
> anywhere, as they were going to use that memory to exit, and free up
> lots more memory.
> 

The intended purpose of TIF_MEMDIE was to allocate pages without being 
bound by the watermarks so that they have access to memory reserves on the 
per-zone level.  If the cpuset doesn't have access to a zone, whether it's 
memory reserve or not, it shouldn't allocate there.

> I'm pretty sure we have this same policy in other places in the
> kernel, besides cpusets.  Did you intend to change them too?
> 

You'd have to cite them first.

> If a MEMDIE task is taking enough memory to OOM other tasks anywhere
> in the system, then doesn't that mean your entire system was in deep
> yogurt, and we're just haggling over who to blame for the upcoming
> crash?
> 

No, it means that it can allocate anywhere based on the zonelist ordering 
and then can OOM a very small exclusive cpuset that would never have had 
any memory pressure if it wasn't violated.

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:16   ` David Rientjes
@ 2007-06-05 23:19     ` Paul Jackson
  2007-06-05 23:20     ` Christoph Lameter
  1 sibling, 0 replies; 30+ messages in thread
From: Paul Jackson @ 2007-06-05 23:19 UTC (permalink / raw)
  To: David Rientjes; +Cc: akpm, ak, clameter, linux-kernel

> The intended purpose of TIF_MEMDIE was to allocate pages without being 

Ok then ... you probably right.  I'll stand down.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:16   ` David Rientjes
  2007-06-05 23:19     ` Paul Jackson
@ 2007-06-05 23:20     ` Christoph Lameter
  2007-06-05 23:25       ` David Rientjes
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-06-05 23:20 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, David Rientjes wrote:

> No, it means that it can allocate anywhere based on the zonelist ordering 
> and then can OOM a very small exclusive cpuset that would never have had 
> any memory pressure if it wasn't violated.

But the alternative is that the exiting process does not save its 
data.

What is this very small exclusive cpuset?



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:20     ` Christoph Lameter
@ 2007-06-05 23:25       ` David Rientjes
  2007-06-05 23:32         ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2007-06-05 23:25 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, Christoph Lameter wrote:

> But the alternative is that the exiting process does not save its 
> data.
> 

The same condition that occurs when there is a system-wide OOM, yes.  
Exclusive cpusets cannot be violated for such allocations outside of the 
obvious GFP_ATOMIC exception.

> What is this very small exclusive cpuset?
> 

That's arbitrary.  The idea is that an exclusive cpuset should not 
encounter memory pressure because another exclusive cpuset encountered an 
OOM condition because its zones happened to be higher on the zonelist.  
Notice how, without this change, it's possible to allocate on a node 
outside our mems_allowed before we use our own memory reserves.

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:25       ` David Rientjes
@ 2007-06-05 23:32         ` Christoph Lameter
  2007-06-05 23:44           ` David Rientjes
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-06-05 23:32 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, David Rientjes wrote:

> > But the alternative is that the exiting process does not save its 
> > data.
> The same condition that occurs when there is a system-wide OOM, yes.  
> Exclusive cpusets cannot be violated for such allocations outside of the 
> obvious GFP_ATOMIC exception.

But with the patch the process would be able to terminate. There is no 
global OOM situation. If there would be a global OOM situation then 
TIF_MEMDIE would not help.

> > What is this very small exclusive cpuset?
> That's arbitrary.  The idea is that an exclusive cpuset should not 
> encounter memory pressure because another exclusive cpuset encountered an 
> OOM condition because its zones happened to be higher on the zonelist.  
> Notice how, without this change, it's possible to allocate on a node 
> outside our mems_allowed before we use our own memory reserves.

So its seems that the patch is addressing an imagined situation?

I think the allocation outside of our mems_allowed is fine when it serves 
to terminate the process and thereby release resources. It is certainly 
better than having the process corrupt data by only partially writing back 
its data.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:32         ` Christoph Lameter
@ 2007-06-05 23:44           ` David Rientjes
  2007-06-05 23:55             ` Paul Jackson
  2007-06-05 23:57             ` Christoph Lameter
  0 siblings, 2 replies; 30+ messages in thread
From: David Rientjes @ 2007-06-05 23:44 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, Christoph Lameter wrote:

> But with the patch the process would be able to terminate. There is no 
> global OOM situation. If there would be a global OOM situation then 
> TIF_MEMDIE would not help.
> 

Sure it would, it would have access to memory reserves because of the 
change in watermarks through get_page_from_freelist().

If that fails, we can't allocate elsewhere because then we have taken 
exclusive memory from other applications and is contrary to the definition 
of mem_exclusive.  You need to construct your cpuset hierarchy with these 
scenarios in mind; when you ask for an exclusive cpuset, it shouldn't come 
with a disclaimer that says "if another cpuset that is also exclusive 
happens to OOM, we'll steal your memory anyway and it's not our problem if 
the dying task gets stuck in D state and doesn't exit synchronously or 
reliably because all we did was send it a SIGKILL."

> So its seems that the patch is addressing an imagined situation?
> 

No, it's returning us to the previous logic where an exclusive cpuset was 
actually exclusive.

And, again, without this change it is possible to allocate in other 
exclusive cpusets without first exhausting your own memory reserves.  
That's wrong.

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:44           ` David Rientjes
@ 2007-06-05 23:55             ` Paul Jackson
  2007-06-06  1:17               ` David Rientjes
  2007-06-05 23:57             ` Christoph Lameter
  1 sibling, 1 reply; 30+ messages in thread
From: Paul Jackson @ 2007-06-05 23:55 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, akpm, ak, clameter, linux-kernel

> If that fails, we can't allocate elsewhere because then we have taken 
> exclusive memory from other applications and is contrary to the definition 
> of mem_exclusive. 

Well, I can't speak to the 'real' meaning of TIF_MEMDIE with authority,
but I can speak to the meaning of cpuset flags.

The mem_exclusive flag doesn't mean this.

It means that you cannot overlap the memory of a sibling cpuset.

You will, necessarily, still overlap the memory of your ancestor cpusets.

Whether or not you make any use of the mem_exclusive flag, you still
get the same (limited) guarantees of memory usage -- namely that your
memory won't be used by tasks in non-overlapping cpusets, with some
exceptions, such as:
 1) memory handed out to interrupt code,
 2) memory handed out for GFP_ATOMIC requests, and
 3) tasks marked PF_EXITING -- will soon free up memory

Tasks in cpusets ancestor to your tasks cpuset can always, easily,
use memory on the same nodes your task is on.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:44           ` David Rientjes
  2007-06-05 23:55             ` Paul Jackson
@ 2007-06-05 23:57             ` Christoph Lameter
  2007-06-06  1:23               ` David Rientjes
  1 sibling, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-06-05 23:57 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, David Rientjes wrote:

> If that fails, we can't allocate elsewhere because then we have taken 
> exclusive memory from other applications and is contrary to the definition 
> of mem_exclusive.  You need to construct your cpuset hierarchy with these 
> scenarios in mind; when you ask for an exclusive cpuset, it shouldn't come 
> with a disclaimer that says "if another cpuset that is also exclusive 
> happens to OOM, we'll steal your memory anyway and it's not our problem if 
> the dying task gets stuck in D state and doesn't exit synchronously or 
> reliably because all we did was send it a SIGKILL."

Exclusive is not as absolute as you may think. There is also the 
GFP_KERNEL exception.

Processes stuck in D state is another issue with reliability.

> > So its seems that the patch is addressing an imagined situation?
> No, it's returning us to the previous logic where an exclusive cpuset was 
> actually exclusive.
> 
> And, again, without this change it is possible to allocate in other 
> exclusive cpusets without first exhausting your own memory reserves.  
> That's wrong.

That is already occurring with GFP_KERNEL. So your patch really does not 
have the purifying effect on exclusivity that you expect. This looks all 
more like hunting for elusive idealistic cpuset behavior to me.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:55             ` Paul Jackson
@ 2007-06-06  1:17               ` David Rientjes
  2007-06-06  1:20                 ` Paul Jackson
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2007-06-06  1:17 UTC (permalink / raw)
  To: Paul Jackson; +Cc: clameter, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, Paul Jackson wrote:

> Well, I can't speak to the 'real' meaning of TIF_MEMDIE with authority,
> but I can speak to the meaning of cpuset flags.
> 
> The mem_exclusive flag doesn't mean this.
> 
> It means that you cannot overlap the memory of a sibling cpuset.
> 

Which, with this patch, we will respect for tasks marked TIF_MEMDIE as 
well.

> You will, necessarily, still overlap the memory of your ancestor cpusets.
> 

And that's what nearest_exclusive_ancestor() determines later on if we're 
not requesting GFP_HARDWALL.

> Whether or not you make any use of the mem_exclusive flag, you still
> get the same (limited) guarantees of memory usage -- namely that your
> memory won't be used by tasks in non-overlapping cpusets, with some
> exceptions, such as:
>  1) memory handed out to interrupt code,
>  2) memory handed out for GFP_ATOMIC requests, and
>  3) tasks marked PF_EXITING -- will soon free up memory
> 

This is precisely the point: we already respect PF_EXITING tasks with 
their ability to allocate outside their own cpuset.  That gets set in 
do_exit() when a task is in receipt of the SIGKILL from the OOM killer 
during the exit path.  Between these time periods (the time when we issue 
the OOM SIGKILL and the time we're marked PF_EXITING in do_exit()), we 
should not allow allocations outside of our cpuset because we do not yet 
have the guarantee that they will exit synchronously or reliably.

> Tasks in cpusets ancestor to your tasks cpuset can always, easily,
> use memory on the same nodes your task is on.
> 

Sure, that behavior is unchanged.  We're relying on 
nearest_exclusive_ancestor() to determine if such nodes overlap.

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  1:17               ` David Rientjes
@ 2007-06-06  1:20                 ` Paul Jackson
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Jackson @ 2007-06-06  1:20 UTC (permalink / raw)
  To: David Rientjes; +Cc: clameter, akpm, ak, clameter, linux-kernel

> Sure, that behavior is unchanged.  We're relying on 
> nearest_exclusive_ancestor() to determine if such nodes overlap.

Ok ... my points on cpusets semantics having been heard,
I stand back down on the matter of memory semantics, where
I am not the master.

Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 23:57             ` Christoph Lameter
@ 2007-06-06  1:23               ` David Rientjes
  2007-06-06  1:32                 ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2007-06-06  1:23 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, Christoph Lameter wrote:

> Exclusive is not as absolute as you may think. There is also the 
> GFP_KERNEL exception.
> 

Memory exclusivity with respect to cpusets should guarantee that memory 
nodes do not overlap with siblings if they are marked with mems_exclusive.  
The patch simply preserves that behavior through the time period between 
when the OOM killer issues a SIGKILL and the task is exiting and marked 
with PF_EXITING.

Obviously GFP_KERNEL allocations can allocate regardless of our memory 
exclusivity, but the point is that a job in one exclusive cpuset should 
not have the ability to effect the performance (in terms of reclaim and 
swap), memory usage, or survival of jobs in other exclusive cpusets 
because it was out of memory.

> Processes stuck in D state is another issue with reliability.
> 

But it's a reality that we need to respect.  It happens and when it does
it has the potential to hamper other cpusets that we setup to be exclusive 
themselves.

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  1:23               ` David Rientjes
@ 2007-06-06  1:32                 ` Christoph Lameter
  2007-06-06  1:40                   ` David Rientjes
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-06-06  1:32 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, David Rientjes wrote:

> Obviously GFP_KERNEL allocations can allocate regardless of our memory 
> exclusivity, but the point is that a job in one exclusive cpuset should 
> not have the ability to effect the performance (in terms of reclaim and 
> swap), memory usage, or survival of jobs in other exclusive cpusets 
> because it was out of memory.

Right but the process is terminating thus only requiring limited resources
to get finished. The process does not have the ability to affect the other 
cpusets. I.e. it cannot directly allocate outside of the cpuset. The 
system has that capability and the system is handling the termination of 
the process and should terminate the process in a clean way if possible.

> > Processes stuck in D state is another issue with reliability.
> But it's a reality that we need to respect.  It happens and when it does
> it has the potential to hamper other cpusets that we setup to be exclusive 
> themselves.

The fact that process hang because of some software deficiency is an 
exceptional failure scenario. The system in general is not fully 
operational anymore anyways. Typically one or the other lock is held which 
makes certain kernel functionality inaccessible.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  1:32                 ` Christoph Lameter
@ 2007-06-06  1:40                   ` David Rientjes
  2007-06-06  1:54                     ` Christoph Lameter
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2007-06-06  1:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, Christoph Lameter wrote:

> On Tue, 5 Jun 2007, David Rientjes wrote:
> 
> > Obviously GFP_KERNEL allocations can allocate regardless of our memory 
> > exclusivity, but the point is that a job in one exclusive cpuset should 
> > not have the ability to effect the performance (in terms of reclaim and 
> > swap), memory usage, or survival of jobs in other exclusive cpusets 
> > because it was out of memory.
> 
> Right but the process is terminating thus only requiring limited resources
> to get finished. The process does not have the ability to affect the other 
> cpusets. I.e. it cannot directly allocate outside of the cpuset. The 
> system has that capability and the system is handling the termination of 
> the process and should terminate the process in a clean way if possible.
> 

And those limited resources should be available in the difference between 
low and no watermarks as defined by each zone in the cpuset's 
mems_allowed.  Regardless, we should not allow allocations outside of the 
cpuset because we have marked it TIF_MEMDIE and we don't have any explicit 
guarantee that it is exiting yet and not mlock'ing an excessive amount of 
memory that exceeds the capacity of all cpuset nodes.

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  1:40                   ` David Rientjes
@ 2007-06-06  1:54                     ` Christoph Lameter
  2007-06-06  3:29                       ` David Rientjes
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Lameter @ 2007-06-06  1:54 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, David Rientjes wrote:

> mems_allowed.  Regardless, we should not allow allocations outside of the 
> cpuset because we have marked it TIF_MEMDIE and we don't have any explicit 
> guarantee that it is exiting yet and not mlock'ing an excessive amount of 
> memory that exceeds the capacity of all cpuset nodes.

Hmmmm... But we have sent it a SIGKILL. If the process is following 
conventions then it is exiting. Of course the process could be abusing the 
system and attempting to OOM the whole system as an act of revenge for 
being killed but isnt this a bit far fetched?




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  1:54                     ` Christoph Lameter
@ 2007-06-06  3:29                       ` David Rientjes
  0 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2007-06-06  3:29 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Tue, 5 Jun 2007, Christoph Lameter wrote:

> Hmmmm... But we have sent it a SIGKILL. If the process is following 
> conventions then it is exiting. Of course the process could be abusing the 
> system and attempting to OOM the whole system as an act of revenge for 
> being killed but isnt this a bit far fetched?
> 

It's not abusing the system because it was killed, it was killed because 
it was abusing the system and attempted to mlock more memory than allowed 
in its exclusive mems.  So we OOM the task but it can continue mlock'ing 
memory and intrude on the memory of other cpusets because of the 
TIF_MEMDIE exception and our SIGKILL hasn't been handled yet.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-05 22:39 [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally David Rientjes
  2007-06-05 22:40 ` Christoph Lameter
  2007-06-05 23:01 ` Paul Jackson
@ 2007-06-06  6:20 ` Peter Zijlstra
  2007-06-06  6:42   ` David Rientjes
  2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2007-06-06  6:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson,
	linux-kernel

On Tue, 2007-06-05 at 15:39 -0700, David Rientjes wrote:
> Reverts git commit c596d9f320aaf30d28c1d793ff3a976dee1db8f5.
> 
> OOM-killed tasks, marked as TIF_MEMDIE, should not be able to access 
> memory outside its cpuset because it could potentially cause other 
> exclusive cpusets to OOM themselves.
> 
> Cc: Andi Kleen <ak@suse.de>
> Cc: Christoph Lameter <clameter@engr.sgi.com>
> Cc: Paul Jackson <pj@sgi.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  kernel/cpuset.c |    6 ------
>  1 files changed, 0 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
>  	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
>  	if (node_isset(node, current->mems_allowed))
>  		return 1;
> -	/*
> -	 * Allow tasks that have access to memory reserves because they have
> -	 * been OOM killed to get memory anywhere.
> -	 */
> -	if (unlikely(test_thread_flag(TIF_MEMDIE)))
> -		return 1;
>  	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
>  		return 0;
>  

This seems a little pointless, since cpuset_zone_allowed_softwall() is
only effective with ALLOC_CPUSET, and the ALLOC_NO_WATERMARKS
allocations opened up by TIF_MEMDIE don't use that.

Also I agree with Christoph's reasoning; this is not for the application
but for the system. Hence the cpuset does not get violated for the
application [ something I tried to argue before, glad Christoph now
agrees with me :-) ].


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  6:20 ` Peter Zijlstra
@ 2007-06-06  6:42   ` David Rientjes
  2007-06-06  7:09     ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2007-06-06  6:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson,
	linux-kernel

On Wed, 6 Jun 2007, Peter Zijlstra wrote:

> > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> > --- a/kernel/cpuset.c
> > +++ b/kernel/cpuset.c
> > @@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
> >  	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
> >  	if (node_isset(node, current->mems_allowed))
> >  		return 1;
> > -	/*
> > -	 * Allow tasks that have access to memory reserves because they have
> > -	 * been OOM killed to get memory anywhere.
> > -	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE)))
> > -		return 1;
> >  	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
> >  		return 0;
> >  
> 
> This seems a little pointless, since cpuset_zone_allowed_softwall() is
> only effective with ALLOC_CPUSET, and the ALLOC_NO_WATERMARKS
> allocations opened up by TIF_MEMDIE don't use that.
> 

That's the change.  Memory reserves on a per-zone level would now be used 
with respect to ALLOC_NO_WATERMARKS because TIF_MEMDIE tasks no longer 
have an explicit bypass for them.  If the node is not in the task's 
mems_allowed and it's a __GFP_HARDWALL allocation or if it's neither
PF_EXITING or in the nearest_exclusive_ancestor() of that task's cpuset, 
then we move on to the next zone in get_page_from_freelist().

The problem the patch addresses is that, as you mentioned, tasks with 
TIF_MEMDIE have an explicit bypass over using any memory reserves and can 
thus, depending on zonelist ordering, allocate first on nodes outside its 
mems_allowed even in the case of an exclusive cpuset before exhausting its 
own memory.  That behavior is wrong.

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  6:42   ` David Rientjes
@ 2007-06-06  7:09     ` Peter Zijlstra
  2007-06-06  7:18       ` David Rientjes
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2007-06-06  7:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson,
	linux-kernel

On Tue, 2007-06-05 at 23:42 -0700, David Rientjes wrote:
> On Wed, 6 Jun 2007, Peter Zijlstra wrote:
> 
> > > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> > > --- a/kernel/cpuset.c
> > > +++ b/kernel/cpuset.c
> > > @@ -2431,12 +2431,6 @@ int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
> > >  	might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
> > >  	if (node_isset(node, current->mems_allowed))
> > >  		return 1;
> > > -	/*
> > > -	 * Allow tasks that have access to memory reserves because they have
> > > -	 * been OOM killed to get memory anywhere.
> > > -	 */
> > > -	if (unlikely(test_thread_flag(TIF_MEMDIE)))
> > > -		return 1;
> > >  	if (gfp_mask & __GFP_HARDWALL)	/* If hardwall request, stop here */
> > >  		return 0;
> > >  
> > 
> > This seems a little pointless, since cpuset_zone_allowed_softwall() is
> > only effective with ALLOC_CPUSET, and the ALLOC_NO_WATERMARKS
> > allocations opened up by TIF_MEMDIE don't use that.
> > 
> 
> That's the change.  Memory reserves on a per-zone level would now be used 
> with respect to ALLOC_NO_WATERMARKS because TIF_MEMDIE tasks no longer 
> have an explicit bypass for them.  If the node is not in the task's 
> mems_allowed and it's a __GFP_HARDWALL allocation or if it's neither
> PF_EXITING or in the nearest_exclusive_ancestor() of that task's cpuset, 
> then we move on to the next zone in get_page_from_freelist().
> 
> The problem the patch addresses is that, as you mentioned, tasks with 
> TIF_MEMDIE have an explicit bypass over using any memory reserves and can 
> thus, depending on zonelist ordering, allocate first on nodes outside its 
> mems_allowed even in the case of an exclusive cpuset before exhausting its 
> own memory.  That behavior is wrong.

Right, I see your point; however considering that its a system
allocation, and all these constraints get violated by interrupts anyway,
its more of an application container than a strict allocation container.

Are you actually seeing the described behaviour, or just being pedantic
(nothing wrong with that per-se)?

I would actually be bold and make it worse by proposing something like
this, which has the benefit of making the reserve threshold system wide.

---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1eef614..870a791 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1630,6 +1630,21 @@ rebalance:
 			&& !in_interrupt()) {
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
+			/*
+			 * break out of mempolicy boundaries
+			 */
+			zonelist = NODE_DATA(numa_node_id())->node_zonelists +
+				gfp_zone(gfp_mask);
+
+			/*
+			 * Before going bare metal, try to get a page above the
+			 * critical threshold - ignoring CPU sets.
+			 */
+			page = get_page_from_freelist(gfp_mask, order, zonelist,
+					ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER);
+			if (page)
+				goto got_pg;
+
 			/* go through the zonelist yet again, ignoring mins */
 			page = get_page_from_freelist(gfp_mask, order,
 				zonelist, ALLOC_NO_WATERMARKS);



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  7:09     ` Peter Zijlstra
@ 2007-06-06  7:18       ` David Rientjes
  2007-06-06  7:34         ` Paul Jackson
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2007-06-06  7:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Andi Kleen, Christoph Lameter, Paul Jackson,
	linux-kernel

On Wed, 6 Jun 2007, Peter Zijlstra wrote:

> Right, I see your point; however considering that its a system
> allocation, and all these constraints get violated by interrupts anyway,
> its more of an application container than a strict allocation container.
> 

It is not necessarily system allocations at all.  It is quite possible, as 
described earlier, to mlock a large quantity of memory that exceeds the 
capacity of all nodes in a task's mems_allowed and then when this task 
OOM's to continue allocating memory in get_user_pages() but with the 
TIF_MEMDIE exception that allows it to allocate anywhere.  Then other 
cpusets get their memory infringed upon and they can OOM themselves even 
though there was no preexisting memory pressure.  This happens before we 
return to userspace in the mlock() and thus we cannot properly handle the 
SIGKILL sent down from the OOM killer.

> Are you actually seeing the described behaviour, or just being pedantic
> (nothing wrong with that per-se)?
> 

Yes, as described above.  An exclusive cpuset containing only root, 
system-critical tasks was OOM'd when it was not under any memory pressure 
at all because a separate exclusive cpuset mlock'd a gigantic amount of 
memory and it could not reliably exit because the mlock continued to 
allocate outside its own cpuset and eventually OOM'd system-critical tasks 
or depleated all system memory.  True story.

At the least, if this patch is not agreed upon, I would suggest adding 
sanity checks against gfp_mask in cpuset_zone_allowed() to ensure we 
should allow the allocation in TIF_MEMDIE circumstances to avoid the above 
scenario.

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  7:18       ` David Rientjes
@ 2007-06-06  7:34         ` Paul Jackson
  2007-06-06  7:39           ` Andrew Morton
  2007-06-06  7:48           ` David Rientjes
  0 siblings, 2 replies; 30+ messages in thread
From: Paul Jackson @ 2007-06-06  7:34 UTC (permalink / raw)
  To: David Rientjes; +Cc: peterz, akpm, ak, clameter, linux-kernel

> a separate exclusive cpuset mlock'd a gigantic amount of 
> memory and it could not reliably exit because the mlock continued to 
> allocate outside its own cpuset and eventually OOM'd system-critical tasks 
> or depleated all system memory.

Seems like that mlock code is able then to get great globs of memory
without returning to user space ... perhaps that's where the fix
should be ... that code should quit chewing up memory if it's
marked MEMDIE or some such?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  7:34         ` Paul Jackson
@ 2007-06-06  7:39           ` Andrew Morton
  2007-06-06  7:48           ` David Rientjes
  1 sibling, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2007-06-06  7:39 UTC (permalink / raw)
  To: Paul Jackson; +Cc: David Rientjes, peterz, ak, clameter, linux-kernel

On Wed, 6 Jun 2007 00:34:21 -0700 Paul Jackson <pj@sgi.com> wrote:

> > a separate exclusive cpuset mlock'd a gigantic amount of 
> > memory and it could not reliably exit because the mlock continued to 
> > allocate outside its own cpuset and eventually OOM'd system-critical tasks 
> > or depleated all system memory.
> 
> Seems like that mlock code is able then to get great globs of memory
> without returning to user space ... perhaps that's where the fix
> should be ... that code should quit chewing up memory if it's
> marked MEMDIE or some such?

yup.  A fix for that is in the pipeline: bale from get_user_pages()
if TIF_MEMDIE is set.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  7:34         ` Paul Jackson
  2007-06-06  7:39           ` Andrew Morton
@ 2007-06-06  7:48           ` David Rientjes
  2007-06-06  7:56             ` Paul Jackson
  2007-06-06  7:56             ` Peter Zijlstra
  1 sibling, 2 replies; 30+ messages in thread
From: David Rientjes @ 2007-06-06  7:48 UTC (permalink / raw)
  To: Paul Jackson; +Cc: peterz, akpm, ak, clameter, linux-kernel

On Wed, 6 Jun 2007, Paul Jackson wrote:

> Seems like that mlock code is able then to get great globs of memory
> without returning to user space ... perhaps that's where the fix
> should be ... that code should quit chewing up memory if it's
> marked MEMDIE or some such?
> 

That's one case.  Are there others?

The TIF_MEMDIE exception in cpuset_zone_allowed_softwall() allowed this 
problem in mlock().  If it had not been allowed to allocate anywhere 
based simply on the zonelist ordering, the mlock iteration would break 
because it could not handle the fault.

Thus, at the least, we should make sure that memory is not allocated 
outside of a task's mems_allowed unless we do sanity checks against 
gfp_mask in the TIF_MEMDIE case via cpuset_zone_allowed_softwall() to make 
sure a rouge application doesn't cause the same trouble.  That is, unless 
you can guarantee this type of problem will not happen again through any 
other means.  The logic needs to be with the TIF_MEMDIE exception to grant 
access to memory outside the cpuset only when it is relevant to the OOM 
killed task's prompt exit.

		David

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  7:48           ` David Rientjes
@ 2007-06-06  7:56             ` Paul Jackson
  2007-06-06  8:00               ` Andrew Morton
  2007-06-06  7:56             ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Paul Jackson @ 2007-06-06  7:56 UTC (permalink / raw)
  To: David Rientjes; +Cc: peterz, akpm, ak, clameter, linux-kernel

David wrote:
> That is, unless you can guarantee this type of problem will not happen again

Well, I certainly cannot guarantee that.

Heck, I can't even guarantee isn't happening right now, somewhere else.

But I'm no memory guru.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  7:48           ` David Rientjes
  2007-06-06  7:56             ` Paul Jackson
@ 2007-06-06  7:56             ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2007-06-06  7:56 UTC (permalink / raw)
  To: David Rientjes; +Cc: Paul Jackson, akpm, ak, clameter, linux-kernel

On Wed, 2007-06-06 at 00:48 -0700, David Rientjes wrote:
> On Wed, 6 Jun 2007, Paul Jackson wrote:
> 
> > Seems like that mlock code is able then to get great globs of memory
> > without returning to user space ... perhaps that's where the fix
> > should be ... that code should quit chewing up memory if it's
> > marked MEMDIE or some such?
> > 
> 
> That's one case.  Are there others?
> 
> The TIF_MEMDIE exception in cpuset_zone_allowed_softwall() allowed this 
> problem in mlock().  If it had not been allowed to allocate anywhere 
> based simply on the zonelist ordering, the mlock iteration would break 
> because it could not handle the fault.
> 
> Thus, at the least, we should make sure that memory is not allocated 
> outside of a task's mems_allowed unless we do sanity checks against 
> gfp_mask in the TIF_MEMDIE case via cpuset_zone_allowed_softwall() to make 
> sure a rouge application doesn't cause the same trouble.  That is, unless 
> you can guarantee this type of problem will not happen again through any 
> other means.  The logic needs to be with the TIF_MEMDIE exception to grant 
> access to memory outside the cpuset only when it is relevant to the OOM 
> killed task's prompt exit.

I don't think your patch alone would have been sufficient. With it it
would have depleted the local reserves and then jumped onwards to other
nodes (since the ALLOC_NO_WATERMARKS allocation doesn't have
ALLOC_CPUSET).

Unless there was a mem-policy restricting the zonelist (not sure if
cpusets and mem-policies are independent like that)

But your point stands.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  7:56             ` Paul Jackson
@ 2007-06-06  8:00               ` Andrew Morton
  2007-06-06  8:03                 ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2007-06-06  8:00 UTC (permalink / raw)
  To: Paul Jackson; +Cc: David Rientjes, peterz, ak, clameter, linux-kernel

On Wed, 6 Jun 2007 00:56:10 -0700 Paul Jackson <pj@sgi.com> wrote:

> David wrote:
> > That is, unless you can guarantee this type of problem will not happen again
> 
> Well, I certainly cannot guarantee that.

The only place I can think of where the kernel will sit there allocating
huge amounts of memory like this is in readahead.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally
  2007-06-06  8:00               ` Andrew Morton
@ 2007-06-06  8:03                 ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2007-06-06  8:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Paul Jackson, David Rientjes, ak, clameter, linux-kernel

On Wed, 2007-06-06 at 01:00 -0700, Andrew Morton wrote:
> On Wed, 6 Jun 2007 00:56:10 -0700 Paul Jackson <pj@sgi.com> wrote:
> 
> > David wrote:
> > > That is, unless you can guarantee this type of problem will not happen again
> > 
> > Well, I certainly cannot guarantee that.
> 
> The only place I can think of where the kernel will sit there allocating
> huge amounts of memory like this is in readahead.

We just created one more, the execve string copy. However that uses
get_user_pages(), so if that gets fixed we're save there too.


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2007-06-06  8:03 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-05 22:39 [patch] cpusets: do not allow TIF_MEMDIE tasks to allocate globally David Rientjes
2007-06-05 22:40 ` Christoph Lameter
2007-06-05 22:42   ` David Rientjes
2007-06-05 23:01 ` Paul Jackson
2007-06-05 23:16   ` David Rientjes
2007-06-05 23:19     ` Paul Jackson
2007-06-05 23:20     ` Christoph Lameter
2007-06-05 23:25       ` David Rientjes
2007-06-05 23:32         ` Christoph Lameter
2007-06-05 23:44           ` David Rientjes
2007-06-05 23:55             ` Paul Jackson
2007-06-06  1:17               ` David Rientjes
2007-06-06  1:20                 ` Paul Jackson
2007-06-05 23:57             ` Christoph Lameter
2007-06-06  1:23               ` David Rientjes
2007-06-06  1:32                 ` Christoph Lameter
2007-06-06  1:40                   ` David Rientjes
2007-06-06  1:54                     ` Christoph Lameter
2007-06-06  3:29                       ` David Rientjes
2007-06-06  6:20 ` Peter Zijlstra
2007-06-06  6:42   ` David Rientjes
2007-06-06  7:09     ` Peter Zijlstra
2007-06-06  7:18       ` David Rientjes
2007-06-06  7:34         ` Paul Jackson
2007-06-06  7:39           ` Andrew Morton
2007-06-06  7:48           ` David Rientjes
2007-06-06  7:56             ` Paul Jackson
2007-06-06  8:00               ` Andrew Morton
2007-06-06  8:03                 ` Peter Zijlstra
2007-06-06  7:56             ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox