linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
@ 2011-11-16 21:08 David Rientjes
  2011-11-17  8:29 ` Miao Xie
  2011-11-17 22:22 ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: David Rientjes @ 2011-11-16 21:08 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Miao Xie, KOSAKI Motohiro, Paul Menage, linux-kernel, linux-mm

c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread.  This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.

This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes.  This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind.  To prevent this, only avoid the stall if there
is no mempolicy for the thread being changed.

This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.

Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change.

Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 kernel/cpuset.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -949,7 +949,7 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 static void cpuset_change_task_nodemask(struct task_struct *tsk,
 					nodemask_t *newmems)
 {
-	bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+	bool need_loop;
 
 repeat:
 	/*
@@ -962,6 +962,14 @@ repeat:
 		return;
 
 	task_lock(tsk);
+	/*
+	 * Determine if a loop is necessary if another thread is doing
+	 * get_mems_allowed().  If at least one node remains unchanged and
+	 * tsk does not have a mempolicy, then an empty nodemask will not be
+	 * possible when mems_allowed is larger than a word.
+	 */
+	need_loop = tsk->mempolicy ||
+			!nodes_intersects(*newmems, tsk->mems_allowed);
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
@@ -981,12 +989,9 @@ repeat:
 
 	/*
 	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.  No wait is necessary, however, if at least one
-	 * node remains unchanged and tsk has a mempolicy that could store an
-	 * empty nodemask.
+	 * for the read-side.
 	 */
-	while (masks_disjoint && tsk->mempolicy &&
-			ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
 		task_unlock(tsk);
 		if (!task_curr(tsk))
 			yield();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-16 21:08 [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask David Rientjes
@ 2011-11-17  8:29 ` Miao Xie
  2011-11-17 21:33   ` David Rientjes
  2011-11-17 22:22 ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-17  8:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Wed, 16 Nov 2011 13:08:33 -0800 (pst), David Rientjes wrote:
> c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
> cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
> nodes from changing for a thread.  This causes any update to a set of
> allowed nodes to stall until put_mems_allowed() is called.
> 
> This stall is unncessary, however, if at least one node remains unchanged
> in the update to the set of allowed nodes.  This was addressed by
> 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
> node remains set"), but it's still possible that an empty nodemask may be
> read from a mempolicy because the old nodemask may be remapped to the new
> nodemask during rebind.  To prevent this, only avoid the stall if there
> is no mempolicy for the thread being changed.
> 
> This is a temporary solution until all reads from mempolicy nodemasks can
> be guaranteed to not be empty without the get_mems_allowed()
> synchronization.
> 
> Also moves the check for nodemask intersection inside task_lock() so that
> tsk->mems_allowed cannot change.
> 
> Reported-by: Miao Xie <miaox@cn.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>

Oh~, David

I find these is another problem, please take account of the following case:

  2-3 -> 1-2 -> 0-1

the user change mems_allowed twice continuously, the task may see the empty
mems_allowed.

So, it is still dangerous.

Thanks
Miao

> ---
>  kernel/cpuset.c |   17 +++++++++++------
>  1 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -949,7 +949,7 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
>  static void cpuset_change_task_nodemask(struct task_struct *tsk,
>  					nodemask_t *newmems)
>  {
> -	bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
> +	bool need_loop;
>  
>  repeat:
>  	/*
> @@ -962,6 +962,14 @@ repeat:
>  		return;
>  
>  	task_lock(tsk);
> +	/*
> +	 * Determine if a loop is necessary if another thread is doing
> +	 * get_mems_allowed().  If at least one node remains unchanged and
> +	 * tsk does not have a mempolicy, then an empty nodemask will not be
> +	 * possible when mems_allowed is larger than a word.
> +	 */
> +	need_loop = tsk->mempolicy ||
> +			!nodes_intersects(*newmems, tsk->mems_allowed);
>  	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>  	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
>  
> @@ -981,12 +989,9 @@ repeat:
>  
>  	/*
>  	 * Allocation of memory is very fast, we needn't sleep when waiting
> -	 * for the read-side.  No wait is necessary, however, if at least one
> -	 * node remains unchanged and tsk has a mempolicy that could store an
> -	 * empty nodemask.
> +	 * for the read-side.
>  	 */
> -	while (masks_disjoint && tsk->mempolicy &&
> -			ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
> +	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
>  		task_unlock(tsk);
>  		if (!task_curr(tsk))
>  			yield();
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-17  8:29 ` Miao Xie
@ 2011-11-17 21:33   ` David Rientjes
  2011-11-18  9:52     ` Miao Xie
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-17 21:33 UTC (permalink / raw)
  To: Miao Xie
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Thu, 17 Nov 2011, Miao Xie wrote:

> Oh~, David
> 
> I find these is another problem, please take account of the following case:
> 
>   2-3 -> 1-2 -> 0-1
> 
> the user change mems_allowed twice continuously, the task may see the empty
> mems_allowed.
> 
> So, it is still dangerous.
> 

With this patch, we're protected by task_lock(tsk) to determine whether we 
want to take the exception, i.e. whether need_loop is false, and the 
setting of tsk->mems_allowed.  So this would see the nodemask change at 
the individual steps from 2-3 -> 1-2 -> 0-1, not some inconsistent state 
in between or directly from 2-3 -> 0-1.  The only time we don't hold 
task_lock(tsk) to change tsk->mems_allowed is when tsk == current and in 
that case we're not concerned about intermediate reads to its own nodemask 
while storing to a mask where MAX_NUMNODES > BITS_PER_LONG.

Thus, there's no problem here with regard to such behavior if we exclude 
mempolicies, which this patch does.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-16 21:08 [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask David Rientjes
  2011-11-17  8:29 ` Miao Xie
@ 2011-11-17 22:22 ` Andrew Morton
  2011-11-17 23:08   ` [patch v2 " David Rientjes
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2011-11-17 22:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Miao Xie, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Wed, 16 Nov 2011 13:08:33 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
> cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
> nodes from changing for a thread.  This causes any update to a set of
> allowed nodes to stall until put_mems_allowed() is called.
> 
> This stall is unncessary, however, if at least one node remains unchanged
> in the update to the set of allowed nodes.  This was addressed by
> 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
> node remains set"), but it's still possible that an empty nodemask may be
> read from a mempolicy because the old nodemask may be remapped to the new
> nodemask during rebind.  To prevent this, only avoid the stall if there
> is no mempolicy for the thread being changed.
> 
> This is a temporary solution until all reads from mempolicy nodemasks can
> be guaranteed to not be empty without the get_mems_allowed()
> synchronization.
> 
> Also moves the check for nodemask intersection inside task_lock() so that
> tsk->mems_allowed cannot change.

The patch doesn't actually apply, due to changes made by your earlier
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if
one node remains set").  Please recheck/redo/resend/etc?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch v2 for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-17 22:22 ` Andrew Morton
@ 2011-11-17 23:08   ` David Rientjes
  2011-11-18  0:00     ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-17 23:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Miao Xie, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask

c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread.  This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.

This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes.  This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind.  To prevent this, only avoid the stall if there
is no mempolicy for the thread being changed.

This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.

Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change.  This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.

Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 kernel/cpuset.c |   16 +++++++++++-----
 1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -949,7 +949,7 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 static void cpuset_change_task_nodemask(struct task_struct *tsk,
 					nodemask_t *newmems)
 {
-	bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+	bool need_loop;
 
 repeat:
 	/*
@@ -962,6 +962,14 @@ repeat:
 		return;
 
 	task_lock(tsk);
+	/*
+	 * Determine if a loop is necessary if another thread is doing
+	 * get_mems_allowed().  If at least one node remains unchanged and
+	 * tsk does not have a mempolicy, then an empty nodemask will not be
+	 * possible when mems_allowed is larger than a word.
+	 */
+	need_loop = tsk->mempolicy ||
+			!nodes_intersects(*newmems, tsk->mems_allowed);
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
 
@@ -981,11 +989,9 @@ repeat:
 
 	/*
 	 * Allocation of memory is very fast, we needn't sleep when waiting
-	 * for the read-side.  No wait is necessary, however, if at least one
-	 * node remains unchanged.
+	 * for the read-side.
 	 */
-	while (masks_disjoint &&
-			ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
 		task_unlock(tsk);
 		if (!task_curr(tsk))
 			yield();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch v2 for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-17 23:08   ` [patch v2 " David Rientjes
@ 2011-11-18  0:00     ` Andrew Morton
  2011-11-18 23:53       ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2011-11-18  0:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Miao Xie, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Thu, 17 Nov 2011 15:08:08 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

> cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
> 
> c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
> cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
> nodes from changing for a thread.  This causes any update to a set of
> allowed nodes to stall until put_mems_allowed() is called.
> 
> This stall is unncessary, however, if at least one node remains unchanged
> in the update to the set of allowed nodes.  This was addressed by
> 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
> node remains set"), but it's still possible that an empty nodemask may be
> read from a mempolicy because the old nodemask may be remapped to the new
> nodemask during rebind.  To prevent this, only avoid the stall if there
> is no mempolicy for the thread being changed.
> 
> This is a temporary solution until all reads from mempolicy nodemasks can
> be guaranteed to not be empty without the get_mems_allowed()
> synchronization.
> 
> Also moves the check for nodemask intersection inside task_lock() so that
> tsk->mems_allowed cannot change.  This ensures that nothing can set this
> tsk's mems_allowed out from under us and also protects tsk->mempolicy.

Nothing in this changelog makes me understand why you think we need this
change in 3.2.  What are the user-visible effects of this change?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-17 21:33   ` David Rientjes
@ 2011-11-18  9:52     ` Miao Xie
  2011-11-18 23:49       ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-18  9:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Thu, 17 Nov 2011 13:33:14 -0800 (pst), David Rientjes wrote:
> On Thu, 17 Nov 2011, Miao Xie wrote:
> 
>> Oh~, David
>>
>> I find these is another problem, please take account of the following case:
>>
>>   2-3 -> 1-2 -> 0-1
>>
>> the user change mems_allowed twice continuously, the task may see the empty
>> mems_allowed.
>>
>> So, it is still dangerous.
>>
> 
> With this patch, we're protected by task_lock(tsk) to determine whether we 
> want to take the exception, i.e. whether need_loop is false, and the 
> setting of tsk->mems_allowed.  So this would see the nodemask change at 
> the individual steps from 2-3 -> 1-2 -> 0-1, not some inconsistent state 
> in between or directly from 2-3 -> 0-1.  The only time we don't hold 
> task_lock(tsk) to change tsk->mems_allowed is when tsk == current and in 
> that case we're not concerned about intermediate reads to its own nodemask 
> while storing to a mask where MAX_NUMNODES > BITS_PER_LONG.
> 
> Thus, there's no problem here with regard to such behavior if we exclude 
> mempolicies, which this patch does.
> 

No.
When the task does memory allocation, it access its mems_allowed without
task_lock(tsk), and it may be blocked after it check 0-1 bits. And then, the
user changes mems_allowed twice continuously(2-3(initial state) -> 1-2 -> 0-1),
After that, the task is woke up and it see the empty mems_allowed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-18  9:52     ` Miao Xie
@ 2011-11-18 23:49       ` David Rientjes
  2011-11-23  2:51         ` Miao Xie
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-18 23:49 UTC (permalink / raw)
  To: Miao Xie
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Fri, 18 Nov 2011, Miao Xie wrote:

> >> I find these is another problem, please take account of the following case:
> >>
> >>   2-3 -> 1-2 -> 0-1
> >>
> >> the user change mems_allowed twice continuously, the task may see the empty
> >> mems_allowed.
> >>
> >> So, it is still dangerous.
> >>
> > 
> > With this patch, we're protected by task_lock(tsk) to determine whether we 
> > want to take the exception, i.e. whether need_loop is false, and the 
> > setting of tsk->mems_allowed.  So this would see the nodemask change at 
> > the individual steps from 2-3 -> 1-2 -> 0-1, not some inconsistent state 
> > in between or directly from 2-3 -> 0-1.  The only time we don't hold 
> > task_lock(tsk) to change tsk->mems_allowed is when tsk == current and in 
> > that case we're not concerned about intermediate reads to its own nodemask 
> > while storing to a mask where MAX_NUMNODES > BITS_PER_LONG.
> > 
> > Thus, there's no problem here with regard to such behavior if we exclude 
> > mempolicies, which this patch does.
> > 
> 
> No.
> When the task does memory allocation, it access its mems_allowed without
> task_lock(tsk), and it may be blocked after it check 0-1 bits. And then, the
> user changes mems_allowed twice continuously(2-3(initial state) -> 1-2 -> 0-1),
> After that, the task is woke up and it see the empty mems_allowed.
> 

I'm confused, you're concerned on a kernel where 
MAX_NUMNODES > BITS_PER_LONG about thread A reading a partial 
tsk->mems_allowed, being preempted, meanwhile thread B changes 
tsk->mems_allowed by taking cgroup_mutex, taking task_lock(tsk), setting 
the intersecting nodemask, releasing both, taking them again, changing the 
nodemask again to be disjoint, then the thread A waking up and finishing 
its read and seeing an intersecting nodemask because it is now disjoint 
from the first read?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch v2 for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-18  0:00     ` Andrew Morton
@ 2011-11-18 23:53       ` David Rientjes
  0 siblings, 0 replies; 18+ messages in thread
From: David Rientjes @ 2011-11-18 23:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Miao Xie, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Thu, 17 Nov 2011, Andrew Morton wrote:

> Nothing in this changelog makes me understand why you think we need this
> change in 3.2.  What are the user-visible effects of this change?
> 

Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty 
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a 
new set of allowed cpuset nodes where the two nodemasks, as a result of 
the remap, are now disjoint.  This fix is a bandaid so that we never 
optimize the stores to tsk->mems_allowed because they intersect if tsk 
also has an existing mempolicy, so that prevents the possibility of seeing 
an empty nodemask during rebind.  For 3.3, I'd like to ensure that 
remapping any mempolicy's nodemask will never result in an empty nodemask.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-18 23:49       ` David Rientjes
@ 2011-11-23  2:51         ` Miao Xie
  2011-11-23  3:32           ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-23  2:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Fri, 18 Nov 2011 15:49:22 -0800 (pst), David Rientjes wrote:
> On Fri, 18 Nov 2011, Miao Xie wrote:
> 
>>>> I find these is another problem, please take account of the following case:
>>>>
>>>>   2-3 -> 1-2 -> 0-1
>>>>
>>>> the user change mems_allowed twice continuously, the task may see the empty
>>>> mems_allowed.
>>>>
>>>> So, it is still dangerous.
>>>>
>>>
>>> With this patch, we're protected by task_lock(tsk) to determine whether we 
>>> want to take the exception, i.e. whether need_loop is false, and the 
>>> setting of tsk->mems_allowed.  So this would see the nodemask change at 
>>> the individual steps from 2-3 -> 1-2 -> 0-1, not some inconsistent state 
>>> in between or directly from 2-3 -> 0-1.  The only time we don't hold 
>>> task_lock(tsk) to change tsk->mems_allowed is when tsk == current and in 
>>> that case we're not concerned about intermediate reads to its own nodemask 
>>> while storing to a mask where MAX_NUMNODES > BITS_PER_LONG.
>>>
>>> Thus, there's no problem here with regard to such behavior if we exclude 
>>> mempolicies, which this patch does.
>>>
>>
>> No.
>> When the task does memory allocation, it access its mems_allowed without
>> task_lock(tsk), and it may be blocked after it check 0-1 bits. And then, the
>> user changes mems_allowed twice continuously(2-3(initial state) -> 1-2 -> 0-1),
>> After that, the task is woke up and it see the empty mems_allowed.
>>
> 
> I'm confused, you're concerned on a kernel where 
> MAX_NUMNODES > BITS_PER_LONG about thread A reading a partial 
> tsk->mems_allowed, being preempted, meanwhile thread B changes 
> tsk->mems_allowed by taking cgroup_mutex, taking task_lock(tsk), setting 
> the intersecting nodemask, releasing both, taking them again, changing the 
> nodemask again to be disjoint, then the thread A waking up and finishing 
> its read and seeing an intersecting nodemask because it is now disjoint 
> from the first read?
> 

(I am sorry for the late reply, I was on leave for the past few days.)

Yes, what you said is right.
But in fact, on the kernel where MAX_NUMNODES <= BITS_PER_LONG, the same problem
can also occur.
	task1			task1's mems	task2
	alloc page		2-3
	  alloc on node1? NO	2-3
				2-3		change mems from 2-3 to 1-2
				1-2		rebind task1's mpol
				1-2		  set new bits
				1-2		change mems from 0-1 to 0
				1-2		rebind task1's mpol
				0-1		  set new bits
	  alloc on node2? NO	0-1
	  ...
	can't alloc page
	  goto oom

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-23  2:51         ` Miao Xie
@ 2011-11-23  3:32           ` David Rientjes
  2011-11-23  4:48             ` Miao Xie
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-23  3:32 UTC (permalink / raw)
  To: Miao Xie
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Wed, 23 Nov 2011, Miao Xie wrote:

> Yes, what you said is right.
> But in fact, on the kernel where MAX_NUMNODES <= BITS_PER_LONG, the same problem
> can also occur.
> 	task1			task1's mems	task2
> 	alloc page		2-3
> 	  alloc on node1? NO	2-3
> 				2-3		change mems from 2-3 to 1-2
> 				1-2		rebind task1's mpol
> 				1-2		  set new bits
> 				1-2		change mems from 0-1 to 0
> 				1-2		rebind task1's mpol
> 				0-1		  set new bits
> 	  alloc on node2? NO	0-1
> 	  ...
> 	can't alloc page
> 	  goto oom
> 

One of the major reasons why changing cpuset.mems can take >30s is because 
of lengthy delays in the page allocator because it continuously loops 
while trying reclaim or killing a thread and trying to allocate a page.

I think we should be able to optimize this by dropping it when it's not 
required and moving it to get_page_from_freelist() which is the only thing 
that cares about cpuset_zone_allowed_softwall().

Something like this:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1643,17 +1643,29 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone **preferred_zone, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
 	struct zone *zone;
-	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
+	nodemask_t *allowednodes = NULL;
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	classzone_idx = zone_idx(preferred_zone);
+	get_mems_allowed();
+	/*
+	 * preferred_zone must come from an allowed node if the allocation is
+	 * constrained to either a mempolicy (nodemask != NULL) or otherwise
+	 * limited by cpusets.
+	 */
+	if (alloc_flags & ALLOC_CPUSET)
+		allowednodes = nodemask ? : &cpuset_current_mems_allowed;
+
+	first_zones_zonelist(zonelist, high_zoneidx, allowednodes, 
+				preferred_zone);
+	classzone_idx = zone_idx(*preferred_zone);
+	allowednodes = NULL;
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -1717,7 +1729,7 @@ zonelist_scan:
 		}
 
 try_this_zone:
-		page = buffered_rmqueue(preferred_zone, zone, order,
+		page = buffered_rmqueue(*preferred_zone, zone, order,
 						gfp_mask, migratetype);
 		if (page)
 			break;
@@ -1731,6 +1743,7 @@ this_zone_full:
 		zlc_active = 0;
 		goto zonelist_scan;
 	}
+	put_mems_allowed();
 	return page;
 }
 
@@ -1832,7 +1845,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone,
+	nodemask_t *nodemask, struct zone **preferred_zone,
 	int migratetype)
 {
 	struct page *page;
@@ -1885,13 +1898,13 @@ out:
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
 	int migratetype, unsigned long *did_some_progress,
 	bool sync_migration)
 {
 	struct page *page;
 
-	if (!order || compaction_deferred(preferred_zone))
+	if (!order || compaction_deferred(*preferred_zone))
 		return NULL;
 
 	current->flags |= PF_MEMALLOC;
@@ -1909,8 +1922,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				alloc_flags, preferred_zone,
 				migratetype);
 		if (page) {
-			preferred_zone->compact_considered = 0;
-			preferred_zone->compact_defer_shift = 0;
+			*preferred_zone->compact_considered = 0;
+			*preferred_zone->compact_defer_shift = 0;
 			count_vm_event(COMPACTSUCCESS);
 			return page;
 		}
@@ -1921,7 +1934,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		 * but not enough to satisfy watermarks.
 		 */
 		count_vm_event(COMPACTFAIL);
-		defer_compaction(preferred_zone);
+		defer_compaction(*preferred_zone);
 
 		cond_resched();
 	}
@@ -1932,7 +1945,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
 	int migratetype, unsigned long *did_some_progress,
 	bool sync_migration)
 {
@@ -1944,7 +1957,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
 	int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
@@ -2001,7 +2014,7 @@ retry:
 static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone,
+	nodemask_t *nodemask, struct zone **preferred_zone,
 	int migratetype)
 {
 	struct page *page;
@@ -2012,7 +2025,8 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
+			wait_iff_congested(*preferred_zone, BLK_RW_ASYNC,
+					   HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -2075,7 +2089,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
-	nodemask_t *nodemask, struct zone *preferred_zone,
+	nodemask_t *nodemask, struct zone **preferred_zone,
 	int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -2110,7 +2124,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 restart:
 	if (!(gfp_mask & __GFP_NO_KSWAPD))
 		wake_all_kswapd(order, zonelist, high_zoneidx,
-						zone_idx(preferred_zone));
+						zone_idx(*preferred_zone));
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -2119,14 +2133,6 @@ restart:
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-	/*
-	 * Find the true preferred zone if the allocation is unconstrained by
-	 * cpusets.
-	 */
-	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
-		first_zones_zonelist(zonelist, high_zoneidx, NULL,
-					&preferred_zone);
-
 rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
@@ -2220,7 +2226,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
+		wait_iff_congested(*preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	} else {
 		/*
@@ -2277,25 +2283,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	get_mems_allowed();
-	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx,
-				nodemask ? : &cpuset_current_mems_allowed,
-				&preferred_zone);
-	if (!preferred_zone) {
-		put_mems_allowed();
-		return NULL;
-	}
-
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
-			preferred_zone, migratetype);
-	if (unlikely(!page))
+			&preferred_zone, migratetype);
+	if (unlikely(!page)) {
+		if (!preferred_zone)
+			return NULL;
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
-	put_mems_allowed();
+				&preferred_zone, migratetype);
+	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 	return page;

This would significantly reduce the amount of time that it takes to write 
to cpuset.mems because we drop get_mems_allowed() between allocation 
attempts.  We really, really want to do this anyway because it's possible 
that a cpuset is being expanded to a larger set of nodes and they are 
inaccessible to concurrent memory allocations because the page allocator 
is holding get_mems_allowed() while looping and trying to find more 
memory.

Comments?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-23  3:32           ` David Rientjes
@ 2011-11-23  4:48             ` Miao Xie
  2011-11-23  6:25               ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-23  4:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Tue, 22 Nov 2011 19:32:40 -0800 (pst), David Rientjes wrote:
[SNIP]
> One of the major reasons why changing cpuset.mems can take >30s is because 
> of lengthy delays in the page allocator because it continuously loops 
> while trying reclaim or killing a thread and trying to allocate a page.
> 
> I think we should be able to optimize this by dropping it when it's not 
> required and moving it to get_page_from_freelist() which is the only thing 
> that cares about cpuset_zone_allowed_softwall().
> 
> Something like this:
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1643,17 +1643,29 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
>  static struct page *
>  get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> -		struct zone *preferred_zone, int migratetype)
> +		struct zone **preferred_zone, int migratetype)
>  {
>  	struct zoneref *z;
>  	struct page *page = NULL;
>  	int classzone_idx;
>  	struct zone *zone;
> -	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
> +	nodemask_t *allowednodes = NULL;
>  	int zlc_active = 0;		/* set if using zonelist_cache */
>  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
>  
> -	classzone_idx = zone_idx(preferred_zone);
> +	get_mems_allowed();
> +	/*
> +	 * preferred_zone must come from an allowed node if the allocation is
> +	 * constrained to either a mempolicy (nodemask != NULL) or otherwise
> +	 * limited by cpusets.
> +	 */
> +	if (alloc_flags & ALLOC_CPUSET)
> +		allowednodes = nodemask ? : &cpuset_current_mems_allowed;
> +
> +	first_zones_zonelist(zonelist, high_zoneidx, allowednodes, 
> +				preferred_zone);
> +	classzone_idx = zone_idx(*preferred_zone);
> +	allowednodes = NULL;
>  zonelist_scan:
>  	/*
>  	 * Scan zonelist, looking for a zone with enough free.
> @@ -1717,7 +1729,7 @@ zonelist_scan:
>  		}
>  
>  try_this_zone:
> -		page = buffered_rmqueue(preferred_zone, zone, order,
> +		page = buffered_rmqueue(*preferred_zone, zone, order,
>  						gfp_mask, migratetype);
>  		if (page)
>  			break;
> @@ -1731,6 +1743,7 @@ this_zone_full:
>  		zlc_active = 0;
>  		goto zonelist_scan;
>  	}
> +	put_mems_allowed();
>  	return page;
>  }
>  
> @@ -1832,7 +1845,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>  static inline struct page *
>  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> -	nodemask_t *nodemask, struct zone *preferred_zone,
> +	nodemask_t *nodemask, struct zone **preferred_zone,
>  	int migratetype)
>  {
>  	struct page *page;
> @@ -1885,13 +1898,13 @@ out:
>  static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> -	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> +	nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
>  	int migratetype, unsigned long *did_some_progress,
>  	bool sync_migration)
>  {
>  	struct page *page;
>  
> -	if (!order || compaction_deferred(preferred_zone))
> +	if (!order || compaction_deferred(*preferred_zone))
>  		return NULL;
>  
>  	current->flags |= PF_MEMALLOC;
> @@ -1909,8 +1922,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  				alloc_flags, preferred_zone,
>  				migratetype);
>  		if (page) {
> -			preferred_zone->compact_considered = 0;
> -			preferred_zone->compact_defer_shift = 0;
> +			*preferred_zone->compact_considered = 0;
> +			*preferred_zone->compact_defer_shift = 0;
>  			count_vm_event(COMPACTSUCCESS);
>  			return page;
>  		}
> @@ -1921,7 +1934,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		 * but not enough to satisfy watermarks.
>  		 */
>  		count_vm_event(COMPACTFAIL);
> -		defer_compaction(preferred_zone);
> +		defer_compaction(*preferred_zone);
>  
>  		cond_resched();
>  	}
> @@ -1932,7 +1945,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> -	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> +	nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
>  	int migratetype, unsigned long *did_some_progress,
>  	bool sync_migration)
>  {
> @@ -1944,7 +1957,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  static inline struct page *
>  __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> -	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> +	nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
>  	int migratetype, unsigned long *did_some_progress)
>  {
>  	struct page *page = NULL;
> @@ -2001,7 +2014,7 @@ retry:
>  static inline struct page *
>  __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> -	nodemask_t *nodemask, struct zone *preferred_zone,
> +	nodemask_t *nodemask, struct zone **preferred_zone,
>  	int migratetype)
>  {
>  	struct page *page;
> @@ -2012,7 +2025,8 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>  			preferred_zone, migratetype);
>  
>  		if (!page && gfp_mask & __GFP_NOFAIL)
> -			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
> +			wait_iff_congested(*preferred_zone, BLK_RW_ASYNC,
> +					   HZ/50);
>  	} while (!page && (gfp_mask & __GFP_NOFAIL));
>  
>  	return page;
> @@ -2075,7 +2089,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> -	nodemask_t *nodemask, struct zone *preferred_zone,
> +	nodemask_t *nodemask, struct zone **preferred_zone,
>  	int migratetype)
>  {
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> @@ -2110,7 +2124,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  restart:
>  	if (!(gfp_mask & __GFP_NO_KSWAPD))
>  		wake_all_kswapd(order, zonelist, high_zoneidx,
> -						zone_idx(preferred_zone));
> +						zone_idx(*preferred_zone));
>  
>  	/*
>  	 * OK, we're below the kswapd watermark and have kicked background
> @@ -2119,14 +2133,6 @@ restart:
>  	 */
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
> -	/*
> -	 * Find the true preferred zone if the allocation is unconstrained by
> -	 * cpusets.
> -	 */
> -	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
> -		first_zones_zonelist(zonelist, high_zoneidx, NULL,
> -					&preferred_zone);
> -
>  rebalance:
>  	/* This is the last chance, in general, before the goto nopage. */
>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> @@ -2220,7 +2226,7 @@ rebalance:
>  	pages_reclaimed += did_some_progress;
>  	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
>  		/* Wait for some write requests to complete then retry */
> -		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
> +		wait_iff_congested(*preferred_zone, BLK_RW_ASYNC, HZ/50);
>  		goto rebalance;
>  	} else {
>  		/*
> @@ -2277,25 +2283,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	if (unlikely(!zonelist->_zonerefs->zone))
>  		return NULL;
>  
> -	get_mems_allowed();
> -	/* The preferred zone is used for statistics later */
> -	first_zones_zonelist(zonelist, high_zoneidx,
> -				nodemask ? : &cpuset_current_mems_allowed,
> -				&preferred_zone);
> -	if (!preferred_zone) {
> -		put_mems_allowed();
> -		return NULL;
> -	}
> -
>  	/* First allocation attempt */
>  	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> -			preferred_zone, migratetype);
> -	if (unlikely(!page))
> +			&preferred_zone, migratetype);
> +	if (unlikely(!page)) {
> +		if (!preferred_zone)
> +			return NULL;
>  		page = __alloc_pages_slowpath(gfp_mask, order,
>  				zonelist, high_zoneidx, nodemask,
> -				preferred_zone, migratetype);
> -	put_mems_allowed();
> +				&preferred_zone, migratetype);
> +	}
>  
>  	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
>  	return page;
> 
> This would significantly reduce the amount of time that it takes to write 
> to cpuset.mems because we drop get_mems_allowed() between allocation 
> attempts.  We really, really want to do this anyway because it's possible 
> that a cpuset is being expanded to a larger set of nodes and they are 
> inaccessible to concurrent memory allocations because the page allocator 
> is holding get_mems_allowed() while looping and trying to find more 
> memory.
> 
> Comments?

This is a good idea. But I worry that oom will happen easily, because we do
direct reclamation and compact by mems_allowed.

I think we can fix it by using a sequence to make the tasks know their
mems_allowed are changed, and the task will do direct reclamation and
compact again.
The other method is that making the tasks ignore its mems_allowed when
they do direct reclamation and compact at the second time.
By these two way, we can drop get_mems_allowed() as many as possible.

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-23  4:48             ` Miao Xie
@ 2011-11-23  6:25               ` David Rientjes
  2011-11-23  7:49                 ` Miao Xie
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-23  6:25 UTC (permalink / raw)
  To: Miao Xie
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Wed, 23 Nov 2011, Miao Xie wrote:

> This is a good idea. But I worry that oom will happen easily, because we do
> direct reclamation and compact by mems_allowed.
> 

Memory compaction actually iterates through each zone regardless of 
whether it's allowed or not in the current context.  Recall that the 
nodemask passed into __alloc_pages_nodemask() is non-NULL only when there 
is a mempolicy that restricts the allocations by MPOL_BIND.  That nodemask 
is not protected by get_mems_allowed(), so there's no change in 
compaction's behavior with my patch.

Direct reclaim does, however, require mems_allowed staying constant 
without the risk of early oom as you mentioned.  It has its own 
get_mems_allowed(), though, so it doesn't have the opportunity to change 
until returning to the page allocator.  It's possible that mems_allowed 
will be different on the next call to get_pages_from_freelist() but we 
don't know anything about that context: it's entirely possible that the 
set of new mems has an abundance of free memory or are completely depleted 
as well.  So there's no strict need for consistency between the set of 
allowed nodes during reclaim and the subsequent allocation attempt.  All 
we care about is that reclaim has a consistent set of allowed nodes to 
determine whether it's making progress or not.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-23  6:25               ` David Rientjes
@ 2011-11-23  7:49                 ` Miao Xie
  2011-11-23 22:26                   ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-23  7:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Tue, 22 Nov 2011 22:25:46 -0800 (pst), David Rientjes wrote:
> On Wed, 23 Nov 2011, Miao Xie wrote:
> 
>> This is a good idea. But I worry that oom will happen easily, because we do
>> direct reclamation and compact by mems_allowed.
>>
> 
> Memory compaction actually iterates through each zone regardless of 
> whether it's allowed or not in the current context.  Recall that the 
> nodemask passed into __alloc_pages_nodemask() is non-NULL only when there 
> is a mempolicy that restricts the allocations by MPOL_BIND.  That nodemask 
> is not protected by get_mems_allowed(), so there's no change in 
> compaction's behavior with my patch.

That nodemask is also protected by get_mems_allowed().

> Direct reclaim does, however, require mems_allowed staying constant 
> without the risk of early oom as you mentioned.  It has its own 
> get_mems_allowed(), though, so it doesn't have the opportunity to change 
> until returning to the page allocator.  It's possible that mems_allowed 
> will be different on the next call to get_pages_from_freelist() but we 
> don't know anything about that context: it's entirely possible that the 
> set of new mems has an abundance of free memory or are completely depleted 
> as well.  So there's no strict need for consistency between the set of 
> allowed nodes during reclaim and the subsequent allocation attempt.  All 
> we care about is that reclaim has a consistent set of allowed nodes to 
> determine whether it's making progress or not.
> 

Agree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-23  7:49                 ` Miao Xie
@ 2011-11-23 22:26                   ` David Rientjes
  2011-11-24  1:26                     ` Miao Xie
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-23 22:26 UTC (permalink / raw)
  To: Miao Xie
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Wed, 23 Nov 2011, Miao Xie wrote:

> That nodemask is also protected by get_mems_allowed().
> 

So what's the problem with the patch?  Can I add your acked-by?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-23 22:26                   ` David Rientjes
@ 2011-11-24  1:26                     ` Miao Xie
  2011-11-24  1:52                       ` David Rientjes
  0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-24  1:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Wed, 23 Nov 2011 14:26:40 -0800 (pst), David Rientjes wrote:
> On Wed, 23 Nov 2011, Miao Xie wrote:
> 
>> That nodemask is also protected by get_mems_allowed().
>>
> 
> So what's the problem with the patch? 

Memory compaction will see an empty nodemask and do nothing, and then
we may fail to get several contiguous pages.


> Can I add your acked-by?

Sure, no problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-24  1:26                     ` Miao Xie
@ 2011-11-24  1:52                       ` David Rientjes
  2011-11-24  2:50                         ` Miao Xie
  0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-24  1:52 UTC (permalink / raw)
  To: Miao Xie
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Thu, 24 Nov 2011, Miao Xie wrote:

> Memory compaction will see an empty nodemask and do nothing, and then
> we may fail to get several contiguous pages.
> 

Where do we rely on a consistent nodemask for memory compaction?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  2011-11-24  1:52                       ` David Rientjes
@ 2011-11-24  2:50                         ` Miao Xie
  0 siblings, 0 replies; 18+ messages in thread
From: Miao Xie @ 2011-11-24  2:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
	linux-kernel, linux-mm

On Wed, 23 Nov 2011 17:52:08 -0800 (pst), David Rientjes wrote:
> On Thu, 24 Nov 2011, Miao Xie wrote:
> 
>> Memory compaction will see an empty nodemask and do nothing, and then
>> we may fail to get several contiguous pages.
>>
> 
> Where do we rely on a consistent nodemask for memory compaction?

Maybe I am overcautious.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-11-24  2:51 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-16 21:08 [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask David Rientjes
2011-11-17  8:29 ` Miao Xie
2011-11-17 21:33   ` David Rientjes
2011-11-18  9:52     ` Miao Xie
2011-11-18 23:49       ` David Rientjes
2011-11-23  2:51         ` Miao Xie
2011-11-23  3:32           ` David Rientjes
2011-11-23  4:48             ` Miao Xie
2011-11-23  6:25               ` David Rientjes
2011-11-23  7:49                 ` Miao Xie
2011-11-23 22:26                   ` David Rientjes
2011-11-24  1:26                     ` Miao Xie
2011-11-24  1:52                       ` David Rientjes
2011-11-24  2:50                         ` Miao Xie
2011-11-17 22:22 ` Andrew Morton
2011-11-17 23:08   ` [patch v2 " David Rientjes
2011-11-18  0:00     ` Andrew Morton
2011-11-18 23:53       ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).