* [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
@ 2011-11-16 21:08 David Rientjes
2011-11-17 8:29 ` Miao Xie
2011-11-17 22:22 ` Andrew Morton
0 siblings, 2 replies; 18+ messages in thread
From: David Rientjes @ 2011-11-16 21:08 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds
Cc: Miao Xie, KOSAKI Motohiro, Paul Menage, linux-kernel, linux-mm
c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread. This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.
This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes. This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind. To prevent this, only avoid the stall if there
is no mempolicy for the thread being changed.
This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.
Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change.
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
kernel/cpuset.c | 17 +++++++++++------
1 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -949,7 +949,7 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
static void cpuset_change_task_nodemask(struct task_struct *tsk,
nodemask_t *newmems)
{
- bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+ bool need_loop;
repeat:
/*
@@ -962,6 +962,14 @@ repeat:
return;
task_lock(tsk);
+ /*
+ * Determine if a loop is necessary if another thread is doing
+ * get_mems_allowed(). If at least one node remains unchanged and
+ * tsk does not have a mempolicy, then an empty nodemask will not be
+ * possible when mems_allowed is larger than a word.
+ */
+ need_loop = tsk->mempolicy ||
+ !nodes_intersects(*newmems, tsk->mems_allowed);
nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
@@ -981,12 +989,9 @@ repeat:
/*
* Allocation of memory is very fast, we needn't sleep when waiting
- * for the read-side. No wait is necessary, however, if at least one
- * node remains unchanged and tsk has a mempolicy that could store an
- * empty nodemask.
+ * for the read-side.
*/
- while (masks_disjoint && tsk->mempolicy &&
- ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+ while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
task_unlock(tsk);
if (!task_curr(tsk))
yield();
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-16 21:08 [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask David Rientjes
@ 2011-11-17 8:29 ` Miao Xie
2011-11-17 21:33 ` David Rientjes
2011-11-17 22:22 ` Andrew Morton
1 sibling, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-17 8:29 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Wed, 16 Nov 2011 13:08:33 -0800 (pst), David Rientjes wrote:
> c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
> cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
> nodes from changing for a thread. This causes any update to a set of
> allowed nodes to stall until put_mems_allowed() is called.
>
> This stall is unncessary, however, if at least one node remains unchanged
> in the update to the set of allowed nodes. This was addressed by
> 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
> node remains set"), but it's still possible that an empty nodemask may be
> read from a mempolicy because the old nodemask may be remapped to the new
> nodemask during rebind. To prevent this, only avoid the stall if there
> is no mempolicy for the thread being changed.
>
> This is a temporary solution until all reads from mempolicy nodemasks can
> be guaranteed to not be empty without the get_mems_allowed()
> synchronization.
>
> Also moves the check for nodemask intersection inside task_lock() so that
> tsk->mems_allowed cannot change.
>
> Reported-by: Miao Xie <miaox@cn.fujitsu.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
Oh~, David
I find these is another problem, please take account of the following case:
2-3 -> 1-2 -> 0-1
the user change mems_allowed twice continuously, the task may see the empty
mems_allowed.
So, it is still dangerous.
Thanks
Miao
> ---
> kernel/cpuset.c | 17 +++++++++++------
> 1 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -949,7 +949,7 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> static void cpuset_change_task_nodemask(struct task_struct *tsk,
> nodemask_t *newmems)
> {
> - bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
> + bool need_loop;
>
> repeat:
> /*
> @@ -962,6 +962,14 @@ repeat:
> return;
>
> task_lock(tsk);
> + /*
> + * Determine if a loop is necessary if another thread is doing
> + * get_mems_allowed(). If at least one node remains unchanged and
> + * tsk does not have a mempolicy, then an empty nodemask will not be
> + * possible when mems_allowed is larger than a word.
> + */
> + need_loop = tsk->mempolicy ||
> + !nodes_intersects(*newmems, tsk->mems_allowed);
> nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
> mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
>
> @@ -981,12 +989,9 @@ repeat:
>
> /*
> * Allocation of memory is very fast, we needn't sleep when waiting
> - * for the read-side. No wait is necessary, however, if at least one
> - * node remains unchanged and tsk has a mempolicy that could store an
> - * empty nodemask.
> + * for the read-side.
> */
> - while (masks_disjoint && tsk->mempolicy &&
> - ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
> + while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
> task_unlock(tsk);
> if (!task_curr(tsk))
> yield();
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-17 8:29 ` Miao Xie
@ 2011-11-17 21:33 ` David Rientjes
2011-11-18 9:52 ` Miao Xie
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-17 21:33 UTC (permalink / raw)
To: Miao Xie
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Thu, 17 Nov 2011, Miao Xie wrote:
> Oh~, David
>
> I find these is another problem, please take account of the following case:
>
> 2-3 -> 1-2 -> 0-1
>
> the user change mems_allowed twice continuously, the task may see the empty
> mems_allowed.
>
> So, it is still dangerous.
>
With this patch, we're protected by task_lock(tsk) to determine whether we
want to take the exception, i.e. whether need_loop is false, and the
setting of tsk->mems_allowed. So this would see the nodemask change at
the individual steps from 2-3 -> 1-2 -> 0-1, not some inconsistent state
in between or directly from 2-3 -> 0-1. The only time we don't hold
task_lock(tsk) to change tsk->mems_allowed is when tsk == current and in
that case we're not concerned about intermediate reads to its own nodemask
while storing to a mask where MAX_NUMNODES > BITS_PER_LONG.
Thus, there's no problem here with regard to such behavior if we exclude
mempolicies, which this patch does.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-16 21:08 [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask David Rientjes
2011-11-17 8:29 ` Miao Xie
@ 2011-11-17 22:22 ` Andrew Morton
2011-11-17 23:08 ` [patch v2 " David Rientjes
1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2011-11-17 22:22 UTC (permalink / raw)
To: David Rientjes
Cc: Linus Torvalds, Miao Xie, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Wed, 16 Nov 2011 13:08:33 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
> cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
> nodes from changing for a thread. This causes any update to a set of
> allowed nodes to stall until put_mems_allowed() is called.
>
> This stall is unncessary, however, if at least one node remains unchanged
> in the update to the set of allowed nodes. This was addressed by
> 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
> node remains set"), but it's still possible that an empty nodemask may be
> read from a mempolicy because the old nodemask may be remapped to the new
> nodemask during rebind. To prevent this, only avoid the stall if there
> is no mempolicy for the thread being changed.
>
> This is a temporary solution until all reads from mempolicy nodemasks can
> be guaranteed to not be empty without the get_mems_allowed()
> synchronization.
>
> Also moves the check for nodemask intersection inside task_lock() so that
> tsk->mems_allowed cannot change.
The patch doesn't actually apply, due to changes made by your earlier
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if
one node remains set"). Please recheck/redo/resend/etc?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch v2 for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-17 22:22 ` Andrew Morton
@ 2011-11-17 23:08 ` David Rientjes
2011-11-18 0:00 ` Andrew Morton
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-17 23:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Linus Torvalds, Miao Xie, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread. This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.
This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes. This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind. To prevent this, only avoid the stall if there
is no mempolicy for the thread being changed.
This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.
Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change. This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
kernel/cpuset.c | 16 +++++++++++-----
1 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -949,7 +949,7 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
static void cpuset_change_task_nodemask(struct task_struct *tsk,
nodemask_t *newmems)
{
- bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+ bool need_loop;
repeat:
/*
@@ -962,6 +962,14 @@ repeat:
return;
task_lock(tsk);
+ /*
+ * Determine if a loop is necessary if another thread is doing
+ * get_mems_allowed(). If at least one node remains unchanged and
+ * tsk does not have a mempolicy, then an empty nodemask will not be
+ * possible when mems_allowed is larger than a word.
+ */
+ need_loop = tsk->mempolicy ||
+ !nodes_intersects(*newmems, tsk->mems_allowed);
nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
@@ -981,11 +989,9 @@ repeat:
/*
* Allocation of memory is very fast, we needn't sleep when waiting
- * for the read-side. No wait is necessary, however, if at least one
- * node remains unchanged.
+ * for the read-side.
*/
- while (masks_disjoint &&
- ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+ while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
task_unlock(tsk);
if (!task_curr(tsk))
yield();
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch v2 for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-17 23:08 ` [patch v2 " David Rientjes
@ 2011-11-18 0:00 ` Andrew Morton
2011-11-18 23:53 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2011-11-18 0:00 UTC (permalink / raw)
To: David Rientjes
Cc: Linus Torvalds, Miao Xie, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Thu, 17 Nov 2011 15:08:08 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:
> cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
>
> c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
> cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
> nodes from changing for a thread. This causes any update to a set of
> allowed nodes to stall until put_mems_allowed() is called.
>
> This stall is unncessary, however, if at least one node remains unchanged
> in the update to the set of allowed nodes. This was addressed by
> 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
> node remains set"), but it's still possible that an empty nodemask may be
> read from a mempolicy because the old nodemask may be remapped to the new
> nodemask during rebind. To prevent this, only avoid the stall if there
> is no mempolicy for the thread being changed.
>
> This is a temporary solution until all reads from mempolicy nodemasks can
> be guaranteed to not be empty without the get_mems_allowed()
> synchronization.
>
> Also moves the check for nodemask intersection inside task_lock() so that
> tsk->mems_allowed cannot change. This ensures that nothing can set this
> tsk's mems_allowed out from under us and also protects tsk->mempolicy.
Nothing in this changelog makes me understand why you think we need this
change in 3.2. What are the user-visible effects of this change?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-17 21:33 ` David Rientjes
@ 2011-11-18 9:52 ` Miao Xie
2011-11-18 23:49 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-18 9:52 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Thu, 17 Nov 2011 13:33:14 -0800 (pst), David Rientjes wrote:
> On Thu, 17 Nov 2011, Miao Xie wrote:
>
>> Oh~, David
>>
>> I find these is another problem, please take account of the following case:
>>
>> 2-3 -> 1-2 -> 0-1
>>
>> the user change mems_allowed twice continuously, the task may see the empty
>> mems_allowed.
>>
>> So, it is still dangerous.
>>
>
> With this patch, we're protected by task_lock(tsk) to determine whether we
> want to take the exception, i.e. whether need_loop is false, and the
> setting of tsk->mems_allowed. So this would see the nodemask change at
> the individual steps from 2-3 -> 1-2 -> 0-1, not some inconsistent state
> in between or directly from 2-3 -> 0-1. The only time we don't hold
> task_lock(tsk) to change tsk->mems_allowed is when tsk == current and in
> that case we're not concerned about intermediate reads to its own nodemask
> while storing to a mask where MAX_NUMNODES > BITS_PER_LONG.
>
> Thus, there's no problem here with regard to such behavior if we exclude
> mempolicies, which this patch does.
>
No.
When the task does memory allocation, it access its mems_allowed without
task_lock(tsk), and it may be blocked after it check 0-1 bits. And then, the
user changes mems_allowed twice continuously(2-3(initial state) -> 1-2 -> 0-1),
After that, the task is woke up and it see the empty mems_allowed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-18 9:52 ` Miao Xie
@ 2011-11-18 23:49 ` David Rientjes
2011-11-23 2:51 ` Miao Xie
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-18 23:49 UTC (permalink / raw)
To: Miao Xie
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Fri, 18 Nov 2011, Miao Xie wrote:
> >> I find these is another problem, please take account of the following case:
> >>
> >> 2-3 -> 1-2 -> 0-1
> >>
> >> the user change mems_allowed twice continuously, the task may see the empty
> >> mems_allowed.
> >>
> >> So, it is still dangerous.
> >>
> >
> > With this patch, we're protected by task_lock(tsk) to determine whether we
> > want to take the exception, i.e. whether need_loop is false, and the
> > setting of tsk->mems_allowed. So this would see the nodemask change at
> > the individual steps from 2-3 -> 1-2 -> 0-1, not some inconsistent state
> > in between or directly from 2-3 -> 0-1. The only time we don't hold
> > task_lock(tsk) to change tsk->mems_allowed is when tsk == current and in
> > that case we're not concerned about intermediate reads to its own nodemask
> > while storing to a mask where MAX_NUMNODES > BITS_PER_LONG.
> >
> > Thus, there's no problem here with regard to such behavior if we exclude
> > mempolicies, which this patch does.
> >
>
> No.
> When the task does memory allocation, it access its mems_allowed without
> task_lock(tsk), and it may be blocked after it check 0-1 bits. And then, the
> user changes mems_allowed twice continuously(2-3(initial state) -> 1-2 -> 0-1),
> After that, the task is woke up and it see the empty mems_allowed.
>
I'm confused, you're concerned on a kernel where
MAX_NUMNODES > BITS_PER_LONG about thread A reading a partial
tsk->mems_allowed, being preempted, meanwhile thread B changes
tsk->mems_allowed by taking cgroup_mutex, taking task_lock(tsk), setting
the intersecting nodemask, releasing both, taking them again, changing the
nodemask again to be disjoint, then the thread A waking up and finishing
its read and seeing an intersecting nodemask because it is now disjoint
from the first read?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch v2 for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-18 0:00 ` Andrew Morton
@ 2011-11-18 23:53 ` David Rientjes
0 siblings, 0 replies; 18+ messages in thread
From: David Rientjes @ 2011-11-18 23:53 UTC (permalink / raw)
To: Andrew Morton
Cc: Linus Torvalds, Miao Xie, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Thu, 17 Nov 2011, Andrew Morton wrote:
> Nothing in this changelog makes me understand why you think we need this
> change in 3.2. What are the user-visible effects of this change?
>
Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
new set of allowed cpuset nodes where the two nodemasks, as a result of
the remap, are now disjoint. This fix is a bandaid so that we never
optimize the stores to tsk->mems_allowed because they intersect if tsk
also has an existing mempolicy, so that prevents the possibility of seeing
an empty nodemask during rebind. For 3.3, I'd like to ensure that
remapping any mempolicy's nodemask will never result in an empty nodemask.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-18 23:49 ` David Rientjes
@ 2011-11-23 2:51 ` Miao Xie
2011-11-23 3:32 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-23 2:51 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Fri, 18 Nov 2011 15:49:22 -0800 (pst), David Rientjes wrote:
> On Fri, 18 Nov 2011, Miao Xie wrote:
>
>>>> I find these is another problem, please take account of the following case:
>>>>
>>>> 2-3 -> 1-2 -> 0-1
>>>>
>>>> the user change mems_allowed twice continuously, the task may see the empty
>>>> mems_allowed.
>>>>
>>>> So, it is still dangerous.
>>>>
>>>
>>> With this patch, we're protected by task_lock(tsk) to determine whether we
>>> want to take the exception, i.e. whether need_loop is false, and the
>>> setting of tsk->mems_allowed. So this would see the nodemask change at
>>> the individual steps from 2-3 -> 1-2 -> 0-1, not some inconsistent state
>>> in between or directly from 2-3 -> 0-1. The only time we don't hold
>>> task_lock(tsk) to change tsk->mems_allowed is when tsk == current and in
>>> that case we're not concerned about intermediate reads to its own nodemask
>>> while storing to a mask where MAX_NUMNODES > BITS_PER_LONG.
>>>
>>> Thus, there's no problem here with regard to such behavior if we exclude
>>> mempolicies, which this patch does.
>>>
>>
>> No.
>> When the task does memory allocation, it access its mems_allowed without
>> task_lock(tsk), and it may be blocked after it check 0-1 bits. And then, the
>> user changes mems_allowed twice continuously(2-3(initial state) -> 1-2 -> 0-1),
>> After that, the task is woke up and it see the empty mems_allowed.
>>
>
> I'm confused, you're concerned on a kernel where
> MAX_NUMNODES > BITS_PER_LONG about thread A reading a partial
> tsk->mems_allowed, being preempted, meanwhile thread B changes
> tsk->mems_allowed by taking cgroup_mutex, taking task_lock(tsk), setting
> the intersecting nodemask, releasing both, taking them again, changing the
> nodemask again to be disjoint, then the thread A waking up and finishing
> its read and seeing an intersecting nodemask because it is now disjoint
> from the first read?
>
(I am sorry for the late reply, I was on leave for the past few days.)
Yes, what you said is right.
But in fact, on the kernel where MAX_NUMNODES <= BITS_PER_LONG, the same problem
can also occur.
task1 task1's mems task2
alloc page 2-3
alloc on node1? NO 2-3
2-3 change mems from 2-3 to 1-2
1-2 rebind task1's mpol
1-2 set new bits
1-2 change mems from 0-1 to 0
1-2 rebind task1's mpol
0-1 set new bits
alloc on node2? NO 0-1
...
can't alloc page
goto oom
Thanks
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-23 2:51 ` Miao Xie
@ 2011-11-23 3:32 ` David Rientjes
2011-11-23 4:48 ` Miao Xie
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-23 3:32 UTC (permalink / raw)
To: Miao Xie
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Wed, 23 Nov 2011, Miao Xie wrote:
> Yes, what you said is right.
> But in fact, on the kernel where MAX_NUMNODES <= BITS_PER_LONG, the same problem
> can also occur.
> task1 task1's mems task2
> alloc page 2-3
> alloc on node1? NO 2-3
> 2-3 change mems from 2-3 to 1-2
> 1-2 rebind task1's mpol
> 1-2 set new bits
> 1-2 change mems from 0-1 to 0
> 1-2 rebind task1's mpol
> 0-1 set new bits
> alloc on node2? NO 0-1
> ...
> can't alloc page
> goto oom
>
One of the major reasons why changing cpuset.mems can take >30s is because
of lengthy delays in the page allocator because it continuously loops
while trying reclaim or killing a thread and trying to allocate a page.
I think we should be able to optimize this by dropping it when it's not
required and moving it to get_page_from_freelist() which is the only thing
that cares about cpuset_zone_allowed_softwall().
Something like this:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1643,17 +1643,29 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
- struct zone *preferred_zone, int migratetype)
+ struct zone **preferred_zone, int migratetype)
{
struct zoneref *z;
struct page *page = NULL;
int classzone_idx;
struct zone *zone;
- nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
+ nodemask_t *allowednodes = NULL;
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
- classzone_idx = zone_idx(preferred_zone);
+ get_mems_allowed();
+ /*
+ * preferred_zone must come from an allowed node if the allocation is
+ * constrained to either a mempolicy (nodemask != NULL) or otherwise
+ * limited by cpusets.
+ */
+ if (alloc_flags & ALLOC_CPUSET)
+ allowednodes = nodemask ? : &cpuset_current_mems_allowed;
+
+ first_zones_zonelist(zonelist, high_zoneidx, allowednodes,
+ preferred_zone);
+ classzone_idx = zone_idx(*preferred_zone);
+ allowednodes = NULL;
zonelist_scan:
/*
* Scan zonelist, looking for a zone with enough free.
@@ -1717,7 +1729,7 @@ zonelist_scan:
}
try_this_zone:
- page = buffered_rmqueue(preferred_zone, zone, order,
+ page = buffered_rmqueue(*preferred_zone, zone, order,
gfp_mask, migratetype);
if (page)
break;
@@ -1731,6 +1743,7 @@ this_zone_full:
zlc_active = 0;
goto zonelist_scan;
}
+ put_mems_allowed();
return page;
}
@@ -1832,7 +1845,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, struct zone *preferred_zone,
+ nodemask_t *nodemask, struct zone **preferred_zone,
int migratetype)
{
struct page *page;
@@ -1885,13 +1898,13 @@ out:
static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+ nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
int migratetype, unsigned long *did_some_progress,
bool sync_migration)
{
struct page *page;
- if (!order || compaction_deferred(preferred_zone))
+ if (!order || compaction_deferred(*preferred_zone))
return NULL;
current->flags |= PF_MEMALLOC;
@@ -1909,8 +1922,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
alloc_flags, preferred_zone,
migratetype);
if (page) {
- preferred_zone->compact_considered = 0;
- preferred_zone->compact_defer_shift = 0;
+ *preferred_zone->compact_considered = 0;
+ *preferred_zone->compact_defer_shift = 0;
count_vm_event(COMPACTSUCCESS);
return page;
}
@@ -1921,7 +1934,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
* but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);
- defer_compaction(preferred_zone);
+ defer_compaction(*preferred_zone);
cond_resched();
}
@@ -1932,7 +1945,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+ nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
int migratetype, unsigned long *did_some_progress,
bool sync_migration)
{
@@ -1944,7 +1957,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+ nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
int migratetype, unsigned long *did_some_progress)
{
struct page *page = NULL;
@@ -2001,7 +2014,7 @@ retry:
static inline struct page *
__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, struct zone *preferred_zone,
+ nodemask_t *nodemask, struct zone **preferred_zone,
int migratetype)
{
struct page *page;
@@ -2012,7 +2025,8 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);
if (!page && gfp_mask & __GFP_NOFAIL)
- wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
+ wait_iff_congested(*preferred_zone, BLK_RW_ASYNC,
+ HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
return page;
@@ -2075,7 +2089,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, struct zone *preferred_zone,
+ nodemask_t *nodemask, struct zone **preferred_zone,
int migratetype)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -2110,7 +2124,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
restart:
if (!(gfp_mask & __GFP_NO_KSWAPD))
wake_all_kswapd(order, zonelist, high_zoneidx,
- zone_idx(preferred_zone));
+ zone_idx(*preferred_zone));
/*
* OK, we're below the kswapd watermark and have kicked background
@@ -2119,14 +2133,6 @@ restart:
*/
alloc_flags = gfp_to_alloc_flags(gfp_mask);
- /*
- * Find the true preferred zone if the allocation is unconstrained by
- * cpusets.
- */
- if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
- first_zones_zonelist(zonelist, high_zoneidx, NULL,
- &preferred_zone);
-
rebalance:
/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
@@ -2220,7 +2226,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
+ wait_iff_congested(*preferred_zone, BLK_RW_ASYNC, HZ/50);
goto rebalance;
} else {
/*
@@ -2277,25 +2283,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;
- get_mems_allowed();
- /* The preferred zone is used for statistics later */
- first_zones_zonelist(zonelist, high_zoneidx,
- nodemask ? : &cpuset_current_mems_allowed,
- &preferred_zone);
- if (!preferred_zone) {
- put_mems_allowed();
- return NULL;
- }
-
/* First allocation attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
- preferred_zone, migratetype);
- if (unlikely(!page))
+ &preferred_zone, migratetype);
+ if (unlikely(!page)) {
+ if (!preferred_zone)
+ return NULL;
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
- preferred_zone, migratetype);
- put_mems_allowed();
+ &preferred_zone, migratetype);
+ }
trace_mm_page_alloc(page, order, gfp_mask, migratetype);
return page;
This would significantly reduce the amount of time that it takes to write
to cpuset.mems because we drop get_mems_allowed() between allocation
attempts. We really, really want to do this anyway because it's possible
that a cpuset is being expanded to a larger set of nodes and they are
inaccessible to concurrent memory allocations because the page allocator
is holding get_mems_allowed() while looping and trying to find more
memory.
Comments?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-23 3:32 ` David Rientjes
@ 2011-11-23 4:48 ` Miao Xie
2011-11-23 6:25 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-23 4:48 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Tue, 22 Nov 2011 19:32:40 -0800 (pst), David Rientjes wrote:
[SNIP]
> One of the major reasons why changing cpuset.mems can take >30s is because
> of lengthy delays in the page allocator because it continuously loops
> while trying reclaim or killing a thread and trying to allocate a page.
>
> I think we should be able to optimize this by dropping it when it's not
> required and moving it to get_page_from_freelist() which is the only thing
> that cares about cpuset_zone_allowed_softwall().
>
> Something like this:
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1643,17 +1643,29 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
> static struct page *
> get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
> - struct zone *preferred_zone, int migratetype)
> + struct zone **preferred_zone, int migratetype)
> {
> struct zoneref *z;
> struct page *page = NULL;
> int classzone_idx;
> struct zone *zone;
> - nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
> + nodemask_t *allowednodes = NULL;
> int zlc_active = 0; /* set if using zonelist_cache */
> int did_zlc_setup = 0; /* just call zlc_setup() one time */
>
> - classzone_idx = zone_idx(preferred_zone);
> + get_mems_allowed();
> + /*
> + * preferred_zone must come from an allowed node if the allocation is
> + * constrained to either a mempolicy (nodemask != NULL) or otherwise
> + * limited by cpusets.
> + */
> + if (alloc_flags & ALLOC_CPUSET)
> + allowednodes = nodemask ? : &cpuset_current_mems_allowed;
> +
> + first_zones_zonelist(zonelist, high_zoneidx, allowednodes,
> + preferred_zone);
> + classzone_idx = zone_idx(*preferred_zone);
> + allowednodes = NULL;
> zonelist_scan:
> /*
> * Scan zonelist, looking for a zone with enough free.
> @@ -1717,7 +1729,7 @@ zonelist_scan:
> }
>
> try_this_zone:
> - page = buffered_rmqueue(preferred_zone, zone, order,
> + page = buffered_rmqueue(*preferred_zone, zone, order,
> gfp_mask, migratetype);
> if (page)
> break;
> @@ -1731,6 +1743,7 @@ this_zone_full:
> zlc_active = 0;
> goto zonelist_scan;
> }
> + put_mems_allowed();
> return page;
> }
>
> @@ -1832,7 +1845,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> static inline struct page *
> __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, enum zone_type high_zoneidx,
> - nodemask_t *nodemask, struct zone *preferred_zone,
> + nodemask_t *nodemask, struct zone **preferred_zone,
> int migratetype)
> {
> struct page *page;
> @@ -1885,13 +1898,13 @@ out:
> static struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, enum zone_type high_zoneidx,
> - nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> + nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
> int migratetype, unsigned long *did_some_progress,
> bool sync_migration)
> {
> struct page *page;
>
> - if (!order || compaction_deferred(preferred_zone))
> + if (!order || compaction_deferred(*preferred_zone))
> return NULL;
>
> current->flags |= PF_MEMALLOC;
> @@ -1909,8 +1922,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> alloc_flags, preferred_zone,
> migratetype);
> if (page) {
> - preferred_zone->compact_considered = 0;
> - preferred_zone->compact_defer_shift = 0;
> + *preferred_zone->compact_considered = 0;
> + *preferred_zone->compact_defer_shift = 0;
> count_vm_event(COMPACTSUCCESS);
> return page;
> }
> @@ -1921,7 +1934,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> * but not enough to satisfy watermarks.
> */
> count_vm_event(COMPACTFAIL);
> - defer_compaction(preferred_zone);
> + defer_compaction(*preferred_zone);
>
> cond_resched();
> }
> @@ -1932,7 +1945,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> static inline struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, enum zone_type high_zoneidx,
> - nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> + nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
> int migratetype, unsigned long *did_some_progress,
> bool sync_migration)
> {
> @@ -1944,7 +1957,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> static inline struct page *
> __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, enum zone_type high_zoneidx,
> - nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
> + nodemask_t *nodemask, int alloc_flags, struct zone **preferred_zone,
> int migratetype, unsigned long *did_some_progress)
> {
> struct page *page = NULL;
> @@ -2001,7 +2014,7 @@ retry:
> static inline struct page *
> __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, enum zone_type high_zoneidx,
> - nodemask_t *nodemask, struct zone *preferred_zone,
> + nodemask_t *nodemask, struct zone **preferred_zone,
> int migratetype)
> {
> struct page *page;
> @@ -2012,7 +2025,8 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> preferred_zone, migratetype);
>
> if (!page && gfp_mask & __GFP_NOFAIL)
> - wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
> + wait_iff_congested(*preferred_zone, BLK_RW_ASYNC,
> + HZ/50);
> } while (!page && (gfp_mask & __GFP_NOFAIL));
>
> return page;
> @@ -2075,7 +2089,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> static inline struct page *
> __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> struct zonelist *zonelist, enum zone_type high_zoneidx,
> - nodemask_t *nodemask, struct zone *preferred_zone,
> + nodemask_t *nodemask, struct zone **preferred_zone,
> int migratetype)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> @@ -2110,7 +2124,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> restart:
> if (!(gfp_mask & __GFP_NO_KSWAPD))
> wake_all_kswapd(order, zonelist, high_zoneidx,
> - zone_idx(preferred_zone));
> + zone_idx(*preferred_zone));
>
> /*
> * OK, we're below the kswapd watermark and have kicked background
> @@ -2119,14 +2133,6 @@ restart:
> */
> alloc_flags = gfp_to_alloc_flags(gfp_mask);
>
> - /*
> - * Find the true preferred zone if the allocation is unconstrained by
> - * cpusets.
> - */
> - if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
> - first_zones_zonelist(zonelist, high_zoneidx, NULL,
> - &preferred_zone);
> -
> rebalance:
> /* This is the last chance, in general, before the goto nopage. */
> page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> @@ -2220,7 +2226,7 @@ rebalance:
> pages_reclaimed += did_some_progress;
> if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> - wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
> + wait_iff_congested(*preferred_zone, BLK_RW_ASYNC, HZ/50);
> goto rebalance;
> } else {
> /*
> @@ -2277,25 +2283,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> if (unlikely(!zonelist->_zonerefs->zone))
> return NULL;
>
> - get_mems_allowed();
> - /* The preferred zone is used for statistics later */
> - first_zones_zonelist(zonelist, high_zoneidx,
> - nodemask ? : &cpuset_current_mems_allowed,
> - &preferred_zone);
> - if (!preferred_zone) {
> - put_mems_allowed();
> - return NULL;
> - }
> -
> /* First allocation attempt */
> page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
> - preferred_zone, migratetype);
> - if (unlikely(!page))
> + &preferred_zone, migratetype);
> + if (unlikely(!page)) {
> + if (!preferred_zone)
> + return NULL;
> page = __alloc_pages_slowpath(gfp_mask, order,
> zonelist, high_zoneidx, nodemask,
> - preferred_zone, migratetype);
> - put_mems_allowed();
> + &preferred_zone, migratetype);
> + }
>
> trace_mm_page_alloc(page, order, gfp_mask, migratetype);
> return page;
>
> This would significantly reduce the amount of time that it takes to write
> to cpuset.mems because we drop get_mems_allowed() between allocation
> attempts. We really, really want to do this anyway because it's possible
> that a cpuset is being expanded to a larger set of nodes and they are
> inaccessible to concurrent memory allocations because the page allocator
> is holding get_mems_allowed() while looping and trying to find more
> memory.
>
> Comments?
This is a good idea. But I worry that oom will happen easily, because we do
direct reclamation and compact by mems_allowed.
I think we can fix it by using a sequence to make the tasks know their
mems_allowed are changed, and the task will do direct reclamation and
compact again.
The other method is that making the tasks ignore its mems_allowed when
they do direct reclamation and compact at the second time.
By these two way, we can drop get_mems_allowed() as many as possible.
Thanks
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-23 4:48 ` Miao Xie
@ 2011-11-23 6:25 ` David Rientjes
2011-11-23 7:49 ` Miao Xie
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-23 6:25 UTC (permalink / raw)
To: Miao Xie
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Wed, 23 Nov 2011, Miao Xie wrote:
> This is a good idea. But I worry that oom will happen easily, because we do
> direct reclamation and compact by mems_allowed.
>
Memory compaction actually iterates through each zone regardless of
whether it's allowed or not in the current context. Recall that the
nodemask passed into __alloc_pages_nodemask() is non-NULL only when there
is a mempolicy that restricts the allocations by MPOL_BIND. That nodemask
is not protected by get_mems_allowed(), so there's no change in
compaction's behavior with my patch.
Direct reclaim does, however, require mems_allowed staying constant
without the risk of early oom as you mentioned. It has its own
get_mems_allowed(), though, so it doesn't have the opportunity to change
until returning to the page allocator. It's possible that mems_allowed
will be different on the next call to get_pages_from_freelist() but we
don't know anything about that context: it's entirely possible that the
set of new mems has an abundance of free memory or are completely depleted
as well. So there's no strict need for consistency between the set of
allowed nodes during reclaim and the subsequent allocation attempt. All
we care about is that reclaim has a consistent set of allowed nodes to
determine whether it's making progress or not.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-23 6:25 ` David Rientjes
@ 2011-11-23 7:49 ` Miao Xie
2011-11-23 22:26 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-23 7:49 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Tue, 22 Nov 2011 22:25:46 -0800 (pst), David Rientjes wrote:
> On Wed, 23 Nov 2011, Miao Xie wrote:
>
>> This is a good idea. But I worry that oom will happen easily, because we do
>> direct reclamation and compact by mems_allowed.
>>
>
> Memory compaction actually iterates through each zone regardless of
> whether it's allowed or not in the current context. Recall that the
> nodemask passed into __alloc_pages_nodemask() is non-NULL only when there
> is a mempolicy that restricts the allocations by MPOL_BIND. That nodemask
> is not protected by get_mems_allowed(), so there's no change in
> compaction's behavior with my patch.
That nodemask is also protected by get_mems_allowed().
> Direct reclaim does, however, require mems_allowed staying constant
> without the risk of early oom as you mentioned. It has its own
> get_mems_allowed(), though, so it doesn't have the opportunity to change
> until returning to the page allocator. It's possible that mems_allowed
> will be different on the next call to get_pages_from_freelist() but we
> don't know anything about that context: it's entirely possible that the
> set of new mems has an abundance of free memory or are completely depleted
> as well. So there's no strict need for consistency between the set of
> allowed nodes during reclaim and the subsequent allocation attempt. All
> we care about is that reclaim has a consistent set of allowed nodes to
> determine whether it's making progress or not.
>
Agree.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-23 7:49 ` Miao Xie
@ 2011-11-23 22:26 ` David Rientjes
2011-11-24 1:26 ` Miao Xie
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-23 22:26 UTC (permalink / raw)
To: Miao Xie
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Wed, 23 Nov 2011, Miao Xie wrote:
> That nodemask is also protected by get_mems_allowed().
>
So what's the problem with the patch? Can I add your acked-by?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-23 22:26 ` David Rientjes
@ 2011-11-24 1:26 ` Miao Xie
2011-11-24 1:52 ` David Rientjes
0 siblings, 1 reply; 18+ messages in thread
From: Miao Xie @ 2011-11-24 1:26 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Wed, 23 Nov 2011 14:26:40 -0800 (pst), David Rientjes wrote:
> On Wed, 23 Nov 2011, Miao Xie wrote:
>
>> That nodemask is also protected by get_mems_allowed().
>>
>
> So what's the problem with the patch?
Memory compaction will see an empty nodemask and do nothing, and then
we may fail to get several contiguous pages.
> Can I add your acked-by?
Sure, no problem.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-24 1:26 ` Miao Xie
@ 2011-11-24 1:52 ` David Rientjes
2011-11-24 2:50 ` Miao Xie
0 siblings, 1 reply; 18+ messages in thread
From: David Rientjes @ 2011-11-24 1:52 UTC (permalink / raw)
To: Miao Xie
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Thu, 24 Nov 2011, Miao Xie wrote:
> Memory compaction will see an empty nodemask and do nothing, and then
> we may fail to get several contiguous pages.
>
Where do we rely on a consistent nodemask for memory compaction?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
2011-11-24 1:52 ` David Rientjes
@ 2011-11-24 2:50 ` Miao Xie
0 siblings, 0 replies; 18+ messages in thread
From: Miao Xie @ 2011-11-24 2:50 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, Paul Menage,
linux-kernel, linux-mm
On Wed, 23 Nov 2011 17:52:08 -0800 (pst), David Rientjes wrote:
> On Thu, 24 Nov 2011, Miao Xie wrote:
>
>> Memory compaction will see an empty nodemask and do nothing, and then
>> we may fail to get several contiguous pages.
>>
>
> Where do we rely on a consistent nodemask for memory compaction?
Maybe I am overcautious.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2011-11-24 2:51 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-16 21:08 [patch for-3.2-rc3] cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask David Rientjes
2011-11-17 8:29 ` Miao Xie
2011-11-17 21:33 ` David Rientjes
2011-11-18 9:52 ` Miao Xie
2011-11-18 23:49 ` David Rientjes
2011-11-23 2:51 ` Miao Xie
2011-11-23 3:32 ` David Rientjes
2011-11-23 4:48 ` Miao Xie
2011-11-23 6:25 ` David Rientjes
2011-11-23 7:49 ` Miao Xie
2011-11-23 22:26 ` David Rientjes
2011-11-24 1:26 ` Miao Xie
2011-11-24 1:52 ` David Rientjes
2011-11-24 2:50 ` Miao Xie
2011-11-17 22:22 ` Andrew Morton
2011-11-17 23:08 ` [patch v2 " David Rientjes
2011-11-18 0:00 ` Andrew Morton
2011-11-18 23:53 ` David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).