* [patch] cpusets: avoid looping when storing to mems_allowed if one node remains set
@ 2011-09-09 10:15 David Rientjes
2011-09-13 4:20 ` Miao Xie
0 siblings, 1 reply; 3+ messages in thread
From: David Rientjes @ 2011-09-09 10:15 UTC (permalink / raw)
To: Andrew Morton
Cc: Miao Xie, KOSAKI Motohiro, Nick Piggin, Paul Menage, linux-kernel,
linux-mm
{get,put}_mems_allowed() exist so that general kernel code may locklessly
access a task's set of allowable nodes without having the chance that a
concurrent write will cause the nodemask to be empty on configurations
where MAX_NUMNODES > BITS_PER_LONG.
This could incur a significant delay, however, especially in low memory
conditions because the page allocator is blocking and reclaim requires
get_mems_allowed() itself. It is not atypical to see writes to cpuset.mems
take over 2 seconds to complete, for example. In low memory conditions,
this is problematic because it's one of the most imporant times to change
cpuset.mems in the first place!
The only way a task's set of allowable nodes may change is through
cpusets by writing to cpuset.mems and when attaching a task to a
different cpuset. This is done by setting all new nodes, ensuring
generic code is not reading the nodemask with get_mems_allowed() at the
same time, and then clearing all the old nodes. This prevents the
possibility that a reader will see an empty nodemask at the same time the
writer is storing a new nodemask.
If at least one node remains unchanged, though, it's possible to simply
set all new nodes and then clear all the old nodes. Changing a task's
nodemask is protected by cgroup_mutex so it's guaranteed that two threads
are not changing the same task's nodemask at the same time, so the
nodemask is guaranteed to be stored before another thread changes it and
determines whether a node remains set or not.
Signed-off-by: David Rientjes <rientjes@google.com>
---
kernel/cpuset.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -949,6 +949,8 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
static void cpuset_change_task_nodemask(struct task_struct *tsk,
nodemask_t *newmems)
{
+ bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
+
repeat:
/*
* Allow tasks that have access to memory reserves because they have
@@ -963,7 +965,6 @@ repeat:
nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
-
/*
* ensure checking ->mems_allowed_change_disable after setting all new
* allowed nodes.
@@ -980,9 +981,11 @@ repeat:
/*
* Allocation of memory is very fast, we needn't sleep when waiting
- * for the read-side.
+ * for the read-side. No wait is necessary, however, if at least one
+ * node remains unchanged.
*/
- while (ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+ while (masks_disjoint &&
+ ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
task_unlock(tsk);
if (!task_curr(tsk))
yield();
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [patch] cpusets: avoid looping when storing to mems_allowed if one node remains set
2011-09-09 10:15 [patch] cpusets: avoid looping when storing to mems_allowed if one node remains set David Rientjes
@ 2011-09-13 4:20 ` Miao Xie
2011-09-13 22:06 ` David Rientjes
0 siblings, 1 reply; 3+ messages in thread
From: Miao Xie @ 2011-09-13 4:20 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, KOSAKI Motohiro, Nick Piggin, Paul Menage,
linux-kernel, linux-mm
On fri, 9 Sep 2011 03:15:17 -0700 (pdt), David Rientjes wrote:
> {get,put}_mems_allowed() exist so that general kernel code may locklessly
> access a task's set of allowable nodes without having the chance that a
> concurrent write will cause the nodemask to be empty on configurations
> where MAX_NUMNODES > BITS_PER_LONG.
>
> This could incur a significant delay, however, especially in low memory
> conditions because the page allocator is blocking and reclaim requires
> get_mems_allowed() itself. It is not atypical to see writes to cpuset.mems
> take over 2 seconds to complete, for example. In low memory conditions,
> this is problematic because it's one of the most imporant times to change
> cpuset.mems in the first place!
>
> The only way a task's set of allowable nodes may change is through
> cpusets by writing to cpuset.mems and when attaching a task to a
> different cpuset. This is done by setting all new nodes, ensuring
> generic code is not reading the nodemask with get_mems_allowed() at the
> same time, and then clearing all the old nodes. This prevents the
> possibility that a reader will see an empty nodemask at the same time the
> writer is storing a new nodemask.
>
> If at least one node remains unchanged, though, it's possible to simply
> set all new nodes and then clear all the old nodes. Changing a task's
> nodemask is protected by cgroup_mutex so it's guaranteed that two threads
> are not changing the same task's nodemask at the same time, so the
> nodemask is guaranteed to be stored before another thread changes it and
> determines whether a node remains set or not.
This patch is dangerous if the task has a bind memory policy that was set
to be neither MPOL_F_STATIC_NODES nor MPOL_F_RELATIVE_NODES, because the
memory policy use node_remap() to rebind the allowed nodes, but node_remap()
may make the old mask and the new mask nonoverlapping. So at this condition,
the task may also see an empty node mask.
Thanks
Miao
>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
> kernel/cpuset.c | 9 ++++++---
> 1 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -949,6 +949,8 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
> static void cpuset_change_task_nodemask(struct task_struct *tsk,
> nodemask_t *newmems)
> {
> + bool masks_disjoint = !nodes_intersects(*newmems, tsk->mems_allowed);
> +
> repeat:
> /*
> * Allow tasks that have access to memory reserves because they have
> @@ -963,7 +965,6 @@ repeat:
> nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
> mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
>
> -
> /*
> * ensure checking ->mems_allowed_change_disable after setting all new
> * allowed nodes.
> @@ -980,9 +981,11 @@ repeat:
>
> /*
> * Allocation of memory is very fast, we needn't sleep when waiting
> - * for the read-side.
> + * for the read-side. No wait is necessary, however, if at least one
> + * node remains unchanged.
> */
> - while (ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
> + while (masks_disjoint &&
> + ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
> task_unlock(tsk);
> if (!task_curr(tsk))
> yield();
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [patch] cpusets: avoid looping when storing to mems_allowed if one node remains set
2011-09-13 4:20 ` Miao Xie
@ 2011-09-13 22:06 ` David Rientjes
0 siblings, 0 replies; 3+ messages in thread
From: David Rientjes @ 2011-09-13 22:06 UTC (permalink / raw)
To: Miao Xie
Cc: Andrew Morton, KOSAKI Motohiro, Nick Piggin, Paul Menage,
linux-kernel, linux-mm
On Tue, 13 Sep 2011, Miao Xie wrote:
> This patch is dangerous if the task has a bind memory policy that was set
> to be neither MPOL_F_STATIC_NODES nor MPOL_F_RELATIVE_NODES, because the
> memory policy use node_remap() to rebind the allowed nodes, but node_remap()
> may make the old mask and the new mask nonoverlapping. So at this condition,
> the task may also see an empty node mask.
>
The vast majority of cpuset users are not going to have mempolicies at
all, the cpuset itself is the only policy they need to take advantage of
the NUMA locality of their machine. I'd be find with checking for
!tsk->mempolicy in this exception as well since we already hold
task_lock(tsk), but I think the real fix would be to make sure that an
empty nodemask is never returned by mempolicies. Something like ensuring
that if the preferred node is MAX_NUMNODES (since it is determined by
using first_node() over a possibly racing empty nodemask) that the first
online node is returned during the race and that
node_states[N_HIGH_MEMORY] is returned if an MPOL_BIND or MPOL_INTERLEAVE
mask is empty. Thoughts?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2011-09-13 22:06 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-09 10:15 [patch] cpusets: avoid looping when storing to mems_allowed if one node remains set David Rientjes
2011-09-13 4:20 ` Miao Xie
2011-09-13 22:06 ` David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).