* [PATCH v2 01/10] cpuset: remove redundant check in cpuset_cpus_allowed_fallback()
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-06-05 9:15 ` Li Zefan
2013-06-05 9:15 ` [PATCH v2 02/10] cpuset: cleanup guarantee_online_{cpus|mems}() Li Zefan
` (8 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:15 UTC (permalink / raw)
To: Tejun Heo; +Cc: Cgroups, Containers, LKML
task_cs() will never return NULL.
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
kernel/cpuset.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 64b3f79..f0c884a 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2253,8 +2253,7 @@ void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
rcu_read_lock();
cs = task_cs(tsk);
- if (cs)
- do_set_cpus_allowed(tsk, cs->cpus_allowed);
+ do_set_cpus_allowed(tsk, cs->cpus_allowed);
rcu_read_unlock();
/*
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH v2 02/10] cpuset: cleanup guarantee_online_{cpus|mems}()
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-05 9:15 ` [PATCH v2 01/10] cpuset: remove redundant check in cpuset_cpus_allowed_fallback() Li Zefan
@ 2013-06-05 9:15 ` Li Zefan
2013-06-05 9:15 ` [PATCH v2 03/10] cpuset: remove unnecessary variable in cpuset_attach() Li Zefan
` (7 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:15 UTC (permalink / raw)
To: Tejun Heo; +Cc: LKML, Cgroups, Containers
- We never pass a NULL @cs to these functions.
- The top cpuset always has some online cpus/mems.
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
kernel/cpuset.c | 29 +++++++----------------------
1 file changed, 7 insertions(+), 22 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f0c884a..d753837 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -304,53 +304,38 @@ static struct file_system_type cpuset_fs_type = {
/*
* Return in pmask the portion of a cpusets's cpus_allowed that
* are online. If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online cpus. If we get
- * all the way to the top and still haven't found any online cpus,
- * return cpu_online_mask. Or if passed a NULL cs from an exit'ing
- * task, return cpu_online_mask.
+ * until we find one that does have some online cpus. The top
+ * cpuset always has some cpus online.
*
* One way or another, we guarantee to return some non-empty subset
* of cpu_online_mask.
*
* Call with callback_mutex held.
*/
-
static void guarantee_online_cpus(const struct cpuset *cs,
struct cpumask *pmask)
{
- while (cs && !cpumask_intersects(cs->cpus_allowed, cpu_online_mask))
+ while (!cpumask_intersects(cs->cpus_allowed, cpu_online_mask))
cs = parent_cs(cs);
- if (cs)
- cpumask_and(pmask, cs->cpus_allowed, cpu_online_mask);
- else
- cpumask_copy(pmask, cpu_online_mask);
- BUG_ON(!cpumask_intersects(pmask, cpu_online_mask));
+ cpumask_and(pmask, cs->cpus_allowed, cpu_online_mask);
}
/*
* Return in *pmask the portion of a cpusets's mems_allowed that
* are online, with memory. If none are online with memory, walk
* up the cpuset hierarchy until we find one that does have some
- * online mems. If we get all the way to the top and still haven't
- * found any online mems, return node_states[N_MEMORY].
+ * online mems. The top cpuset always has some mems online.
*
* One way or another, we guarantee to return some non-empty subset
* of node_states[N_MEMORY].
*
* Call with callback_mutex held.
*/
-
static void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
{
- while (cs && !nodes_intersects(cs->mems_allowed,
- node_states[N_MEMORY]))
+ while (!nodes_intersects(cs->mems_allowed, node_states[N_MEMORY]))
cs = parent_cs(cs);
- if (cs)
- nodes_and(*pmask, cs->mems_allowed,
- node_states[N_MEMORY]);
- else
- *pmask = node_states[N_MEMORY];
- BUG_ON(!nodes_intersects(*pmask, node_states[N_MEMORY]));
+ nodes_and(*pmask, cs->mems_allowed, node_states[N_MEMORY]);
}
/*
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH v2 03/10] cpuset: remove unnecessary variable in cpuset_attach()
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-05 9:15 ` [PATCH v2 01/10] cpuset: remove redundant check in cpuset_cpus_allowed_fallback() Li Zefan
2013-06-05 9:15 ` [PATCH v2 02/10] cpuset: cleanup guarantee_online_{cpus|mems}() Li Zefan
@ 2013-06-05 9:15 ` Li Zefan
2013-06-05 9:15 ` [PATCH v2 04/10] cpuset: remove cpuset_test_cpumask() Li Zefan
` (6 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:15 UTC (permalink / raw)
To: Tejun Heo; +Cc: LKML, Cgroups, Containers
We can just use oldcs->mems_allowed.
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
kernel/cpuset.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index d753837..dbef832 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1407,8 +1407,7 @@ static cpumask_var_t cpus_attach;
static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
{
- /* static bufs protected by cpuset_mutex */
- static nodemask_t cpuset_attach_nodemask_from;
+ /* static buf protected by cpuset_mutex */
static nodemask_t cpuset_attach_nodemask_to;
struct mm_struct *mm;
struct task_struct *task;
@@ -1442,13 +1441,12 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
* Change mm, possibly for multiple threads in a threadgroup. This is
* expensive and may sleep.
*/
- cpuset_attach_nodemask_from = oldcs->mems_allowed;
cpuset_attach_nodemask_to = cs->mems_allowed;
mm = get_task_mm(leader);
if (mm) {
mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
if (is_memory_migrate(cs))
- cpuset_migrate_mm(mm, &cpuset_attach_nodemask_from,
+ cpuset_migrate_mm(mm, &oldcs->mems_allowed,
&cpuset_attach_nodemask_to);
mmput(mm);
}
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH v2 04/10] cpuset: remove cpuset_test_cpumask()
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
` (2 preceding siblings ...)
2013-06-05 9:15 ` [PATCH v2 03/10] cpuset: remove unnecessary variable in cpuset_attach() Li Zefan
@ 2013-06-05 9:15 ` Li Zefan
2013-06-05 9:15 ` [PATCH v2 05/10] cpuset: re-structure update_cpumask() a bit Li Zefan
` (5 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:15 UTC (permalink / raw)
To: Tejun Heo; +Cc: Cgroups, Containers, LKML
The test is done in set_cpus_allowed_ptr(), so it's redundant.
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
kernel/cpuset.c | 19 +------------------
1 file changed, 1 insertion(+), 18 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index dbef832..51f8e1d 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -784,23 +784,6 @@ void rebuild_sched_domains(void)
}
/**
- * cpuset_test_cpumask - test a task's cpus_allowed versus its cpuset's
- * @tsk: task to test
- * @scan: struct cgroup_scanner contained in its struct cpuset_hotplug_scanner
- *
- * Call with cpuset_mutex held. May take callback_mutex during call.
- * Called for each task in a cgroup by cgroup_scan_tasks().
- * Return nonzero if this tasks's cpus_allowed mask should be changed (in other
- * words, if its mask is not equal to its cpuset's mask).
- */
-static int cpuset_test_cpumask(struct task_struct *tsk,
- struct cgroup_scanner *scan)
-{
- return !cpumask_equal(&tsk->cpus_allowed,
- (cgroup_cs(scan->cg))->cpus_allowed);
-}
-
-/**
* cpuset_change_cpumask - make a task's cpus_allowed the same as its cpuset's
* @tsk: task to test
* @scan: struct cgroup_scanner containing the cgroup of the task
@@ -835,7 +818,7 @@ static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
struct cgroup_scanner scan;
scan.cg = cs->css.cgroup;
- scan.test_task = cpuset_test_cpumask;
+ scan.test_task = NULL;
scan.process_task = cpuset_change_cpumask;
scan.heap = heap;
cgroup_scan_tasks(&scan);
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH v2 05/10] cpuset: re-structure update_cpumask() a bit
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
` (3 preceding siblings ...)
2013-06-05 9:15 ` [PATCH v2 04/10] cpuset: remove cpuset_test_cpumask() Li Zefan
@ 2013-06-05 9:15 ` Li Zefan
[not found] ` <51AF01CF.1070706-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-05 9:16 ` [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset Li Zefan
` (4 subsequent siblings)
9 siblings, 1 reply; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:15 UTC (permalink / raw)
To: Tejun Heo; +Cc: Cgroups, Containers, LKML
Check if cpus_allowed is to be changed before calling validate_change().
This won't change any behavior, but later it will allow us to do this:
# mkdir /cpuset/child
# echo $$ > /cpuset/child/tasks /* empty cpuset */
# echo > /cpuset/child/cpuset.cpus /* do nothing, won't fail */
Without this patch, the last operation will fail.
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
kernel/cpuset.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 51f8e1d..535dce6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -856,14 +856,15 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
return -EINVAL;
}
- retval = validate_change(cs, trialcs);
- if (retval < 0)
- return retval;
/* Nothing to do if the cpus didn't change */
if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed))
return 0;
+ retval = validate_change(cs, trialcs);
+ if (retval < 0)
+ return retval;
+
retval = heap_init(&heap, PAGE_SIZE, GFP_KERNEL, NULL);
if (retval)
return retval;
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
` (4 preceding siblings ...)
2013-06-05 9:15 ` [PATCH v2 05/10] cpuset: re-structure update_cpumask() a bit Li Zefan
@ 2013-06-05 9:16 ` Li Zefan
[not found] ` <51AF01E8.8080205-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-05 9:16 ` [PATCH v2 07/10] cpuset: introduce effective_{cpumask|nodemask}_cpuset() Li Zefan
` (3 subsequent siblings)
9 siblings, 1 reply; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:16 UTC (permalink / raw)
To: Tejun Heo; +Cc: Cgroups, Containers, LKML
When we update a cpuset's mems_allowed and thus update tasks'
mems_allowed, it's required to pass the old mems_allowed and new
mems_allowed to cpuset_migrate_mm().
Currently we save old mems_allowed in a temp local variable before
changing cpuset->mems_allowed. This patch changes it by saving
old mems_allowed in cpuset->old_mems_allowed.
This currently won't change any behavior, but it will later allow
us to keep tasks in empty cpusets.
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
kernel/cpuset.c | 62 +++++++++++++++++++++++++++++++++------------------------
1 file changed, 36 insertions(+), 26 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 535dce6..b848505 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -87,6 +87,18 @@ struct cpuset {
cpumask_var_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */
+ /*
+ * This is old Memory Nodes tasks took on.
+ *
+ * - top_cpuset.old_mems_allowed is initialized to mems_allowed.
+ * - A new cpuset's old_mems_allowed is initialized when some
+ * task is moved into it.
+ * - old_mems_allowed is used in cpuset_migrate_mm() when we change
+ * cpuset.mems_allowed and have tasks' nodemask updated, and
+ * then old_mems_allowed is updated to mems_allowed.
+ */
+ nodemask_t old_mems_allowed;
+
struct fmeter fmeter; /* memory_pressure filter */
/*
@@ -976,16 +988,12 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
static void cpuset_change_nodemask(struct task_struct *p,
struct cgroup_scanner *scan)
{
+ struct cpuset *cs = cgroup_cs(scan->cg);
struct mm_struct *mm;
- struct cpuset *cs;
int migrate;
- const nodemask_t *oldmem = scan->data;
- static nodemask_t newmems; /* protected by cpuset_mutex */
-
- cs = cgroup_cs(scan->cg);
- guarantee_online_mems(cs, &newmems);
+ nodemask_t *newmems = scan->data;
- cpuset_change_task_nodemask(p, &newmems);
+ cpuset_change_task_nodemask(p, newmems);
mm = get_task_mm(p);
if (!mm)
@@ -995,7 +1003,7 @@ static void cpuset_change_nodemask(struct task_struct *p,
mpol_rebind_mm(mm, &cs->mems_allowed);
if (migrate)
- cpuset_migrate_mm(mm, oldmem, &cs->mems_allowed);
+ cpuset_migrate_mm(mm, &cs->old_mems_allowed, newmems);
mmput(mm);
}
@@ -1004,25 +1012,26 @@ static void *cpuset_being_rebound;
/**
* update_tasks_nodemask - Update the nodemasks of tasks in the cpuset.
* @cs: the cpuset in which each task's mems_allowed mask needs to be changed
- * @oldmem: old mems_allowed of cpuset cs
* @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
*
* Called with cpuset_mutex held
* No return value. It's guaranteed that cgroup_scan_tasks() always returns 0
* if @heap != NULL.
*/
-static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
- struct ptr_heap *heap)
+static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
{
+ static nodemask_t newmems; /* protected by cpuset_mutex */
struct cgroup_scanner scan;
cpuset_being_rebound = cs; /* causes mpol_dup() rebind */
+ guarantee_online_mems(cs, &newmems);
+
scan.cg = cs->css.cgroup;
scan.test_task = NULL;
scan.process_task = cpuset_change_nodemask;
scan.heap = heap;
- scan.data = (nodemask_t *)oldmem;
+ scan.data = &newmems;
/*
* The mpol_rebind_mm() call takes mmap_sem, which we couldn't
@@ -1036,6 +1045,12 @@ static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
*/
cgroup_scan_tasks(&scan);
+ /*
+ * All the tasks' nodemasks have been updated, update
+ * cs->old_mems_allowed.
+ */
+ cs->old_mems_allowed = newmems;
+
/* We're done rebinding vmas to this cpuset's new mems_allowed. */
cpuset_being_rebound = NULL;
}
@@ -1056,13 +1071,9 @@ static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
const char *buf)
{
- NODEMASK_ALLOC(nodemask_t, oldmem, GFP_KERNEL);
int retval;
struct ptr_heap heap;
- if (!oldmem)
- return -ENOMEM;
-
/*
* top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
* it's read-only
@@ -1091,8 +1102,8 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
goto done;
}
}
- *oldmem = cs->mems_allowed;
- if (nodes_equal(*oldmem, trialcs->mems_allowed)) {
+
+ if (nodes_equal(cs->mems_allowed, trialcs->mems_allowed)) {
retval = 0; /* Too easy - nothing to do */
goto done;
}
@@ -1108,11 +1119,10 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
cs->mems_allowed = trialcs->mems_allowed;
mutex_unlock(&callback_mutex);
- update_tasks_nodemask(cs, oldmem, &heap);
+ update_tasks_nodemask(cs, &heap);
heap_free(&heap);
done:
- NODEMASK_FREE(oldmem);
return retval;
}
@@ -1425,7 +1435,6 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
* Change mm, possibly for multiple threads in a threadgroup. This is
* expensive and may sleep.
*/
- cpuset_attach_nodemask_to = cs->mems_allowed;
mm = get_task_mm(leader);
if (mm) {
mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
@@ -1435,6 +1444,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
mmput(mm);
}
+ cs->old_mems_allowed = cpuset_attach_nodemask_to;
+
cs->attach_in_progress--;
/*
@@ -2001,7 +2012,7 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
{
static cpumask_t off_cpus;
- static nodemask_t off_mems, tmp_mems;
+ static nodemask_t off_mems;
struct cpuset *cs = container_of(work, struct cpuset, hotplug_work);
bool is_empty;
@@ -2020,11 +2031,10 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
/* remove offline mems from @cs */
if (!nodes_empty(off_mems)) {
- tmp_mems = cs->mems_allowed;
mutex_lock(&callback_mutex);
nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
mutex_unlock(&callback_mutex);
- update_tasks_nodemask(cs, &tmp_mems, NULL);
+ update_tasks_nodemask(cs, NULL);
}
is_empty = cpumask_empty(cs->cpus_allowed) ||
@@ -2116,11 +2126,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
/* synchronize mems_allowed to N_MEMORY */
if (mems_updated) {
- tmp_mems = top_cpuset.mems_allowed;
mutex_lock(&callback_mutex);
top_cpuset.mems_allowed = new_mems;
mutex_unlock(&callback_mutex);
- update_tasks_nodemask(&top_cpuset, &tmp_mems, NULL);
+ update_tasks_nodemask(&top_cpuset, NULL);
}
/* if cpus or mems went down, we need to propagate to descendants */
@@ -2186,6 +2195,7 @@ void __init cpuset_init_smp(void)
{
cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
top_cpuset.mems_allowed = node_states[N_MEMORY];
+ top_cpuset.old_mems_allowed = top_cpuset.mems_allowed;
register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH v2 07/10] cpuset: introduce effective_{cpumask|nodemask}_cpuset()
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
` (5 preceding siblings ...)
2013-06-05 9:16 ` [PATCH v2 06/10] cpuset: record old_mems_allowed in struct cpuset Li Zefan
@ 2013-06-05 9:16 ` Li Zefan
2013-06-05 9:16 ` [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets Li Zefan
` (2 subsequent siblings)
9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:16 UTC (permalink / raw)
To: Tejun Heo; +Cc: Cgroups, Containers, LKML
effective_cpumask_cpuset() returns an ancestor cpuset which has
non-empty cpumask.
If a cpuset is empty and the tasks in it need to update their
cpus_allowed, they take on the ancestor cpuset's cpumask.
This currently won't change any behavior, but it will later allow us
to keep tasks in empty cpusets.
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
kernel/cpuset.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 65 insertions(+), 11 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b848505..5252f94 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -795,6 +795,45 @@ void rebuild_sched_domains(void)
mutex_unlock(&cpuset_mutex);
}
+/*
+ * effective_cpumask_cpuset - return nearest ancestor with non-empty cpus
+ * @cs: the cpuset in interest
+ *
+ * A cpuset's effective cpumask is the cpumask of the nearest ancestor
+ * with non-empty cpus. We use effective cpumask whenever:
+ * - we update tasks' cpus_allowed. (they take on the ancestor's cpumask
+ * if the cpuset they reside in has no cpus)
+ * - we want to retrieve task_cs(tsk)'s cpus_allowed.
+ *
+ * Called with cpuset_mutex held. cpuset_cpus_allowed_fallback() is an
+ * exception. See comments there.
+ */
+static struct cpuset *effective_cpumask_cpuset(struct cpuset *cs)
+{
+ while (cpumask_empty(cs->cpus_allowed))
+ cs = parent_cs(cs);
+ return cs;
+}
+
+/*
+ * effective_nodemask_cpuset - return nearest ancestor with non-empty mems
+ * @cs: the cpuset in interest
+ *
+ * A cpuset's effective nodemask is the nodemask of the nearest ancestor
+ * with non-empty memss. We use effective nodemask whenever:
+ * - we update tasks' mems_allowed. (they take on the ancestor's nodemask
+ * if the cpuset they reside in has no mems)
+ * - we want to retrieve task_cs(tsk)'s mems_allowed.
+ *
+ * Called with cpuset_mutex held.
+ */
+static struct cpuset *effective_nodemask_cpuset(struct cpuset *cs)
+{
+ while (nodes_empty(cs->mems_allowed))
+ cs = parent_cs(cs);
+ return cs;
+}
+
/**
* cpuset_change_cpumask - make a task's cpus_allowed the same as its cpuset's
* @tsk: task to test
@@ -809,7 +848,10 @@ void rebuild_sched_domains(void)
static void cpuset_change_cpumask(struct task_struct *tsk,
struct cgroup_scanner *scan)
{
- set_cpus_allowed_ptr(tsk, ((cgroup_cs(scan->cg))->cpus_allowed));
+ struct cpuset *cpus_cs;
+
+ cpus_cs = effective_cpumask_cpuset(cgroup_cs(scan->cg));
+ set_cpus_allowed_ptr(tsk, cpus_cs->cpus_allowed);
}
/**
@@ -924,12 +966,14 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
const nodemask_t *to)
{
struct task_struct *tsk = current;
+ struct cpuset *mems_cs;
tsk->mems_allowed = *to;
do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
- guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
+ mems_cs = effective_nodemask_cpuset(task_cs(tsk));
+ guarantee_online_mems(mems_cs, &tsk->mems_allowed);
}
/*
@@ -1022,10 +1066,11 @@ static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
{
static nodemask_t newmems; /* protected by cpuset_mutex */
struct cgroup_scanner scan;
+ struct cpuset *mems_cs = effective_nodemask_cpuset(cs);
cpuset_being_rebound = cs; /* causes mpol_dup() rebind */
- guarantee_online_mems(cs, &newmems);
+ guarantee_online_mems(mems_cs, &newmems);
scan.cg = cs->css.cgroup;
scan.test_task = NULL;
@@ -1409,6 +1454,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
struct cgroup *oldcgrp = cgroup_taskset_cur_cgroup(tset);
struct cpuset *cs = cgroup_cs(cgrp);
struct cpuset *oldcs = cgroup_cs(oldcgrp);
+ struct cpuset *cpus_cs = effective_cpumask_cpuset(cs);
+ struct cpuset *mems_cs = effective_nodemask_cpuset(cs);
mutex_lock(&cpuset_mutex);
@@ -1416,9 +1463,9 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
if (cs == &top_cpuset)
cpumask_copy(cpus_attach, cpu_possible_mask);
else
- guarantee_online_cpus(cs, cpus_attach);
+ guarantee_online_cpus(cpus_cs, cpus_attach);
- guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+ guarantee_online_mems(mems_cs, &cpuset_attach_nodemask_to);
cgroup_taskset_for_each(task, cgrp, tset) {
/*
@@ -1437,9 +1484,11 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
*/
mm = get_task_mm(leader);
if (mm) {
+ struct cpuset *mems_oldcs = effective_nodemask_cpuset(oldcs);
+
mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
if (is_memory_migrate(cs))
- cpuset_migrate_mm(mm, &oldcs->mems_allowed,
+ cpuset_migrate_mm(mm, &mems_oldcs->mems_allowed,
&cpuset_attach_nodemask_to);
mmput(mm);
}
@@ -2217,20 +2266,23 @@ void __init cpuset_init_smp(void)
void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
{
+ struct cpuset *cpus_cs;
+
mutex_lock(&callback_mutex);
task_lock(tsk);
- guarantee_online_cpus(task_cs(tsk), pmask);
+ cpus_cs = effective_cpumask_cpuset(task_cs(tsk));
+ guarantee_online_cpus(cpus_cs, pmask);
task_unlock(tsk);
mutex_unlock(&callback_mutex);
}
void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
{
- const struct cpuset *cs;
+ const struct cpuset *cpus_cs;
rcu_read_lock();
- cs = task_cs(tsk);
- do_set_cpus_allowed(tsk, cs->cpus_allowed);
+ cpus_cs = effective_cpumask_cpuset(task_cs(tsk));
+ do_set_cpus_allowed(tsk, cpus_cs->cpus_allowed);
rcu_read_unlock();
/*
@@ -2269,11 +2321,13 @@ void cpuset_init_current_mems_allowed(void)
nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
{
+ struct cpuset *mems_cs;
nodemask_t mask;
mutex_lock(&callback_mutex);
task_lock(tsk);
- guarantee_online_mems(task_cs(tsk), &mask);
+ mems_cs = effective_nodemask_cpuset(task_cs(tsk));
+ guarantee_online_mems(mems_cs, &mask);
task_unlock(tsk);
mutex_unlock(&callback_mutex);
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
` (6 preceding siblings ...)
2013-06-05 9:16 ` [PATCH v2 07/10] cpuset: introduce effective_{cpumask|nodemask}_cpuset() Li Zefan
@ 2013-06-05 9:16 ` Li Zefan
[not found] ` <51AF020B.3030701-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-05 9:17 ` [PATCH v2 09/10] cpuset: allow to move tasks to " Li Zefan
2013-06-05 9:17 ` [PATCH v2 10/10] cpuset: fix to migrate mm correctly in a corner case Li Zefan
9 siblings, 1 reply; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:16 UTC (permalink / raw)
To: Tejun Heo; +Cc: Cgroups, Containers, LKML
To achieve this:
- We call update_tasks_cpumask/nodemask() for empty cpusets when
hotplug happens, instead of moving tasks out of them.
- When a cpuset's masks are changed by writing cpuset.cpus/mems,
we also update tasks in child cpusets which are empty.
v2:
- drop rcu_read_lock before calling update_task_nodemask() and
update_task_cpumask(), instead of using workqueue.
- add documentation in include/linux/cgroup.h
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
include/linux/cgroup.h | 4 ++
kernel/cpuset.c | 137 +++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 120 insertions(+), 21 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d0ad379..53e81a6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -277,6 +277,10 @@ enum {
*
* - Remount is disallowed.
*
+ * - cpuset: tasks will be kept in empty cpusets when hotplug happens
+ * and take masks of ancestors with non-empty cpus/mems, instead of
+ * being moved to an ancestor.
+ *
* - memcg: use_hierarchy is on by default and the cgroup file for
* the flag is not created.
*
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5252f94..3b93098 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -878,6 +878,49 @@ static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
cgroup_scan_tasks(&scan);
}
+/*
+ * update_tasks_cpumask_hier - Update the cpumasks of tasks in the hierarchy.
+ * @root_cs: the root cpuset of the hierarchy
+ * @update_root: update root cpuset or not?
+ * @heap: the heap used by cgroup_scan_tasks()
+ *
+ * This will update cpumasks of tasks in @root_cs and all other empty cpusets
+ * which take on cpumask of @root_cs.
+ *
+ * Called with cpuset_mutex held
+ */
+static void update_tasks_cpumask_hier(struct cpuset *root_cs,
+ bool update_root, struct ptr_heap *heap)
+{
+ struct cpuset *cp;
+ struct cgroup *pos_cgrp;
+
+ if (update_root)
+ update_tasks_cpumask(root_cs, heap);
+
+ rcu_read_lock();
+ cpuset_for_each_descendant_pre(cp, pos_cgrp, root_cs) {
+ /* skip the whole subtree if @cp have some CPU */
+ if (!cpumask_empty(cp->cpus_allowed)) {
+ pos_cgrp = cgroup_rightmost_descendant(pos_cgrp);
+ continue;
+ }
+ if (!css_tryget(&cp->css))
+ continue;
+ rcu_read_unlock();
+
+ update_tasks_cpumask(cp, heap);
+
+ rcu_read_lock();
+ /*
+ * We should acquire rcu_read_lock first, so it's still
+ * valid to access @cp after dropping the css refcnt.
+ */
+ css_put(&cp->css);
+ }
+ rcu_read_unlock();
+}
+
/**
* update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
* @cs: the cpuset to consider
@@ -929,11 +972,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
mutex_unlock(&callback_mutex);
- /*
- * Scan tasks in the cpuset, and update the cpumasks of any
- * that need an update.
- */
- update_tasks_cpumask(cs, &heap);
+ update_tasks_cpumask_hier(cs, true, &heap);
heap_free(&heap);
@@ -1101,6 +1140,49 @@ static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
}
/*
+ * update_tasks_nodemask_hier - Update the nodemasks of tasks in the hierarchy.
+ * @cs: the root cpuset of the hierarchy
+ * @update_root: update the root cpuset or not?
+ * @heap: the heap used by cgroup_scan_tasks()
+ *
+ * This will update nodemasks of tasks in @root_cs and all other empty cpusets
+ * which take on nodemask of @root_cs.
+ *
+ * Called with cpuset_mutex held
+ */
+static void update_tasks_nodemask_hier(struct cpuset *root_cs,
+ bool update_root, struct ptr_heap *heap)
+{
+ struct cpuset *cp;
+ struct cgroup *pos_cgrp;
+
+ if (update_root)
+ update_tasks_nodemask(root_cs, heap);
+
+ rcu_read_lock();
+ cpuset_for_each_descendant_pre(cp, pos_cgrp, root_cs) {
+ /* skip the whole subtree if @cp have some CPU */
+ if (!nodes_empty(cp->mems_allowed)) {
+ pos_cgrp = cgroup_rightmost_descendant(pos_cgrp);
+ continue;
+ }
+ if (!css_tryget(&cp->css))
+ continue;
+ rcu_read_unlock();
+
+ update_tasks_nodemask(cp, heap);
+
+ rcu_read_lock();
+ /*
+ * We should acquire rcu_read_lock first, so it's still
+ * valid to access @cp after dropping the css refcnt.
+ */
+ css_put(&cp->css);
+ }
+ rcu_read_unlock();
+}
+
+/*
* Handle user request to change the 'mems' memory placement
* of a cpuset. Needs to validate the request, update the
* cpusets mems_allowed, and for each task in the cpuset,
@@ -1164,7 +1246,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
cs->mems_allowed = trialcs->mems_allowed;
mutex_unlock(&callback_mutex);
- update_tasks_nodemask(cs, &heap);
+ update_tasks_nodemask_hier(cs, true, &heap);
heap_free(&heap);
done:
@@ -2064,27 +2146,36 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
static nodemask_t off_mems;
struct cpuset *cs = container_of(work, struct cpuset, hotplug_work);
bool is_empty;
+ bool sane = cgroup_sane_behavior(cs->css.cgroup);
mutex_lock(&cpuset_mutex);
cpumask_andnot(&off_cpus, cs->cpus_allowed, top_cpuset.cpus_allowed);
nodes_andnot(off_mems, cs->mems_allowed, top_cpuset.mems_allowed);
- /* remove offline cpus from @cs */
- if (!cpumask_empty(&off_cpus)) {
- mutex_lock(&callback_mutex);
- cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, &off_cpus);
- mutex_unlock(&callback_mutex);
+ mutex_lock(&callback_mutex);
+ cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, &off_cpus);
+ mutex_unlock(&callback_mutex);
+
+ /*
+ * If sane_behavior flag is set, we need to update tasks' cpumask
+ * for empty cpuset to take on ancestor's cpumask
+ */
+ if ((sane && cpumask_empty(cs->cpus_allowed)) ||
+ !cpumask_empty(&off_cpus))
update_tasks_cpumask(cs, NULL);
- }
- /* remove offline mems from @cs */
- if (!nodes_empty(off_mems)) {
- mutex_lock(&callback_mutex);
- nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
- mutex_unlock(&callback_mutex);
+ mutex_lock(&callback_mutex);
+ nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
+ mutex_unlock(&callback_mutex);
+
+ /*
+ * If sane_behavior flag is set, we need to update tasks' nodemask
+ * for empty cpuset to take on ancestor's nodemask
+ */
+ if ((sane && nodes_empty(cs->mems_allowed)) ||
+ !nodes_empty(off_mems))
update_tasks_nodemask(cs, NULL);
- }
is_empty = cpumask_empty(cs->cpus_allowed) ||
nodes_empty(cs->mems_allowed);
@@ -2092,11 +2183,13 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
mutex_unlock(&cpuset_mutex);
/*
- * If @cs became empty, move tasks to the nearest ancestor with
- * execution resources. This is full cgroup operation which will
+ * If sane_behavior flag is set, we'll keep tasks in empty cpusets.
+ *
+ * Otherwise move tasks to the nearest ancestor with execution
+ * resources. This is full cgroup operation which will
* also call back into cpuset. Should be done outside any lock.
*/
- if (is_empty)
+ if (!sane && is_empty)
remove_tasks_in_empty_cpuset(cs);
/* the following may free @cs, should be the last operation */
@@ -2171,6 +2264,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
mutex_unlock(&callback_mutex);
/* we don't mess with cpumasks of tasks in top_cpuset */
+ update_tasks_cpumask_hier(&top_cpuset, false, NULL);
}
/* synchronize mems_allowed to N_MEMORY */
@@ -2179,6 +2273,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
top_cpuset.mems_allowed = new_mems;
mutex_unlock(&callback_mutex);
update_tasks_nodemask(&top_cpuset, NULL);
+ update_tasks_nodemask_hier(&top_cpuset, false, NULL);
}
/* if cpus or mems went down, we need to propagate to descendants */
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH v2 09/10] cpuset: allow to move tasks to empty cpusets
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
` (7 preceding siblings ...)
2013-06-05 9:16 ` [PATCH v2 08/10] cpuset: allow to keep tasks in empty cpusets Li Zefan
@ 2013-06-05 9:17 ` Li Zefan
2013-06-05 9:17 ` [PATCH v2 10/10] cpuset: fix to migrate mm correctly in a corner case Li Zefan
9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:17 UTC (permalink / raw)
To: Tejun Heo; +Cc: Cgroups, Containers, LKML
Currently some cpuset behaviors are not friendly when cpuset is co-mounted
with other cgroup controllers.
Now with this patchset if cpuset is mounted with sane_behavior option,
it behaves differently:
- Tasks will be kept in empty cpusets when hotplug happens and take
masks of ancestors with non-empty cpus/mems, instead of being moved to
an ancestor.
- A task can be moved into an empty cpuset, and again it takes masks of
ancestors, so the user can drop a task into a newly created cgroup without
having to do anything for it.
As tasks can reside in empy cpusets, here're some rules:
- They can be moved to another cpuset, regardless it's empty or not.
- Though it takes masks from ancestors, it takes other configs from the
empty cpuset.
- If the ancestors' masks are changed, those tasks will also be updated
to take new masks.
v2: add documentation in include/linux/cgroup.h
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
include/linux/cgroup.h | 3 +++
kernel/cpuset.c | 12 +++++++++---
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 53e81a6..74e8b8e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -281,6 +281,9 @@ enum {
* and take masks of ancestors with non-empty cpus/mems, instead of
* being moved to an ancestor.
*
+ * - cpuset: a task can be moved into an empty cpuset, and again it
+ * takes masks of ancestors.
+ *
* - memcg: use_hierarchy is on by default and the cgroup file for
* the flag is not created.
*
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3b93098..9bb6a47 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -483,7 +483,7 @@ static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
*/
ret = -ENOSPC;
if ((cgroup_task_count(cur->css.cgroup) || cur->attach_in_progress) &&
- (cpumask_empty(trial->cpus_allowed) ||
+ (cpumask_empty(trial->cpus_allowed) &&
nodes_empty(trial->mems_allowed)))
goto out;
@@ -1478,8 +1478,13 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
mutex_lock(&cpuset_mutex);
+ /*
+ * We allow to move tasks into an empty cpuset if sane_behavior
+ * flag is set.
+ */
ret = -ENOSPC;
- if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
+ if (!cgroup_sane_behavior(cgrp) &&
+ (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
goto out_unlock;
cgroup_taskset_for_each(task, cgrp, tset) {
@@ -1584,7 +1589,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
* propagation if @cs doesn't have any CPU or memory. It will move
* the newly added tasks to the nearest parent which can execute.
*/
- if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
+ if (!cgroup_sane_behavior(cgrp) &&
+ (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
schedule_cpuset_propagate_hotplug(cs);
mutex_unlock(&cpuset_mutex);
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH v2 10/10] cpuset: fix to migrate mm correctly in a corner case
[not found] ` <51AF0183.8070602-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
` (8 preceding siblings ...)
2013-06-05 9:17 ` [PATCH v2 09/10] cpuset: allow to move tasks to " Li Zefan
@ 2013-06-05 9:17 ` Li Zefan
9 siblings, 0 replies; 17+ messages in thread
From: Li Zefan @ 2013-06-05 9:17 UTC (permalink / raw)
To: Tejun Heo; +Cc: Cgroups, Containers, LKML
Before moving tasks out of empty cpusets, update_tasks_nodemask()
is called, which calls do_migrate_pages(xx, from, to). Then those
tasks are moved to an ancestor, and do_migrate_pages() is called
again.
The first time: from = node_to_be_offlined, to = empty.
The second time: from = empty, to = ancestor's nodemask.
so looks like no pages will be migrated.
Fix this by:
- Don't call update_tasks_nodemask() on empty cpusets.
- Pass cs->old_mems_allowed to do_migrate_pages().
Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
kernel/cpuset.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9bb6a47..de7f6c1 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1574,9 +1574,16 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
struct cpuset *mems_oldcs = effective_nodemask_cpuset(oldcs);
mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
- if (is_memory_migrate(cs))
- cpuset_migrate_mm(mm, &mems_oldcs->mems_allowed,
+ if (is_memory_migrate(cs)) {
+ /*
+ * old_mems_allowed is the same with mems_allowed,
+ * except if this task is being moved automatically
+ * due to hotplug, and in this case mems_allowed is
+ * empty and old_mems_allowed is the offflined node.
+ */
+ cpuset_migrate_mm(mm, &mems_oldcs->old_mems_allowed,
&cpuset_attach_nodemask_to);
+ }
mmput(mm);
}
@@ -2168,7 +2175,7 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
* for empty cpuset to take on ancestor's cpumask
*/
if ((sane && cpumask_empty(cs->cpus_allowed)) ||
- !cpumask_empty(&off_cpus))
+ (!cpumask_empty(&off_cpus) && !cpumask_empty(cs->cpus_allowed)))
update_tasks_cpumask(cs, NULL);
mutex_lock(&callback_mutex);
@@ -2180,7 +2187,7 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
* for empty cpuset to take on ancestor's nodemask
*/
if ((sane && nodes_empty(cs->mems_allowed)) ||
- !nodes_empty(off_mems))
+ (!nodes_empty(off_mems) && !nodes_empty(cs->mems_allowed)))
update_tasks_nodemask(cs, NULL);
is_empty = cpumask_empty(cs->cpus_allowed) ||
--
1.8.0.2
^ permalink raw reply related [flat|nested] 17+ messages in thread