[PATCH v3 0/7] cpuset: implement sane hierarchy behaviors

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/7] cpuset: implement sane hierarchy behaviors
@ 2013-06-09  9:14 Li Zefan
       [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2013-06-09  9:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Cgroups, LKML, Containers

v2 -> v3:

- remove async hotplug propagation work
- do propagation work in one place for both hotplug and unplug

v1 -> v2:

- add documentation in include/linux/cgroup.h
- drop rcu_read_lock() before calling update_task_nodemask() when iterating
cpuset hierarchy

======================================

Currently some cpuset behaviors are not friendly when cpuset is co-mounted
with other cgroup controllers.

Now with this patchset if cpuset is mounted with sane_behavior option, it
behaves differently:

- Tasks will be kept in empty cpusets when hotplug happens and take masks
of ancestors with non-empty cpus/mems, instead of being moved to an ancestor.

- A task can be moved into an empty cpuset, and again it takes masks of
ancestors, so the user can drop a task into a newly created cgroup without
having to do anything for it.

As tasks can reside in empy cpusets, here're some rules:

- They can be moved to another cpuset, regardless it's empty or not.

- Though it takes masks from ancestors, it takes other configs from the
empty cpuset.

- If the ancestors' masks are changed, those tasks will also be updated
to take new masks.

Li Zefan (7):
  cpuset: let hotplug propagation work wait for task attaching
  cpuset: remove async hotplug propagation work
  cpuset: record old_mems_allowed in struct cpuset
  cpuset: introduce effective_{cpumask|nodemask}_cpuset()
  cpuset: allow to keep tasks in empty cpusets
  cpuset: allow to move tasks to empty cpusets
  cpuset: fix to migrate mm correctly in a corner case

 include/linux/cgroup.h |   7 +
 kernel/cpuset.c        | 390 ++++++++++++++++++++++++++++++++-----------------
 2 files changed, 264 insertions(+), 133 deletions(-)

-- 
1.8.0.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/7] cpuset: let hotplug propagation work wait for task attaching
       [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-06-09  9:14   ` Li Zefan
  2013-06-09  9:14   ` [PATCH v3 2/7] cpuset: remove async hotplug propagation work Li Zefan
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Li Zefan @ 2013-06-09  9:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Cgroups, LKML, Containers

Instead of triggering porpagation work in cpuset_attach(), we make
hotplug propagation work wait until there's no task attaching in
progress.

IMO this is more robust. We won't see empty masks in cpuset_attach().

Also it's a preparation for removing propagation work. Without asynchronous
propagation we can't call move_tasks_in_empty_cpuset() in cpuset_attach(),
because otherwise we'll deadlock on cgroup_mutex.

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cpuset.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 535dce6..5ffcc3b 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -59,6 +59,7 @@
 #include <linux/mutex.h>
 #include <linux/workqueue.h>
 #include <linux/cgroup.h>
+#include <linux/wait.h>
 
 /*
  * Tracks how many cpusets are currently defined in system.
@@ -275,6 +276,8 @@ static void schedule_cpuset_propagate_hotplug(struct cpuset *cs);
 
 static DECLARE_WORK(cpuset_hotplug_work, cpuset_hotplug_workfn);
 
+static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
+
 /*
  * This is ugly, but preserves the userspace API for existing cpuset
  * users. If someone tries to mount the "cpuset" filesystem, we
@@ -1436,14 +1439,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	}
 
 	cs->attach_in_progress--;
-
-	/*
-	 * We may have raced with CPU/memory hotunplug.  Trigger hotplug
-	 * propagation if @cs doesn't have any CPU or memory.  It will move
-	 * the newly added tasks to the nearest parent which can execute.
-	 */
-	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
-		schedule_cpuset_propagate_hotplug(cs);
+	if (!cs->attach_in_progress)
+		wake_up(&cpuset_attach_wq);
 
 	mutex_unlock(&cpuset_mutex);
 }
@@ -1555,10 +1552,6 @@ static int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
 	 * resources, wait for the previously scheduled operations before
 	 * proceeding, so that we don't end up keep removing tasks added
 	 * after execution capability is restored.
-	 *
-	 * Flushing cpuset_hotplug_work is enough to synchronize against
-	 * hotplug hanlding; however, cpuset_attach() may schedule
-	 * propagation work directly.  Flush the workqueue too.
 	 */
 	flush_work(&cpuset_hotplug_work);
 	flush_workqueue(cpuset_propagate_hotplug_wq);
@@ -2005,8 +1998,20 @@ static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
 	struct cpuset *cs = container_of(work, struct cpuset, hotplug_work);
 	bool is_empty;
 
+retry:
+	wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
+
 	mutex_lock(&cpuset_mutex);
 
+	/*
+	 * We have raced with task attaching. We wait util attaching
+	 * is finished, so we won't attach a task to an empty cpuset.
+	 */
+	if (cs->attach_in_progress) {
+		mutex_unlock(&cpuset_mutex);
+		goto retry;
+	}
+
 	cpumask_andnot(&off_cpus, cs->cpus_allowed, top_cpuset.cpus_allowed);
 	nodes_andnot(off_mems, cs->mems_allowed, top_cpuset.mems_allowed);
 
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/7] cpuset: remove async hotplug propagation work
       [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  2013-06-09  9:14   ` [PATCH v3 1/7] cpuset: let hotplug propagation work wait for task attaching Li Zefan
@ 2013-06-09  9:14   ` Li Zefan
       [not found]     ` <51B44787.2030601-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  2013-06-09  9:15   ` [PATCH v3 3/7] cpuset: record old_mems_allowed in struct cpuset Li Zefan
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2013-06-09  9:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Cgroups, Containers, LKML

As we can drop rcu read lock while iterating cgroup hierarchy,
we don't have to do propagation asynchronously via workqueue.

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cpuset.c | 69 +++++++++++++--------------------------------------------
 1 file changed, 16 insertions(+), 53 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 5ffcc3b..2d68d5c 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -101,8 +101,6 @@ struct cpuset {
 
 	/* for custom sched domain */
 	int relax_domain_level;
-
-	struct work_struct hotplug_work;
 };
 
 /* Retrieve the cpuset for a cgroup */
@@ -268,12 +266,7 @@ static DEFINE_MUTEX(callback_mutex);
 /*
  * CPU / memory hotplug is handled asynchronously.
  */
-static struct workqueue_struct *cpuset_propagate_hotplug_wq;
-
 static void cpuset_hotplug_workfn(struct work_struct *work);
-static void cpuset_propagate_hotplug_workfn(struct work_struct *work);
-static void schedule_cpuset_propagate_hotplug(struct cpuset *cs);
-
 static DECLARE_WORK(cpuset_hotplug_work, cpuset_hotplug_workfn);
 
 static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
@@ -1554,7 +1547,6 @@ static int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
 	 * after execution capability is restored.
 	 */
 	flush_work(&cpuset_hotplug_work);
-	flush_workqueue(cpuset_propagate_hotplug_wq);
 
 	mutex_lock(&cpuset_mutex);
 	if (!is_cpuset_online(cs))
@@ -1821,7 +1813,6 @@ static struct cgroup_subsys_state *cpuset_css_alloc(struct cgroup *cont)
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
 	fmeter_init(&cs->fmeter);
-	INIT_WORK(&cs->hotplug_work, cpuset_propagate_hotplug_workfn);
 	cs->relax_domain_level = -1;
 
 	return &cs->css;
@@ -1984,18 +1975,17 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
 }
 
 /**
- * cpuset_propagate_hotplug_workfn - propagate CPU/memory hotplug to a cpuset
+ * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug
  * @cs: cpuset in interest
  *
  * Compare @cs's cpu and mem masks against top_cpuset and if some have gone
  * offline, update @cs accordingly.  If @cs ends up with no CPU or memory,
  * all its tasks are moved to the nearest ancestor with both resources.
  */
-static void cpuset_propagate_hotplug_workfn(struct work_struct *work)
+static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 {
 	static cpumask_t off_cpus;
 	static nodemask_t off_mems, tmp_mems;
-	struct cpuset *cs = container_of(work, struct cpuset, hotplug_work);
 	bool is_empty;
 
 retry:
@@ -2044,34 +2034,6 @@ retry:
 	 */
 	if (is_empty)
 		remove_tasks_in_empty_cpuset(cs);
-
-	/* the following may free @cs, should be the last operation */
-	css_put(&cs->css);
-}
-
-/**
- * schedule_cpuset_propagate_hotplug - schedule hotplug propagation to a cpuset
- * @cs: cpuset of interest
- *
- * Schedule cpuset_propagate_hotplug_workfn() which will update CPU and
- * memory masks according to top_cpuset.
- */
-static void schedule_cpuset_propagate_hotplug(struct cpuset *cs)
-{
-	/*
-	 * Pin @cs.  The refcnt will be released when the work item
-	 * finishes executing.
-	 */
-	if (!css_tryget(&cs->css))
-		return;
-
-	/*
-	 * Queue @cs->hotplug_work.  If already pending, lose the css ref.
-	 * cpuset_propagate_hotplug_wq is ordered and propagation will
-	 * happen in the order this function is called.
-	 */
-	if (!queue_work(cpuset_propagate_hotplug_wq, &cs->hotplug_work))
-		css_put(&cs->css);
 }
 
 /**
@@ -2084,8 +2046,8 @@ static void schedule_cpuset_propagate_hotplug(struct cpuset *cs)
  * actively using CPU hotplug but making no active use of cpusets.
  *
  * Non-root cpusets are only affected by offlining.  If any CPUs or memory
- * nodes have been taken down, cpuset_propagate_hotplug() is invoked on all
- * descendants.
+ * nodes have been taken down, cpuset_hotplug_update_tasks() is invoked on
+ * all descendants.
  *
  * Note that CPU offlining during suspend is ignored.  We don't modify
  * cpusets across suspend/resume cycles at all.
@@ -2128,21 +2090,26 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		update_tasks_nodemask(&top_cpuset, &tmp_mems, NULL);
 	}
 
+	mutex_unlock(&cpuset_mutex);
+
 	/* if cpus or mems went down, we need to propagate to descendants */
 	if (cpus_offlined || mems_offlined) {
 		struct cpuset *cs;
 		struct cgroup *pos_cgrp;
 
 		rcu_read_lock();
-		cpuset_for_each_descendant_pre(cs, pos_cgrp, &top_cpuset)
-			schedule_cpuset_propagate_hotplug(cs);
-		rcu_read_unlock();
-	}
+		cpuset_for_each_descendant_pre(cs, pos_cgrp, &top_cpuset) {
+			if (!css_tryget(&cs->css))
+				continue;
+			rcu_read_unlock();
 
-	mutex_unlock(&cpuset_mutex);
+			cpuset_hotplug_update_tasks(cs);
 
-	/* wait for propagations to finish */
-	flush_workqueue(cpuset_propagate_hotplug_wq);
+			rcu_read_lock();
+			css_put(&cs->css);
+		}
+		rcu_read_unlock();
+	}
 
 	/* rebuild sched domains if cpus_allowed has changed */
 	if (cpus_updated)
@@ -2193,10 +2160,6 @@ void __init cpuset_init_smp(void)
 	top_cpuset.mems_allowed = node_states[N_MEMORY];
 
 	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
-
-	cpuset_propagate_hotplug_wq =
-		alloc_ordered_workqueue("cpuset_hotplug", 0);
-	BUG_ON(!cpuset_propagate_hotplug_wq);
 }
 
 /**
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 3/7] cpuset: record old_mems_allowed in struct cpuset
       [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  2013-06-09  9:14   ` [PATCH v3 1/7] cpuset: let hotplug propagation work wait for task attaching Li Zefan
  2013-06-09  9:14   ` [PATCH v3 2/7] cpuset: remove async hotplug propagation work Li Zefan
@ 2013-06-09  9:15   ` Li Zefan
  2013-06-09  9:15   ` [PATCH v3 4/7] cpuset: introduce effective_{cpumask|nodemask}_cpuset() Li Zefan
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Li Zefan @ 2013-06-09  9:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Cgroups, Containers, LKML

When we update a cpuset's mems_allowed and thus update tasks'
mems_allowed, it's required to pass the old mems_allowed and new
mems_allowed to cpuset_migrate_mm().

Currently we save old mems_allowed in a temp local variable before
changing cpuset->mems_allowed. This patch changes it by saving
old mems_allowed in cpuset->old_mems_allowed.

This currently won't change any behavior, but it will later allow
us to keep tasks in empty cpusets.

v3: restored "cpuset_attach_nodemask_to = cs->mems_allowed"

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cpuset.c | 61 ++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 36 insertions(+), 25 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 2d68d5c..a766b87 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -88,6 +88,18 @@ struct cpuset {
 	cpumask_var_t cpus_allowed;	/* CPUs allowed to tasks in cpuset */
 	nodemask_t mems_allowed;	/* Memory Nodes allowed to tasks */
 
+	/*
+	 * This is old Memory Nodes tasks took on.
+	 *
+	 * - top_cpuset.old_mems_allowed is initialized to mems_allowed.
+	 * - A new cpuset's old_mems_allowed is initialized when some
+	 *   task is moved into it.
+	 * - old_mems_allowed is used in cpuset_migrate_mm() when we change
+	 *   cpuset.mems_allowed and have tasks' nodemask updated, and
+	 *   then old_mems_allowed is updated to mems_allowed.
+	 */
+	nodemask_t old_mems_allowed;
+
 	struct fmeter fmeter;		/* memory_pressure filter */
 
 	/*
@@ -972,16 +984,12 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 static void cpuset_change_nodemask(struct task_struct *p,
 				   struct cgroup_scanner *scan)
 {
+	struct cpuset *cs = cgroup_cs(scan->cg);
 	struct mm_struct *mm;
-	struct cpuset *cs;
 	int migrate;
-	const nodemask_t *oldmem = scan->data;
-	static nodemask_t newmems;	/* protected by cpuset_mutex */
-
-	cs = cgroup_cs(scan->cg);
-	guarantee_online_mems(cs, &newmems);
+	nodemask_t *newmems = scan->data;
 
-	cpuset_change_task_nodemask(p, &newmems);
+	cpuset_change_task_nodemask(p, newmems);
 
 	mm = get_task_mm(p);
 	if (!mm)
@@ -991,7 +999,7 @@ static void cpuset_change_nodemask(struct task_struct *p,
 
 	mpol_rebind_mm(mm, &cs->mems_allowed);
 	if (migrate)
-		cpuset_migrate_mm(mm, oldmem, &cs->mems_allowed);
+		cpuset_migrate_mm(mm, &cs->old_mems_allowed, newmems);
 	mmput(mm);
 }
 
@@ -1000,25 +1008,26 @@ static void *cpuset_being_rebound;
 /**
  * update_tasks_nodemask - Update the nodemasks of tasks in the cpuset.
  * @cs: the cpuset in which each task's mems_allowed mask needs to be changed
- * @oldmem: old mems_allowed of cpuset cs
  * @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
  *
  * Called with cpuset_mutex held
  * No return value. It's guaranteed that cgroup_scan_tasks() always returns 0
  * if @heap != NULL.
  */
-static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
-				 struct ptr_heap *heap)
+static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
 {
+	static nodemask_t newmems;	/* protected by cpuset_mutex */
 	struct cgroup_scanner scan;
 
 	cpuset_being_rebound = cs;		/* causes mpol_dup() rebind */
 
+	guarantee_online_mems(cs, &newmems);
+
 	scan.cg = cs->css.cgroup;
 	scan.test_task = NULL;
 	scan.process_task = cpuset_change_nodemask;
 	scan.heap = heap;
-	scan.data = (nodemask_t *)oldmem;
+	scan.data = &newmems;
 
 	/*
 	 * The mpol_rebind_mm() call takes mmap_sem, which we couldn't
@@ -1032,6 +1041,12 @@ static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
 	 */
 	cgroup_scan_tasks(&scan);
 
+	/*
+	 * All the tasks' nodemasks have been updated, update
+	 * cs->old_mems_allowed.
+	 */
+	cs->old_mems_allowed = newmems;
+
 	/* We're done rebinding vmas to this cpuset's new mems_allowed. */
 	cpuset_being_rebound = NULL;
 }
@@ -1052,13 +1067,9 @@ static void update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem,
 static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 			   const char *buf)
 {
-	NODEMASK_ALLOC(nodemask_t, oldmem, GFP_KERNEL);
 	int retval;
 	struct ptr_heap heap;
 
-	if (!oldmem)
-		return -ENOMEM;
-
 	/*
 	 * top_cpuset.mems_allowed tracks node_stats[N_MEMORY];
 	 * it's read-only
@@ -1087,8 +1098,8 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 			goto done;
 		}
 	}
-	*oldmem = cs->mems_allowed;
-	if (nodes_equal(*oldmem, trialcs->mems_allowed)) {
+
+	if (nodes_equal(cs->mems_allowed, trialcs->mems_allowed)) {
 		retval = 0;		/* Too easy - nothing to do */
 		goto done;
 	}
@@ -1104,11 +1115,10 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	cs->mems_allowed = trialcs->mems_allowed;
 	mutex_unlock(&callback_mutex);
 
-	update_tasks_nodemask(cs, oldmem, &heap);
+	update_tasks_nodemask(cs, &heap);
 
 	heap_free(&heap);
 done:
-	NODEMASK_FREE(oldmem);
 	return retval;
 }
 
@@ -1431,6 +1441,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 		mmput(mm);
 	}
 
+	cs->old_mems_allowed = cpuset_attach_nodemask_to;
+
 	cs->attach_in_progress--;
 	if (!cs->attach_in_progress)
 		wake_up(&cpuset_attach_wq);
@@ -1985,7 +1997,7 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
 static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 {
 	static cpumask_t off_cpus;
-	static nodemask_t off_mems, tmp_mems;
+	static nodemask_t off_mems;
 	bool is_empty;
 
 retry:
@@ -2015,11 +2027,10 @@ retry:
 
 	/* remove offline mems from @cs */
 	if (!nodes_empty(off_mems)) {
-		tmp_mems = cs->mems_allowed;
 		mutex_lock(&callback_mutex);
 		nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
 		mutex_unlock(&callback_mutex);
-		update_tasks_nodemask(cs, &tmp_mems, NULL);
+		update_tasks_nodemask(cs, NULL);
 	}
 
 	is_empty = cpumask_empty(cs->cpus_allowed) ||
@@ -2083,11 +2094,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	/* synchronize mems_allowed to N_MEMORY */
 	if (mems_updated) {
-		tmp_mems = top_cpuset.mems_allowed;
 		mutex_lock(&callback_mutex);
 		top_cpuset.mems_allowed = new_mems;
 		mutex_unlock(&callback_mutex);
-		update_tasks_nodemask(&top_cpuset, &tmp_mems, NULL);
+		update_tasks_nodemask(&top_cpuset, NULL);
 	}
 
 	mutex_unlock(&cpuset_mutex);
@@ -2158,6 +2168,7 @@ void __init cpuset_init_smp(void)
 {
 	cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
 	top_cpuset.mems_allowed = node_states[N_MEMORY];
+	top_cpuset.old_mems_allowed = top_cpuset.mems_allowed;
 
 	register_hotmemory_notifier(&cpuset_track_online_nodes_nb);
 }
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 4/7] cpuset: introduce effective_{cpumask|nodemask}_cpuset()
       [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-06-09  9:15   ` [PATCH v3 3/7] cpuset: record old_mems_allowed in struct cpuset Li Zefan
@ 2013-06-09  9:15   ` Li Zefan
  2013-06-09  9:16   ` [PATCH v3 5/7] cpuset: allow to keep tasks in empty cpusets Li Zefan
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Li Zefan @ 2013-06-09  9:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Cgroups, Containers, LKML

effective_cpumask_cpuset() returns an ancestor cpuset which has
non-empty cpumask.

If a cpuset is empty and the tasks in it need to update their
cpus_allowed, they take on the ancestor cpuset's cpumask.

This currently won't change any behavior, but it will later allow us
to keep tasks in empty cpusets.

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cpuset.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 65 insertions(+), 11 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index a766b87..aab014f 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -791,6 +791,45 @@ void rebuild_sched_domains(void)
 	mutex_unlock(&cpuset_mutex);
 }
 
+/*
+ * effective_cpumask_cpuset - return nearest ancestor with non-empty cpus
+ * @cs: the cpuset in interest
+ *
+ * A cpuset's effective cpumask is the cpumask of the nearest ancestor
+ * with non-empty cpus. We use effective cpumask whenever:
+ * - we update tasks' cpus_allowed. (they take on the ancestor's cpumask
+ *   if the cpuset they reside in has no cpus)
+ * - we want to retrieve task_cs(tsk)'s cpus_allowed.
+ *
+ * Called with cpuset_mutex held. cpuset_cpus_allowed_fallback() is an
+ * exception. See comments there.
+ */
+static struct cpuset *effective_cpumask_cpuset(struct cpuset *cs)
+{
+	while (cpumask_empty(cs->cpus_allowed))
+		cs = parent_cs(cs);
+	return cs;
+}
+
+/*
+ * effective_nodemask_cpuset - return nearest ancestor with non-empty mems
+ * @cs: the cpuset in interest
+ *
+ * A cpuset's effective nodemask is the nodemask of the nearest ancestor
+ * with non-empty memss. We use effective nodemask whenever:
+ * - we update tasks' mems_allowed. (they take on the ancestor's nodemask
+ *   if the cpuset they reside in has no mems)
+ * - we want to retrieve task_cs(tsk)'s mems_allowed.
+ *
+ * Called with cpuset_mutex held.
+ */
+static struct cpuset *effective_nodemask_cpuset(struct cpuset *cs)
+{
+	while (nodes_empty(cs->mems_allowed))
+		cs = parent_cs(cs);
+	return cs;
+}
+
 /**
  * cpuset_change_cpumask - make a task's cpus_allowed the same as its cpuset's
  * @tsk: task to test
@@ -805,7 +844,10 @@ void rebuild_sched_domains(void)
 static void cpuset_change_cpumask(struct task_struct *tsk,
 				  struct cgroup_scanner *scan)
 {
-	set_cpus_allowed_ptr(tsk, ((cgroup_cs(scan->cg))->cpus_allowed));
+	struct cpuset *cpus_cs;
+
+	cpus_cs = effective_cpumask_cpuset(cgroup_cs(scan->cg));
+	set_cpus_allowed_ptr(tsk, cpus_cs->cpus_allowed);
 }
 
 /**
@@ -920,12 +962,14 @@ static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
 							const nodemask_t *to)
 {
 	struct task_struct *tsk = current;
+	struct cpuset *mems_cs;
 
 	tsk->mems_allowed = *to;
 
 	do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
 
-	guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
+	mems_cs = effective_nodemask_cpuset(task_cs(tsk));
+	guarantee_online_mems(mems_cs, &tsk->mems_allowed);
 }
 
 /*
@@ -1018,10 +1062,11 @@ static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
 {
 	static nodemask_t newmems;	/* protected by cpuset_mutex */
 	struct cgroup_scanner scan;
+	struct cpuset *mems_cs = effective_nodemask_cpuset(cs);
 
 	cpuset_being_rebound = cs;		/* causes mpol_dup() rebind */
 
-	guarantee_online_mems(cs, &newmems);
+	guarantee_online_mems(mems_cs, &newmems);
 
 	scan.cg = cs->css.cgroup;
 	scan.test_task = NULL;
@@ -1405,6 +1450,8 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	struct cgroup *oldcgrp = cgroup_taskset_cur_cgroup(tset);
 	struct cpuset *cs = cgroup_cs(cgrp);
 	struct cpuset *oldcs = cgroup_cs(oldcgrp);
+	struct cpuset *cpus_cs = effective_cpumask_cpuset(cs);
+	struct cpuset *mems_cs = effective_nodemask_cpuset(cs);
 
 	mutex_lock(&cpuset_mutex);
 
@@ -1412,9 +1459,9 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	if (cs == &top_cpuset)
 		cpumask_copy(cpus_attach, cpu_possible_mask);
 	else
-		guarantee_online_cpus(cs, cpus_attach);
+		guarantee_online_cpus(cpus_cs, cpus_attach);
 
-	guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
+	guarantee_online_mems(mems_cs, &cpuset_attach_nodemask_to);
 
 	cgroup_taskset_for_each(task, cgrp, tset) {
 		/*
@@ -1434,9 +1481,11 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 	cpuset_attach_nodemask_to = cs->mems_allowed;
 	mm = get_task_mm(leader);
 	if (mm) {
+		struct cpuset *mems_oldcs = effective_nodemask_cpuset(oldcs);
+
 		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
 		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, &oldcs->mems_allowed,
+			cpuset_migrate_mm(mm, &mems_oldcs->mems_allowed,
 					  &cpuset_attach_nodemask_to);
 		mmput(mm);
 	}
@@ -2186,20 +2235,23 @@ void __init cpuset_init_smp(void)
 
 void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
 {
+	struct cpuset *cpus_cs;
+
 	mutex_lock(&callback_mutex);
 	task_lock(tsk);
-	guarantee_online_cpus(task_cs(tsk), pmask);
+	cpus_cs = effective_cpumask_cpuset(task_cs(tsk));
+	guarantee_online_cpus(cpus_cs, pmask);
 	task_unlock(tsk);
 	mutex_unlock(&callback_mutex);
 }
 
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
-	const struct cpuset *cs;
+	const struct cpuset *cpus_cs;
 
 	rcu_read_lock();
-	cs = task_cs(tsk);
-	do_set_cpus_allowed(tsk, cs->cpus_allowed);
+	cpus_cs = effective_cpumask_cpuset(task_cs(tsk));
+	do_set_cpus_allowed(tsk, cpus_cs->cpus_allowed);
 	rcu_read_unlock();
 
 	/*
@@ -2238,11 +2290,13 @@ void cpuset_init_current_mems_allowed(void)
 
 nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
 {
+	struct cpuset *mems_cs;
 	nodemask_t mask;
 
 	mutex_lock(&callback_mutex);
 	task_lock(tsk);
-	guarantee_online_mems(task_cs(tsk), &mask);
+	mems_cs = effective_nodemask_cpuset(task_cs(tsk));
+	guarantee_online_mems(mems_cs, &mask);
 	task_unlock(tsk);
 	mutex_unlock(&callback_mutex);
 
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 5/7] cpuset: allow to keep tasks in empty cpusets
       [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-06-09  9:15   ` [PATCH v3 4/7] cpuset: introduce effective_{cpumask|nodemask}_cpuset() Li Zefan
@ 2013-06-09  9:16   ` Li Zefan
  2013-06-09  9:16   ` [PATCH v3 6/7] cpuset: allow to move tasks to " Li Zefan
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Li Zefan @ 2013-06-09  9:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Cgroups, LKML, Containers

To achieve this:

- We call update_tasks_cpumask/nodemask() for empty cpusets when
hotplug happens, instead of moving tasks out of them.

- When a cpuset's masks are changed by writing cpuset.cpus/mems,
we also update tasks in child cpusets which are empty.

v3:
- do propagation work in one place for both hotplug and unplug

v2:
- drop rcu_read_lock before calling update_task_nodemask() and
  update_task_cpumask(), instead of using workqueue.
- add documentation in include/linux/cgroup.h

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h |   4 ++
 kernel/cpuset.c        | 141 ++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 114 insertions(+), 31 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d0ad379..53e81a6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -277,6 +277,10 @@ enum {
 	 *
 	 * - Remount is disallowed.
 	 *
+	 * - cpuset: tasks will be kept in empty cpusets when hotplug happens
+	 *   and take masks of ancestors with non-empty cpus/mems, instead of
+	 *   being moved to an ancestor.
+	 *
 	 * - memcg: use_hierarchy is on by default and the cgroup file for
 	 *   the flag is not created.
 	 *
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index aab014f..fd7572f 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -874,6 +874,45 @@ static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
 	cgroup_scan_tasks(&scan);
 }
 
+/*
+ * update_tasks_cpumask_hier - Update the cpumasks of tasks in the hierarchy.
+ * @root_cs: the root cpuset of the hierarchy
+ * @update_root: update root cpuset or not?
+ * @heap: the heap used by cgroup_scan_tasks()
+ *
+ * This will update cpumasks of tasks in @root_cs and all other empty cpusets
+ * which take on cpumask of @root_cs.
+ *
+ * Called with cpuset_mutex held
+ */
+static void update_tasks_cpumask_hier(struct cpuset *root_cs,
+				      bool update_root, struct ptr_heap *heap)
+{
+	struct cpuset *cp;
+	struct cgroup *pos_cgrp;
+
+	if (update_root)
+		update_tasks_cpumask(root_cs, heap);
+
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cp, pos_cgrp, root_cs) {
+		/* skip the whole subtree if @cp have some CPU */
+		if (!cpumask_empty(cp->cpus_allowed)) {
+			pos_cgrp = cgroup_rightmost_descendant(pos_cgrp);
+			continue;
+		}
+		if (!css_tryget(&cp->css))
+			continue;
+		rcu_read_unlock();
+
+		update_tasks_cpumask(cp, heap);
+
+		rcu_read_lock();
+		css_put(&cp->css);
+	}
+	rcu_read_unlock();
+}
+
 /**
  * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
  * @cs: the cpuset to consider
@@ -925,11 +964,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
 	mutex_unlock(&callback_mutex);
 
-	/*
-	 * Scan tasks in the cpuset, and update the cpumasks of any
-	 * that need an update.
-	 */
-	update_tasks_cpumask(cs, &heap);
+	update_tasks_cpumask_hier(cs, true, &heap);
 
 	heap_free(&heap);
 
@@ -1097,6 +1132,45 @@ static void update_tasks_nodemask(struct cpuset *cs, struct ptr_heap *heap)
 }
 
 /*
+ * update_tasks_nodemask_hier - Update the nodemasks of tasks in the hierarchy.
+ * @cs: the root cpuset of the hierarchy
+ * @update_root: update the root cpuset or not?
+ * @heap: the heap used by cgroup_scan_tasks()
+ *
+ * This will update nodemasks of tasks in @root_cs and all other empty cpusets
+ * which take on nodemask of @root_cs.
+ *
+ * Called with cpuset_mutex held
+ */
+static void update_tasks_nodemask_hier(struct cpuset *root_cs,
+				       bool update_root, struct ptr_heap *heap)
+{
+	struct cpuset *cp;
+	struct cgroup *pos_cgrp;
+
+	if (update_root)
+		update_tasks_nodemask(root_cs, heap);
+
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(cp, pos_cgrp, root_cs) {
+		/* skip the whole subtree if @cp have some CPU */
+		if (!nodes_empty(cp->mems_allowed)) {
+			pos_cgrp = cgroup_rightmost_descendant(pos_cgrp);
+			continue;
+		}
+		if (!css_tryget(&cp->css))
+			continue;
+		rcu_read_unlock();
+
+		update_tasks_nodemask(cp, heap);
+
+		rcu_read_lock();
+		css_put(&cp->css);
+	}
+	rcu_read_unlock();
+}
+
+/*
  * Handle user request to change the 'mems' memory placement
  * of a cpuset.  Needs to validate the request, update the
  * cpusets mems_allowed, and for each task in the cpuset,
@@ -1160,7 +1234,7 @@ static int update_nodemask(struct cpuset *cs, struct cpuset *trialcs,
 	cs->mems_allowed = trialcs->mems_allowed;
 	mutex_unlock(&callback_mutex);
 
-	update_tasks_nodemask(cs, &heap);
+	update_tasks_nodemask_hier(cs, true, &heap);
 
 	heap_free(&heap);
 done:
@@ -2048,6 +2122,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
 	static cpumask_t off_cpus;
 	static nodemask_t off_mems;
 	bool is_empty;
+	bool sane = cgroup_sane_behavior(cs->css.cgroup);
 
 retry:
 	wait_event(cpuset_attach_wq, cs->attach_in_progress == 0);
@@ -2066,21 +2141,29 @@ retry:
 	cpumask_andnot(&off_cpus, cs->cpus_allowed, top_cpuset.cpus_allowed);
 	nodes_andnot(off_mems, cs->mems_allowed, top_cpuset.mems_allowed);
 
-	/* remove offline cpus from @cs */
-	if (!cpumask_empty(&off_cpus)) {
-		mutex_lock(&callback_mutex);
-		cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, &off_cpus);
-		mutex_unlock(&callback_mutex);
+	mutex_lock(&callback_mutex);
+	cpumask_andnot(cs->cpus_allowed, cs->cpus_allowed, &off_cpus);
+	mutex_unlock(&callback_mutex);
+
+	/*
+	 * If sane_behavior flag is set, we need to update tasks' cpumask
+	 * for empty cpuset to take on ancestor's cpumask.
+	 */
+	if ((sane && cpumask_empty(cs->cpus_allowed)) ||
+	    !cpumask_empty(&off_cpus))
 		update_tasks_cpumask(cs, NULL);
-	}
 
-	/* remove offline mems from @cs */
-	if (!nodes_empty(off_mems)) {
-		mutex_lock(&callback_mutex);
-		nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
-		mutex_unlock(&callback_mutex);
+	mutex_lock(&callback_mutex);
+	nodes_andnot(cs->mems_allowed, cs->mems_allowed, off_mems);
+	mutex_unlock(&callback_mutex);
+
+	/*
+	 * If sane_behavior flag is set, we need to update tasks' nodemask
+	 * for empty cpuset to take on ancestor's nodemask.
+	 */
+	if ((sane && nodes_empty(cs->mems_allowed)) ||
+	    !nodes_empty(off_mems))
 		update_tasks_nodemask(cs, NULL);
-	}
 
 	is_empty = cpumask_empty(cs->cpus_allowed) ||
 		nodes_empty(cs->mems_allowed);
@@ -2088,11 +2171,13 @@ retry:
 	mutex_unlock(&cpuset_mutex);
 
 	/*
-	 * If @cs became empty, move tasks to the nearest ancestor with
-	 * execution resources.  This is full cgroup operation which will
+	 * If sane_behavior flag is set, we'll keep tasks in empty cpusets.
+	 *
+	 * Otherwise move tasks to the nearest ancestor with execution
+	 * resources.  This is full cgroup operation which will
 	 * also call back into cpuset.  Should be done outside any lock.
 	 */
-	if (is_empty)
+	if (!sane && is_empty)
 		remove_tasks_in_empty_cpuset(cs);
 }
 
@@ -2114,10 +2199,9 @@ retry:
  */
 static void cpuset_hotplug_workfn(struct work_struct *work)
 {
-	static cpumask_t new_cpus, tmp_cpus;
-	static nodemask_t new_mems, tmp_mems;
+	static cpumask_t new_cpus;
+	static nodemask_t new_mems;
 	bool cpus_updated, mems_updated;
-	bool cpus_offlined, mems_offlined;
 
 	mutex_lock(&cpuset_mutex);
 
@@ -2126,12 +2210,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	new_mems = node_states[N_MEMORY];
 
 	cpus_updated = !cpumask_equal(top_cpuset.cpus_allowed, &new_cpus);
-	cpus_offlined = cpumask_andnot(&tmp_cpus, top_cpuset.cpus_allowed,
-				       &new_cpus);
-
 	mems_updated = !nodes_equal(top_cpuset.mems_allowed, new_mems);
-	nodes_andnot(tmp_mems, top_cpuset.mems_allowed, new_mems);
-	mems_offlined = !nodes_empty(tmp_mems);
 
 	/* synchronize cpus_allowed to cpu_active_mask */
 	if (cpus_updated) {
@@ -2151,8 +2230,8 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 
 	mutex_unlock(&cpuset_mutex);
 
-	/* if cpus or mems went down, we need to propagate to descendants */
-	if (cpus_offlined || mems_offlined) {
+	/* if cpus or mems changed, we need to propagate to descendants */
+	if (cpus_updated || mems_updated) {
 		struct cpuset *cs;
 		struct cgroup *pos_cgrp;
 
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 6/7] cpuset: allow to move tasks to empty cpusets
       [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2013-06-09  9:16   ` [PATCH v3 5/7] cpuset: allow to keep tasks in empty cpusets Li Zefan
@ 2013-06-09  9:16   ` Li Zefan
  2013-06-09  9:17   ` [PATCH v3 7/7] cpuset: fix to migrate mm correctly in a corner case Li Zefan
  2013-06-09 16:03   ` [PATCH v3 0/7] cpuset: implement sane hierarchy behaviors Tejun Heo
  7 siblings, 0 replies; 13+ messages in thread
From: Li Zefan @ 2013-06-09  9:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Cgroups, LKML, Containers

Currently some cpuset behaviors are not friendly when cpuset is co-mounted
with other cgroup controllers.

Now with this patchset if cpuset is mounted with sane_behavior option,
it behaves differently:

- Tasks will be kept in empty cpusets when hotplug happens and take
  masks of ancestors with non-empty cpus/mems, instead of being moved to
  an ancestor.

- A task can be moved into an empty cpuset, and again it takes masks of
  ancestors, so the user can drop a task into a newly created cgroup without
  having to do anything for it.

As tasks can reside in empy cpusets, here're some rules:

- They can be moved to another cpuset, regardless it's empty or not.

- Though it takes masks from ancestors, it takes other configs from the
  empty cpuset.

- If the ancestors' masks are changed, those tasks will also be updated
  to take new masks.

v2: add documentation in include/linux/cgroup.h

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 include/linux/cgroup.h | 3 +++
 kernel/cpuset.c        | 9 +++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 53e81a6..74e8b8e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -281,6 +281,9 @@ enum {
 	 *   and take masks of ancestors with non-empty cpus/mems, instead of
 	 *   being moved to an ancestor.
 	 *
+	 * - cpuset: a task can be moved into an empty cpuset, and again it
+	 *   takes masks of ancestors.
+	 *
 	 * - memcg: use_hierarchy is on by default and the cgroup file for
 	 *   the flag is not created.
 	 *
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index fd7572f..d1d6782 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -479,7 +479,7 @@ static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
 	 */
 	ret = -ENOSPC;
 	if ((cgroup_task_count(cur->css.cgroup) || cur->attach_in_progress) &&
-	    (cpumask_empty(trial->cpus_allowed) ||
+	    (cpumask_empty(trial->cpus_allowed) &&
 	     nodes_empty(trial->mems_allowed)))
 		goto out;
 
@@ -1466,8 +1466,13 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 
 	mutex_lock(&cpuset_mutex);
 
+	/*
+	 * We allow to move tasks into an empty cpuset if sane_behavior
+	 * flag is set.
+	 */
 	ret = -ENOSPC;
-	if (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
+	if (!cgroup_sane_behavior(cgrp) &&
+	    (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
 		goto out_unlock;
 
 	cgroup_taskset_for_each(task, cgrp, tset) {
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 7/7] cpuset: fix to migrate mm correctly in a corner case
       [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2013-06-09  9:16   ` [PATCH v3 6/7] cpuset: allow to move tasks to " Li Zefan
@ 2013-06-09  9:17   ` Li Zefan
       [not found]     ` <51B4480F.2050400-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  2013-06-09 16:03   ` [PATCH v3 0/7] cpuset: implement sane hierarchy behaviors Tejun Heo
  7 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2013-06-09  9:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Cgroups, LKML, Containers

Before moving tasks out of empty cpusets, update_tasks_nodemask()
is called, which calls do_migrate_pages(xx, from, to). Then those
tasks are moved to an ancestor, and do_migrate_pages() is called
again.

The first time: from = node_to_be_offlined, to = empty.
The second time: from = empty, to = ancestor's nodemask.

so looks like no pages will be migrated.

Fix this by:

- Don't call update_tasks_nodemask() on empty cpusets.
- Pass cs->old_mems_allowed to do_migrate_pages().

Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/cpuset.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index d1d6782..3cd68b8 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1563,9 +1563,16 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
 		struct cpuset *mems_oldcs = effective_nodemask_cpuset(oldcs);
 
 		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
-		if (is_memory_migrate(cs))
-			cpuset_migrate_mm(mm, &mems_oldcs->mems_allowed,
+		if (is_memory_migrate(cs)) {
+			/*
+			 * old_mems_allowed is the same with mems_allowed here,
+			 * except if this task is being moved automatically due
+			 * to hotplug, and in this case mems_allowed is empty
+			 * and old_mems_allowed is the offlined node.
+			 */
+			cpuset_migrate_mm(mm, &mems_oldcs->old_mems_allowed,
 					  &cpuset_attach_nodemask_to);
+		}
 		mmput(mm);
 	}
 
@@ -2155,7 +2162,7 @@ retry:
 	 * for empty cpuset to take on ancestor's cpumask.
 	 */
 	if ((sane && cpumask_empty(cs->cpus_allowed)) ||
-	    !cpumask_empty(&off_cpus))
+	    (!cpumask_empty(&off_cpus) && !cpumask_empty(cs->cpus_allowed)))
 		update_tasks_cpumask(cs, NULL);
 
 	mutex_lock(&callback_mutex);
@@ -2167,7 +2174,7 @@ retry:
 	 * for empty cpuset to take on ancestor's nodemask.
 	 */
 	if ((sane && nodes_empty(cs->mems_allowed)) ||
-	    !nodes_empty(off_mems))
+	    (!nodes_empty(off_mems) && !nodes_empty(cs->mems_allowed)))
 		update_tasks_nodemask(cs, NULL);
 
 	is_empty = cpumask_empty(cs->cpus_allowed) ||
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 2/7] cpuset: remove async hotplug propagation work
       [not found]     ` <51B44787.2030601-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-06-09 15:47       ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2013-06-09 15:47 UTC (permalink / raw)
  To: Li Zefan; +Cc: Cgroups, Containers, LKML

On Sun, Jun 09, 2013 at 05:14:47PM +0800, Li Zefan wrote:
> As we can drop rcu read lock while iterating cgroup hierarchy,
> we don't have to do propagation asynchronously via workqueue.
> 
> Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

Applied 1-2 to cgroup/for-3.11-cpuset.  Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 7/7] cpuset: fix to migrate mm correctly in a corner case
       [not found]     ` <51B4480F.2050400-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-06-09 15:49       ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2013-06-09 15:49 UTC (permalink / raw)
  To: Li Zefan; +Cc: Cgroups, Containers, LKML

On Sun, Jun 09, 2013 at 05:17:03PM +0800, Li Zefan wrote:
> Before moving tasks out of empty cpusets, update_tasks_nodemask()
> is called, which calls do_migrate_pages(xx, from, to). Then those
> tasks are moved to an ancestor, and do_migrate_pages() is called
> again.
> 
> The first time: from = node_to_be_offlined, to = empty.
> The second time: from = empty, to = ancestor's nodemask.
> 
> so looks like no pages will be migrated.
> 
> Fix this by:
> 
> - Don't call update_tasks_nodemask() on empty cpusets.
> - Pass cs->old_mems_allowed to do_migrate_pages().
> 
> Signed-off-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> ---
>  kernel/cpuset.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index d1d6782..3cd68b8 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1563,9 +1563,16 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
>  		struct cpuset *mems_oldcs = effective_nodemask_cpuset(oldcs);
>  
>  		mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
> -		if (is_memory_migrate(cs))
> -			cpuset_migrate_mm(mm, &mems_oldcs->mems_allowed,
> +		if (is_memory_migrate(cs)) {
> +			/*
> +			 * old_mems_allowed is the same with mems_allowed here,
> +			 * except if this task is being moved automatically due
> +			 * to hotplug, and in this case mems_allowed is empty
> +			 * and old_mems_allowed is the offlined node.
> +			 */

Wouldn't it be better if the above explain why rather than what?

> +			cpuset_migrate_mm(mm, &mems_oldcs->old_mems_allowed,
>  					  &cpuset_attach_nodemask_to);
> +		}
>  		mmput(mm);
>  	}
>  
> @@ -2155,7 +2162,7 @@ retry:
>  	 * for empty cpuset to take on ancestor's cpumask.
>  	 */
>  	if ((sane && cpumask_empty(cs->cpus_allowed)) ||
> -	    !cpumask_empty(&off_cpus))
> +	    (!cpumask_empty(&off_cpus) && !cpumask_empty(cs->cpus_allowed)))
>  		update_tasks_cpumask(cs, NULL);

No biggie but maybe we want briefly explain the conditions being tested?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/7] cpuset: implement sane hierarchy behaviors
       [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2013-06-09  9:17   ` [PATCH v3 7/7] cpuset: fix to migrate mm correctly in a corner case Li Zefan
@ 2013-06-09 16:03   ` Tejun Heo
       [not found]     ` <20130609160353.GE2835-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  7 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2013-06-09 16:03 UTC (permalink / raw)
  To: Li Zefan; +Cc: Cgroups, LKML, Containers

Hello, Li.

On Sun, Jun 09, 2013 at 05:14:02PM +0800, Li Zefan wrote:
> v2 -> v3:
> Currently some cpuset behaviors are not friendly when cpuset is co-mounted
> with other cgroup controllers.
> 
> Now with this patchset if cpuset is mounted with sane_behavior option, it
> behaves differently:
> 
> - Tasks will be kept in empty cpusets when hotplug happens and take masks
> of ancestors with non-empty cpus/mems, instead of being moved to an ancestor.
> 
> - A task can be moved into an empty cpuset, and again it takes masks of
> ancestors, so the user can drop a task into a newly created cgroup without
> having to do anything for it.

I applied 1-2 and the rest of the series also look correct to me and
seem like a step in the right direction; however, I'm not quite sure
this is the final interface we want.

* cpus/mems_allowed changing as CPUs go up and down is nasty.  There
  should be separation between the configured CPUs and currently
  available CPUs.  The current behavior makes sense when coupled with
  the irreversible task migration and all.  If we're allowing tasks to
  remain in empty cpusets, it only makes sense to retain and re-apply
  configuration as CPUs come back online.

  I find the original behavior of changing configurations as system
  state changes pretty weird especially because it's happening without
  any notification making it pretty difficult to use in any sort of
  automated way - anything which wants to wrap cpuset would have to
  track the configuration and CPU/nodes up/down states separately on
  its own, which is a very easy way to introduce incoherencies.

* validate_change() rejecting updates to config if any of its
  descendants are using some is weird.  The config change should be
  enforced in hierarchical manner too.  If the parent drops some CPUs,
  it should simply drop those CPUs from the children.  The same in the
  other direction, children having configs which aren't fully
  contained inside their parents is fine as long as the effective
  masks are correct.

  IOW, validate_change() doesn't really make sense if we're keeping
  tasks in empty cgroups.  As CPUs go down and up, we'd keep the
  organization but lose the configuration, which is just weird.

I think what we want is expanding on this patchset so that we have
separate "configured" and "effective" masks, which are preferably
exposed to userland and just let the config propagation deal with
computing the effective masks as CPUs/nodes go down/up and config
changes.  The code actually could be simpler that way although
there'll be complications due to the old behaviors.

What do you think?  If you agree, how should we proceed?  We can apply
these patches and build on top if you prefer.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/7] cpuset: implement sane hierarchy behaviors
       [not found]     ` <20130609160353.GE2835-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2013-06-13  7:04       ` Li Zefan
       [not found]         ` <51B96F04.30803-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Li Zefan @ 2013-06-13  7:04 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Cgroups, Containers, LKML

On 2013/6/10 0:03, Tejun Heo wrote:
> Hello, Li.
> 
> On Sun, Jun 09, 2013 at 05:14:02PM +0800, Li Zefan wrote:
>> v2 -> v3:
>> Currently some cpuset behaviors are not friendly when cpuset is co-mounted
>> with other cgroup controllers.
>>
>> Now with this patchset if cpuset is mounted with sane_behavior option, it
>> behaves differently:
>>
>> - Tasks will be kept in empty cpusets when hotplug happens and take masks
>> of ancestors with non-empty cpus/mems, instead of being moved to an ancestor.
>>
>> - A task can be moved into an empty cpuset, and again it takes masks of
>> ancestors, so the user can drop a task into a newly created cgroup without
>> having to do anything for it.
> 
> I applied 1-2 and the rest of the series also look correct to me and
> seem like a step in the right direction; however, I'm not quite sure
> this is the final interface we want.
> 
> * cpus/mems_allowed changing as CPUs go up and down is nasty.  There
>   should be separation between the configured CPUs and currently
>   available CPUs.  The current behavior makes sense when coupled with
>   the irreversible task migration and all.  If we're allowing tasks to
>   remain in empty cpusets, it only makes sense to retain and re-apply
>   configuration as CPUs come back online.
> 
>   I find the original behavior of changing configurations as system
>   state changes pretty weird especially because it's happening without
>   any notification making it pretty difficult to use in any sort of
>   automated way - anything which wants to wrap cpuset would have to
>   track the configuration and CPU/nodes up/down states separately on
>   its own, which is a very easy way to introduce incoherencies.
> 
> * validate_change() rejecting updates to config if any of its
>   descendants are using some is weird.  The config change should be
>   enforced in hierarchical manner too.  If the parent drops some CPUs,
>   it should simply drop those CPUs from the children.  The same in the
>   other direction, children having configs which aren't fully
>   contained inside their parents is fine as long as the effective
>   masks are correct.
> 

I've just checked other cgroup controllers, and they do behavior the
way you described. So yeah, it makes sense that cpuset behaviors
coherently.

>   IOW, validate_change() doesn't really make sense if we're keeping
>   tasks in empty cgroups.  As CPUs go down and up, we'd keep the
>   organization but lose the configuration, which is just weird.
> 
> I think what we want is expanding on this patchset so that we have
> separate "configured" and "effective" masks, which are preferably
> exposed to userland and just let the config propagation deal with
> computing the effective masks as CPUs/nodes go down/up and config
> changes.  The code actually could be simpler that way although
> there'll be complications due to the old behaviors.
> 
> What do you think?  If you agree, how should we proceed?  We can apply
> these patches and build on top if you prefer.
> 

I would prefer those patches are applied first, as the new changes can
be based on this patchset, and the changes should be quite straightforward,
and also I don't have to rebase those patches again.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 0/7] cpuset: implement sane hierarchy behaviors
       [not found]         ` <51B96F04.30803-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2013-06-13 17:49           ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2013-06-13 17:49 UTC (permalink / raw)
  To: Li Zefan; +Cc: Cgroups, LKML, Containers

On Thu, Jun 13, 2013 at 03:04:36PM +0800, Li Zefan wrote:
> I would prefer those patches are applied first, as the new changes can
> be based on this patchset, and the changes should be quite straightforward,
> and also I don't have to rebase those patches again.

Aye aye, applied the rest of the series to cgroup/for-3.11-cpuset.
Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-06-13 17:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-09  9:14 [PATCH v3 0/7] cpuset: implement sane hierarchy behaviors Li Zefan
     [not found] ` <51B4475A.8030509-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-09  9:14   ` [PATCH v3 1/7] cpuset: let hotplug propagation work wait for task attaching Li Zefan
2013-06-09  9:14   ` [PATCH v3 2/7] cpuset: remove async hotplug propagation work Li Zefan
     [not found]     ` <51B44787.2030601-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-09 15:47       ` Tejun Heo
2013-06-09  9:15   ` [PATCH v3 3/7] cpuset: record old_mems_allowed in struct cpuset Li Zefan
2013-06-09  9:15   ` [PATCH v3 4/7] cpuset: introduce effective_{cpumask|nodemask}_cpuset() Li Zefan
2013-06-09  9:16   ` [PATCH v3 5/7] cpuset: allow to keep tasks in empty cpusets Li Zefan
2013-06-09  9:16   ` [PATCH v3 6/7] cpuset: allow to move tasks to " Li Zefan
2013-06-09  9:17   ` [PATCH v3 7/7] cpuset: fix to migrate mm correctly in a corner case Li Zefan
     [not found]     ` <51B4480F.2050400-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-09 15:49       ` Tejun Heo
2013-06-09 16:03   ` [PATCH v3 0/7] cpuset: implement sane hierarchy behaviors Tejun Heo
     [not found]     ` <20130609160353.GE2835-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-06-13  7:04       ` Li Zefan
     [not found]         ` <51B96F04.30803-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2013-06-13 17:49           ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).