* [RFC] cpuset update_cgroup_cpus_allowed
@ 2007-10-15 7:11 Paul Jackson
2007-10-15 18:49 ` David Rientjes
2007-10-15 21:24 ` Paul Menage
0 siblings, 2 replies; 15+ messages in thread
From: Paul Jackson @ 2007-10-15 7:11 UTC (permalink / raw)
To: Paul Menage, David Rientjes
Cc: nickpiggin, a.p.zijlstra, balbir, linux-kernel, clg, ebiederm,
containers, serue, svaidy, Paul Jackson, Andrew Morton, xemul
Paul M, David R, others -- how does this look?
From: Paul Jackson <pj@sgi.com>
Update the per-task cpus_allowed of each task in a cgroup
whenever it has a cpuset whose 'cpus' mask changes.
The change to basing cpusets on the cgroup (aka container)
infrastructure broke an essential cpuset hack. The old cpuset
code had used the act of reattaching a task to its own cpuset
(writing its pid back into the same 'tasks' file it was already
in) to trigger the code that updates the cpus_allowed cpumask
in the task struct to the cpus_allowed cpumask dictated by that
tasks cpuset.
This was a hack to avoid having code in the main scheduler
code path that checked for changes in the cpus_allowed by each
tasks cpuset, which would have unacceptable performance impact
on the scheduler.
The cgroup code avoids calling the update callout if a task
is reattached to the cgroup it is already attached to do.
This turned reattaching a task to its own cpuset into a no-op,
making it impossible to change a tasks CPU placement by changing
the cpus_allowed of the cpuset containing that task.
The right thing to do would be to have the code that updates a
cpusets cpus_allowed walk through each task currently in that
cpuset and update the cpus_allowed in that tasks task_struct.
This change does that, adding code called from cpuset
update_cpumask() that updates the task_struct cpus_allowed of
each task in a cgroup whenever it has a cpuset whose 'cpus'
is changed.
Signed-off-by: Paul Jackson <pj@sgi.com>
---
This patch applies anywhere after:
cpusets-decrustify-cpuset-mask-update-code.patch
Documentation/cpusets.txt | 23 +++++----------
kernel/cpuset.c | 68 +++++++++++++++++++++++++++++++++++++++++-----
kernel/sched.c | 3 ++
mm/pdflush.c | 3 ++
4 files changed, 76 insertions(+), 21 deletions(-)
--- 2.6.23-mm1.orig/kernel/cpuset.c 2007-10-14 22:24:56.268309633 -0700
+++ 2.6.23-mm1/kernel/cpuset.c 2007-10-14 22:34:52.645364388 -0700
@@ -677,6 +677,64 @@ done:
}
/*
+ * update_cgroup_cpus_allowed(cont, cpus)
+ *
+ * Keep looping over the tasks in cgroup 'cont', up to 'ntasks'
+ * tasks at a time, setting each task->cpus_allowed to 'cpus',
+ * until all tasks in the cgroup have that cpus_allowed setting.
+ *
+ * The 'set_cpus_allowed()' call cannot be made while holding the
+ * css_set_lock lock embedded in the cgroup_iter_* calls, so we stash
+ * some task pointers, in the tasks[] array on the stack, then drop
+ * that lock (cgroup_iter_end) before looping over the stashed tasks
+ * to update their cpus_allowed fields.
+ *
+ * Making the const 'ntasks' larger would use more stack space (bad),
+ * and reduce the number of cgroup_iter_start/cgroup_iter_end calls
+ * (good). But perhaps more importantly, it could allow any bugs
+ * lurking in the 'need_repeat' looping logic to remain hidden longer.
+ * So keep ntasks rather small, to ensure any bugs in this loop logic
+ * are exposed quickly.
+ */
+static void update_cgroup_cpus_allowed(struct cgroup *cont, cpumask_t *cpus)
+{
+ int need_repeat = true;
+
+ while (need_repeat) {
+ struct cgroup_iter it;
+ const int ntasks = 10;
+ struct task_struct *tasks[ntasks];
+ struct task_struct **p, **q;
+
+ need_repeat = false;
+ p = tasks;
+
+ cgroup_iter_start(cont, &it);
+ while (1) {
+ struct task_struct *t;
+
+ t = cgroup_iter_next(cont, &it);
+ if (!t)
+ break;
+ if (cpus_equal(*cpus, t->cpus_allowed))
+ continue;
+ if (p == tasks + ntasks) {
+ need_repeat = true;
+ break;
+ }
+ get_task_struct(t);
+ *p++ = t;
+ }
+ cgroup_iter_end(cont, &it);
+
+ for (q = tasks; q < p; q++) {
+ set_cpus_allowed(*q, *cpus);
+ put_task_struct(*q);
+ }
+ }
+}
+
+/*
* Call with manage_mutex held. May take callback_mutex during call.
*/
@@ -684,7 +742,6 @@ static int update_cpumask(struct cpuset
{
struct cpuset trialcs;
int retval;
- int cpus_changed, is_load_balanced;
/* top_cpuset.cpus_allowed tracks cpu_online_map; it's read-only */
if (cs == &top_cpuset)
@@ -713,16 +770,15 @@ static int update_cpumask(struct cpuset
if (retval < 0)
return retval;
- cpus_changed = !cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
- is_load_balanced = is_sched_load_balance(&trialcs);
+ if (cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed))
+ return 0;
mutex_lock(&callback_mutex);
cs->cpus_allowed = trialcs.cpus_allowed;
mutex_unlock(&callback_mutex);
- if (cpus_changed && is_load_balanced)
- rebuild_sched_domains();
-
+ update_cgroup_cpus_allowed(cs->css.cgroup, &cs->cpus_allowed);
+ rebuild_sched_domains();
return 0;
}
--- 2.6.23-mm1.orig/Documentation/cpusets.txt 2007-10-14 22:24:56.236309148 -0700
+++ 2.6.23-mm1/Documentation/cpusets.txt 2007-10-14 22:25:59.953276792 -0700
@@ -523,21 +523,14 @@ from one cpuset to another, then the ker
memory placement, as above, the next time that the kernel attempts
to allocate a page of memory for that task.
-If a cpuset has its CPUs modified, then each task using that
-cpuset does _not_ change its behavior automatically. In order to
-minimize the impact on the critical scheduling code in the kernel,
-tasks will continue to use their prior CPU placement until they
-are rebound to their cpuset, by rewriting their pid to the 'tasks'
-file of their cpuset. If a task had been bound to some subset of its
-cpuset using the sched_setaffinity() call, and if any of that subset
-is still allowed in its new cpuset settings, then the task will be
-restricted to the intersection of the CPUs it was allowed on before,
-and its new cpuset CPU placement. If, on the other hand, there is
-no overlap between a tasks prior placement and its new cpuset CPU
-placement, then the task will be allowed to run on any CPU allowed
-in its new cpuset. If a task is moved from one cpuset to another,
-its CPU placement is updated in the same way as if the tasks pid is
-rewritten to the 'tasks' file of its current cpuset.
+If a cpuset has its 'cpus' modified, then each task in that cpuset
+will have its allowed CPU placement changed immediately. Similarly,
+if a tasks pid is written to a cpusets 'tasks' file, in either its
+current cpuset or another cpuset, then its allowed CPU placement is
+changed immediately. If such a task had been bound to some subset
+of its cpuset using the sched_setaffinity() call, the task will be
+allowed to run on any CPU allowed in its new cpuset, negating the
+affect of the prior sched_setaffinity() call.
In summary, the memory placement of a task whose cpuset is changed is
updated by the kernel, on the next allocation of a page for that task,
--- 2.6.23-mm1.orig/kernel/sched.c 2007-10-14 22:24:56.340310725 -0700
+++ 2.6.23-mm1/kernel/sched.c 2007-10-14 22:25:59.973277096 -0700
@@ -51,6 +51,7 @@
#include <linux/rcupdate.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
+#include <linux/cgroup.h>
#include <linux/percpu.h>
#include <linux/cpu_acct.h>
#include <linux/kthread.h>
@@ -4335,9 +4336,11 @@ long sched_setaffinity(pid_t pid, cpumas
if (retval)
goto out_unlock;
+ cgroup_lock();
cpus_allowed = cpuset_cpus_allowed(p);
cpus_and(new_mask, new_mask, cpus_allowed);
retval = set_cpus_allowed(p, new_mask);
+ cgroup_unlock();
out_unlock:
put_task_struct(p);
--- 2.6.23-mm1.orig/mm/pdflush.c 2007-10-14 22:23:28.710981177 -0700
+++ 2.6.23-mm1/mm/pdflush.c 2007-10-14 22:25:59.989277340 -0700
@@ -21,6 +21,7 @@
#include <linux/writeback.h> // Prototypes pdflush_operation()
#include <linux/kthread.h>
#include <linux/cpuset.h>
+#include <linux/cgroup.h>
#include <linux/freezer.h>
@@ -187,8 +188,10 @@ static int pdflush(void *dummy)
* This is needed as pdflush's are dynamically created and destroyed.
* The boottime pdflush's are easily placed w/o these 2 lines.
*/
+ cgroup_lock();
cpus_allowed = cpuset_cpus_allowed(current);
set_cpus_allowed(current, cpus_allowed);
+ cgroup_unlock();
return __pdflush(&my_work);
}
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.650.933.1373
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RFC] cpuset update_cgroup_cpus_allowed
2007-10-15 7:11 [RFC] cpuset update_cgroup_cpus_allowed Paul Jackson
@ 2007-10-15 18:49 ` David Rientjes
2007-10-16 2:32 ` Paul Jackson
2007-10-16 6:07 ` Paul Jackson
2007-10-15 21:24 ` Paul Menage
1 sibling, 2 replies; 15+ messages in thread
From: David Rientjes @ 2007-10-15 18:49 UTC (permalink / raw)
To: Paul Jackson
Cc: Paul Menage, nickpiggin, a.p.zijlstra, balbir, linux-kernel, clg,
ebiederm, containers, serue, svaidy, Andrew Morton, xemul
On Mon, 15 Oct 2007, Paul Jackson wrote:
> --- 2.6.23-mm1.orig/kernel/cpuset.c 2007-10-14 22:24:56.268309633 -0700
> +++ 2.6.23-mm1/kernel/cpuset.c 2007-10-14 22:34:52.645364388 -0700
> @@ -677,6 +677,64 @@ done:
> }
>
> /*
> + * update_cgroup_cpus_allowed(cont, cpus)
> + *
> + * Keep looping over the tasks in cgroup 'cont', up to 'ntasks'
> + * tasks at a time, setting each task->cpus_allowed to 'cpus',
> + * until all tasks in the cgroup have that cpus_allowed setting.
> + *
> + * The 'set_cpus_allowed()' call cannot be made while holding the
> + * css_set_lock lock embedded in the cgroup_iter_* calls, so we stash
> + * some task pointers, in the tasks[] array on the stack, then drop
> + * that lock (cgroup_iter_end) before looping over the stashed tasks
> + * to update their cpus_allowed fields.
> + *
> + * Making the const 'ntasks' larger would use more stack space (bad),
> + * and reduce the number of cgroup_iter_start/cgroup_iter_end calls
> + * (good). But perhaps more importantly, it could allow any bugs
> + * lurking in the 'need_repeat' looping logic to remain hidden longer.
> + * So keep ntasks rather small, to ensure any bugs in this loop logic
> + * are exposed quickly.
> + */
> +static void update_cgroup_cpus_allowed(struct cgroup *cont, cpumask_t *cpus)
> +{
> + int need_repeat = true;
> +
> + while (need_repeat) {
> + struct cgroup_iter it;
> + const int ntasks = 10;
> + struct task_struct *tasks[ntasks];
> + struct task_struct **p, **q;
> +
> + need_repeat = false;
> + p = tasks;
> +
> + cgroup_iter_start(cont, &it);
> + while (1) {
> + struct task_struct *t;
> +
> + t = cgroup_iter_next(cont, &it);
> + if (!t)
> + break;
> + if (cpus_equal(*cpus, t->cpus_allowed))
> + continue;
By making this cpus_equal() and not cpus_intersects(), you're trying to
make sure that t->cpus_allowed is always equal to *cpus for each task in
the iterator.
> + if (p == tasks + ntasks) {
> + need_repeat = true;
> + break;
> + }
> + get_task_struct(t);
> + *p++ = t;
> + }
> + cgroup_iter_end(cont, &it);
> +
> + for (q = tasks; q < p; q++) {
> + set_cpus_allowed(*q, *cpus);
> + put_task_struct(*q);
> + }
> + }
> +}
Yet by not doing any locking here to prevent a cpu from being
hot-unplugged, you can race and allow the hot-unplug event to happen
before calling set_cpus_allowed(). That makes this entire function a
no-op with set_cpus_allowed() returning -EINVAL for every call, which
isn't caught, and no error is reported to userspace.
Now all the tasks in the cpuset have an inconsistent state with respect to
their p->cpuset->cpus_allowed, because that was already updated in
update_cpumask(). When userspace checks that value via the 'cpus' file,
this is the value returned which is actually not true at all for any of
the tasks in 'tasks'.
David
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RFC] cpuset update_cgroup_cpus_allowed
2007-10-15 18:49 ` David Rientjes
@ 2007-10-16 2:32 ` Paul Jackson
2007-10-16 6:07 ` Paul Jackson
1 sibling, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2007-10-16 2:32 UTC (permalink / raw)
To: David Rientjes
Cc: menage, nickpiggin, a.p.zijlstra, balbir, linux-kernel, clg,
ebiederm, containers, serue, svaidy, akpm, xemul
> Yet by not doing any locking here to prevent a cpu from being
> hot-unplugged, you can race and allow the hot-unplug event to happen
> before calling set_cpus_allowed(). That makes this entire function a
> no-op with set_cpus_allowed() returning -EINVAL for every call, which
> isn't caught, and no error is reported to userspace.
Good point ... hmmm ...
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RFC] cpuset update_cgroup_cpus_allowed
2007-10-15 18:49 ` David Rientjes
2007-10-16 2:32 ` Paul Jackson
@ 2007-10-16 6:07 ` Paul Jackson
2007-10-16 6:21 ` David Rientjes
1 sibling, 1 reply; 15+ messages in thread
From: Paul Jackson @ 2007-10-16 6:07 UTC (permalink / raw)
To: David Rientjes
Cc: menage, nickpiggin, a.p.zijlstra, balbir, linux-kernel, clg,
ebiederm, containers, serue, svaidy, akpm, xemul
> > + if (cpus_equal(*cpus, t->cpus_allowed))
> > + continue;
> > ...
> > + for (q = tasks; q < p; q++) {
> > + set_cpus_allowed(*q, *cpus);
> > + put_task_struct(*q);
> > + }
> > + }
> > +}
>
> Yet by not doing any locking here to prevent a cpu from being
> hot-unplugged, you can race and allow the hot-unplug event to happen
> before calling set_cpus_allowed(). That makes this entire function a
> no-op with set_cpus_allowed() returning -EINVAL for every call, which
> isn't caught, and no error is reported to userspace.
>
> Now all the tasks in the cpuset have an inconsistent state with respect to
> their p->cpuset->cpus_allowed, because that was already updated in
> update_cpumask().
My solution may be worse than that. Because set_cpus_allowed() will
fail if asked to set a non-overlapping cpumask, my solution could never
terminate. If asked to set a cpusets cpus to something that went off
line right then, this I'd guess this code could keep looping forever,
looking for cpumasks that didn't match, and then not noticing that it
was failing to set them so as they would match.
... it needs work ... or the alternative solution from Paul M.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RFC] cpuset update_cgroup_cpus_allowed
2007-10-16 6:07 ` Paul Jackson
@ 2007-10-16 6:21 ` David Rientjes
2007-10-16 9:16 ` Paul Jackson
0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2007-10-16 6:21 UTC (permalink / raw)
To: Paul Jackson
Cc: menage, nickpiggin, a.p.zijlstra, balbir, linux-kernel, clg,
ebiederm, containers, serue, svaidy, akpm, xemul
On Mon, 15 Oct 2007, Paul Jackson wrote:
> My solution may be worse than that. Because set_cpus_allowed() will
> fail if asked to set a non-overlapping cpumask, my solution could never
> terminate. If asked to set a cpusets cpus to something that went off
> line right then, this I'd guess this code could keep looping forever,
> looking for cpumasks that didn't match, and then not noticing that it
> was failing to set them so as they would match.
>
Why can't you just add a helper function to sched.c:
void set_hotcpus_allowed(struct task_struct *task,
cpumask_t cpumask)
{
mutex_lock(&sched_hotcpu_mutex);
set_cpus_allowed(task, cpumask);
mutex_unlock(&sched_hotcpu_mutex);
}
And then change each task's cpus_allowed via that function instead of
set_cpus_allowed() directly?
You don't need to worry about making the task->cpuset->cpus_allowed
assignment a critical section because common_cpu_mem_hotplug_unplug() will
remove any hot-unplugged cpus from each cpuset's cpus_allowed in the
hierarchy.
Your loop will still need to be reworked so that cgroup_iter_{start,end}()
are not reinvoked unnecessarily and you rely only on cgroup_iter_next()
returning NULL to determine when you've gone through the entire list.
There's no need to go back and check the cpus_allowed of tasks you've
already called set_cpus_allowed() on either directly or indirectly via my
helper function above.
David
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RFC] cpuset update_cgroup_cpus_allowed
2007-10-16 6:21 ` David Rientjes
@ 2007-10-16 9:16 ` Paul Jackson
2007-10-16 18:27 ` David Rientjes
0 siblings, 1 reply; 15+ messages in thread
From: Paul Jackson @ 2007-10-16 9:16 UTC (permalink / raw)
To: David Rientjes
Cc: menage, nickpiggin, a.p.zijlstra, balbir, linux-kernel, clg,
ebiederm, containers, serue, svaidy, akpm, xemul
David wrote:
> Why can't you just add a helper function to sched.c:
>
> void set_hotcpus_allowed(struct task_struct *task,
> cpumask_t cpumask)
> {
> mutex_lock(&sched_hotcpu_mutex);
> set_cpus_allowed(task, cpumask);
> mutex_unlock(&sched_hotcpu_mutex);
> }
>
> And then change each task's cpus_allowed via that function instead of
> set_cpus_allowed() directly?
I guess this would avoid race conditions within the set_cpus_allowed()
routine, between its code to read the cpu_online_map and set the tasks
cpus_allowed ... though if that's useful, don't we really need to add
locking/unlocking on sched_hotcpu_mutex right inside the
set_cpus_allowed() routine, for all users of set_cpus_allowed ??
But I don't see where the above code helps at all deal with the
races I considered in my previous message:
> My solution may be worse than that. Because set_cpus_allowed() will
> fail if asked to set a non-overlapping cpumask, my solution could never
> terminate. If asked to set a cpusets cpus to something that went off
> line right then, this I'd guess this code could keep looping forever,
> looking for cpumasks that didn't match, and then not noticing that it
> was failing to set them so as they would match.
These races involve reading the tasks cpuset cpus_allowed mask, reading
the online map, and both reading and writing the tasks task_struct
cpus_allowed. Unless one holds the relevant lock for the entire
interval surrounding the critical accesses to these values, it won't do
any good that I can see. Just briefly holding a lock around each
separate access is useless.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RFC] cpuset update_cgroup_cpus_allowed
2007-10-16 9:16 ` Paul Jackson
@ 2007-10-16 18:27 ` David Rientjes
2007-10-16 23:14 ` Paul Jackson
0 siblings, 1 reply; 15+ messages in thread
From: David Rientjes @ 2007-10-16 18:27 UTC (permalink / raw)
To: Paul Jackson
Cc: menage, nickpiggin, a.p.zijlstra, balbir, linux-kernel, clg,
ebiederm, containers, serue, svaidy, akpm, xemul
On Tue, 16 Oct 2007, Paul Jackson wrote:
> David wrote:
> > Why can't you just add a helper function to sched.c:
> >
> > void set_hotcpus_allowed(struct task_struct *task,
> > cpumask_t cpumask)
> > {
> > mutex_lock(&sched_hotcpu_mutex);
> > set_cpus_allowed(task, cpumask);
> > mutex_unlock(&sched_hotcpu_mutex);
> > }
> >
> > And then change each task's cpus_allowed via that function instead of
> > set_cpus_allowed() directly?
>
> I guess this would avoid race conditions within the set_cpus_allowed()
> routine, between its code to read the cpu_online_map and set the tasks
> cpus_allowed ... though if that's useful, don't we really need to add
> locking/unlocking on sched_hotcpu_mutex right inside the
> set_cpus_allowed() routine, for all users of set_cpus_allowed ??
>
Not necessarily because migration only occurs to any online cpu in the
mask, it won't attempt to migrate it to some cpu that has been downed.
> > My solution may be worse than that. Because set_cpus_allowed() will
> > fail if asked to set a non-overlapping cpumask, my solution could never
> > terminate. If asked to set a cpusets cpus to something that went off
> > line right then, this I'd guess this code could keep looping forever,
> > looking for cpumasks that didn't match, and then not noticing that it
> > was failing to set them so as they would match.
>
> These races involve reading the tasks cpuset cpus_allowed mask, reading
> the online map, and both reading and writing the tasks task_struct
> cpus_allowed. Unless one holds the relevant lock for the entire
> interval surrounding the critical accesses to these values, it won't do
> any good that I can see. Just briefly holding a lock around each
> separate access is useless.
>
It's not useless since a cpu hot-unplug event automatically updates the
cpus_allowed for all cpusets in the hierarchy by removing it.
You could protect the entire reassignment with the same mutex since it's
so lightly contended, but it's unnecessary. All it would guarantee is
that cpuset->cpus_allowed is consistent with task->cpus_allowed for all
tasks in cpuset.
Something like the following (untested) patch, but adding a
set_hotcpus_allowed() helper would also work if it's protected by
sched_hotcpu_mutex.
---
include/linux/sched.h | 8 +++++
kernel/cpuset.c | 79 ++++++++++++++++++++++++++++--------------------
kernel/sched.c | 13 ++++++--
mm/pdflush.c | 3 --
4 files changed, 64 insertions(+), 39 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1391,6 +1391,8 @@ static inline void put_task_struct(struct task_struct *t)
#ifdef CONFIG_SMP
extern int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask);
+extern void sched_hotcpu_lock(void);
+extern void sched_hotcpu_unlock(void);
#else
static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
{
@@ -1398,6 +1400,12 @@ static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
return -EINVAL;
return 0;
}
+static inline void sched_hotcpu_lock(void)
+{
+}
+static inline void sched_hotcpu_unlock(void)
+{
+}
#endif
extern unsigned long long sched_clock(void);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -706,40 +706,51 @@ done:
*/
static void update_cgroup_cpus_allowed(struct cgroup *cont, cpumask_t *cpus)
{
- int need_repeat = true;
-
- while (need_repeat) {
- struct cgroup_iter it;
- const int ntasks = 10;
- struct task_struct *tasks[ntasks];
- struct task_struct **p, **q;
-
- need_repeat = false;
- p = tasks;
-
- cgroup_iter_start(cont, &it);
- while (1) {
- struct task_struct *t;
-
- t = cgroup_iter_next(cont, &it);
- if (!t)
- break;
- if (cpus_equal(*cpus, t->cpus_allowed))
- continue;
- if (p == tasks + ntasks) {
- need_repeat = true;
- break;
- }
- get_task_struct(t);
- *p++ = t;
- }
- cgroup_iter_end(cont, &it);
+ struct cgroup_iter it;
+ struct task_struct *p, **tasks;
+ const int max_tasks = 10;
+ int nr_tasks;
+ int i = 0;
+
+ cgroup_iter_start(cont, &it);
+repeat:
+ for (nr_tasks = max_tasks - 1;
+ (p = cgroup_iter_next(cont, &it)); nr_tasks--) {
+ /*
+ * If the cpumask is already equal for this task, there's no
+ * reason to call set_cpus_allowed() because no change is
+ * needed and no migration is required.
+ */
+ if (cpus_equal(p->cpus_allowed, *cpus))
+ continue;
- for (q = tasks; q < p; q++) {
- set_cpus_allowed(*q, *cpus);
- put_task_struct(*q);
- }
+ /*
+ * Save a reference to the task structure so it doesn't exit
+ * prematurely.
+ */
+ get_task_struct(p);
+ tasks[i++] = p;
+ if (!nr_tasks)
+ goto set_allowed;
+ }
+ cgroup_iter_end(cont, &it);
+
+set_allowed:
+ while (--i >= 0) {
+ /*
+ * Update the cpus_allowed for this task and migrate it if
+ * necessary. Then, decrement the task's usage counter.
+ */
+ set_cpus_allowed(tasks[i], *cpus);
+ put_task_struct(tasks[i]);
}
+
+ /*
+ * We may need to continue iterating through more tasks if we exhausted
+ * our stack from the previous set.
+ */
+ if (!nr_tasks)
+ goto repeat;
}
/*
@@ -779,11 +790,13 @@ static int update_cpumask(struct cpuset *cs, char *buf)
if (cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed))
return 0;
+ sched_hotcpu_lock();
mutex_lock(&callback_mutex);
cs->cpus_allowed = trialcs.cpus_allowed;
mutex_unlock(&callback_mutex);
-
update_cgroup_cpus_allowed(cs->css.cgroup, &cs->cpus_allowed);
+ sched_hotcpu_unlock();
+
rebuild_sched_domains();
return 0;
}
diff --git a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -51,7 +51,6 @@
#include <linux/rcupdate.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
-#include <linux/cgroup.h>
#include <linux/percpu.h>
#include <linux/cpu_acct.h>
#include <linux/kthread.h>
@@ -363,6 +362,16 @@ struct rq {
static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
static DEFINE_MUTEX(sched_hotcpu_mutex);
+void sched_hotcpu_lock(void)
+{
+ mutex_lock(&sched_hotcpu_mutex);
+}
+
+void sched_hotcpu_unlock(void)
+{
+ mutex_unlock(&sched_hotcpu_mutex);
+}
+
static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
{
rq->curr->sched_class->check_preempt_curr(rq, p);
@@ -4365,11 +4374,9 @@ long sched_setaffinity(pid_t pid, cpumask_t new_mask)
if (retval)
goto out_unlock;
- cgroup_lock();
cpus_allowed = cpuset_cpus_allowed(p);
cpus_and(new_mask, new_mask, cpus_allowed);
retval = set_cpus_allowed(p, new_mask);
- cgroup_unlock();
out_unlock:
put_task_struct(p);
diff --git a/mm/pdflush.c b/mm/pdflush.c
--- a/mm/pdflush.c
+++ b/mm/pdflush.c
@@ -21,7 +21,6 @@
#include <linux/writeback.h> // Prototypes pdflush_operation()
#include <linux/kthread.h>
#include <linux/cpuset.h>
-#include <linux/cgroup.h>
#include <linux/freezer.h>
@@ -188,10 +187,8 @@ static int pdflush(void *dummy)
* This is needed as pdflush's are dynamically created and destroyed.
* The boottime pdflush's are easily placed w/o these 2 lines.
*/
- cgroup_lock();
cpus_allowed = cpuset_cpus_allowed(current);
set_cpus_allowed(current, cpus_allowed);
- cgroup_unlock();
return __pdflush(&my_work);
}
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RFC] cpuset update_cgroup_cpus_allowed
2007-10-16 18:27 ` David Rientjes
@ 2007-10-16 23:14 ` Paul Jackson
0 siblings, 0 replies; 15+ messages in thread
From: Paul Jackson @ 2007-10-16 23:14 UTC (permalink / raw)
To: David Rientjes
Cc: menage, nickpiggin, a.p.zijlstra, balbir, linux-kernel, clg,
ebiederm, containers, serue, svaidy, akpm, xemul
David wrote:
> Not necessarily because migration only occurs to any online cpu in the
> mask, it won't attempt to migrate it to some cpu that has been downed.
> ...
... one of David or I is insane ... I can't tell which one yet,
perhaps both of us ;).
I'm going to reply to David without all the CC list above.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC] cpuset update_cgroup_cpus_allowed
2007-10-15 7:11 [RFC] cpuset update_cgroup_cpus_allowed Paul Jackson
2007-10-15 18:49 ` David Rientjes
@ 2007-10-15 21:24 ` Paul Menage
2007-10-16 0:16 ` Paul Jackson
1 sibling, 1 reply; 15+ messages in thread
From: Paul Menage @ 2007-10-15 21:24 UTC (permalink / raw)
To: Paul Jackson
Cc: David Rientjes, nickpiggin, a.p.zijlstra, balbir, linux-kernel,
clg, ebiederm, containers, serue, svaidy, Andrew Morton, xemul
Paul Jackson wrote:
> Paul M, David R, others -- how does this look?
>
Looks plausible, although as David comments I don't think it handles a concurrent CPU hotplug/unplug. Also I don't like the idea of doing a cgroup_lock() across sched_setaffinity() - cgroup_lock() can be held for relatively long periods of time.
Here's an alternative for consideration, below. The main differences are:
- currently against an older kernel with pre-cgroup cpusets, so it uses tasklist_lock and do_each_thread(); a cgroup version would use cgroup iterators as yours does
- solves the race between sched_setaffinity() and update_cpumask() by having sched_setaffinity() check for changes to cpuset_cpus_allowed() after doing set_cpus_allowed()
- guarantees to only act on each process once (so guarantees forward progress, in the absence of fork bombs. (And could be adapted to handle fork bombs too)
- uses a priority heap to pick the processes to act on, based on start time
- uses lock_cpu_hotplug() to avoid races with CPU hotplug; sadly I think this is gone in more recent kernels, so some other synchronization would be needed
Paul
> From: Paul Jackson <pj@sgi.com>
>
> Update the per-task cpus_allowed of each task in a cgroup
> whenever it has a cpuset whose 'cpus' mask changes.
>
> The change to basing cpusets on the cgroup (aka container)
> infrastructure broke an essential cpuset hack. The old cpuset
> code had used the act of reattaching a task to its own cpuset
> (writing its pid back into the same 'tasks' file it was already
> in) to trigger the code that updates the cpus_allowed cpumask
> in the task struct to the cpus_allowed cpumask dictated by that
> tasks cpuset.
>
> This was a hack to avoid having code in the main scheduler
> code path that checked for changes in the cpus_allowed by each
> tasks cpuset, which would have unacceptable performance impact
> on the scheduler.
>
> The cgroup code avoids calling the update callout if a task
> is reattached to the cgroup it is already attached to do.
> This turned reattaching a task to its own cpuset into a no-op,
> making it impossible to change a tasks CPU placement by changing
> the cpus_allowed of the cpuset containing that task.
>
> The right thing to do would be to have the code that updates a
> cpusets cpus_allowed walk through each task currently in that
> cpuset and update the cpus_allowed in that tasks task_struct.
>
> This change does that, adding code called from cpuset
> update_cpumask() that updates the task_struct cpus_allowed of
> each task in a cgroup whenever it has a cpuset whose 'cpus'
> is changed.
>
> Signed-off-by: Paul Jackson <pj@sgi.com>
>
> ---
>
> This patch applies anywhere after:
> cpusets-decrustify-cpuset-mask-update-code.patch
>
> Documentation/cpusets.txt | 23 +++++----------
> kernel/cpuset.c | 68 +++++++++++++++++++++++++++++++++++++++++-----
> kernel/sched.c | 3 ++
> mm/pdflush.c | 3 ++
> 4 files changed, 76 insertions(+), 21 deletions(-)
>
> --- 2.6.23-mm1.orig/kernel/cpuset.c 2007-10-14 22:24:56.268309633 -0700
> +++ 2.6.23-mm1/kernel/cpuset.c 2007-10-14 22:34:52.645364388 -0700
> @@ -677,6 +677,64 @@ done:
> }
>
> /*
> + * update_cgroup_cpus_allowed(cont, cpus)
> + *
> + * Keep looping over the tasks in cgroup 'cont', up to 'ntasks'
> + * tasks at a time, setting each task->cpus_allowed to 'cpus',
> + * until all tasks in the cgroup have that cpus_allowed setting.
> + *
> + * The 'set_cpus_allowed()' call cannot be made while holding the
> + * css_set_lock lock embedded in the cgroup_iter_* calls, so we stash
> + * some task pointers, in the tasks[] array on the stack, then drop
> + * that lock (cgroup_iter_end) before looping over the stashed tasks
> + * to update their cpus_allowed fields.
> + *
> + * Making the const 'ntasks' larger would use more stack space (bad),
> + * and reduce the number of cgroup_iter_start/cgroup_iter_end calls
> + * (good). But perhaps more importantly, it could allow any bugs
> + * lurking in the 'need_repeat' looping logic to remain hidden longer.
> + * So keep ntasks rather small, to ensure any bugs in this loop logic
> + * are exposed quickly.
> + */
> +static void update_cgroup_cpus_allowed(struct cgroup *cont, cpumask_t *cpus)
> +{
> + int need_repeat = true;
> +
> + while (need_repeat) {
> + struct cgroup_iter it;
> + const int ntasks = 10;
> + struct task_struct *tasks[ntasks];
> + struct task_struct **p, **q;
> +
> + need_repeat = false;
> + p = tasks;
> +
> + cgroup_iter_start(cont, &it);
> + while (1) {
> + struct task_struct *t;
> +
> + t = cgroup_iter_next(cont, &it);
> + if (!t)
> + break;
> + if (cpus_equal(*cpus, t->cpus_allowed))
> + continue;
> + if (p == tasks + ntasks) {
> + need_repeat = true;
> + break;
> + }
> + get_task_struct(t);
> + *p++ = t;
> + }
> + cgroup_iter_end(cont, &it);
> +
> + for (q = tasks; q < p; q++) {
> + set_cpus_allowed(*q, *cpus);
> + put_task_struct(*q);
> + }
> + }
> +}
> +
> +/*
> * Call with manage_mutex held. May take callback_mutex during call.
> */
>
> @@ -684,7 +742,6 @@ static int update_cpumask(struct cpuset
> {
> struct cpuset trialcs;
> int retval;
> - int cpus_changed, is_load_balanced;
>
> /* top_cpuset.cpus_allowed tracks cpu_online_map; it's read-only */
> if (cs == &top_cpuset)
> @@ -713,16 +770,15 @@ static int update_cpumask(struct cpuset
> if (retval < 0)
> return retval;
>
> - cpus_changed = !cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
> - is_load_balanced = is_sched_load_balance(&trialcs);
> + if (cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed))
> + return 0;
>
> mutex_lock(&callback_mutex);
> cs->cpus_allowed = trialcs.cpus_allowed;
> mutex_unlock(&callback_mutex);
>
> - if (cpus_changed && is_load_balanced)
> - rebuild_sched_domains();
> -
> + update_cgroup_cpus_allowed(cs->css.cgroup, &cs->cpus_allowed);
> + rebuild_sched_domains();
> return 0;
> }
>
> --- 2.6.23-mm1.orig/Documentation/cpusets.txt 2007-10-14 22:24:56.236309148 -0700
> +++ 2.6.23-mm1/Documentation/cpusets.txt 2007-10-14 22:25:59.953276792 -0700
> @@ -523,21 +523,14 @@ from one cpuset to another, then the ker
> memory placement, as above, the next time that the kernel attempts
> to allocate a page of memory for that task.
>
> -If a cpuset has its CPUs modified, then each task using that
> -cpuset does _not_ change its behavior automatically. In order to
> -minimize the impact on the critical scheduling code in the kernel,
> -tasks will continue to use their prior CPU placement until they
> -are rebound to their cpuset, by rewriting their pid to the 'tasks'
> -file of their cpuset. If a task had been bound to some subset of its
> -cpuset using the sched_setaffinity() call, and if any of that subset
> -is still allowed in its new cpuset settings, then the task will be
> -restricted to the intersection of the CPUs it was allowed on before,
> -and its new cpuset CPU placement. If, on the other hand, there is
> -no overlap between a tasks prior placement and its new cpuset CPU
> -placement, then the task will be allowed to run on any CPU allowed
> -in its new cpuset. If a task is moved from one cpuset to another,
> -its CPU placement is updated in the same way as if the tasks pid is
> -rewritten to the 'tasks' file of its current cpuset.
> +If a cpuset has its 'cpus' modified, then each task in that cpuset
> +will have its allowed CPU placement changed immediately. Similarly,
> +if a tasks pid is written to a cpusets 'tasks' file, in either its#12 - /usr/local/google/home/menage/kernel9/linux/kernel/cpuset.c ====
# action=edit type=text
> +current cpuset or another cpuset, then its allowed CPU placement is
> +changed immediately. If such a task had been bound to some subset
> +of its cpuset using the sched_setaffinity() call, the task will be
> +allowed to run on any CPU allowed in its new cpuset, negating the
> +affect of the prior sched_setaffinity() call.
>
> In summary, the memory placement of a task whose cpuset is changed is
> updated by the kernel, on the next allocation of a page for that task,
> --- 2.6.23-mm1.orig/kernel/sched.c 2007-10-14 22:24:56.340310725 -0700
> +++ 2.6.23-mm1/kernel/sched.c 2007-10-14 22:25:59.973277096 -0700
> @@ -51,6 +51,7 @@
> #include <linux/rcupdate.h>
> #include <linux/cpu.h>
> #include <linux/cpuset.h>
> +#include <linux/cgroup.h>
> #include <linux/percpu.h>
> #include <linux/cpu_acct.h>
> #include <linux/kthread.h>
> @@ -4335,9 +4336,11 @@ long sched_setaffinity(pid_t pid, cpumas
> if (retval)
> goto out_unlock;
>
> + cgroup_lock();
> cpus_allowed = cpuset_cpus_allowed(p);
> cpus_and(new_mask, new_mask, cpus_allowed);
> retval = set_cpus_allowed(p, new_mask);
> + cgroup_unlock();
>
> out_unlock:
> put_task_struct(p);
> --- 2.6.23-mm1.orig/mm/pdflush.c 2007-10-14 22:23:28.710981177 -0700
> +++ 2.6.23-mm1/mm/pdflush.c 2007-10-14 22:25:59.989277340 -0700
> @@ -21,6 +21,7 @@
> #include <linux/writeback.h> // Prototypes pdflush_operation()
> #include <linux/kthread.h>
> #include <linux/cpuset.h>
> +#include <linux/cgroup.h>
> #include <linux/freezer.h>
>
>
> @@ -187,8 +188,10 @@ static int pdflush(void *dummy)
> * This is needed as pdflush's are dynamically created and destroyed.
> * The boottime pdflush's are easily placed w/o these 2 lines.
> */
> + cgroup_lock();
> cpus_allowed = cpuset_cpus_allowed(current);
> set_cpus_allowed(current, cpus_allowed);
> + cgroup_unlock();
>
> return __pdflush(&my_work);
> }
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RFC] cpuset update_cgroup_cpus_allowed
2007-10-15 21:24 ` Paul Menage
@ 2007-10-16 0:16 ` Paul Jackson
[not found] ` <20071015171636.6213bf43.pj-sJ/iWh9BUns@public.gmane.org>
0 siblings, 1 reply; 15+ messages in thread
From: Paul Jackson @ 2007-10-16 0:16 UTC (permalink / raw)
To: Paul Menage
Cc: rientjes, nickpiggin, a.p.zijlstra, balbir, linux-kernel, clg,
ebiederm, containers, serue, svaidy, akpm, xemul
Paul M wrote:
> Here's an alternative for consideration, below.
I don't see the alternative -- I just see my patch, with the added
blurbage:
#12 - /usr/local/google/home/menage/kernel9/linux/kernel/cpuset.c ====
# action=edit type=text
Should I be increasing my caffeine intake?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2007-10-16 23:14 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-15 7:11 [RFC] cpuset update_cgroup_cpus_allowed Paul Jackson
2007-10-15 18:49 ` David Rientjes
2007-10-16 2:32 ` Paul Jackson
2007-10-16 6:07 ` Paul Jackson
2007-10-16 6:21 ` David Rientjes
2007-10-16 9:16 ` Paul Jackson
2007-10-16 18:27 ` David Rientjes
2007-10-16 23:14 ` Paul Jackson
2007-10-15 21:24 ` Paul Menage
2007-10-16 0:16 ` Paul Jackson
[not found] ` <20071015171636.6213bf43.pj-sJ/iWh9BUns@public.gmane.org>
2007-10-16 0:20 ` Paul Menage
2007-10-16 2:34 ` Paul Jackson
2007-10-16 5:12 ` Paul Menage
2007-10-16 5:20 ` Paul Jackson
2007-10-16 10:07 ` Paul Menage
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox