[PATCH] task containersv11 add tasks file interface fix for cpusets

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] task containersv11 add tasks file interface fix for cpusets
@ 2007-10-03  8:42 Paul Jackson
  2007-10-03 15:51 ` Paul Menage
  2007-10-03 20:56 ` Paul Jackson
  0 siblings, 2 replies; 26+ messages in thread
From: Paul Jackson @ 2007-10-03  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Peter Zijlstra, Serge E. Hallyn, Cedric Le Goater,
	Paul Menage, linux-kernel, Eric W. Biederman, David Rientjes,
	Vaidyanathan Srinivasan, Pavel Emelianov, containers,
	Paul Jackson, balbir

From: Paul Jackson <pj@sgi.com>

The code in kernel/cgroup.c attach_task() which skips the
attachment of a task to the group it is already in has to be
removed.  Cpusets depends on reattaching a task to its current
cpuset, in order to trigger updating the cpus_allowed mask in the
task struct.

The dependency of cpusets on this is a hack, granted, but an
important one.  It lets us avoid checking for a changed cpuset
'cpus' setting in critical scheduler code paths.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Paul Menage <menage@google.com>

---

Andrew - this patch applies directly following the patch:
    task-containersv11-add-tasks-file-interface.patch

 kernel/cgroup.c |    3 ---
 1 file changed, 3 deletions(-)

--- 2.6.23-rc8-mm1.orig/kernel/cgroup.c	2007-10-02 20:24:11.078925442 -0700
+++ 2.6.23-rc8-mm1/kernel/cgroup.c	2007-10-02 20:25:41.352279374 -0700
@@ -739,10 +739,7 @@ static int attach_task(struct cgroup *co
 
 	get_first_subsys(cont, NULL, &subsys_id);
 
-	/* Nothing to do if the task is already in that cgroup */
 	oldcont = task_cgroup(tsk, subsys_id);
-	if (cont == oldcont)
-		return 0;
 
 	for_each_subsys(root, ss) {
 		if (ss->can_attach) {

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03  8:42 [PATCH] task containersv11 add tasks file interface fix for cpusets Paul Jackson
@ 2007-10-03 15:51 ` Paul Menage
  2007-10-03 17:58   ` Paul Jackson
  2007-10-03 20:56 ` Paul Jackson
  1 sibling, 1 reply; 26+ messages in thread
From: Paul Menage @ 2007-10-03 15:51 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, Nick Piggin, Peter Zijlstra, Serge E. Hallyn,
	Cedric Le Goater, linux-kernel, Eric W. Biederman, David Rientjes,
	Vaidyanathan Srinivasan, Pavel Emelianov, containers, balbir

On 10/3/07, Paul Jackson <pj@sgi.com> wrote:
> From: Paul Jackson <pj@sgi.com>
>
> The code in kernel/cgroup.c attach_task() which skips the
> attachment of a task to the group it is already in has to be
> removed.  Cpusets depends on reattaching a task to its current
> cpuset, in order to trigger updating the cpus_allowed mask in the
> task struct.

Can you explain a bit more about why this is needed? (i.e. specific
scenarios where cpusets will break without this change).

What triggers the reattach in normal use? Something from userspace?

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03 15:51 ` Paul Menage
@ 2007-10-03 17:58   ` Paul Jackson
  2007-10-03 18:10     ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Jackson @ 2007-10-03 17:58 UTC (permalink / raw)
  To: Paul Menage
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

Paul M wrote:
> > The code in kernel/cgroup.c attach_task() which skips the
> > attachment of a task to the group it is already in has to be
> > removed.  Cpusets depends on reattaching a task to its current
> > cpuset, in order to trigger updating the cpus_allowed mask in the
> > task struct.
> 
> Can you explain a bit more about why this is needed? (i.e. specific
> scenarios where cpusets will break without this change).
> 
> What triggers the reattach in normal use? Something from userspace?

Yes, something in user space has to do it.  It's part of the
kernel-user cpuset API.  If you change a cpuset's 'cpus', then
you have to rewrite each pid in its 'tasks' file back to that
'tasks' file in order to get that 'cpus' change to be applied
to the task struct cpus_allowed of each task, and thereby visible
to the scheduler.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03 17:58   ` Paul Jackson
@ 2007-10-03 18:10     ` Paul Menage
  2007-10-03 18:25       ` Paul Menage
  2007-10-03 20:16       ` Paul Jackson
  0 siblings, 2 replies; 26+ messages in thread
From: Paul Menage @ 2007-10-03 18:10 UTC (permalink / raw)
  To: Paul Jackson
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

On 10/3/07, Paul Jackson <pj@sgi.com> wrote:
>
> Yes, something in user space has to do it.  It's part of the
> kernel-user cpuset API.  If you change a cpuset's 'cpus', then
> you have to rewrite each pid in its 'tasks' file back to that
> 'tasks' file in order to get that 'cpus' change to be applied
> to the task struct cpus_allowed of each task, and thereby visible
> to the scheduler.

What's the rationale for this?

Given that later in cpusets.txt you say:

>If hotplug functionality is used
>to remove all the CPUs that are currently assigned to a cpuset,
>then the kernel will automatically update the cpus_allowed of all
>tasks attached to CPUs in that cpuset to allow all CPUs

why can't the same thing be done when changing the 'cpus' file manually.

What's wrong with, in update_cpumask(), doing a loop across all
members of the cgroup and updating their cpus_allowed fields?

The existing cpusets API is broken, since a new child can always be
forked between reading the tasks file and doing the writes.

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03 18:10     ` Paul Menage
@ 2007-10-03 18:25       ` Paul Menage
  2007-10-03 20:16       ` Paul Jackson
  1 sibling, 0 replies; 26+ messages in thread
From: Paul Menage @ 2007-10-03 18:25 UTC (permalink / raw)
  To: Paul Jackson
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

On 10/3/07, Paul Menage <menage@google.com> wrote:
>
> What's wrong with, in update_cpumask(), doing a loop across all
> members of the cgroup and updating their cpus_allowed fields?

i.e. something like:

struct cgroup_iter it;
struct task_struct *p;
while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
	set_cpus_allowed(p, cs->cpus_allowed);
}
cgroup_iter_end(cs->css.cgroup, &it);

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03 18:10     ` Paul Menage
  2007-10-03 18:25       ` Paul Menage
@ 2007-10-03 20:16       ` Paul Jackson
  2007-10-03 20:31         ` Paul Menage
  1 sibling, 1 reply; 26+ messages in thread
From: Paul Jackson @ 2007-10-03 20:16 UTC (permalink / raw)
  To: Paul Menage
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

Paul M wrote:
> Given that later in cpusets.txt you say:
> 
> >If hotplug functionality is used
> >to remove all the CPUs that are currently assigned to a cpuset,
> >then the kernel will automatically update the cpus_allowed of all
> >tasks attached to CPUs in that cpuset to allow all CPUs
> 
> why can't the same thing be done when changing the 'cpus' file manually.
> 
> What's wrong with, in update_cpumask(), doing a loop across all
> members of the cgroup and updating their cpus_allowed fields?
> 
> The existing cpusets API is broken, since a new child can always be
> forked between reading the tasks file and doing the writes.

Hmmm ... interesting.  Yes - that race, between fork and rewriting
the tasks file is there, by design, and has been there for years.

So far as I know, it is a benign race.  In practice, no one that I know
has tripped over it.  This doesn't imply that we should not look for
opportunities to fix it, but it does mean it is not an urgent fix, in
my experience.

But there may be more issues here than that:
 1) The above cpusets.txt text might be both incorrect (at first
    glance now, I suspect that it is the cpuset cpus_allowed that
    is updated on hotplug events, not the tasks cpus_allowed) and
    perhaps stale as well (given some recent cpuset hotplug patches
    being worked by Cliff Wickman.)
 2) It is not immediately clear to me how the hotplug can work at
    present - if it updates just the cpusets cpus_allowed but doesn't
    get to the tasks cpus_allowed.
 3) So far as I can recall, the only kernel code path to update a
    tasks cpus_allowed goes via the "set_cpus_allowed()" routine,
    which the kernel/cpuset.c code only calls from one place, that
    being the code that handles writing a pid to a cpuset tasks file.
    This provides more evidence that (2) above is a problem -- no
    hotplug code paths in the cpuset code seem to invoke this key
    "set_cpus_allowed()" routine.

Vedy vedy interesting.  Like lifting up a rock and seeing a bunch
of weird creepy crawlies under it.

I'm wondering how to proceed from here, in two regards:

 1) The cc list on this thread is long and not well focused for the
    detailed discussion that it may take to unravel this.
 2) I am not optimistic that we can unravel this in time for the
    rapidly approaching start of 2.6.24.

At present, the only one of the bugs or potential bugs noted here
that are actually biting me is the bug this patch addressed; the
current (yes, knowingly broken) cpuset code requires the attach
code to be run even when reattaching to the same cgroup/cpuset to
which the task is already attached.

If the 2.6.24 kernel releases with this bug, such that reattaching
tasks to their current cpuset becomes a no-op, that will hurt.

Does my patch (the one opening this thread) cause any problems
that you know of, Paul M?

So far as I can tell, this patch just removes a minor optimization, and
other than its allowing a hack in the cpuset code to persist (a side
affect of what logically should be a no-op), this patch should be
unnoticeable to user land.

If that's so, can we allow this patch to proceed, likely for the
2.6.24 kernel series, while we work on killing off all the creepy
crawlies that your good questions have exposed?

And I'm tempted to continue this discussion on a new thread, as it seems
that it could be an involved discussion not well focused on the
interests of many on the current cc list of this message.  There may be
more hotplug interest, and less container interest, than currently seen
on this threads cc list.

Any further suggestions, or embarrassing (;) questions?

Thanks!

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03 20:16       ` Paul Jackson
@ 2007-10-03 20:31         ` Paul Menage
  2007-10-03 20:52           ` Paul Jackson
  2007-10-06  8:24           ` Paul Jackson
  0 siblings, 2 replies; 26+ messages in thread
From: Paul Menage @ 2007-10-03 20:31 UTC (permalink / raw)
  To: Paul Jackson
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

On 10/3/07, Paul Jackson <pj@sgi.com> wrote:
>
> So far as I can tell, this patch just removes a minor optimization,

It's not minor for any subsystem that has a non-trivial attach() operation.

>
> Any further suggestions, or embarrassing (;) questions?
>

What was wrong with my suggestion from a couple of emails back? Adding
the following in cpuset_attach():

struct cgroup_iter it;
struct task_struct *p;
while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
       set_cpus_allowed(p, cs->cpus_allowed);
}
cgroup_iter_end(cs->css.cgroup, &it);

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03 20:31         ` Paul Menage
@ 2007-10-03 20:52           ` Paul Jackson
  2007-10-03 20:58             ` Paul Menage
  2007-10-06  8:24           ` Paul Jackson
  1 sibling, 1 reply; 26+ messages in thread
From: Paul Jackson @ 2007-10-03 20:52 UTC (permalink / raw)
  To: Paul Menage
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

> What was wrong with my suggestion from a couple of emails back? Adding
> the following in cpuset_attach():
> 
> struct cgroup_iter it;
> struct task_struct *p;
> while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
>        set_cpus_allowed(p, cs->cpus_allowed);
> }
> cgroup_iter_end(cs->css.cgroup, &it);

Hmmm ... that just might work.

And this brings to light the reason (justification, excuse, whatever
you call it) that I probably didn't do this earlier.

In the dark ages before cgroups (aka containers) we did not have an
efficient way to walk the tasks in a cpuset. One had to walk the entire
task list, comparing task struct cpuset pointers.  On big honking NUMA
iron, one should avoid task list walks as much as one can get away
with, even if it meant sneaking in a little bit racey API.  Since some
updates of a cpusets 'cpus' mask don't need it (you happen to know that
all tasks in that cpuset are pause'd anyway) I might have made the
tradeoff to make this task list walk an explicitly invoked user action,
to be done only when needed.

But now (correct me if I'm wrong here) cgroups has a per-cgroup task
list, and the above loop has cost linear in the number of tasks
actually in the cgroup, plus (unfortunate but necessary and tolerable)
the cost of taking a global css_set_lock, right?

And I take it the above code snipped is missing the cgroup_iter_start,
correct?

I'll ask Andrew to kill this patch of mine, and I will test out your
suggestion this evening.

This still leaves the other creepy crawlies involving cpusets and hot
plug that I glimpsed slithering by in my last message.  I guess I'll
start a separate discussion with Cliff Wickman and whomever else I think
might want to be involved on those issues.

Nice work - thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03 20:52           ` Paul Jackson
@ 2007-10-03 20:58             ` Paul Menage
  0 siblings, 0 replies; 26+ messages in thread
From: Paul Menage @ 2007-10-03 20:58 UTC (permalink / raw)
  To: Paul Jackson
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

On 10/3/07, Paul Jackson <pj@sgi.com> wrote:
>
> But now (correct me if I'm wrong here) cgroups has a per-cgroup task
> list, and the above loop has cost linear in the number of tasks
> actually in the cgroup, plus (unfortunate but necessary and tolerable)
> the cost of taking a global css_set_lock, right?

Yes.

>
> And I take it the above code snipped is missing the cgroup_iter_start,
> correct?

Oops, yes.

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03 20:31         ` Paul Menage
  2007-10-03 20:52           ` Paul Jackson
@ 2007-10-06  8:24           ` Paul Jackson
  2007-10-06 17:54             ` David Rientjes
  2007-10-06 20:53             ` Paul Menage
  1 sibling, 2 replies; 26+ messages in thread
From: Paul Jackson @ 2007-10-06  8:24 UTC (permalink / raw)
  To: Paul Menage
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

Paul Menage wrote:
> What was wrong with my suggestion from a couple of emails back? Adding
> the following in cpuset_attach():
> 
> struct cgroup_iter it;
> struct task_struct *p;
> cgroup_iter_start(cs->css.cgroup, &it);
> while ((p = cgroup_iter_next(cs->css.cgroup, &it)))
>        set_cpus_allowed(p, cs->cpus_allowed);
> cgroup_iter_end(cs->css.cgroup, &it);

This isn't working for me.

The key kernel routine for updating a tasks cpus_allowed
cannot be called while holding a spinlock.

But the above loop holds a spinlock, css_set_lock, between
the cgroup_iter_start and the cgroup_iter_end.

I end up generating complaints of:
	BUG: scheduling while atomic
when I invoke the set_cpus_allowed() above.

Should css_set_lock be a mutex?  Locking changes like that
can be risky.

Or perhaps there should be another callback, called only by
attach() requests back to the same group.  Likely cpusets would
be the only subsystem interested in plugging that callback.

That, or my original patch, which calls the attach routine
even if re-attaching to the current cgroup ...

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-06  8:24           ` Paul Jackson
@ 2007-10-06 17:54             ` David Rientjes
  2007-10-06 19:59               ` Paul Jackson
  2007-10-06 21:11               ` Paul Menage
  2007-10-06 20:53             ` Paul Menage
  1 sibling, 2 replies; 26+ messages in thread
From: David Rientjes @ 2007-10-06 17:54 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Paul Menage, Andrew Morton, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, ebiederm, svaidy, xemul, containers, balbir

On Sat, 6 Oct 2007, Paul Jackson wrote:

> This isn't working for me.
> 
> The key kernel routine for updating a tasks cpus_allowed
> cannot be called while holding a spinlock.
> 
> But the above loop holds a spinlock, css_set_lock, between
> the cgroup_iter_start and the cgroup_iter_end.
> 
> I end up generating complaints of:
> 	BUG: scheduling while atomic
> when I invoke the set_cpus_allowed() above.
> 
> Should css_set_lock be a mutex?  Locking changes like that
> can be risky.
> 

It would probably be better to just save references to the tasks.

	struct cgroup_iter it;
	struct task_struct *p, **tasks;
	int i = 0;

	cgroup_iter_start(cs->css.cgroup, &it);
	while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
		get_task_struct(p);
		tasks[i++] = p;
	}
	cgroup_iter_end(cs->css.cgroup, &it);

	while (--i >= 0) {
		set_cpus_allowed(tasks[i], cs->cpus_allowed);
		put_task_struct(tasks[i]);
	}

The getting and putting of the tasks will prevent them from exiting or 
being deallocated prematurely.  But this is also a critical section that 
will need to be protected by some mutex so it doesn't race with other 
set_cpus_allowed().

		David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-06 17:54             ` David Rientjes
@ 2007-10-06 19:59               ` Paul Jackson
  2007-10-06 21:09                 ` Paul Menage
  2007-10-07  6:13                 ` David Rientjes
  2007-10-06 21:11               ` Paul Menage
  1 sibling, 2 replies; 26+ messages in thread
From: Paul Jackson @ 2007-10-06 19:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: menage, akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, svaidy, xemul, containers, balbir

David wrote:
> It would probably be better to just save references to the tasks.
> 
> 	struct cgroup_iter it;
> 	struct task_struct *p, **tasks;
> 	int i = 0;
> 
> 	cgroup_iter_start(cs->css.cgroup, &it);
> 	while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
> 		get_task_struct(p);
> 		tasks[i++] = p;
> 	}
> 	cgroup_iter_end(cs->css.cgroup, &it);

Hmmm ... guess I'd have to loop over the cgroup twice, once to count
them (the 'count' field is not claimed to be accurate) and then again,
after I've kmalloc'd the tasks[] array, filling in the tasks[] array.

On a big cgroup on a big system, this could easily be thousands of
iteration loops.

And I've have to drop the css_set_lock spinlock between the two loops,
since I can't hold a spinlock while calling kmalloc.

So then I'd have to be prepared for the possibility that the second
loop found more cgroups on the list than what I counted in the
first loop.

This is doable ... indeed I've done such before, in the code that
is now known as kernel/cgroup.c:cgroup_tasks_open().  Look for how
pidarray[] is setup.

And note that that code doesn't deal with the case that more cgroups
showed up after they were counted.  When supporting the reading of the
'tasks' file by user code, this is ok - it's inherently racey anyway -
so not worth trying too hard just to close the window part way.

If I need to close the window all the way, completely solving the race
condition, then I have the code in kernel/cpuset.c:update_nodemask(),
which builds an mmarray[] using two loops and some retries if newly
forked tasks are showing up too rapidly at the same time.  The first of
the two loops is hidden in the cgroup_task_count() call.

That's a bunch of code, mate.  If some other solution was adequate
(no worse than the current situation, which forces user space to
rewrite every pid in the tasks file back to itself if they want
a 'cpus' change to actually be applied) but took much less code,
then I'd have to give it serious consideration, as I did before.

I don't mind a bunch of code, but kernel text has to earn its keep.
I'm not yet convinced that the above page or two of somewhat fussy code
(see again the code in kernel/cpuset.c:update_nodemask() ...) has
sufficient real user value per byte of kernel text space to justify its
existence.

... by the way ... tell me again why css_set_lock is a spinlock?

    I didn't think it was such a good idea to hold a spinlock while
    iterating over a major list, doing lord knows what (the loops
    over cgroup_iter_next() do user provided code, as in this case.)
    Shouldn't that be a mutex?

Or, if there is a good reason that must remain a spinlock, then the
smallest amount of new code, and the easiest code to write, would
perhaps be adding another cgroup callback, called only by cgroup attach
() requests back to the same group.  Then code that wants to do
something odd, such as cpusets, for what seems like a no-op, can do so.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-06 19:59               ` Paul Jackson
@ 2007-10-06 21:09                 ` Paul Menage
  2007-10-06 21:41                   ` Paul Jackson
  2007-10-11 22:03                   ` Paul Jackson
  2007-10-07  6:13                 ` David Rientjes
  1 sibling, 2 replies; 26+ messages in thread
From: Paul Menage @ 2007-10-06 21:09 UTC (permalink / raw)
  To: Paul Jackson
  Cc: David Rientjes, akpm, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, ebiederm, svaidy, xemul, containers, balbir

On 10/6/07, Paul Jackson <pj@sgi.com> wrote:
> David wrote:
> > It would probably be better to just save references to the tasks.
> >
> >       struct cgroup_iter it;
> >       struct task_struct *p, **tasks;
> >       int i = 0;
> >
> >       cgroup_iter_start(cs->css.cgroup, &it);
> >       while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
> >               get_task_struct(p);
> >               tasks[i++] = p;
> >       }
> >       cgroup_iter_end(cs->css.cgroup, &it);
>
> Hmmm ... guess I'd have to loop over the cgroup twice, once to count
> them (the 'count' field is not claimed to be accurate) and then again,
> after I've kmalloc'd the tasks[] array, filling in the tasks[] array.
>
> On a big cgroup on a big system, this could easily be thousands of
> iteration loops.

But if userspace has to do it, the effect will be far more expensive.

>
> If I need to close the window all the way, completely solving the race
> condition, then I have the code in kernel/cpuset.c:update_nodemask(),
> which builds an mmarray[] using two loops and some retries if newly
> forked tasks are showing up too rapidly at the same time.  The first of
> the two loops is hidden in the cgroup_task_count() call.

In general, the loop inside cgroup_task_count() will only have a
single iteration. (It's iterating across the shared css_set objects,
not across member tasks.

What's wrong with:

  allocate a page of task_struct pointers
again:
  need_repeat = false;
  cgroup_iter_start();
  while (cgroup_iter_next()) {
    if (p->cpus_allowed != new_cpumask) {
      store p;
      if (page is full) {
        need_repeat = true;
        break;
      }
    }
  }
  for each saved task p {
    set_cpus_allowed(p, new_cpumask);
    release p;
  }
  if (need_repeat)
    goto again;

Advantages:

- no vmalloc needed
- just one iteration in the case where a cgroup has fewer than 512 members
- additional iterations only need to deal with tasks that don't have
the right cpu mask
- automatically handles fork races


Another option would be to have a cpuset fork callback that forces
p->cpus_allowed to its cpuset's cpus_allowed if another thread is in
the middle of update_cpumask(). Then you don't need to worry about
fork races at all, since all new threads will get the right cpumask.

> Or, if there is a good reason that must remain a spinlock, then the
> smallest amount of new code, and the easiest code to write, would
> perhaps be adding another cgroup callback, called only by cgroup attach
> () requests back to the same group.  Then code that wants to do
> something odd, such as cpusets, for what seems like a no-op, can do so.

I'd much rather not perpetuate that broken API requirement. The fact
that cpusets wants this odd behaviour is based on a nasty hack.

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-06 21:09                 ` Paul Menage
@ 2007-10-06 21:41                   ` Paul Jackson
  2007-10-11 22:03                   ` Paul Jackson
  1 sibling, 0 replies; 26+ messages in thread
From: Paul Jackson @ 2007-10-06 21:41 UTC (permalink / raw)
  To: Paul Menage
  Cc: rientjes, akpm, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, ebiederm, svaidy, xemul, containers, balbir

Paul M wrote:
>
> What's wrong with:
> 
>   allocate a page of task_struct pointers
> again:
>   need_repeat = false;
>   cgroup_iter_start();
>   while (cgroup_iter_next()) {
>     if (p->cpus_allowed != new_cpumask) {
>       store p;
>       if (page is full) {
>         need_repeat = true;
>         break;
>       }
>     }
>   }
>   for each saved task p {
>     set_cpus_allowed(p, new_cpumask);
>     release p;
>   }
>   if (need_repeat)
>     goto again;

That might work ... nice idea there, comparing the two masks, so one
only needs to store the ones not yet fixed.  Uunfortunately, I need to
put this aside until I return in four days, from a short trip.

Given that cgroups is targeted for 2.6.24, and that cpusets is broken
without this, I'll have to do something soon.  But this, or some such,
should work, soon enough.

Thanks.

> I'd much rather not perpetuate that broken API requirement. The fact
> that cpusets wants this odd behaviour is based on a nasty hack.

Well ... yeah ... it's a bit of an ugly child.  But I'm its daddy.
The kid looks fine in my eyes. <grin>

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-06 21:09                 ` Paul Menage
  2007-10-06 21:41                   ` Paul Jackson
@ 2007-10-11 22:03                   ` Paul Jackson
  2007-10-11 23:20                     ` Eric W. Biederman
  1 sibling, 1 reply; 26+ messages in thread
From: Paul Jackson @ 2007-10-11 22:03 UTC (permalink / raw)
  To: Paul Menage
  Cc: rientjes, akpm, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, ebiederm, svaidy, xemul, containers, balbir

Several days ago, Paul M replied to Paul J:
> > Hmmm ... guess I'd have to loop over the cgroup twice, once to count
> > them (the 'count' field is not claimed to be accurate) and then again,
> > after I've kmalloc'd the tasks[] array, filling in the tasks[] array.
> >
> > On a big cgroup on a big system, this could easily be thousands of
> > iteration loops.
> 
> But if userspace has to do it, the effect will be far more expensive.

Ah - but I'm not trying to optimize this particular operation (changing
a cpusets 'cpus').  It's not at all performance critical.

I'm trying to minimize the amount of special purpose code in the
kernel.

The maintenance costs of a line of kernel code are quite a bit higher
than for a line of user code.  I work hard to have most of my lines of
kernel code be on well traveled code paths, of general usefulness, even
if this means that some infrequent operations require yet more user
source code lines and user CPU cycles, in order to be refactored as the
combination of multiple system call primitives.

... all within reasonable limits, of course.

Corner case, special situation, non-trivial chunks of kernel code are
very expensive.  They don't get very good testing coverage in the
real world, and end up harboring latent bugs for months or years,
by which time it can be expensive to deal with them.

Be that as it may, I've just started digesting the actual code
suggestions posted by yourself and David (thanks!) this last week.
I just couldn't resist a bit of philosophizing ... sorry.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-11 22:03                   ` Paul Jackson
@ 2007-10-11 23:20                     ` Eric W. Biederman
  2007-10-12  1:23                       ` Paul Jackson
  0 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2007-10-11 23:20 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Paul Menage, rientjes, akpm, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, svaidy, xemul, containers, balbir

Stupid question.

Would it help at all if we structured this as:
- Take the control group off of the cpus and runqueues.
- Modify the tasks in the control group.
- Place the control group back on the runqueues.

Essentially this is what ptrace does (except for one task at a time).

Since we know that the tasks are not running, and that we have
exclusive access to the tasks in the control group we can take action
as if we were the actual tasks themselves.  Which should simplify
locking.

So I think with just a touch of care we can significantly simplify the
locking by being much more coarse grained, at the expense of this kind
of control group operation being more expensive.  Plus we would have
an operation that was reusable.

I think the stop everything.  modify everything, start everything
could also be used instead of an array of task structures.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-11 23:20                     ` Eric W. Biederman
@ 2007-10-12  1:23                       ` Paul Jackson
  0 siblings, 0 replies; 26+ messages in thread
From: Paul Jackson @ 2007-10-12  1:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: menage, rientjes, akpm, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, svaidy, xemul, containers, balbir

> Since we know that the tasks are not running, and that we have
> exclusive access to the tasks in the control group we can take action
> as if we were the actual tasks themselves.  Which should simplify
> locking.

The Big Kernel Lock (BKL), born again, as a Medium Sized Cgroup Lock ?

This only simplifies things if it enables us to remove finer grain
locking, but some finer grain locking is essential for performance on
higher processor count systems (which if Intel and AMD have their way,
will be just about any system bigger than a cell phone.)

There is no escaping actually having to think about these things, and
clearly understand and document what locks what.  Locks don't just
guard code sections, and they don't just guard particular data items.

Rather, in my view, they ensure certain invariants on your data.
Non-atomic code sections that depend on, or modify, data subject to
such invariants must be done while holding the appropriate lock.
One is guaranteed not to see the invariant violated while holding
the lock, nor to expose others to temporary violations of the invariant
done while locked.

In this case, we have multiple copies of cpumasks in task structs and
cpusets that must honor the invariant that they are equal or subsets
of, cpu_online_map.  Changes to the set of online CPUs must hold some
lock, apparently sched_hotcpu_mutex, until all those cpumasks are
adjusted to once again honor, this invariant.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-06 19:59               ` Paul Jackson
  2007-10-06 21:09                 ` Paul Menage
@ 2007-10-07  6:13                 ` David Rientjes
  1 sibling, 0 replies; 26+ messages in thread
From: David Rientjes @ 2007-10-07  6:13 UTC (permalink / raw)
  To: Paul Jackson
  Cc: menage, akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, svaidy, xemul, containers, balbir

On Sat, 6 Oct 2007, Paul Jackson wrote:

> > 	struct cgroup_iter it;
> > 	struct task_struct *p, **tasks;
> > 	int i = 0;
> > 
> > 	cgroup_iter_start(cs->css.cgroup, &it);
> > 	while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
> > 		get_task_struct(p);
> > 		tasks[i++] = p;
> > 	}
> > 	cgroup_iter_end(cs->css.cgroup, &it);
> 
> Hmmm ... guess I'd have to loop over the cgroup twice, once to count
> them (the 'count' field is not claimed to be accurate) and then again,
> after I've kmalloc'd the tasks[] array, filling in the tasks[] array.
> 

Use the struct task_struct **tasks above.  Any call here should only be 
about 300 bytes into the stack so, with 8K stacks on x86_64, you should be 
able to store pointers to a little under 1,000 tasks before corrupting it.

Or, if you think there's going to be more than 1,000 tasks in this cgroup, 
just do an arbitrary number of them and then reuse your stack before 
calling set_cpus_allowed() and put_task_struct() on all tasks you've 
collected so far; then finally call cgroup_iter_end() when you're done 
with all of them.  Use the task_struct usage counter as your locking so 
that tasks don't prematurely exit or deallocate before you migrate them.

Keep in mind that simply iterating over the tasks and using 
set_cpus_allowed() isn't good enough not only because you're holding a 
spinlock but also because the task can exit during the call, unless you're 
using some sort of cgroups locking I'm not aware of that prevents that.  
My solution prevents that with {get,put}_task_struct().

The current update_cpu_domains(), or -mm's equivalent of it with this 
patchset, cannot call lock_cpu_hotplug() with callback_mutex held, so 
it'll need to be unlocked after all the tasks have been put.  If you're 
preempted here before update_cpu_domains(), attach_task() will take care 
of the cpus_allowed reassignment.

		David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-06 17:54             ` David Rientjes
  2007-10-06 19:59               ` Paul Jackson
@ 2007-10-06 21:11               ` Paul Menage
  2007-10-07  6:15                 ` David Rientjes
  1 sibling, 1 reply; 26+ messages in thread
From: Paul Menage @ 2007-10-06 21:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Paul Jackson, Andrew Morton, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, ebiederm, svaidy, xemul, containers, balbir

On 10/6/07, David Rientjes <rientjes@google.com> wrote:
> The getting and putting of the tasks will prevent them from exiting or
> being deallocated prematurely.  But this is also a critical section that
> will need to be protected by some mutex so it doesn't race with other
> set_cpus_allowed().

Is that necessary? If some other process calls set_cpus_allowed()
concurrently with a cpuset cpus update, it's not clear that there's
any defined serialization semantics that have to be achieved, as long
as the end result is that the task's cpus_allowed are within the
cpuset's cpus_allowed.

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-06 21:11               ` Paul Menage
@ 2007-10-07  6:15                 ` David Rientjes
  2007-10-10 20:46                   ` Paul Menage
  0 siblings, 1 reply; 26+ messages in thread
From: David Rientjes @ 2007-10-07  6:15 UTC (permalink / raw)
  To: Paul Menage
  Cc: Paul Jackson, Andrew Morton, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, ebiederm, svaidy, xemul, containers, balbir

On Sat, 6 Oct 2007, Paul Menage wrote:

> > The getting and putting of the tasks will prevent them from exiting or
> > being deallocated prematurely.  But this is also a critical section that
> > will need to be protected by some mutex so it doesn't race with other
> > set_cpus_allowed().
> 
> Is that necessary? If some other process calls set_cpus_allowed()
> concurrently with a cpuset cpus update, it's not clear that there's
> any defined serialization semantics that have to be achieved, as long
> as the end result is that the task's cpus_allowed are within the
> cpuset's cpus_allowed.
> 

It can race with sched_setaffinity().  It has to give up tasklist_lock as 
well to call set_cpus_allowed() and can race

	cpus_allowed = cpuset_cpus_allowed(p);
	cpus_and(new_mask, new_mask, cpus_allowed);
	retval = set_cpus_allowed(p, new_mask);

and allow a task to have a cpu outside of the cpuset's new cpus_allowed if 
you've taken it away between cpuset_cpus_allowed() and set_cpus_allowed().

		David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-07  6:15                 ` David Rientjes
@ 2007-10-10 20:46                   ` Paul Menage
  2007-10-10 20:59                     ` David Rientjes
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Menage @ 2007-10-10 20:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Paul Jackson, Andrew Morton, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, ebiederm, svaidy, xemul, containers, balbir

On 10/6/07, David Rientjes <rientjes@google.com> wrote:
>
> It can race with sched_setaffinity().  It has to give up tasklist_lock as
> well to call set_cpus_allowed() and can race
>
>         cpus_allowed = cpuset_cpus_allowed(p);
>         cpus_and(new_mask, new_mask, cpus_allowed);
>         retval = set_cpus_allowed(p, new_mask);
>
> and allow a task to have a cpu outside of the cpuset's new cpus_allowed if
> you've taken it away between cpuset_cpus_allowed() and set_cpus_allowed().

cpuset_cpus_allowed() takes callback_mutex, which is held by
update_cpumask() when it updates cs->cpus_allowed. So if we continue
to hold callback_mutex across the task update loop this wouldn't be a
race. Having said that, holding callback mutex for that long might not
be a good idea.

A cleaner solution might be to drop callback_mutex after updating
cs->cpus_allowed in update_cpumask() and then make sched_setaffinity()
do a post-check:

	cpus_allowed = cpuset_cpus_allowed(p);
 again:
	cpus_and(new_mask, new_mask, cpus_allowed);
	retval = set_cpus_allowed(p, new_mask);
	if (!retval) {
		/* Check for races with cpuset updates */
		cpus_allowed = cpuset_cpus_allowed(p);
		if (!cpus_subset(new_mask, cpus_allowed)) {
			/*
			 * We raced with a change to cpuset update,
			 * and our cpumask is now outside the
			 * permitted cpumask for the cpuset.  Since a
			 * change to the cpuset's cpus resets the
			 * cpumask for each task, do the same thing
			 * here.
			 */
			new_mask = cpus_allowed;
			goto again;
		}
	}

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-10 20:46                   ` Paul Menage
@ 2007-10-10 20:59                     ` David Rientjes
  2007-10-11 23:15                       ` Paul Jackson
  0 siblings, 1 reply; 26+ messages in thread
From: David Rientjes @ 2007-10-10 20:59 UTC (permalink / raw)
  To: Paul Menage
  Cc: Paul Jackson, Andrew Morton, nickpiggin, a.p.zijlstra, serue, clg,
	linux-kernel, ebiederm, svaidy, xemul, containers, balbir

On Wed, 10 Oct 2007, Paul Menage wrote:

> On 10/6/07, David Rientjes <rientjes@google.com> wrote:
> >
> > It can race with sched_setaffinity().  It has to give up tasklist_lock as
> > well to call set_cpus_allowed() and can race
> >
> >         cpus_allowed = cpuset_cpus_allowed(p);
> >         cpus_and(new_mask, new_mask, cpus_allowed);
> >         retval = set_cpus_allowed(p, new_mask);
> >
> > and allow a task to have a cpu outside of the cpuset's new cpus_allowed if
> > you've taken it away between cpuset_cpus_allowed() and set_cpus_allowed().
> 
> cpuset_cpus_allowed() takes callback_mutex, which is held by
> update_cpumask() when it updates cs->cpus_allowed. So if we continue
> to hold callback_mutex across the task update loop this wouldn't be a
> race. Having said that, holding callback mutex for that long might not
> be a good idea.
> 

Moving the actual use of set_cpus_allowed() from the cgroup code to 
sched.c would probably be the best in terms of a clean interface.  Then 
you could protect the whole thing by sched_hotcpu_mutex, which is 
expressly designed for migrations.

Something like this:

	struct cgroup_iter it;
	struct task_struct *p, **tasks;
	int i = 0;

	cgroup_iter_start(cs->css.cgroup, &it);
	while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
		get_task_struct(p);
		tasks[i++] = p;
	}
	cgroup_iter_end(cs->css.cgroup, &it);

	while (--i >= 0) {
		sched_migrate_task(tasks[i], cs->cpus_allowed);
		put_task_struct(tasks[i]);
	}

kernel/sched.c:

	void sched_migrate_task(struct task_struct *task,
				cpumask_t cpus_allowed)
	{
		mutex_lock(&sched_hotcpu_mutex);
		set_cpus_allowed(task, cpus_allowed);
		mutex_unlock(&sched_hotcpu_mutex);
	}

It'd be faster to just pass **tasks to a sched.c function with the number 
of tasks to migrate to reduce the contention on sched_hotcpu_mutex, but 
then the put_task_struct() is left dangling over in cgroup code.  
sched_hotcpu_mutex should be rarely contended, anyway.

		David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-10 20:59                     ` David Rientjes
@ 2007-10-11 23:15                       ` Paul Jackson
  2007-10-12 15:13                         ` David Rientjes
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Jackson @ 2007-10-11 23:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: menage, akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, svaidy, xemul, containers, balbir

David wrote:
> you could protect the whole thing by sched_hotcpu_mutex, which is 
> expressly designed for migrations.
> 
> Something like this:
> 
> 	struct cgroup_iter it;
> 	struct task_struct *p, **tasks;
> 	int i = 0;
> 
> 	cgroup_iter_start(cs->css.cgroup, &it);
> 	while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
> 		get_task_struct(p);
> 		tasks[i++] = p;
> 	}
> 	cgroup_iter_end(cs->css.cgroup, &it);
> 
> 	while (--i >= 0) {
> 		sched_migrate_task(tasks[i], cs->cpus_allowed);
> 		put_task_struct(tasks[i]);
> 	}
> 
> kernel/sched.c:
> 
> 	void sched_migrate_task(struct task_struct *task,
> 				cpumask_t cpus_allowed)
> 	{
> 		mutex_lock(&sched_hotcpu_mutex);
> 		set_cpus_allowed(task, cpus_allowed);
> 		mutex_unlock(&sched_hotcpu_mutex);
> 	}

Hmmm ... I hadn't noticed that sched_hotcpu_mutex before.

I wonder what it is guarding?  As best as I can guess, it seems, at
least in part, to be keeping the following two items consistent:
 1) cpu_online_map
 2) the per-task cpus_allowed masks

That is, it seems to ensure that a task is allowed to run on some
online CPU.

If that's approximately true, then shouldn't I take sched_hotcpu_mutex
around the entire chunk of code that handles updating a cpusets 'cpus',
from the time it verifies that the requested CPUs are online, until the
time that every affected task has its cpus_allowed updated?

Furthermore, I should probably guard changes to and verifications
against the top_cpuset's cpus_allowed with this mutex as well, as it is
supposed to be a copy of cpu_online_map.

And since all descendent cpusets have to have 'cpus' masks that are
subsets of their parents, this means guarding other chunks of cpuset
code that depend on the consistency of various per-cpuset cpus_allowed
masks and cpu_online_map.

My current intuition is that this expanded use of sched_hotcpu_mutex in
the cpuset code involving various cpus_allowed masks would be a good
thing.

In sum, perhaps sched_hotcpu_mutex is guarding the dispersed kernel
state that depends on what CPUs are online.  This includes the per-task
and per-cpuset cpus_allowed masks, all of which are supposed to be some
non-empty subset of the online CPUs.

Taking and dropping the sched_hotcpu_mutex for each task, just around
the call to set_cpus_allowed(), as you suggested above, doesn't seem to
accomplish much that I can see, and certainly doesn't seem to guard the
consistency of cpu_online_map with the tasks cpus_allowed masks.

... lurkers beware ... good chance I haven't a friggin clue ;).
    In other words: Thirty minutes ago I couldn't even spell
    sched_hotcpu_mutex, and now I'm pontificating on it ;);).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-11 23:15                       ` Paul Jackson
@ 2007-10-12 15:13                         ` David Rientjes
  0 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2007-10-12 15:13 UTC (permalink / raw)
  To: Paul Jackson
  Cc: menage, akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, svaidy, xemul, containers, balbir

On Thu, 11 Oct 2007, Paul Jackson wrote:

> Hmmm ... I hadn't noticed that sched_hotcpu_mutex before.
> 
> I wonder what it is guarding?  As best as I can guess, it seems, at
> least in part, to be keeping the following two items consistent:
>  1) cpu_online_map

Yes, it protects against cpu hot-plug or hot-unplug; cpu_online_map is 
guaranteed to be unchanged while the mutex is being held.

>  2) the per-task cpus_allowed masks
> 

It doesn't need to protect the per-task cpus_allowed per se, that's 
already protected.  If a task's cpu affinity changes during a call to 
set_cpus_allowed(), the migration thread will notice the change when it 
tries to deactive the task and activate it on the destination cpu.  It 
then becomes a no-op.

That's a consequence of the fact that we can't migrate current and need a 
kthread, particularly the source cpu's runqueue migration thread, to do it 
when it's scheduled.  A migration request such as that includes a 
completion variable so that the set_cpus_allowed() waits until it has 
either been migrated or changed cpu affinity again.

> That is, it seems to ensure that a task is allowed to run on some
> online CPU.
> 

Right, the destination cpu will not be hot-unplugged out from underneath 
the task during migration.

> If that's approximately true, then shouldn't I take sched_hotcpu_mutex
> around the entire chunk of code that handles updating a cpusets 'cpus',
> from the time it verifies that the requested CPUs are online, until the
> time that every affected task has its cpus_allowed updated?
> 

Not necessarily, you can iterate through a list of tasks and change their 
cpu affinity (represented by task->cpus_allowed) by migrating them away 
while task->cpuset->cpus_allowed remains unchanged.  The hotcpu notifier 
cpuset_handle_cpuhp() will update that when necessary for cpu hot-plug or 
hot-unplug events.

So it's entirely possible that a cpu will be downed during your iteration 
of tasks, but that's fine.  Just as long as it isn't downed during the 
migration.  The cpuset's cpus_allowed will be updated by the hotcpu 
notifier and sched_hotcpu_mutex will protect from unplugged cpus around 
the set_cpus_allowed() call, which checks for intersection between your 
new cpumask and cpu_online_map.

> Furthermore, I should probably guard changes to and verifications
> against the top_cpuset's cpus_allowed with this mutex as well, as it is
> supposed to be a copy of cpu_online_map.
> 

The hotcpu notifier protects you there as well.  
common_cpu_mem_hotplug_unplug() explicitly sets them.

> And since all descendent cpusets have to have 'cpus' masks that are
> subsets of their parents, this means guarding other chunks of cpuset
> code that depend on the consistency of various per-cpuset cpus_allowed
> masks and cpu_online_map.
> 

Same as above, except now you're using 
guarantee_online_cpus_mems_in_subtree().

> My current intuition is that this expanded use of sched_hotcpu_mutex in
> the cpuset code involving various cpus_allowed masks would be a good
> thing.
> 
> In sum, perhaps sched_hotcpu_mutex is guarding the dispersed kernel
> state that depends on what CPUs are online.  This includes the per-task
> and per-cpuset cpus_allowed masks, all of which are supposed to be some
> non-empty subset of the online CPUs.
> 

It guards cpu_online_map from being changed while it's held.

> Taking and dropping the sched_hotcpu_mutex for each task, just around
> the call to set_cpus_allowed(), as you suggested above, doesn't seem to
> accomplish much that I can see, and certainly doesn't seem to guard the
> consistency of cpu_online_map with the tasks cpus_allowed masks.
> 

It's needed to serialize with other migrations such as sched_setaffinity() 
and you can use it since all migrations will inherently need this type of 
protection.  It makes the new cpumask consistent with cpu_online_map only 
so far as that it's a subset; otherwise, set_cpus_allowed() will fail.  
The particular destination cpu is chosen as any online cpu, which we know 
won't be downed because we're holding sched_hotcpu_mutex.

		David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-06  8:24           ` Paul Jackson
  2007-10-06 17:54             ` David Rientjes
@ 2007-10-06 20:53             ` Paul Menage
  1 sibling, 0 replies; 26+ messages in thread
From: Paul Menage @ 2007-10-06 20:53 UTC (permalink / raw)
  To: Paul Jackson
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

On 10/6/07, Paul Jackson <pj@sgi.com> wrote:
>
> This isn't working for me.
>
> The key kernel routine for updating a tasks cpus_allowed
> cannot be called while holding a spinlock.
>
> But the above loop holds a spinlock, css_set_lock, between
> the cgroup_iter_start and the cgroup_iter_end.
>
> I end up generating complaints of:
>         BUG: scheduling while atomic
> when I invoke the set_cpus_allowed() above.
>
> Should css_set_lock be a mutex?  Locking changes like that
> can be risky.

css_set_lock is an rwlock currently; I'd rather not turn it into an rw
mutex since there are places that it gets taken where we can't afford
to sleep.

>
> Or perhaps there should be another callback, called only by
> attach() requests back to the same group.  Likely cpusets would
> be the only subsystem interested in plugging that callback.
>
> That, or my original patch, which calls the attach routine
> even if re-attaching to the current cgroup ...

I'd prefer David's solution of grabbing references to tasks during the
iteration and then doing set_cpus_allowed outside the tasklist_lock.

Paul

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
  2007-10-03  8:42 [PATCH] task containersv11 add tasks file interface fix for cpusets Paul Jackson
  2007-10-03 15:51 ` Paul Menage
@ 2007-10-03 20:56 ` Paul Jackson
  1 sibling, 0 replies; 26+ messages in thread
From: Paul Jackson @ 2007-10-03 20:56 UTC (permalink / raw)
  To: Paul Jackson
  Cc: akpm, nickpiggin, a.p.zijlstra, serue, clg, menage, linux-kernel,
	ebiederm, rientjes, svaidy, xemul, containers, balbir

Andrew - please kill this patch.

Looks like Paul Menage has a better solution
that I will be trying out.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2007-10-12 15:14 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-03  8:42 [PATCH] task containersv11 add tasks file interface fix for cpusets Paul Jackson
2007-10-03 15:51 ` Paul Menage
2007-10-03 17:58   ` Paul Jackson
2007-10-03 18:10     ` Paul Menage
2007-10-03 18:25       ` Paul Menage
2007-10-03 20:16       ` Paul Jackson
2007-10-03 20:31         ` Paul Menage
2007-10-03 20:52           ` Paul Jackson
2007-10-03 20:58             ` Paul Menage
2007-10-06  8:24           ` Paul Jackson
2007-10-06 17:54             ` David Rientjes
2007-10-06 19:59               ` Paul Jackson
2007-10-06 21:09                 ` Paul Menage
2007-10-06 21:41                   ` Paul Jackson
2007-10-11 22:03                   ` Paul Jackson
2007-10-11 23:20                     ` Eric W. Biederman
2007-10-12  1:23                       ` Paul Jackson
2007-10-07  6:13                 ` David Rientjes
2007-10-06 21:11               ` Paul Menage
2007-10-07  6:15                 ` David Rientjes
2007-10-10 20:46                   ` Paul Menage
2007-10-10 20:59                     ` David Rientjes
2007-10-11 23:15                       ` Paul Jackson
2007-10-12 15:13                         ` David Rientjes
2007-10-06 20:53             ` Paul Menage
2007-10-03 20:56 ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox