From: Paul Jackson <pj@sgi.com>
To: David Rientjes <rientjes@google.com>
Cc: menage@google.com, akpm@linux-foundation.org,
nickpiggin@yahoo.com.au, a.p.zijlstra@chello.nl,
serue@us.ibm.com, clg@fr.ibm.com, linux-kernel@vger.kernel.org,
ebiederm@xmission.com, svaidy@linux.vnet.ibm.com,
xemul@openvz.org, containers@lists.osdl.org,
balbir@linux.vnet.ibm.com
Subject: Re: [PATCH] task containersv11 add tasks file interface fix for cpusets
Date: Sat, 6 Oct 2007 12:59:04 -0700 [thread overview]
Message-ID: <20071006125904.a26ed99f.pj@sgi.com> (raw)
In-Reply-To: <alpine.DEB.0.9999.0710061047400.24311@chino.kir.corp.google.com>
David wrote:
> It would probably be better to just save references to the tasks.
>
> struct cgroup_iter it;
> struct task_struct *p, **tasks;
> int i = 0;
>
> cgroup_iter_start(cs->css.cgroup, &it);
> while ((p = cgroup_iter_next(cs->css.cgroup, &it))) {
> get_task_struct(p);
> tasks[i++] = p;
> }
> cgroup_iter_end(cs->css.cgroup, &it);
Hmmm ... guess I'd have to loop over the cgroup twice, once to count
them (the 'count' field is not claimed to be accurate) and then again,
after I've kmalloc'd the tasks[] array, filling in the tasks[] array.
On a big cgroup on a big system, this could easily be thousands of
iteration loops.
And I've have to drop the css_set_lock spinlock between the two loops,
since I can't hold a spinlock while calling kmalloc.
So then I'd have to be prepared for the possibility that the second
loop found more cgroups on the list than what I counted in the
first loop.
This is doable ... indeed I've done such before, in the code that
is now known as kernel/cgroup.c:cgroup_tasks_open(). Look for how
pidarray[] is setup.
And note that that code doesn't deal with the case that more cgroups
showed up after they were counted. When supporting the reading of the
'tasks' file by user code, this is ok - it's inherently racey anyway -
so not worth trying too hard just to close the window part way.
If I need to close the window all the way, completely solving the race
condition, then I have the code in kernel/cpuset.c:update_nodemask(),
which builds an mmarray[] using two loops and some retries if newly
forked tasks are showing up too rapidly at the same time. The first of
the two loops is hidden in the cgroup_task_count() call.
That's a bunch of code, mate. If some other solution was adequate
(no worse than the current situation, which forces user space to
rewrite every pid in the tasks file back to itself if they want
a 'cpus' change to actually be applied) but took much less code,
then I'd have to give it serious consideration, as I did before.
I don't mind a bunch of code, but kernel text has to earn its keep.
I'm not yet convinced that the above page or two of somewhat fussy code
(see again the code in kernel/cpuset.c:update_nodemask() ...) has
sufficient real user value per byte of kernel text space to justify its
existence.
... by the way ... tell me again why css_set_lock is a spinlock?
I didn't think it was such a good idea to hold a spinlock while
iterating over a major list, doing lord knows what (the loops
over cgroup_iter_next() do user provided code, as in this case.)
Shouldn't that be a mutex?
Or, if there is a good reason that must remain a spinlock, then the
smallest amount of new code, and the easiest code to write, would
perhaps be adding another cgroup callback, called only by cgroup attach
() requests back to the same group. Then code that wants to do
something odd, such as cpusets, for what seems like a no-op, can do so.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
next prev parent reply other threads:[~2007-10-06 19:59 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-10-03 8:42 [PATCH] task containersv11 add tasks file interface fix for cpusets Paul Jackson
2007-10-03 15:51 ` Paul Menage
2007-10-03 17:58 ` Paul Jackson
2007-10-03 18:10 ` Paul Menage
2007-10-03 18:25 ` Paul Menage
2007-10-03 20:16 ` Paul Jackson
2007-10-03 20:31 ` Paul Menage
2007-10-03 20:52 ` Paul Jackson
2007-10-03 20:58 ` Paul Menage
2007-10-06 8:24 ` Paul Jackson
2007-10-06 17:54 ` David Rientjes
2007-10-06 19:59 ` Paul Jackson [this message]
2007-10-06 21:09 ` Paul Menage
2007-10-06 21:41 ` Paul Jackson
2007-10-11 22:03 ` Paul Jackson
2007-10-11 23:20 ` Eric W. Biederman
2007-10-12 1:23 ` Paul Jackson
2007-10-07 6:13 ` David Rientjes
2007-10-06 21:11 ` Paul Menage
2007-10-07 6:15 ` David Rientjes
2007-10-10 20:46 ` Paul Menage
2007-10-10 20:59 ` David Rientjes
2007-10-11 23:15 ` Paul Jackson
2007-10-12 15:13 ` David Rientjes
2007-10-06 20:53 ` Paul Menage
2007-10-03 20:56 ` Paul Jackson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20071006125904.a26ed99f.pj@sgi.com \
--to=pj@sgi.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=balbir@linux.vnet.ibm.com \
--cc=clg@fr.ibm.com \
--cc=containers@lists.osdl.org \
--cc=ebiederm@xmission.com \
--cc=linux-kernel@vger.kernel.org \
--cc=menage@google.com \
--cc=nickpiggin@yahoo.com.au \
--cc=rientjes@google.com \
--cc=serue@us.ibm.com \
--cc=svaidy@linux.vnet.ibm.com \
--cc=xemul@openvz.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox