From: Nick Piggin <nickpiggin@yahoo.com.au>
To: hawkes@sgi.com
Cc: Andrew Morton <akpm@osdl.org>, Ingo Molnar <mingo@elte.hu>,
Jack Steiner <steiner@sgi.com>, Paul Jackson <pj@sgi.com>,
linux-kernel@vger.kernel.org, John Hawkes <jrhawkes@yahoo.com>
Subject: Re: [PATCH] build sched domains tracking cpusets
Date: Fri, 07 Jul 2006 18:38:33 +1000 [thread overview]
Message-ID: <44AE1D89.20608@yahoo.com.au> (raw)
In-Reply-To: <20060706234356.23106.60834.sendpatchset@tomahawk.engr.sgi.com>
Hi John,
cool patch. I have a few comments.
hawkes@sgi.com wrote:
> This patch introduces the notion of dynamic sched domains (and sched
> groups) that track the creation and deletion of cpusets. Up to eight of
> these dynamic sched domains can exist per CPU, which seems to handle the
> observed use of cpusets by one popular job manager. Essentially, every
> new cpuset causes the creation of a new sched domain for the CPUs of
> that cpuset, and a deletion of the cpuset causes the destruction of that
> dynamic sched domain. New sched domains are inserted into each CPU's
> list as appropriate. The limit of 8/CPU (vs. unlimited) not only caps
> the upperbound of kmalloc memory being assigned to sched domain and
> sched group structs (thereby avoiding a potential denial-of-service
> attack by a runaway user creating cpusets), but it also caps the number
> of sched domains being searched by various algorithms in the scheduler.
>
> A "feature" of this implementation is that these dynamic sched domains
> build up until that limit of 8/CPU is reached, at which point no
> additional sched domains are created; or until a cpu-exclusive cpuset
> gets declared, at which point all the dynamically created sched domains
> in the affected CPUs get destroyed. A more sophisticated algorithm in
> kernel/cpuset.c could redeclare the dynamic sched domains after a
> cpu-exclusive cpuset gets declared, but that is outside the scope of
> this simple patch's simple change to kernel/cpuset.c.
>
> Signed-off-by: John Hawkes <hawkes@sgi.com>
>
> Index: linux/include/linux/sched.h
> ===================================================================
> --- linux.orig/include/linux/sched.h 2006-06-17 18:49:35.000000000 -0700
> +++ linux/include/linux/sched.h 2006-06-20 13:50:19.092284572 -0700
> @@ -558,6 +558,7 @@ enum idle_type
> #define SD_WAKE_AFFINE 32 /* Wake task to waking CPU */
> #define SD_WAKE_BALANCE 64 /* Perform balancing at task wakeup */
> #define SD_SHARE_CPUPOWER 128 /* Domain members share cpu power */
> +#define SD_TRACKS_CPUSET 4096 /* Spam matches non-exclusive cpuset */
I'd just use the next bit, and we should convert these to hex as
well I guess... I don't know what I was thinking ;)
>
> struct sched_group {
> struct sched_group *next; /* Must be a circular list */
> @@ -630,6 +631,9 @@ struct sched_domain {
> extern void partition_sched_domains(cpumask_t *partition1,
> cpumask_t *partition2);
>
> +extern void add_sched_domain(const cpumask_t *cpu_map);
> +extern void destroy_sched_domain(const cpumask_t *cpu_map);
Not a big deal, but I'd probably be happier with a single hook from
the cpusets code, which does the right thing depending on whether it
is exclusive or not.
Hard to know which way around the layering should go, but I think that
it is simply a hint to the balancer code to do something nice, and as
such we should leave it completely to the scheduler.
> +#ifdef CONFIG_CPUSETS
CPUSETS only? Or do we want to try to help sched_setaffinity users as
well?
> +
> +struct sched_domain_bundle {
> + struct sched_domain sd_cpuset;
> + struct sched_group **sg_cpuset_nodes;
> + int use_count;
> +};
Can you get away without using the bundle? Or does it make things easier?
You could add a new use_count to struct sched_domain if you'd like.... does
it make the setup/teardown too difficult?
> +#define SCHED_DOMAIN_CPUSET_MAX 8 /* power of 2 */
> +static DEFINE_PER_CPU(long, sd_cpusets_used) = { 1UL <<SCHED_DOMAIN_CPUSET_MAX};
> +static DEFINE_PER_CPU(struct sched_domain_bundle[SCHED_DOMAIN_CPUSET_MAX],
> + sd_cpusets_bundle);
Do we need a limit of 8? If you're worried about memory, I guess there are
lots of ways to DOS the system... if you're worried about scheduler balancing
overhead, we could do a seperate traversal of these guys at a reduced
interval (I guess that still leaves some places needing help, though)
> +static int find_existing_sched_domain(int cpu, const cpumask_t *cpu_map)
Probably some way to distinguish these guys as operating on your "bundle" stack
would make it a bit clearer?
> +{
> + int sd_idx;
> +
> + for (sd_idx = 0; sd_idx < SCHED_DOMAIN_CPUSET_MAX; sd_idx++) {
> + struct sched_domain_bundle *sd_cpuset_bundle;
> + struct sched_domain *sd;
> +
> + if (!test_bit(sd_idx, &per_cpu(sd_cpusets_used, cpu)))
> + continue;
> + sd_cpuset_bundle = &per_cpu(sd_cpusets_bundle[sd_idx], cpu);
> + sd = &sd_cpuset_bundle->sd_cpuset;
> + if (cpus_equal(*cpu_map, sd->span))
> + return sd_idx;
> + }
> +
> + return SCHED_DOMAIN_CPUSET_MAX;
> +}
> +
> +void add_sched_domain(const cpumask_t *cpu_map)
> +{
[...]
> + /* tweak sched_domain params based upon domain size */
> + *sd = SD_NODE_INIT;
> + sd->flags |= SD_TRACKS_CPUSET;
> + sd->max_interval = 8*(min(new_sd_span, 32));
> + sd->span = *cpu_map;
> + cpu_set(cpu, new_sd_cpu_map);
Can we just have a new SD_xxx_INIT for these?
> Index: linux/kernel/cpuset.c
> ===================================================================
> --- linux.orig/kernel/cpuset.c 2006-07-05 15:51:38.873939805 -0700
> +++ linux/kernel/cpuset.c 2006-07-05 16:01:14.892039725 -0700
> @@ -828,8 +828,12 @@ static int update_cpumask(struct cpuset
> mutex_lock(&callback_mutex);
> cs->cpus_allowed = trialcs.cpus_allowed;
> mutex_unlock(&callback_mutex);
> - if (is_cpu_exclusive(cs) && !cpus_unchanged)
> - update_cpu_domains(cs);
> + if (!cpus_unchanged) {
> + if (is_cpu_exclusive(cs))
> + update_cpu_domains(cs);
> + else
> + add_sched_domain(&cs->cpus_allowed);
> + }
> return 0;
> }
>
> @@ -1934,6 +1938,8 @@ static int cpuset_rmdir(struct inode *un
> set_bit(CS_REMOVED, &cs->flags);
> if (is_cpu_exclusive(cs))
> update_cpu_domains(cs);
> + else
> + destroy_sched_domain(&cs->cpus_allowed);
> list_del(&cs->sibling); /* delete my sibling from parent->children */
> spin_lock(&cs->dentry->d_lock);
> d = dget(cs->dentry);
>
So you're just doing the inside. What about the complement? That
way you'd get a nice symmetric allocation on all cpus for each
cpuset... but that probably would require making the limit greater
than 8.
And it would be probably required for good sched_setaffinity
balancing too.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
next prev parent reply other threads:[~2006-07-07 17:16 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-07-06 23:43 [PATCH] build sched domains tracking cpusets hawkes
2006-07-07 8:38 ` Nick Piggin [this message]
2006-07-07 19:05 ` John Hawkes
-- strict thread matches above, loose matches on Subject: below --
2006-07-07 19:31 hawkes
2006-07-10 21:41 ` Siddha, Suresh B
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=44AE1D89.20608@yahoo.com.au \
--to=nickpiggin@yahoo.com.au \
--cc=akpm@osdl.org \
--cc=hawkes@sgi.com \
--cc=jrhawkes@yahoo.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=pj@sgi.com \
--cc=steiner@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox