From: Nick Piggin <nickpiggin@yahoo.com.au>
To: hawkes@sgi.com
Cc: Andrew Morton <akpm@osdl.org>, Ingo Molnar <mingo@elte.hu>,
Jack Steiner <steiner@sgi.com>, Paul Jackson <pj@sgi.com>,
linux-kernel@vger.kernel.org, John Hawkes <jrhawkes@yahoo.com>
Subject: Re: [PATCH] build sched domains tracking cpusets
Date: Fri, 07 Jul 2006 18:38:33 +1000 [thread overview]
Message-ID: <44AE1D89.20608@yahoo.com.au> (raw)
In-Reply-To: <20060706234356.23106.60834.sendpatchset@tomahawk.engr.sgi.com>
Hi John,
cool patch. I have a few comments.
hawkes@sgi.com wrote:
> This patch introduces the notion of dynamic sched domains (and sched
> groups) that track the creation and deletion of cpusets. Up to eight of
> these dynamic sched domains can exist per CPU, which seems to handle the
> observed use of cpusets by one popular job manager. Essentially, every
> new cpuset causes the creation of a new sched domain for the CPUs of
> that cpuset, and a deletion of the cpuset causes the destruction of that
> dynamic sched domain. New sched domains are inserted into each CPU's
> list as appropriate. The limit of 8/CPU (vs. unlimited) not only caps
> the upperbound of kmalloc memory being assigned to sched domain and
> sched group structs (thereby avoiding a potential denial-of-service
> attack by a runaway user creating cpusets), but it also caps the number
> of sched domains being searched by various algorithms in the scheduler.
>
> A "feature" of this implementation is that these dynamic sched domains
> build up until that limit of 8/CPU is reached, at which point no
> additional sched domains are created; or until a cpu-exclusive cpuset
> gets declared, at which point all the dynamically created sched domains
> in the affected CPUs get destroyed. A more sophisticated algorithm in
> kernel/cpuset.c could redeclare the dynamic sched domains after a
> cpu-exclusive cpuset gets declared, but that is outside the scope of
> this simple patch's simple change to kernel/cpuset.c.
>
> Signed-off-by: John Hawkes <hawkes@sgi.com>
>
> Index: linux/include/linux/sched.h
> ===================================================================
> --- linux.orig/include/linux/sched.h 2006-06-17 18:49:35.000000000 -0700
> +++ linux/include/linux/sched.h 2006-06-20 13:50:19.092284572 -0700
> @@ -558,6 +558,7 @@ enum idle_type
> #define SD_WAKE_AFFINE 32 /* Wake task to waking CPU */
> #define SD_WAKE_BALANCE 64 /* Perform balancing at task wakeup */
> #define SD_SHARE_CPUPOWER 128 /* Domain members share cpu power */
> +#define SD_TRACKS_CPUSET 4096 /* Spam matches non-exclusive cpuset */
I'd just use the next bit, and we should convert these to hex as
well I guess... I don't know what I was thinking ;)
>
> struct sched_group {
> struct sched_group *next; /* Must be a circular list */
> @@ -630,6 +631,9 @@ struct sched_domain {
> extern void partition_sched_domains(cpumask_t *partition1,
> cpumask_t *partition2);
>
> +extern void add_sched_domain(const cpumask_t *cpu_map);
> +extern void destroy_sched_domain(const cpumask_t *cpu_map);
Not a big deal, but I'd probably be happier with a single hook from
the cpusets code, which does the right thing depending on whether it
is exclusive or not.
Hard to know which way around the layering should go, but I think that
it is simply a hint to the balancer code to do something nice, and as
such we should leave it completely to the scheduler.
> +#ifdef CONFIG_CPUSETS
CPUSETS only? Or do we want to try to help sched_setaffinity users as
well?
> +
> +struct sched_domain_bundle {
> + struct sched_domain sd_cpuset;
> + struct sched_group **sg_cpuset_nodes;
> + int use_count;
> +};
Can you get away without using the bundle? Or does it make things easier?
You could add a new use_count to struct sched_domain if you'd like.... does
it make the setup/teardown too difficult?
> +#define SCHED_DOMAIN_CPUSET_MAX 8 /* power of 2 */
> +static DEFINE_PER_CPU(long, sd_cpusets_used) = { 1UL <<SCHED_DOMAIN_CPUSET_MAX};
> +static DEFINE_PER_CPU(struct sched_domain_bundle[SCHED_DOMAIN_CPUSET_MAX],
> + sd_cpusets_bundle);
Do we need a limit of 8? If you're worried about memory, I guess there are
lots of ways to DOS the system... if you're worried about scheduler balancing
overhead, we could do a seperate traversal of these guys at a reduced
interval (I guess that still leaves some places needing help, though)
> +static int find_existing_sched_domain(int cpu, const cpumask_t *cpu_map)
Probably some way to distinguish these guys as operating on your "bundle" stack
would make it a bit clearer?
> +{
> + int sd_idx;
> +
> + for (sd_idx = 0; sd_idx < SCHED_DOMAIN_CPUSET_MAX; sd_idx++) {
> + struct sched_domain_bundle *sd_cpuset_bundle;
> + struct sched_domain *sd;
> +
> + if (!test_bit(sd_idx, &per_cpu(sd_cpusets_used, cpu)))
> + continue;
> + sd_cpuset_bundle = &per_cpu(sd_cpusets_bundle[sd_idx], cpu);
> + sd = &sd_cpuset_bundle->sd_cpuset;
> + if (cpus_equal(*cpu_map, sd->span))
> + return sd_idx;
> + }
> +
> + return SCHED_DOMAIN_CPUSET_MAX;
> +}
> +
> +void add_sched_domain(const cpumask_t *cpu_map)
> +{
[...]
> + /* tweak sched_domain params based upon domain size */
> + *sd = SD_NODE_INIT;
> + sd->flags |= SD_TRACKS_CPUSET;
> + sd->max_interval = 8*(min(new_sd_span, 32));
> + sd->span = *cpu_map;
> + cpu_set(cpu, new_sd_cpu_map);
Can we just have a new SD_xxx_INIT for these?
> Index: linux/kernel/cpuset.c
> ===================================================================
> --- linux.orig/kernel/cpuset.c 2006-07-05 15:51:38.873939805 -0700
> +++ linux/kernel/cpuset.c 2006-07-05 16:01:14.892039725 -0700
> @@ -828,8 +828,12 @@ static int update_cpumask(struct cpuset
> mutex_lock(&callback_mutex);
> cs->cpus_allowed = trialcs.cpus_allowed;
> mutex_unlock(&callback_mutex);
> - if (is_cpu_exclusive(cs) && !cpus_unchanged)
> - update_cpu_domains(cs);
> + if (!cpus_unchanged) {
> + if (is_cpu_exclusive(cs))
> + update_cpu_domains(cs);
> + else
> + add_sched_domain(&cs->cpus_allowed);
> + }
> return 0;
> }
>
> @@ -1934,6 +1938,8 @@ static int cpuset_rmdir(struct inode *un
> set_bit(CS_REMOVED, &cs->flags);
> if (is_cpu_exclusive(cs))
> update_cpu_domains(cs);
> + else
> + destroy_sched_domain(&cs->cpus_allowed);
> list_del(&cs->sibling); /* delete my sibling from parent->children */
> spin_lock(&cs->dentry->d_lock);
> d = dget(cs->dentry);
>
So you're just doing the inside. What about the complement? That
way you'd get a nice symmetric allocation on all cpus for each
cpuset... but that probably would require making the limit greater
than 8.
And it would be probably required for good sched_setaffinity
balancing too.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
next prev parent reply other threads:[~2006-07-07 17:16 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-07-06 23:43 [PATCH] build sched domains tracking cpusets hawkes
2006-07-07 8:38 ` Nick Piggin [this message]
2006-07-07 19:05 ` John Hawkes
-- strict thread matches above, loose matches on Subject: below --
2006-07-07 19:31 hawkes
2006-07-10 21:41 ` Siddha, Suresh B
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=44AE1D89.20608@yahoo.com.au \
--to=nickpiggin@yahoo.com.au \
--cc=akpm@osdl.org \
--cc=hawkes@sgi.com \
--cc=jrhawkes@yahoo.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=pj@sgi.com \
--cc=steiner@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.