* [RFT PATCH] Dynamic sched domains (v0.6)
@ 2005-05-17 4:10 Dinakar Guniguntala
2005-05-17 4:12 ` [PATCH 2/3] " Dinakar Guniguntala
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Dinakar Guniguntala @ 2005-05-17 4:10 UTC (permalink / raw)
To: Paul Jackson, Simon Derr, Nick Piggin, lkml, lse-tech,
Matthew Dobson, Dipankar Sarma, Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 1524 bytes --]
Ok, here is hopefully the last iteration of the dynamic sched domain
patches before I can send it off to Andrew.
I would have posted it earlier, except I got sidetracked with all
the hotplug+preempt issues. cpusets+hotplug+preempt requires more
testing and I'll post those patches sometime later in the week
patch1 - sched.c+sched.h changes
patch2 - cpuset.c+Documentation changes
patch3 - ia64 changes
All patches are against 2.6.12-rc4-mm1 and have been tested
(except for the ia64 changes)
linux-2.6.12-rc4-mm1-1/include/linux/sched.h | 2
linux-2.6.12-rc4-mm1-1/kernel/sched.c | 131 +++++++++++++++--------
linux-2.6.12-rc4-mm1-2/Documentation/cpusets.txt | 16 ++
linux-2.6.12-rc4-mm1-2/kernel/cpuset.c | 89 ++++++++++++---
linux-2.6.12-rc4-mm1-3/arch/ia64/kernel/domain.c | 77 ++++++++-----
5 files changed, 225 insertions(+), 90 deletions(-)
o Patch1 adds the new API partition_sched_domains.
I have incorporated all of the feedback from before.
o I didnt think it necessary to add another semaphore to protect the sched domain
changes as suggested by Nick. As it was clear last week, hotplug+cpusets
causes enough issues with nested sems, so I didnt want to add yet another in
the mix
o I have removed the __devinit qualifier from some functions which will
now get called anytime changes are made to exclusive cpusets instead of
only during boot
o arch_init_sched_domains/arch_destroy_sched_domains now take a pointer
to a const cpumask_t * type.
-Dinakar
[-- Attachment #2: dyn-sd-rc4mm1-v0.6-1.patch --]
[-- Type: text/plain, Size: 10276 bytes --]
diff -Naurp linux-2.6.12-rc4-mm1-0/include/linux/sched.h linux-2.6.12-rc4-mm1-1/include/linux/sched.h
--- linux-2.6.12-rc4-mm1-0/include/linux/sched.h 2005-05-16 14:55:43.000000000 +0530
+++ linux-2.6.12-rc4-mm1-1/include/linux/sched.h 2005-05-16 15:13:22.000000000 +0530
@@ -561,6 +561,8 @@ struct sched_domain {
#endif
};
+extern void partition_sched_domains(cpumask_t *partition1,
+ cpumask_t *partition2);
#ifdef ARCH_HAS_SCHED_DOMAIN
/* Useful helpers that arch setup code may use. Defined in kernel/sched.c */
extern cpumask_t cpu_isolated_map;
diff -Naurp linux-2.6.12-rc4-mm1-0/kernel/sched.c linux-2.6.12-rc4-mm1-1/kernel/sched.c
--- linux-2.6.12-rc4-mm1-0/kernel/sched.c 2005-05-16 14:58:01.000000000 +0530
+++ linux-2.6.12-rc4-mm1-1/kernel/sched.c 2005-05-16 19:21:27.000000000 +0530
@@ -264,7 +264,7 @@ static DEFINE_PER_CPU(struct runqueue, r
/*
* The domain tree (rq->sd) is protected by RCU's quiescent state transition.
- * See update_sched_domains: synchronize_kernel for details.
+ * See detach_destroy_domains: synchronize_sched for details.
*
* The domain tree of any CPU may only be accessed from within
* preempt-disabled sections.
@@ -4615,7 +4615,7 @@ int __init migration_init(void)
#endif
#ifdef CONFIG_SMP
-#define SCHED_DOMAIN_DEBUG
+#undef SCHED_DOMAIN_DEBUG
#ifdef SCHED_DOMAIN_DEBUG
static void sched_domain_debug(struct sched_domain *sd, int cpu)
{
@@ -4831,7 +4831,7 @@ static void init_sched_domain_sysctl(voi
}
#endif
-static int __devinit sd_degenerate(struct sched_domain *sd)
+static int sd_degenerate(struct sched_domain *sd)
{
if (cpus_weight(sd->span) == 1)
return 1;
@@ -4854,7 +4854,7 @@ static int __devinit sd_degenerate(struc
return 1;
}
-static int __devinit sd_parent_degenerate(struct sched_domain *sd,
+static int sd_parent_degenerate(struct sched_domain *sd,
struct sched_domain *parent)
{
unsigned long cflags = sd->flags, pflags = parent->flags;
@@ -4886,7 +4886,7 @@ static int __devinit sd_parent_degenerat
* Attach the domain 'sd' to 'cpu' as its base domain. Callers must
* hold the hotplug lock.
*/
-void __devinit cpu_attach_domain(struct sched_domain *sd, int cpu)
+void cpu_attach_domain(struct sched_domain *sd, int cpu)
{
runqueue_t *rq = cpu_rq(cpu);
struct sched_domain *tmp;
@@ -4937,7 +4937,7 @@ __setup ("isolcpus=", isolated_cpu_setup
* covered by the given span, and will set each group's ->cpumask correctly,
* and ->cpu_power to 0.
*/
-void __devinit init_sched_build_groups(struct sched_group groups[],
+void init_sched_build_groups(struct sched_group groups[],
cpumask_t span, int (*group_fn)(int cpu))
{
struct sched_group *first = NULL, *last = NULL;
@@ -4973,13 +4973,14 @@ void __devinit init_sched_build_groups(s
#ifdef ARCH_HAS_SCHED_DOMAIN
-extern void __devinit arch_init_sched_domains(void);
-extern void __devinit arch_destroy_sched_domains(void);
+extern void build_sched_domains(const cpumask_t *cpu_map);
+extern void arch_init_sched_domains(const cpumask_t *cpu_map);
+extern void arch_destroy_sched_domains(const cpumask_t *cpu_map);
#else
#ifdef CONFIG_SCHED_SMT
static DEFINE_PER_CPU(struct sched_domain, cpu_domains);
static struct sched_group sched_group_cpus[NR_CPUS];
-static int __devinit cpu_to_cpu_group(int cpu)
+static int cpu_to_cpu_group(int cpu)
{
return cpu;
}
@@ -4987,7 +4988,7 @@ static int __devinit cpu_to_cpu_group(in
static DEFINE_PER_CPU(struct sched_domain, phys_domains);
static struct sched_group sched_group_phys[NR_CPUS];
-static int __devinit cpu_to_phys_group(int cpu)
+static int cpu_to_phys_group(int cpu)
{
#ifdef CONFIG_SCHED_SMT
return first_cpu(cpu_sibling_map[cpu]);
@@ -5000,7 +5001,7 @@ static int __devinit cpu_to_phys_group(i
static DEFINE_PER_CPU(struct sched_domain, node_domains);
static struct sched_group sched_group_nodes[MAX_NUMNODES];
-static int __devinit cpu_to_node_group(int cpu)
+static int cpu_to_node_group(int cpu)
{
return cpu_to_node(cpu);
}
@@ -5031,39 +5032,28 @@ static void check_sibling_maps(void)
#endif
/*
- * Set up scheduler domains and groups. Callers must hold the hotplug lock.
+ * Build sched domains for a given set of cpus and attach the sched domains
+ * to the individual cpus
*/
-static void __devinit arch_init_sched_domains(void)
+static void build_sched_domains(const cpumask_t *cpu_map)
{
int i;
- cpumask_t cpu_default_map;
-#if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
- check_sibling_maps();
-#endif
/*
- * Setup mask for cpus without special case scheduling requirements.
- * For now this just excludes isolated cpus, but could be used to
- * exclude other special cases in the future.
+ * Set up domains for cpus specified by the cpu_map.
*/
- cpus_complement(cpu_default_map, cpu_isolated_map);
- cpus_and(cpu_default_map, cpu_default_map, cpu_online_map);
-
- /*
- * Set up domains. Isolated domains just stay on the NULL domain.
- */
- for_each_cpu_mask(i, cpu_default_map) {
+ for_each_cpu_mask(i, *cpu_map) {
int group;
struct sched_domain *sd = NULL, *p;
cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
- cpus_and(nodemask, nodemask, cpu_default_map);
+ cpus_and(nodemask, nodemask, *cpu_map);
#ifdef CONFIG_NUMA
sd = &per_cpu(node_domains, i);
group = cpu_to_node_group(i);
*sd = SD_NODE_INIT;
- sd->span = cpu_default_map;
+ sd->span = *cpu_map;
sd->groups = &sched_group_nodes[group];
#endif
@@ -5081,7 +5071,7 @@ static void __devinit arch_init_sched_do
group = cpu_to_cpu_group(i);
*sd = SD_SIBLING_INIT;
sd->span = cpu_sibling_map[i];
- cpus_and(sd->span, sd->span, cpu_default_map);
+ cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
sd->groups = &sched_group_cpus[group];
#endif
@@ -5091,7 +5081,7 @@ static void __devinit arch_init_sched_do
/* Set up CPU (sibling) groups */
for_each_online_cpu(i) {
cpumask_t this_sibling_map = cpu_sibling_map[i];
- cpus_and(this_sibling_map, this_sibling_map, cpu_default_map);
+ cpus_and(this_sibling_map, this_sibling_map, *cpu_map);
if (i != first_cpu(this_sibling_map))
continue;
@@ -5104,7 +5094,7 @@ static void __devinit arch_init_sched_do
for (i = 0; i < MAX_NUMNODES; i++) {
cpumask_t nodemask = node_to_cpumask(i);
- cpus_and(nodemask, nodemask, cpu_default_map);
+ cpus_and(nodemask, nodemask, *cpu_map);
if (cpus_empty(nodemask))
continue;
@@ -5114,12 +5104,12 @@ static void __devinit arch_init_sched_do
#ifdef CONFIG_NUMA
/* Set up node groups */
- init_sched_build_groups(sched_group_nodes, cpu_default_map,
+ init_sched_build_groups(sched_group_nodes, *cpu_map,
&cpu_to_node_group);
#endif
/* Calculate CPU power for physical packages and nodes */
- for_each_cpu_mask(i, cpu_default_map) {
+ for_each_cpu_mask(i, *cpu_map) {
int power;
struct sched_domain *sd;
#ifdef CONFIG_SCHED_SMT
@@ -5143,7 +5133,7 @@ static void __devinit arch_init_sched_do
}
/* Attach the domains */
- for_each_online_cpu(i) {
+ for_each_cpu_mask(i, *cpu_map) {
struct sched_domain *sd;
#ifdef CONFIG_SCHED_SMT
sd = &per_cpu(cpu_domains, i);
@@ -5153,16 +5143,72 @@ static void __devinit arch_init_sched_do
cpu_attach_domain(sd, i);
}
}
+/*
+ * Set up scheduler domains and groups. Callers must hold the hotplug lock.
+ */
+static void arch_init_sched_domains(cpumask_t *cpu_map)
+{
+ cpumask_t cpu_default_map;
-#ifdef CONFIG_HOTPLUG_CPU
-static void __devinit arch_destroy_sched_domains(void)
+#if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
+ check_sibling_maps();
+#endif
+ /*
+ * Setup mask for cpus without special case scheduling requirements.
+ * For now this just excludes isolated cpus, but could be used to
+ * exclude other special cases in the future.
+ */
+ cpus_complement(cpu_default_map, cpu_isolated_map);
+ cpus_and(cpu_default_map, cpu_default_map, *cpu_map);
+
+ build_sched_domains(&cpu_default_map);
+}
+
+static void arch_destroy_sched_domains(const cpumask_t *cpu_map)
{
/* Do nothing: everything is statically allocated. */
}
-#endif
#endif /* ARCH_HAS_SCHED_DOMAIN */
+/*
+ * Detach sched domains from a group of cpus specified in cpu_map
+ * These cpus will now be attached to the NULL domain
+ */
+static inline void detach_destroy_domains(const cpumask_t *cpu_map)
+{
+ int i;
+
+ for_each_cpu_mask(i, *cpu_map)
+ cpu_attach_domain(NULL, i);
+ synchronize_sched();
+ arch_destroy_sched_domains(cpu_map);
+}
+
+/*
+ * Partition sched domains as specified by the cpumasks below.
+ * This attaches all cpus from the cpumasks to the NULL domain,
+ * waits for a RCU quiescent period, recalculates sched
+ * domain information and then attaches them back to the
+ * correct sched domains
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(cpumask_t *partition1, cpumask_t *partition2)
+{
+ cpumask_t change_map;
+
+ cpus_and(*partition1, *partition1, cpu_online_map);
+ cpus_and(*partition2, *partition2, cpu_online_map);
+ cpus_or(change_map, *partition1, *partition2);
+
+ /* Detach sched domains from all of the affected cpus */
+ detach_destroy_domains(&change_map);
+ if (!cpus_empty(*partition1))
+ build_sched_domains(partition1);
+ if (!cpus_empty(*partition2))
+ build_sched_domains(partition2);
+}
+
#ifdef CONFIG_HOTPLUG_CPU
/*
* Force a reinitialization of the sched domains hierarchy. The domains
@@ -5178,10 +5224,7 @@ static int update_sched_domains(struct n
switch (action) {
case CPU_UP_PREPARE:
case CPU_DOWN_PREPARE:
- for_each_online_cpu(i)
- cpu_attach_domain(NULL, i);
- synchronize_kernel();
- arch_destroy_sched_domains();
+ detach_destroy_domains(&cpu_online_map);
return NOTIFY_OK;
case CPU_UP_CANCELED:
@@ -5197,7 +5240,7 @@ static int update_sched_domains(struct n
}
/* The hotplug lock is already held by cpu_up/cpu_down */
- arch_init_sched_domains();
+ arch_init_sched_domains(&cpu_online_map);
return NOTIFY_OK;
}
@@ -5206,7 +5249,7 @@ static int update_sched_domains(struct n
void __init sched_init_smp(void)
{
lock_cpu_hotplug();
- arch_init_sched_domains();
+ arch_init_sched_domains(&cpu_online_map);
unlock_cpu_hotplug();
/* XXX: Theoretical race here - CPU may be hotplugged now */
hotcpu_notifier(update_sched_domains, 0);
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/3] Dynamic sched domains (v0.6)
2005-05-17 4:10 [RFT PATCH] Dynamic sched domains (v0.6) Dinakar Guniguntala
@ 2005-05-17 4:12 ` Dinakar Guniguntala
2005-05-17 6:25 ` Nick Piggin
2005-05-17 4:14 ` [PATCH 3/3] " Dinakar Guniguntala
2005-05-18 5:53 ` [RFT PATCH] " Paul Jackson
2 siblings, 1 reply; 10+ messages in thread
From: Dinakar Guniguntala @ 2005-05-17 4:12 UTC (permalink / raw)
To: Paul Jackson, Simon Derr, Nick Piggin, lkml, lse-tech,
Matthew Dobson, Dipankar Sarma, Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 151 bytes --]
o Patch2 has updated cpusets documentation and the core update_cpu_domains
function
o I have also moved the dentry d_lock as discussed previously
[-- Attachment #2: dyn-sd-rc4mm1-v0.6-2.patch --]
[-- Type: text/plain, Size: 6565 bytes --]
diff -Naurp linux-2.6.12-rc4-mm1-1/Documentation/cpusets.txt linux-2.6.12-rc4-mm1-2/Documentation/cpusets.txt
--- linux-2.6.12-rc4-mm1-1/Documentation/cpusets.txt 2005-05-16 15:14:05.000000000 +0530
+++ linux-2.6.12-rc4-mm1-2/Documentation/cpusets.txt 2005-05-16 22:56:43.000000000 +0530
@@ -51,6 +51,14 @@ mems_allowed vector.
If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
ancestor or descendent, may share any of the same CPUs or Memory Nodes.
+A cpuset that is cpu exclusive has a sched domain associated with it.
+The sched domain consists of all cpus in the current cpuset that are not
+part of any exclusive child cpusets.
+This ensures that the scheduler load balacing code only balances
+against the cpus that are in the sched domain as defined above and not
+all of the cpus in the system. This removes any overhead due to
+load balancing code trying to pull tasks outside of the cpu exclusive
+cpuset only to be prevented by the tasks' cpus_allowed mask.
User level code may create and destroy cpusets by name in the cpuset
virtual file system, manage the attributes and permissions of these
@@ -84,6 +92,9 @@ This can be especially valuable on:
and a database), or
* NUMA systems running large HPC applications with demanding
performance characteristics.
+ * Also cpu-exclusive cpusets are useful for servers running orthogonal
+ workloads such as RT applications requiring low latency and HPC
+ applications that are throughput sensitive
These subsets, or "soft partitions" must be able to be dynamically
adjusted, as the job mix changes, without impacting other concurrently
@@ -125,6 +136,8 @@ Cpusets extends these two mechanisms as
- A cpuset may be marked exclusive, which ensures that no other
cpuset (except direct ancestors and descendents) may contain
any overlapping CPUs or Memory Nodes.
+ Also a cpu-exclusive cpuset would be associated with a sched
+ domain.
- You can list all the tasks (by pid) attached to any cpuset.
The implementation of cpusets requires a few, simple hooks
@@ -136,6 +149,9 @@ into the rest of the kernel, none in per
allowed in that tasks cpuset.
- in sched.c migrate_all_tasks(), to keep migrating tasks within
the CPUs allowed by their cpuset, if possible.
+ - in sched.c, a new API partition_sched_domains for handling
+ sched domain changes associated with cpu-exclusive cpusets
+ and related changes in both sched.c and arch/ia64/kernel/domain.c
- in the mbind and set_mempolicy system calls, to mask the requested
Memory Nodes by what's allowed in that tasks cpuset.
- in page_alloc, to restrict memory to allowed nodes.
diff -Naurp linux-2.6.12-rc4-mm1-1/kernel/cpuset.c linux-2.6.12-rc4-mm1-2/kernel/cpuset.c
--- linux-2.6.12-rc4-mm1-1/kernel/cpuset.c 2005-05-16 15:08:08.000000000 +0530
+++ linux-2.6.12-rc4-mm1-2/kernel/cpuset.c 2005-05-16 15:19:54.000000000 +0530
@@ -596,12 +596,62 @@ static int validate_change(const struct
return 0;
}
+/*
+ * For a given cpuset cur, partition the system as follows
+ * a. All cpus in the parent cpuset's cpus_allowed that are not part of any
+ * exclusive child cpusets
+ * b. All cpus in the current cpuset's cpus_allowed that are not part of any
+ * exclusive child cpusets
+ * Build these two partitions by calling partition_sched_domains
+ */
+static void update_cpu_domains(struct cpuset *cur)
+{
+ struct cpuset *c, *par = cur->parent;
+ cpumask_t pspan, cspan;
+
+ if (par == NULL || cpus_empty(cur->cpus_allowed))
+ return;
+
+ /*
+ * Get all cpus from parent's cpus_allowed not part of exclusive
+ * children
+ */
+ pspan = par->cpus_allowed;
+ list_for_each_entry(c, &par->children, sibling) {
+ if (is_cpu_exclusive(c))
+ cpus_andnot(pspan, pspan, c->cpus_allowed);
+ }
+ if (is_removed(cur) || !is_cpu_exclusive(cur)) {
+ cpus_or(pspan, pspan, cur->cpus_allowed);
+ if (cpus_equal(pspan, cur->cpus_allowed))
+ return;
+ cspan = CPU_MASK_NONE;
+ }
+ else {
+ if (cpus_empty(pspan))
+ return;
+ cspan = cur->cpus_allowed;
+ /*
+ * Get all cpus from current cpuset's cpus_allowed not part
+ * of exclusive children
+ */
+ list_for_each_entry(c, &cur->children, sibling) {
+ if (is_cpu_exclusive(c))
+ cpus_andnot(cspan, cspan, c->cpus_allowed);
+ }
+ }
+
+ lock_cpu_hotplug();
+ partition_sched_domains(&pspan, &cspan);
+ unlock_cpu_hotplug();
+}
+
static int update_cpumask(struct cpuset *cs, char *buf)
{
- struct cpuset trialcs;
+ struct cpuset trialcs, oldcs;
int retval;
- trialcs = *cs;
+ trialcs = oldcs = *cs;
retval = cpulist_parse(buf, trialcs.cpus_allowed);
if (retval < 0)
return retval;
@@ -609,9 +659,13 @@ static int update_cpumask(struct cpuset
if (cpus_empty(trialcs.cpus_allowed))
return -ENOSPC;
retval = validate_change(cs, &trialcs);
- if (retval == 0)
- cs->cpus_allowed = trialcs.cpus_allowed;
- return retval;
+ if (retval < 0)
+ return retval;
+ cs->cpus_allowed = trialcs.cpus_allowed;
+ if (is_cpu_exclusive(cs) &&
+ (!cpus_equal(cs->cpus_allowed, oldcs.cpus_allowed)))
+ update_cpu_domains(cs);
+ return 0;
}
static int update_nodemask(struct cpuset *cs, char *buf)
@@ -646,25 +700,28 @@ static int update_nodemask(struct cpuset
static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, char *buf)
{
int turning_on;
- struct cpuset trialcs;
+ struct cpuset trialcs, oldcs;
int err;
turning_on = (simple_strtoul(buf, NULL, 10) != 0);
- trialcs = *cs;
+ trialcs = oldcs = *cs;
if (turning_on)
set_bit(bit, &trialcs.flags);
else
clear_bit(bit, &trialcs.flags);
err = validate_change(cs, &trialcs);
- if (err == 0) {
- if (turning_on)
- set_bit(bit, &cs->flags);
- else
- clear_bit(bit, &cs->flags);
- }
- return err;
+ if (err < 0)
+ return err;
+ if (turning_on)
+ set_bit(bit, &cs->flags);
+ else
+ clear_bit(bit, &cs->flags);
+
+ if (is_cpu_exclusive(cs) != is_cpu_exclusive(&oldcs))
+ update_cpu_domains(cs);
+ return 0;
}
static int attach_task(struct cpuset *cs, char *buf)
@@ -1310,12 +1367,14 @@ static int cpuset_rmdir(struct inode *un
up(&cpuset_sem);
return -EBUSY;
}
- spin_lock(&cs->dentry->d_lock);
parent = cs->parent;
set_bit(CS_REMOVED, &cs->flags);
+ if (is_cpu_exclusive(cs))
+ update_cpu_domains(cs);
list_del(&cs->sibling); /* delete my sibling from parent->children */
if (list_empty(&parent->children))
check_for_release(parent);
+ spin_lock(&cs->dentry->d_lock);
d = dget(cs->dentry);
cs->dentry = NULL;
spin_unlock(&d->d_lock);
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] Dynamic sched domains (v0.6)
2005-05-17 4:10 [RFT PATCH] Dynamic sched domains (v0.6) Dinakar Guniguntala
2005-05-17 4:12 ` [PATCH 2/3] " Dinakar Guniguntala
@ 2005-05-17 4:14 ` Dinakar Guniguntala
2005-05-18 5:53 ` [RFT PATCH] " Paul Jackson
2 siblings, 0 replies; 10+ messages in thread
From: Dinakar Guniguntala @ 2005-05-17 4:14 UTC (permalink / raw)
To: Paul Jackson, Simon Derr, Nick Piggin, lkml, lse-tech,
Matthew Dobson, Dipankar Sarma, Andrew Morton
[-- Attachment #1: Type: text/plain, Size: 108 bytes --]
o Patch3 has the ia64 changes similar to kernel/sched.c
o This patch compiles ok, but has not been tested
[-- Attachment #2: dyn-sd-rc4mm1-v0.6-3.patch --]
[-- Type: text/plain, Size: 6871 bytes --]
diff -Naurp linux-2.6.12-rc4-mm1-2/arch/ia64/kernel/domain.c linux-2.6.12-rc4-mm1-3/arch/ia64/kernel/domain.c
--- linux-2.6.12-rc4-mm1-2/arch/ia64/kernel/domain.c 2005-05-16 15:06:51.000000000 +0530
+++ linux-2.6.12-rc4-mm1-3/arch/ia64/kernel/domain.c 2005-05-16 17:21:56.000000000 +0530
@@ -27,7 +27,7 @@
*
* Should use nodemask_t.
*/
-static int __devinit find_next_best_node(int node, unsigned long *used_nodes)
+static int find_next_best_node(int node, unsigned long *used_nodes)
{
int i, n, val, min_val, best_node = 0;
@@ -66,7 +66,7 @@ static int __devinit find_next_best_node
* should be one that prevents unnecessary balancing, but also spreads tasks
* out optimally.
*/
-static cpumask_t __devinit sched_domain_node_span(int node)
+static cpumask_t sched_domain_node_span(int node)
{
int i;
cpumask_t span, nodemask;
@@ -96,7 +96,7 @@ static cpumask_t __devinit sched_domain_
#ifdef CONFIG_SCHED_SMT
static DEFINE_PER_CPU(struct sched_domain, cpu_domains);
static struct sched_group sched_group_cpus[NR_CPUS];
-static int __devinit cpu_to_cpu_group(int cpu)
+static int cpu_to_cpu_group(int cpu)
{
return cpu;
}
@@ -104,7 +104,7 @@ static int __devinit cpu_to_cpu_group(in
static DEFINE_PER_CPU(struct sched_domain, phys_domains);
static struct sched_group sched_group_phys[NR_CPUS];
-static int __devinit cpu_to_phys_group(int cpu)
+static int cpu_to_phys_group(int cpu)
{
#ifdef CONFIG_SCHED_SMT
return first_cpu(cpu_sibling_map[cpu]);
@@ -125,44 +125,36 @@ static struct sched_group *sched_group_n
static DEFINE_PER_CPU(struct sched_domain, allnodes_domains);
static struct sched_group sched_group_allnodes[MAX_NUMNODES];
-static int __devinit cpu_to_allnodes_group(int cpu)
+static int cpu_to_allnodes_group(int cpu)
{
return cpu_to_node(cpu);
}
#endif
/*
- * Set up scheduler domains and groups. Callers must hold the hotplug lock.
+ * Build sched domains for a given set of cpus and attach the sched domains
+ * to the individual cpus
*/
-void __devinit arch_init_sched_domains(void)
+void build_sched_domains(const cpumask_t *cpu_map)
{
int i;
- cpumask_t cpu_default_map;
-
- /*
- * Setup mask for cpus without special case scheduling requirements.
- * For now this just excludes isolated cpus, but could be used to
- * exclude other special cases in the future.
- */
- cpus_complement(cpu_default_map, cpu_isolated_map);
- cpus_and(cpu_default_map, cpu_default_map, cpu_online_map);
/*
- * Set up domains. Isolated domains just stay on the dummy domain.
+ * Set up domains for cpus specified by the cpu_map.
*/
- for_each_cpu_mask(i, cpu_default_map) {
+ for_each_cpu_mask(i, *cpu_map) {
int group;
struct sched_domain *sd = NULL, *p;
cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
- cpus_and(nodemask, nodemask, cpu_default_map);
+ cpus_and(nodemask, nodemask, *cpu_map);
#ifdef CONFIG_NUMA
if (num_online_cpus()
> SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
sd = &per_cpu(allnodes_domains, i);
*sd = SD_ALLNODES_INIT;
- sd->span = cpu_default_map;
+ sd->span = *cpu_map;
group = cpu_to_allnodes_group(i);
sd->groups = &sched_group_allnodes[group];
p = sd;
@@ -173,7 +165,7 @@ void __devinit arch_init_sched_domains(v
*sd = SD_NODE_INIT;
sd->span = sched_domain_node_span(cpu_to_node(i));
sd->parent = p;
- cpus_and(sd->span, sd->span, cpu_default_map);
+ cpus_and(sd->span, sd->span, *cpu_map);
#endif
p = sd;
@@ -190,7 +182,7 @@ void __devinit arch_init_sched_domains(v
group = cpu_to_cpu_group(i);
*sd = SD_SIBLING_INIT;
sd->span = cpu_sibling_map[i];
- cpus_and(sd->span, sd->span, cpu_default_map);
+ cpus_and(sd->span, sd->span, *cpu_map);
sd->parent = p;
sd->groups = &sched_group_cpus[group];
#endif
@@ -198,9 +190,9 @@ void __devinit arch_init_sched_domains(v
#ifdef CONFIG_SCHED_SMT
/* Set up CPU (sibling) groups */
- for_each_cpu_mask(i, cpu_default_map) {
+ for_each_cpu_mask(i, *cpu_map) {
cpumask_t this_sibling_map = cpu_sibling_map[i];
- cpus_and(this_sibling_map, this_sibling_map, cpu_default_map);
+ cpus_and(this_sibling_map, this_sibling_map, *cpu_map);
if (i != first_cpu(this_sibling_map))
continue;
@@ -213,7 +205,7 @@ void __devinit arch_init_sched_domains(v
for (i = 0; i < MAX_NUMNODES; i++) {
cpumask_t nodemask = node_to_cpumask(i);
- cpus_and(nodemask, nodemask, cpu_default_map);
+ cpus_and(nodemask, nodemask, *cpu_map);
if (cpus_empty(nodemask))
continue;
@@ -222,7 +214,7 @@ void __devinit arch_init_sched_domains(v
}
#ifdef CONFIG_NUMA
- init_sched_build_groups(sched_group_allnodes, cpu_default_map,
+ init_sched_build_groups(sched_group_allnodes, *cpu_map,
&cpu_to_allnodes_group);
for (i = 0; i < MAX_NUMNODES; i++) {
@@ -233,12 +225,12 @@ void __devinit arch_init_sched_domains(v
cpumask_t covered = CPU_MASK_NONE;
int j;
- cpus_and(nodemask, nodemask, cpu_default_map);
+ cpus_and(nodemask, nodemask, *cpu_map);
if (cpus_empty(nodemask))
continue;
domainspan = sched_domain_node_span(i);
- cpus_and(domainspan, domainspan, cpu_default_map);
+ cpus_and(domainspan, domainspan, *cpu_map);
sg = kmalloc(sizeof(struct sched_group), GFP_KERNEL);
sched_group_nodes[i] = sg;
@@ -266,7 +258,7 @@ void __devinit arch_init_sched_domains(v
int n = (i + j) % MAX_NUMNODES;
cpus_complement(notcovered, covered);
- cpus_and(tmp, notcovered, cpu_default_map);
+ cpus_and(tmp, notcovered, *cpu_map);
cpus_and(tmp, tmp, domainspan);
if (cpus_empty(tmp))
break;
@@ -293,7 +285,7 @@ void __devinit arch_init_sched_domains(v
#endif
/* Calculate CPU power for physical packages and nodes */
- for_each_cpu_mask(i, cpu_default_map) {
+ for_each_cpu_mask(i, *cpu_map) {
int power;
struct sched_domain *sd;
#ifdef CONFIG_SCHED_SMT
@@ -359,13 +351,36 @@ next_sg:
cpu_attach_domain(sd, i);
}
}
+/*
+ * Set up scheduler domains and groups. Callers must hold the hotplug lock.
+ */
+void arch_init_sched_domains(const cpumask_t *cpu_map)
+{
+ cpumask_t cpu_default_map;
+
+ /*
+ * Setup mask for cpus without special case scheduling requirements.
+ * For now this just excludes isolated cpus, but could be used to
+ * exclude other special cases in the future.
+ */
+ cpus_complement(cpu_default_map, cpu_isolated_map);
+ cpus_and(cpu_default_map, cpu_default_map, *cpu_map);
+
+ build_sched_domains(&cpu_default_map);
+}
-void __devinit arch_destroy_sched_domains(void)
+void arch_destroy_sched_domains(const cpumask_t *cpu_map)
{
#ifdef CONFIG_NUMA
int i;
for (i = 0; i < MAX_NUMNODES; i++) {
+ cpumask_t nodemask = node_to_cpumask(i);
struct sched_group *oldsg, *sg = sched_group_nodes[i];
+
+ cpus_and(nodemask, nodemask, *cpu_map);
+ if (cpus_empty(nodemask))
+ continue;
+
if (sg == NULL)
continue;
sg = sg->next;
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] Dynamic sched domains (v0.6)
2005-05-17 4:12 ` [PATCH 2/3] " Dinakar Guniguntala
@ 2005-05-17 6:25 ` Nick Piggin
2005-05-17 9:35 ` Dinakar Guniguntala
0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2005-05-17 6:25 UTC (permalink / raw)
To: dino
Cc: Paul Jackson, Simon Derr, lkml, lse-tech, Matthew Dobson,
Dipankar Sarma, Andrew Morton
Dinakar Guniguntala wrote:
> o Patch2 has updated cpusets documentation and the core update_cpu_domains
> function
> o I have also moved the dentry d_lock as discussed previously
>
Hi Dinakar,
patch1 looks good. Just one tiny little minor thing:
> +
> + lock_cpu_hotplug();
> + partition_sched_domains(&pspan, &cspan);
> + unlock_cpu_hotplug();
> +}
> +
I don't think the cpu hotplug lock isn't supposed to provide
synchronisation between readers (for example, it may be turned
into an rwsem), but only between the thread and the cpu hotplug
callbacks.
In that case, can you move this locking into kernel/sched.c, and
add the comment in partition_sched_domains that the callers must
take care of synchronisation (which without reading the code, I
assume you're doing with the cpuset sem?).
If you agree with that change, you can add an
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
to patch 1 and send it to Andrew whenever you're ready (better
CC Ingo as well). If not, please discuss! :)
Thanks,
Nick
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/3] Dynamic sched domains (v0.6)
2005-05-17 6:25 ` Nick Piggin
@ 2005-05-17 9:35 ` Dinakar Guniguntala
0 siblings, 0 replies; 10+ messages in thread
From: Dinakar Guniguntala @ 2005-05-17 9:35 UTC (permalink / raw)
To: Nick Piggin
Cc: Paul Jackson, Simon Derr, lkml, lse-tech, Matthew Dobson,
Dipankar Sarma, Andrew Morton
On Tue, May 17, 2005 at 04:25:37PM +1000, Nick Piggin wrote:
> >+
> >+ lock_cpu_hotplug();
> >+ partition_sched_domains(&pspan, &cspan);
> >+ unlock_cpu_hotplug();
> >+}
> >+
>
> I don't think the cpu hotplug lock isn't supposed to provide
> synchronisation between readers (for example, it may be turned
> into an rwsem), but only between the thread and the cpu hotplug
> callbacks.
That should be ok
>
> In that case, can you move this locking into kernel/sched.c, and
> add the comment in partition_sched_domains that the callers must
> take care of synchronisation (which without reading the code, I
> assume you're doing with the cpuset sem?).
I didnt want to do this as my next patch, which introduces
hotplug support for dynamic sched domains, also calls
partition_sched_domains. That code is called with the hotplug lock
already held. (I am still testing that code, it should be out by
this weekend)
However I will add a comment about the synchronization and yes
currently it is taken care of by the cpuset sem
-Dinakar
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFT PATCH] Dynamic sched domains (v0.6)
2005-05-17 4:10 [RFT PATCH] Dynamic sched domains (v0.6) Dinakar Guniguntala
2005-05-17 4:12 ` [PATCH 2/3] " Dinakar Guniguntala
2005-05-17 4:14 ` [PATCH 3/3] " Dinakar Guniguntala
@ 2005-05-18 5:53 ` Paul Jackson
2005-05-18 18:06 ` [Lse-tech] " Dinakar Guniguntala
2 siblings, 1 reply; 10+ messages in thread
From: Paul Jackson @ 2005-05-18 5:53 UTC (permalink / raw)
To: dino
Cc: Simon.Derr, nickpiggin, linux-kernel, lse-tech, colpatch,
dipankar, akpm
Looking good. Some minor comments on these three patches ...
* The name 'nodemask' for the cpumask_t of CPUs that are siblings to CPU i
is a bit confusing (yes, that name was already there). How about
something like 'siblings' ?
* I suspect that the following two lines:
cpus_complement(cpu_default_map, cpu_isolated_map);
cpus_and(cpu_default_map, cpu_default_map, *cpu_map);
can be replaced with the one line:
cpus_andnot(cpu_default_map, *cpu_map, cpu_isolated_map);
* You have 'cpu-exclusive' in some places in the Documentation.
I would mildly prefer to always spell this 'cpu_exclusive' (with
underscore, not hyphen).
* I like how this design came out, as described in:
A cpuset that is cpu exclusive has a sched domain associated with it.
The sched domain consists of all cpus in the current cpuset that are not
part of any exclusive child cpusets.
Good work.
* Question - any idea how much of a performance hiccup a system will feel
whenever someone changes the cpu_exclusive cpusets? Could this lead
to a denial-of-service attack, if say some untrusted user were allowed
modify privileges on some small cpuset that was cpu_exclusive, and they
abused that privilege by turning on and off the cpu_exclusive property
on their little cpuset (or creating/destroying an exclusive child):
cd /dev/cpuset/$(cat /proc/self/cpuset)
while true
do
for i in 0 1
do
echo $i > cpu_exclusive
done
done
If so, perhaps we should recommend that shared systems with untrusted
users avoid allowing a cpu_exclusive cpuset to be modifiable, or to have
a cpu_exclusive flag modifiable, by those untrusted users.
* The cpuset 'oldcs' in update_flag() seems to only be used for its
cpu_exclusive flag. We could save some stack space on my favorite
big honkin NUMA iron by just having a local variable for this
'old_cpu_exclusive' value, instead of the entire cpuset.
* Similarly, though with a bit less savings, one could replace 'oldcs'
in update_cpumask() with just the old_cpus_allowed mask.
Or, skip even that, and compute a boolean flag:
cpus_changed = cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
before copying over the trialcs, so we only need one word of stack
for the boolean, not possibly many words for a cpumask.
* Non-traditional code style:
}
else {
should be instead:
} else {
* Is it the case that update_cpu_domains() is called with cpuset_sem held?
Would it be a good idea to note in the comment for that routine:
* Call with cpuset_sem held. May nest a call to the
* lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
I didn't callout the cpuset_sem lock precondition on many routines,
but since this one can nest the cpu_hotplug lock, it might be worth
calling it out, for the benefit of engineers who are passing through,
needing to know how the hotplug lock nests with other semaphores.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Lse-tech] Re: [RFT PATCH] Dynamic sched domains (v0.6)
2005-05-18 5:53 ` [RFT PATCH] " Paul Jackson
@ 2005-05-18 18:06 ` Dinakar Guniguntala
2005-05-18 21:02 ` Paul Jackson
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Dinakar Guniguntala @ 2005-05-18 18:06 UTC (permalink / raw)
To: Paul Jackson
Cc: Simon.Derr, nickpiggin, linux-kernel, lse-tech, colpatch,
dipankar, akpm
On Tue, May 17, 2005 at 10:53:54PM -0700, Paul Jackson wrote:
> Looking good. Some minor comments on these three patches ...
>
> * The name 'nodemask' for the cpumask_t of CPUs that are siblings to CPU i
> is a bit confusing (yes, that name was already there). How about
> something like 'siblings' ?
Not sure which code you are referring to here ?? I dont see any nodemask
referring to SMT siblings ?
> can be replaced with the one line:
>
> cpus_andnot(cpu_default_map, *cpu_map, cpu_isolated_map);
yeah, ok
> I would mildly prefer to always spell this 'cpu_exclusive' (with
> underscore, not hyphen).
fine
> Good work.
Thanks !
>
> * Question - any idea how much of a performance hiccup a system will feel
> whenever someone changes the cpu_exclusive cpusets? Could this lead
> to a denial-of-service attack, if say some untrusted user were allowed
> modify privileges on some small cpuset that was cpu_exclusive, and they
> abused that privilege by turning on and off the cpu_exclusive property
> on their little cpuset (or creating/destroying an exclusive child):
>
I tried your script and see that it makes absolutely no impact on top.
The CPU on which it is running is mostly 100% idle. However I'll run
more tests to confirm that it has no impact
>
> * The cpuset 'oldcs' in update_flag() seems to only be used for its
> cpu_exclusive flag. We could save some stack space on my favorite
> big honkin NUMA iron by just having a local variable for this
> 'old_cpu_exclusive' value, instead of the entire cpuset.
>
> * Similarly, though with a bit less savings, one could replace 'oldcs'
> in update_cpumask() with just the old_cpus_allowed mask.
> Or, skip even that, and compute a boolean flag:
> cpus_changed = cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
> before copying over the trialcs, so we only need one word of stack
> for the boolean, not possibly many words for a cpumask.
ok for both
>
> * Non-traditional code style:
> }
> else {
> should be instead:
> } else {
I dont know how that snuck back in, I'll change that
>
> * Is it the case that update_cpu_domains() is called with cpuset_sem held?
> Would it be a good idea to note in the comment for that routine:
> * Call with cpuset_sem held. May nest a call to the
> * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
> I didn't callout the cpuset_sem lock precondition on many routines,
> but since this one can nest the cpu_hotplug lock, it might be worth
> calling it out, for the benefit of engineers who are passing through,
> needing to know how the hotplug lock nests with other semaphores.
ok
I do feel with the above updates the patches can go into -mm.
Appreciate all the review comments from everyone, Thanks
-Dinakar
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Lse-tech] Re: [RFT PATCH] Dynamic sched domains (v0.6)
2005-05-18 18:06 ` [Lse-tech] " Dinakar Guniguntala
@ 2005-05-18 21:02 ` Paul Jackson
2005-05-18 21:04 ` Paul Jackson
2005-05-18 21:05 ` Paul Jackson
2 siblings, 0 replies; 10+ messages in thread
From: Paul Jackson @ 2005-05-18 21:02 UTC (permalink / raw)
To: dino
Cc: Simon.Derr, nickpiggin, linux-kernel, lse-tech, colpatch,
dipankar, akpm
Dinakar wrote:
> > * The name 'nodemask' for the cpumask_t of CPUs that are siblings to CPU i
> > is a bit confusing (yes, that name was already there). How about
> > something like 'siblings' ?
>
> Not sure which code you are referring to here ?? I dont see any nodemask
> referring to SMT siblings ?
This comment was referring to lines such as the following, which appear
a few places in your patch (though not lines you wrote, just nearby
lines, in all but one case):
cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
I was thinking to change such a line to:
cpumask_t sibling = node_to_cpumask(cpu_to_node(i));
However, it is no biggie, and since it is not in your actual new
code, probably should not be part of your patch anyway.
There is one place, arch_destroy_sched_domains(), where you added such a
line, but there you should probably use the same 'nodemask' name as the
other couple of places, unless and until these places change together.
So bottom line - nevermind this comment.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Lse-tech] Re: [RFT PATCH] Dynamic sched domains (v0.6)
2005-05-18 18:06 ` [Lse-tech] " Dinakar Guniguntala
2005-05-18 21:02 ` Paul Jackson
@ 2005-05-18 21:04 ` Paul Jackson
2005-05-18 21:05 ` Paul Jackson
2 siblings, 0 replies; 10+ messages in thread
From: Paul Jackson @ 2005-05-18 21:04 UTC (permalink / raw)
To: dino
Cc: Simon.Derr, nickpiggin, linux-kernel, lse-tech, colpatch,
dipankar, akpm
Dinakar wrote:
> I tried your script and see that it makes absolutely no impact on top.
> The CPU on which it is running is mostly 100% idle. However I'll run
> more tests to confirm that it has no impact
I have no particular intuition, one way or the other, on how much a
dynamic reallocation of sched domains will impact the system. So
once you are comfortable that this is not normally a problem (which
you might already be), don't worry about it further on my account.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Lse-tech] Re: [RFT PATCH] Dynamic sched domains (v0.6)
2005-05-18 18:06 ` [Lse-tech] " Dinakar Guniguntala
2005-05-18 21:02 ` Paul Jackson
2005-05-18 21:04 ` Paul Jackson
@ 2005-05-18 21:05 ` Paul Jackson
2 siblings, 0 replies; 10+ messages in thread
From: Paul Jackson @ 2005-05-18 21:05 UTC (permalink / raw)
To: dino
Cc: Simon.Derr, nickpiggin, linux-kernel, lse-tech, colpatch,
dipankar, akpm
Dinakar wrote:
> I do feel with the above updates the patches can go into -mm.
Acked-by: Paul Jackson <pj@sgi.com>
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-05-18 21:09 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-17 4:10 [RFT PATCH] Dynamic sched domains (v0.6) Dinakar Guniguntala
2005-05-17 4:12 ` [PATCH 2/3] " Dinakar Guniguntala
2005-05-17 6:25 ` Nick Piggin
2005-05-17 9:35 ` Dinakar Guniguntala
2005-05-17 4:14 ` [PATCH 3/3] " Dinakar Guniguntala
2005-05-18 5:53 ` [RFT PATCH] " Paul Jackson
2005-05-18 18:06 ` [Lse-tech] " Dinakar Guniguntala
2005-05-18 21:02 ` Paul Jackson
2005-05-18 21:04 ` Paul Jackson
2005-05-18 21:05 ` Paul Jackson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox