public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] Dynamic sched domains (v0.5)
@ 2005-05-01 19:09 Dinakar Guniguntala
  2005-05-02  9:10 ` Nick Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Dinakar Guniguntala @ 2005-05-01 19:09 UTC (permalink / raw)
  To: Paul Jackson, Simon Derr, Nick Piggin, lkml, lse-tech,
	Matthew Dobson, Dipankar Sarma, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2405 bytes --]


Ok, Here is the minimal patchset that I had promised after the
last discussion.

What it does have
o  The current patch enhances the meaning of exclusive cpusets by
   attaching them (exclusive cpusets) to sched domains
o  It does _not_ require any additional cpumask_t variable. It
   just parses the cpus_allowed of the parent/sibling/children
   cpusets for manipulating sched domains
o  All existing operations on non-/exclusive cpusets are preserved as-is.
o  The sched code has been modified to bring it upto 2.6.12-rc2-mm3

Usage
o  On setting the cpu_exclusive flag of a cpuset X, it creates two
   sched domains
   a. One, All cpus from X's parent cpuset that dont belong to any
      exclusive sibling cpuset of X
   b. Two, All cpus in X's cpus_allowed
o  On adding/deleting cpus to/from a exclusive cpuset X that has exclusive
   children, it creates two sched domains
   a. One, All cpus from X's parent cpuset that dont belong to any
      exclusive sibling cpuset of X
   b. Two, All cpus in X's cpus_allowed, after taking away any cpus that
      belong to exclusive child cpusets of X
o  On unsetting the cpu_exclusive flag of cpuset X or rmdir X, it creates a
   single sched domain, containing all cpus from X' parent cpuset that
   dont belong to any exclusive sibling of X and the cpus of X
o  It does _not_ modify the cpus_allowed variable of the parent as in the
   previous version. It relies on user space to move tasks to proper
   cpusets for complete isolation/exclusion
o  See function update_cpu_domains for more details

What it does not have
o  It is still short on documentation
o  Does not have hotplug support as yet
o  Supports only x86 as of now
o  No thoughts on "memory domains" (Though I am not sure, who
   would use such a thing and what exactly are the requirements)


Here is the increase in binary sizes for different values of CONFIG_NR_CPUS
(32 and 255)

kernel/cpuset.o.orig-32         16728
kernel/cpuset.o-v0.5-32         17176 (~2.7% increase)

kernel/cpuset.o.orig-255        17916
kernel/cpuset.o-v0.5-255        19120 (~6.7% increase)

The code changes are also minimal

 include/linux/init.h  |    2
 include/linux/sched.h |    1
 kernel/cpuset.c       |   73 +++++++++++++++++++++++++++------
 kernel/sched.c        |  109 +++++++++++++++++++++++++++++++++-----------------
 4 files changed, 134 insertions(+), 51 deletions(-)


        -Dinakar



[-- Attachment #2: dyn-sd-v0.5-1.patch --]
[-- Type: text/plain, Size: 6388 bytes --]

diff -Naurp linux-2.6.12-rc2.orig/include/linux/init.h linux-2.6.12-rc2/include/linux/init.h
--- linux-2.6.12-rc2.orig/include/linux/init.h	2005-04-04 22:07:52.000000000 +0530
+++ linux-2.6.12-rc2/include/linux/init.h	2005-05-01 22:07:56.000000000 +0530
@@ -217,7 +217,7 @@ void __init parse_early_param(void);
 #define __initdata_or_module __initdata
 #endif /*CONFIG_MODULES*/
 
-#ifdef CONFIG_HOTPLUG
+#if defined(CONFIG_HOTPLUG) || defined(CONFIG_CPUSETS)
 #define __devinit
 #define __devinitdata
 #define __devexit
diff -Naurp linux-2.6.12-rc2.orig/include/linux/sched.h linux-2.6.12-rc2/include/linux/sched.h
--- linux-2.6.12-rc2.orig/include/linux/sched.h	2005-04-28 18:24:11.000000000 +0530
+++ linux-2.6.12-rc2/include/linux/sched.h	2005-05-01 22:07:35.000000000 +0530
@@ -155,6 +155,7 @@ typedef struct task_struct task_t;
 extern void sched_init(void);
 extern void sched_init_smp(void);
 extern void init_idle(task_t *idle, int cpu);
+extern void rebuild_sched_domains(cpumask_t span1, cpumask_t span2);
 
 extern cpumask_t nohz_cpu_mask;
 
diff -Naurp linux-2.6.12-rc2.orig/kernel/sched.c linux-2.6.12-rc2/kernel/sched.c
--- linux-2.6.12-rc2.orig/kernel/sched.c	2005-04-28 18:24:11.000000000 +0530
+++ linux-2.6.12-rc2/kernel/sched.c	2005-05-01 22:06:55.000000000 +0530
@@ -4526,7 +4526,7 @@ int __init migration_init(void)
 #endif
 
 #ifdef CONFIG_SMP
-#define SCHED_DOMAIN_DEBUG
+#undef SCHED_DOMAIN_DEBUG
 #ifdef SCHED_DOMAIN_DEBUG
 static void sched_domain_debug(struct sched_domain *sd, int cpu)
 {
@@ -4934,40 +4934,50 @@ static void check_sibling_maps(void)
 }
 #endif
 
-/*
- * Set up scheduler domains and groups.  Callers must hold the hotplug lock.
- */
-static void __devinit arch_init_sched_domains(void)
+static inline void detach_domains(cpumask_t cpu_map)
 {
 	int i;
-	cpumask_t cpu_default_map;
 
-#if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
-	check_sibling_maps();
+	for_each_cpu_mask(i, cpu_map)
+		cpu_attach_domain(NULL, i);
+	synchronize_kernel();
+}
+
+static inline void attach_domains(cpumask_t cpu_map)
+{
+	int i;
+
+	/* Attach the domains */
+	for_each_cpu_mask(i, cpu_map) {
+		struct sched_domain *sd;
+#ifdef CONFIG_SCHED_SMT
+		sd = &per_cpu(cpu_domains, i);
+#else
+		sd = &per_cpu(phys_domains, i);
 #endif
-	/*
-	 * Setup mask for cpus without special case scheduling requirements.
-	 * For now this just excludes isolated cpus, but could be used to
-	 * exclude other special cases in the future.
-	 */
-	cpus_complement(cpu_default_map, cpu_isolated_map);
-	cpus_and(cpu_default_map, cpu_default_map, cpu_online_map);
+		cpu_attach_domain(sd, i);
+	}
+}
+
+static void build_sched_domains(cpumask_t cpu_map)
+{
+	int i;
 
 	/*
-	 * Set up domains. Isolated domains just stay on the NULL domain.
+	 * Set up domains as specified by the cpu_map.
 	 */
-	for_each_cpu_mask(i, cpu_default_map) {
+	for_each_cpu_mask(i, cpu_map) {
 		int group;
 		struct sched_domain *sd = NULL, *p;
 		cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
 
-		cpus_and(nodemask, nodemask, cpu_default_map);
+		cpus_and(nodemask, nodemask, cpu_map);
 
 #ifdef CONFIG_NUMA
 		sd = &per_cpu(node_domains, i);
 		group = cpu_to_node_group(i);
 		*sd = SD_NODE_INIT;
-		sd->span = cpu_default_map;
+		sd->span = cpu_map;
 		sd->groups = &sched_group_nodes[group];
 #endif
 
@@ -4985,7 +4995,7 @@ static void __devinit arch_init_sched_do
 		group = cpu_to_cpu_group(i);
 		*sd = SD_SIBLING_INIT;
 		sd->span = cpu_sibling_map[i];
-		cpus_and(sd->span, sd->span, cpu_default_map);
+		cpus_and(sd->span, sd->span, cpu_map);
 		sd->parent = p;
 		sd->groups = &sched_group_cpus[group];
 #endif
@@ -4995,7 +5005,7 @@ static void __devinit arch_init_sched_do
 	/* Set up CPU (sibling) groups */
 	for_each_online_cpu(i) {
 		cpumask_t this_sibling_map = cpu_sibling_map[i];
-		cpus_and(this_sibling_map, this_sibling_map, cpu_default_map);
+		cpus_and(this_sibling_map, this_sibling_map, cpu_map);
 		if (i != first_cpu(this_sibling_map))
 			continue;
 
@@ -5008,7 +5018,7 @@ static void __devinit arch_init_sched_do
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		cpumask_t nodemask = node_to_cpumask(i);
 
-		cpus_and(nodemask, nodemask, cpu_default_map);
+		cpus_and(nodemask, nodemask, cpu_map);
 		if (cpus_empty(nodemask))
 			continue;
 
@@ -5018,12 +5028,12 @@ static void __devinit arch_init_sched_do
 
 #ifdef CONFIG_NUMA
 	/* Set up node groups */
-	init_sched_build_groups(sched_group_nodes, cpu_default_map,
+	init_sched_build_groups(sched_group_nodes, cpu_map,
 					&cpu_to_node_group);
 #endif
 
 	/* Calculate CPU power for physical packages and nodes */
-	for_each_cpu_mask(i, cpu_default_map) {
+	for_each_cpu_mask(i, cpu_map) {
 		int power;
 		struct sched_domain *sd;
 #ifdef CONFIG_SCHED_SMT
@@ -5045,17 +5055,46 @@ static void __devinit arch_init_sched_do
 		}
 #endif
 	}
+}
 
-	/* Attach the domains */
-	for_each_online_cpu(i) {
-		struct sched_domain *sd;
-#ifdef CONFIG_SCHED_SMT
-		sd = &per_cpu(cpu_domains, i);
-#else
-		sd = &per_cpu(phys_domains, i);
+void rebuild_sched_domains(cpumask_t span1, cpumask_t span2)
+{
+	cpumask_t change_map;
+
+	cpus_or(change_map, span1, span2);
+
+	preempt_disable();
+	detach_domains(change_map);
+
+	if (!cpus_empty(span1))
+		build_sched_domains(span1);
+	if (!cpus_empty(span2))
+		build_sched_domains(span2);
+
+	attach_domains(change_map);
+	preempt_enable();
+}
+
+/*
+ * Set up scheduler domains and groups.  Callers must hold the hotplug lock.
+ */
+static void __devinit arch_init_sched_domains(void)
+{
+	cpumask_t cpu_default_map;
+
+#if defined(CONFIG_SCHED_SMT) && defined(CONFIG_NUMA)
+	check_sibling_maps();
 #endif
-		cpu_attach_domain(sd, i);
-	}
+	/*
+	 * Setup mask for cpus without special case scheduling requirements.
+	 * For now this just excludes isolated cpus, but could be used to
+	 * exclude other special cases in the future.
+	 */
+	cpus_complement(cpu_default_map, cpu_isolated_map);
+	cpus_and(cpu_default_map, cpu_default_map, cpu_online_map);
+
+	build_sched_domains(cpu_default_map);
+	attach_domains(cpu_default_map);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -5082,9 +5121,7 @@ static int update_sched_domains(struct n
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_DOWN_PREPARE:
-		for_each_online_cpu(i)
-			cpu_attach_domain(NULL, i);
-		synchronize_kernel();
+		detach_domains(cpu_online_map);
 		arch_destroy_sched_domains();
 		return NOTIFY_OK;
 

[-- Attachment #3: dyn-sd-v0.5-2.patch --]
[-- Type: text/plain, Size: 3279 bytes --]

diff -Naurp linux-2.6.12-rc2.orig/kernel/cpuset.c linux-2.6.12-rc2/kernel/cpuset.c
--- linux-2.6.12-rc2.orig/kernel/cpuset.c	2005-04-28 18:24:11.000000000 +0530
+++ linux-2.6.12-rc2/kernel/cpuset.c	2005-05-01 22:15:06.000000000 +0530
@@ -602,12 +602,48 @@ static int validate_change(const struct 
 	return 0;
 }
 
+static void update_cpu_domains(struct cpuset *cur)
+{
+	struct cpuset *c, *par = cur->parent;
+	cpumask_t span1, span2;
+
+	if (par == NULL || cpus_empty(cur->cpus_allowed))
+		return;
+
+	/* Get all non-exclusive cpus from parent domain */
+	span1 = par->cpus_allowed;
+	list_for_each_entry(c, &par->children, sibling) {
+		if (is_cpu_exclusive(c))
+			cpus_andnot(span1, span1, c->cpus_allowed);
+	}
+	if (is_removed(cur) || !is_cpu_exclusive(cur)) {
+		cpus_or(span1, span1, cur->cpus_allowed);
+		if (cpus_equal(span1, cur->cpus_allowed))
+			return;
+		span2 = CPU_MASK_NONE;
+	}
+	else {
+		if (cpus_empty(span1))
+			return;
+		span2 = cur->cpus_allowed;
+		/* If current cpuset has exclusive children, exclude from domain */
+		list_for_each_entry(c, &cur->children, sibling) {
+			if (is_cpu_exclusive(c))
+				cpus_andnot(span2, span2, c->cpus_allowed);
+		}
+	}
+
+	lock_cpu_hotplug();
+	rebuild_sched_domains(span1, span2);
+	unlock_cpu_hotplug();
+}
+
 static int update_cpumask(struct cpuset *cs, char *buf)
 {
-	struct cpuset trialcs;
+	struct cpuset trialcs, oldcs;
 	int retval;
 
-	trialcs = *cs;
+	trialcs = oldcs = *cs;
 	retval = cpulist_parse(buf, trialcs.cpus_allowed);
 	if (retval < 0)
 		return retval;
@@ -615,9 +651,13 @@ static int update_cpumask(struct cpuset 
 	if (cpus_empty(trialcs.cpus_allowed))
 		return -ENOSPC;
 	retval = validate_change(cs, &trialcs);
-	if (retval == 0)
-		cs->cpus_allowed = trialcs.cpus_allowed;
-	return retval;
+	if (retval < 0)
+		return retval;
+	cs->cpus_allowed = trialcs.cpus_allowed;
+	if (is_cpu_exclusive(cs) && 
+	    (!cpus_equal(cs->cpus_allowed, oldcs.cpus_allowed)))
+		update_cpu_domains(cs);
+	return 0;
 }
 
 static int update_nodemask(struct cpuset *cs, char *buf)
@@ -652,25 +692,28 @@ static int update_nodemask(struct cpuset
 static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, char *buf)
 {
 	int turning_on;
-	struct cpuset trialcs;
+	struct cpuset trialcs, oldcs;
 	int err;
 
 	turning_on = (simple_strtoul(buf, NULL, 10) != 0);
 
-	trialcs = *cs;
+	trialcs = oldcs = *cs;
 	if (turning_on)
 		set_bit(bit, &trialcs.flags);
 	else
 		clear_bit(bit, &trialcs.flags);
 
 	err = validate_change(cs, &trialcs);
-	if (err == 0) {
-		if (turning_on)
-			set_bit(bit, &cs->flags);
-		else
-			clear_bit(bit, &cs->flags);
-	}
-	return err;
+	if (err < 0)
+		return err;
+	if (turning_on)
+		set_bit(bit, &cs->flags);
+	else
+		clear_bit(bit, &cs->flags);
+
+	if (is_cpu_exclusive(cs) != is_cpu_exclusive(&oldcs))
+                update_cpu_domains(cs);
+	return 0;
 }
 
 static int attach_task(struct cpuset *cs, char *buf)
@@ -1319,6 +1362,8 @@ static int cpuset_rmdir(struct inode *un
 	spin_lock(&cs->dentry->d_lock);
 	parent = cs->parent;
 	set_bit(CS_REMOVED, &cs->flags);
+	if (is_cpu_exclusive(cs))
+		update_cpu_domains(cs);
 	list_del(&cs->sibling);	/* delete my sibling from parent->children */
 	if (list_empty(&parent->children))
 		check_for_release(parent);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-01 19:09 [RFC PATCH] Dynamic sched domains (v0.5) Dinakar Guniguntala
@ 2005-05-02  9:10 ` Nick Piggin
  2005-05-02 17:17   ` Dinakar Guniguntala
  2005-05-02  9:44 ` Nick Piggin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2005-05-02  9:10 UTC (permalink / raw)
  To: dino
  Cc: Paul Jackson, Simon Derr, lkml, lse-tech, Matthew Dobson,
	Dipankar Sarma, Andrew Morton

Dinakar Guniguntala wrote:
> Ok, Here is the minimal patchset that I had promised after the
> last discussion.
> 

The sched-domains part of it (kernel/sched.c) is looking much
better now. Haven't had a really good look, but it is definitely
on the right track now. Well done.

As I said before, I am not expert on the cpusets side of things,
but as far as sched-domains partitioning goes, we really just
want the absolute minimum support in kernel/sched.c which can be
managed by a higher layer.

I'll review it in detail when it gets into -mm, but I don't expect
to find any major problems.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-01 19:09 [RFC PATCH] Dynamic sched domains (v0.5) Dinakar Guniguntala
  2005-05-02  9:10 ` Nick Piggin
@ 2005-05-02  9:44 ` Nick Piggin
  2005-05-02 17:16   ` Dinakar Guniguntala
  2005-05-02 18:01 ` Paul Jackson
  2005-05-03 22:03 ` Matthew Dobson
  3 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2005-05-02  9:44 UTC (permalink / raw)
  To: dino
  Cc: Paul Jackson, Simon Derr, lkml, lse-tech, Matthew Dobson,
	Dipankar Sarma, Andrew Morton

Dinakar Guniguntala wrote:

> +void rebuild_sched_domains(cpumask_t span1, cpumask_t span2)
> +{
> +	cpumask_t change_map;
> +
> +	cpus_or(change_map, span1, span2);
> +
> +	preempt_disable();

Oh, you can't do this here, attach_domains does a synchronize_kernel.
So take it out, it doesn't do anything anyway, does it?

I suggest you also use some sort of locking to prevent concurrent rebuilds
and rebuilds racing with cpu hotplug. You could probably have a static
semaphore around rebuild_sched_domains, and take lock_cpu_hotplug here too.

Or alternatively take the semaphore in the cpu hotplug notifier as well...
Maybe both - neither are performance critical code.

-- 
SUSE Labs, Novell Inc.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-02  9:44 ` Nick Piggin
@ 2005-05-02 17:16   ` Dinakar Guniguntala
  2005-05-02 23:23     ` Nick Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Dinakar Guniguntala @ 2005-05-02 17:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul Jackson, Simon Derr, lkml, lse-tech, Matthew Dobson,
	Dipankar Sarma, Andrew Morton

On Mon, May 02, 2005 at 07:44:05PM +1000, Nick Piggin wrote:
> Dinakar Guniguntala wrote:
> 
> >+void rebuild_sched_domains(cpumask_t span1, cpumask_t span2)
> >+{
> >+	cpumask_t change_map;
> >+
> >+	cpus_or(change_map, span1, span2);
> >+
> >+	preempt_disable();
> 
> Oh, you can't do this here, attach_domains does a synchronize_kernel.
> So take it out, it doesn't do anything anyway, does it?

I put that in to prevent hangs with CONFIG_PREEMPT turned on, but
clearly didn't test it with preempt turned on. Looks like all I need to 
do here is a local_irq_disable

> 
> I suggest you also use some sort of locking to prevent concurrent rebuilds
> and rebuilds racing with cpu hotplug. You could probably have a static
> semaphore around rebuild_sched_domains, and take lock_cpu_hotplug here too.

I already do a lock_cpu_hotplug() in cpuset.c before calling 
rebuild_sched_domains and also am holding cpuset_sem, so that should take
care of both hotplug and concurrent rebuilds

	-Dinakar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-02  9:10 ` Nick Piggin
@ 2005-05-02 17:17   ` Dinakar Guniguntala
  0 siblings, 0 replies; 17+ messages in thread
From: Dinakar Guniguntala @ 2005-05-02 17:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul Jackson, Simon Derr, lkml, lse-tech, Matthew Dobson,
	Dipankar Sarma, Andrew Morton

On Mon, May 02, 2005 at 07:10:35PM +1000, Nick Piggin wrote:
> Dinakar Guniguntala wrote:
> >Ok, Here is the minimal patchset that I had promised after the
> >last discussion.
> >
> 
> The sched-domains part of it (kernel/sched.c) is looking much
> better now. Haven't had a really good look, but it is definitely
> on the right track now. Well done.
> 
Thank you for reviewing !

	-Dinakar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-01 19:09 [RFC PATCH] Dynamic sched domains (v0.5) Dinakar Guniguntala
  2005-05-02  9:10 ` Nick Piggin
  2005-05-02  9:44 ` Nick Piggin
@ 2005-05-02 18:01 ` Paul Jackson
  2005-05-03 14:44   ` Dinakar Guniguntala
  2005-05-03 22:03 ` Matthew Dobson
  3 siblings, 1 reply; 17+ messages in thread
From: Paul Jackson @ 2005-05-02 18:01 UTC (permalink / raw)
  To: dino
  Cc: Simon.Derr, nickpiggin, linux-kernel, lse-tech, colpatch,
	dipankar, akpm

Thanks for the minimalist patch.  I'll review it in more detail when I
get a chance.  It looks promising, more promising than before.


My current concerns include:
 o Having it work on ia64 would facilitate my testing.
 o _Still_ no clear statement of the requirements - see below.
 o Does this patch ensure that isolated sched domains form
      a partition (disjoint cover) of a systems CPUs?  Should it?
 o Does this change any documented semantics of cpusets?  I don't
      see offhand that it does.  Perhaps that's good.  Perhaps
      I missed something.


I am still in the frustrating position of playing guessing games with
what are you essential requirements, the one or two features that you
require.  The closest we got was a six step list which was really more
like pseudo code for a prior implementation.

I see nothing on first glance in your latest patch that responds to my
previous attempt to ask this.  I don't see how I can improve on the
wording of that question, so I repeat myself ...

On April 25, pj wrote:
======================

A few days ago, you provided a six step list, under the introduction:
> Ok, Let me begin at the beginning and attempt to define what I am 
> doing here

I suspect those six steps were not really your essential requirements,
but one possible procedure that accomplishes them.

So far I am guessing that your requirement(s) are one or both of the
following two items:

 (1) avoid having the scheduler wasting too much time trying to
     load balance tasks that only turn out to be not allowed on
     the cpus the scheduler is considering, and/or
 (2) provide improved administrative control of a system by being
     able to construct a cpuset that is guaranteed to have no
     overlap of allowed cpus with its parent or any other cpuset
     not descendent from it.

If (1) is one of your essential requirements, then I have described a
couple of implementations that mark some existing cpusets to form a
partition (in the mathematical sense of a disjoint covering of subsets)
of the system to define isolated scheduler domains.  I did this without
adding any additional bitmasks to each cpuset.

If (2) is one of your essential requirements, then I believe this can be
done with the current cpuset kernel code, entirely with additional user
level code.

> I am working on a minimalistic design right now and will get back in
> a day or two

Good.

Hopefully, you will also be able to get through my thick head what your
essential requirement(s) is or are.


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-02 17:16   ` Dinakar Guniguntala
@ 2005-05-02 23:23     ` Nick Piggin
  2005-05-03 14:58       ` Dinakar Guniguntala
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2005-05-02 23:23 UTC (permalink / raw)
  To: dino
  Cc: Paul Jackson, Simon Derr, lkml, lse-tech, Matthew Dobson,
	Dipankar Sarma, Andrew Morton

Dinakar Guniguntala wrote:
> On Mon, May 02, 2005 at 07:44:05PM +1000, Nick Piggin wrote:
> 
>>Dinakar Guniguntala wrote:
>>
>>
>>>+void rebuild_sched_domains(cpumask_t span1, cpumask_t span2)
>>>+{
>>>+	cpumask_t change_map;
>>>+
>>>+	cpus_or(change_map, span1, span2);
>>>+
>>>+	preempt_disable();
>>
>>Oh, you can't do this here, attach_domains does a synchronize_kernel.
>>So take it out, it doesn't do anything anyway, does it?
> 
> 
> I put that in to prevent hangs with CONFIG_PREEMPT turned on, but
> clearly didn't test it with preempt turned on. Looks like all I need to 
> do here is a local_irq_disable
> 

What are you protecting against, though? synchroinze_kernel can
sleep, so local_irq_disable is probably the wrong thing to do as well.

AFAIKS, you don't need anything here - so long as you have mutual
exclusion from other sched-domain building then this can take as long
as it wants / be preempted as many times as we like.

> 
>>I suggest you also use some sort of locking to prevent concurrent rebuilds
>>and rebuilds racing with cpu hotplug. You could probably have a static
>>semaphore around rebuild_sched_domains, and take lock_cpu_hotplug here too.
> 
> 
> I already do a lock_cpu_hotplug() in cpuset.c before calling 
> rebuild_sched_domains and also am holding cpuset_sem, so that should take
> care of both hotplug and concurrent rebuilds
> 

OK.

But if we want this to be a respectable interface (possibly for more than
just cpusets) then it should probably do some locking itself. It isn't
performance critical, so I think taking a semaphore wouldn't hurt.

-- 
SUSE Labs, Novell Inc.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-02 18:01 ` Paul Jackson
@ 2005-05-03 14:44   ` Dinakar Guniguntala
  2005-05-03 15:21     ` Paul Jackson
  2005-05-03 15:24     ` Paul Jackson
  0 siblings, 2 replies; 17+ messages in thread
From: Dinakar Guniguntala @ 2005-05-03 14:44 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Simon.Derr, nickpiggin, linux-kernel, lse-tech, colpatch,
	dipankar, akpm

On Mon, May 02, 2005 at 11:01:35AM -0700, Paul Jackson wrote:
> My current concerns include:
>  o Having it work on ia64 would facilitate my testing.

   I am working on making changes to ia64, should be ready pretty soon

>  o Does this patch ensure that isolated sched domains form
>       a partition (disjoint cover) of a systems CPUs?  Should it?

   With this patch the cpus of an exclusive cpuset form a sched domain.
   Since only exclusive cpusets can form a sched domain, this ensures
   that the cpus form a disjoint cover

>  o Does this change any documented semantics of cpusets?  I don't
>       see offhand that it does.  Perhaps that's good.  Perhaps
>       I missed something.

   No, all semantics continue to be the same as before


I have trimmed the requirements to do only the absolute minimal in
kernel space

1.  Partitioning of the system (both cpus and memory) to ensure
    dedicated resources are available for application use.
    This has to be done through user space with the help of the
    existing cpuset infrastructure and no additional changes are
    required to be done in the kernel

2.  Remove unnecessary scheduler load balancing overhead in the
    partitions mentioned above
    a. Ensure that load balance code is aware of partitioning of cpus
       and load balance happens within these partitions and not
       across the entire system
    b. Provide for complete removal of load-balancing on a given
       partition of cpus

    This is necessary for a variety of workloads, including real-time,
    HPC and any mix of these workloads
    In the current patch, only 2(a) has been addressed.
    I intend to add support for 2(b) once the current patch is acceptable
    to everyone. I think this should not be a major change

3.  Support CPU hotplug is another requirement, though not a direct one


Hope this helps

	-Dinakar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-02 23:23     ` Nick Piggin
@ 2005-05-03 14:58       ` Dinakar Guniguntala
  2005-05-03 15:31         ` Paul Jackson
  0 siblings, 1 reply; 17+ messages in thread
From: Dinakar Guniguntala @ 2005-05-03 14:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Paul Jackson, Simon Derr, lkml, lse-tech, Matthew Dobson,
	Dipankar Sarma, Andrew Morton

> >On Mon, May 02, 2005 at 07:44:05PM +1000, Nick Piggin wrote:
> 
> What are you protecting against, though? synchroinze_kernel can
> sleep, so local_irq_disable is probably the wrong thing to do as well.

Paul, any reason why code marked "####" (fn cpuset_rmdir) is under 
the dentry lock ??

	spin_lock(&cs->dentry->d_lock);
        parent = cs->parent;			####
        set_bit(CS_REMOVED, &cs->flags);	####
        if (is_cpu_exclusive(cs))
                update_cpu_domains(cs);
        list_del(&cs->sibling); /* delete my sibling from parent->children */
        if (list_empty(&parent->children))
                check_for_release(parent);
        d = dget(cs->dentry);			<----
        cs->dentry = NULL;			<----
        spin_unlock(&d->d_lock);


As far as I can see only the ones marked "<----" should be under the
dentry lock, considering the fact that it already holds the cpuset_sem
all the while.

I saw that calling update_cpu_domains with the dentry lock held,
causes it to oops with preempt turned on. (Scheduling while atomic)

	-Dinakar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-03 14:44   ` Dinakar Guniguntala
@ 2005-05-03 15:21     ` Paul Jackson
  2005-05-03 15:24     ` Paul Jackson
  1 sibling, 0 replies; 17+ messages in thread
From: Paul Jackson @ 2005-05-03 15:21 UTC (permalink / raw)
  To: dino
  Cc: Simon.Derr, nickpiggin, linux-kernel, lse-tech, colpatch,
	dipankar, akpm

Dinakar wrote:
>   Since only exclusive cpusets can form a sched domain, this ensures
>   that the cpus form a disjoint cover

I don't understand why this is so.

Certainly, the exclusive cpusets do not form a disjoint cover.  Not
disjoint, because the top cpuset is exclusive, and overlaps with all
other cpusets.  Nor do the other exclusive cpusets, excluding the top,
necessarily form a cover - there might not even be any other such
cpusets.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-03 14:44   ` Dinakar Guniguntala
  2005-05-03 15:21     ` Paul Jackson
@ 2005-05-03 15:24     ` Paul Jackson
  1 sibling, 0 replies; 17+ messages in thread
From: Paul Jackson @ 2005-05-03 15:24 UTC (permalink / raw)
  To: dino
  Cc: Simon.Derr, nickpiggin, linux-kernel, lse-tech, colpatch,
	dipankar, akpm

Dinakar wrote:
>   No, all semantics continue to be the same as before

Good.

> I have trimmed the requirements to do only the absolute minimal in
> kernel space

Good.

> Partitioning of the system ... done through user space

Ok.

> Remove unnecessary scheduler load balancing ...

Ok.

Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-03 14:58       ` Dinakar Guniguntala
@ 2005-05-03 15:31         ` Paul Jackson
  0 siblings, 0 replies; 17+ messages in thread
From: Paul Jackson @ 2005-05-03 15:31 UTC (permalink / raw)
  To: dino
  Cc: nickpiggin, Simon.Derr, linux-kernel, lse-tech, colpatch,
	dipankar, akpm

Dinakar wrote:
> As far as I can see only the ones marked "<----" should be under the
> dentry lock, considering the fact that it already holds the cpuset_sem
> all the while.

It looks that way to me, too.

I doubt we had any particular reason for locking entry as early
as we do in that code.

It's ok by me if you move the dentry lock lower, as you suggest.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-01 19:09 [RFC PATCH] Dynamic sched domains (v0.5) Dinakar Guniguntala
                   ` (2 preceding siblings ...)
  2005-05-02 18:01 ` Paul Jackson
@ 2005-05-03 22:03 ` Matthew Dobson
  2005-05-04  0:08   ` Nick Piggin
  2005-05-05 13:26   ` Dinakar Guniguntala
  3 siblings, 2 replies; 17+ messages in thread
From: Matthew Dobson @ 2005-05-03 22:03 UTC (permalink / raw)
  To: dino
  Cc: Paul Jackson, Simon Derr, Nick Piggin, lkml, lse-tech,
	Dipankar Sarma, Andrew Morton

Dinakar Guniguntala wrote:
> Ok, Here is the minimal patchset that I had promised after the
> last discussion.
> 
> What it does have
> o  The current patch enhances the meaning of exclusive cpusets by
>    attaching them (exclusive cpusets) to sched domains
> o  It does _not_ require any additional cpumask_t variable. It
>    just parses the cpus_allowed of the parent/sibling/children
>    cpusets for manipulating sched domains
> o  All existing operations on non-/exclusive cpusets are preserved as-is.
> o  The sched code has been modified to bring it upto 2.6.12-rc2-mm3
> 
> Usage
> o  On setting the cpu_exclusive flag of a cpuset X, it creates two
>    sched domains
>    a. One, All cpus from X's parent cpuset that dont belong to any
>       exclusive sibling cpuset of X
>    b. Two, All cpus in X's cpus_allowed
> o  On adding/deleting cpus to/from a exclusive cpuset X that has exclusive
>    children, it creates two sched domains
>    a. One, All cpus from X's parent cpuset that dont belong to any
>       exclusive sibling cpuset of X
>    b. Two, All cpus in X's cpus_allowed, after taking away any cpus that
>       belong to exclusive child cpusets of X
> o  On unsetting the cpu_exclusive flag of cpuset X or rmdir X, it creates a
>    single sched domain, containing all cpus from X' parent cpuset that
>    dont belong to any exclusive sibling of X and the cpus of X
> o  It does _not_ modify the cpus_allowed variable of the parent as in the
>    previous version. It relies on user space to move tasks to proper
>    cpusets for complete isolation/exclusion
> o  See function update_cpu_domains for more details
> 
> What it does not have
> o  It is still short on documentation
> o  Does not have hotplug support as yet
> o  Supports only x86 as of now
> o  No thoughts on "memory domains" (Though I am not sure, who
>    would use such a thing and what exactly are the requirements)

An interesting feature.  I tried a while ago to get cpusets and
sched_domains to play nice (nicer?) and didn't have much luck.  It seems
you're taking a better approach, with smaller patches.  Good luck!


> diff -Naurp linux-2.6.12-rc2.orig/include/linux/init.h linux-2.6.12-rc2/include/linux/init.h
> --- linux-2.6.12-rc2.orig/include/linux/init.h	2005-04-04 22:07:52.000000000 +0530
> +++ linux-2.6.12-rc2/include/linux/init.h	2005-05-01 22:07:56.000000000 +0530
> @@ -217,7 +217,7 @@ void __init parse_early_param(void);
>  #define __initdata_or_module __initdata
>  #endif /*CONFIG_MODULES*/
>  
> -#ifdef CONFIG_HOTPLUG
> +#if defined(CONFIG_HOTPLUG) || defined(CONFIG_CPUSETS)
>  #define __devinit
>  #define __devinitdata
>  #define __devexit

This looks just plain wrong.  Why do you need this?  It doesn't seem that
arch_init_sched_domains() and/or update_sched_domains() are called from
anywhere that is cpuset related, so why the #ifdef CONFIG_CPUSETS?


> diff -Naurp linux-2.6.12-rc2.orig/kernel/sched.c linux-2.6.12-rc2/kernel/sched.c
> --- linux-2.6.12-rc2.orig/kernel/sched.c	2005-04-28 18:24:11.000000000 +0530
> +++ linux-2.6.12-rc2/kernel/sched.c	2005-05-01 22:06:55.000000000 +0530
> @@ -4526,7 +4526,7 @@ int __init migration_init(void)
>  #endif
>  
>  #ifdef CONFIG_SMP
> -#define SCHED_DOMAIN_DEBUG
> +#undef SCHED_DOMAIN_DEBUG
>  #ifdef SCHED_DOMAIN_DEBUG
>  static void sched_domain_debug(struct sched_domain *sd, int cpu)
>  {

Is this just to quiet boot for your testing?  Is there are better reason
you're turning this off?  It seems unrelated to the rest of your patch.


> ------------------------------------------------------------------------
> 
> diff -Naurp linux-2.6.12-rc2.orig/kernel/cpuset.c linux-2.6.12-rc2/kernel/cpuset.c
> --- linux-2.6.12-rc2.orig/kernel/cpuset.c	2005-04-28 18:24:11.000000000 +0530
> +++ linux-2.6.12-rc2/kernel/cpuset.c	2005-05-01 22:15:06.000000000 +0530
> @@ -602,12 +602,48 @@ static int validate_change(const struct 
>  	return 0;
>  }
>  
> +static void update_cpu_domains(struct cpuset *cur)
> +{
> +	struct cpuset *c, *par = cur->parent;
> +	cpumask_t span1, span2;
> +
> +	if (par == NULL || cpus_empty(cur->cpus_allowed))
> +		return;
> +
> +	/* Get all non-exclusive cpus from parent domain */
> +	span1 = par->cpus_allowed;
> +	list_for_each_entry(c, &par->children, sibling) {
> +		if (is_cpu_exclusive(c))
> +			cpus_andnot(span1, span1, c->cpus_allowed);
> +	}
> +	if (is_removed(cur) || !is_cpu_exclusive(cur)) {
> +		cpus_or(span1, span1, cur->cpus_allowed);
> +		if (cpus_equal(span1, cur->cpus_allowed))
> +			return;
> +		span2 = CPU_MASK_NONE;
> +	}
> +	else {
> +		if (cpus_empty(span1))
> +			return;
> +		span2 = cur->cpus_allowed;
> +		/* If current cpuset has exclusive children, exclude from domain */
> +		list_for_each_entry(c, &cur->children, sibling) {
> +			if (is_cpu_exclusive(c))
> +				cpus_andnot(span2, span2, c->cpus_allowed);
> +		}
> +	}
> +
> +	lock_cpu_hotplug();
> +	rebuild_sched_domains(span1, span2);
> +	unlock_cpu_hotplug();
> +}

Nitpicky, but span1 and span2 could do with better names.

Otherwise, the patch looks good to me.

-Matt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-03 22:03 ` Matthew Dobson
@ 2005-05-04  0:08   ` Nick Piggin
  2005-05-04  0:28     ` Matthew Dobson
  2005-05-05 13:26   ` Dinakar Guniguntala
  1 sibling, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2005-05-04  0:08 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: dino, Paul Jackson, Simon Derr, lkml, lse-tech, Dipankar Sarma,
	Andrew Morton

Matthew Dobson wrote:
> Dinakar Guniguntala wrote:
> 
>>+	lock_cpu_hotplug();
>>+	rebuild_sched_domains(span1, span2);
>>+	unlock_cpu_hotplug();
>>+}
> 
> 
> Nitpicky, but span1 and span2 could do with better names.
> 

As could rebuild_sched_domains while we're at it.

partition_disjoint_sched_domains(partition1, partition2);
?

Dunno. That isn't really great, but maybe better? Pretty
long, but it'll only ever be called in one or two places.

-- 
SUSE Labs, Novell Inc.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-04  0:08   ` Nick Piggin
@ 2005-05-04  0:28     ` Matthew Dobson
  2005-05-05 13:28       ` Dinakar Guniguntala
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Dobson @ 2005-05-04  0:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: dino, Paul Jackson, Simon Derr, lkml, lse-tech, Dipankar Sarma,
	Andrew Morton

Nick Piggin wrote:
> Matthew Dobson wrote:
> 
>> Dinakar Guniguntala wrote:
>>
>>> +    lock_cpu_hotplug();
>>> +    rebuild_sched_domains(span1, span2);
>>> +    unlock_cpu_hotplug();
>>> +}
>>
>>
>>
>> Nitpicky, but span1 and span2 could do with better names.
>>
> 
> As could rebuild_sched_domains while we're at it.
> 
> partition_disjoint_sched_domains(partition1, partition2);
> ?
> 
> Dunno. That isn't really great, but maybe better? Pretty
> long, but it'll only ever be called in one or two places.

build_disjoint_sched_domains(partition1, partition2)?  Or just
partition_sched_domains(partition1, partition2)?  Partition and disjoint
seem mildly redundant to me, for varying definitions of partition... ;)

-Matt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-03 22:03 ` Matthew Dobson
  2005-05-04  0:08   ` Nick Piggin
@ 2005-05-05 13:26   ` Dinakar Guniguntala
  1 sibling, 0 replies; 17+ messages in thread
From: Dinakar Guniguntala @ 2005-05-05 13:26 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: Paul Jackson, Simon Derr, Nick Piggin, lkml, lse-tech,
	Dipankar Sarma, Andrew Morton

On Tue, May 03, 2005 at 03:03:23PM -0700, Matthew Dobson wrote:
> An interesting feature.  I tried a while ago to get cpusets and
> sched_domains to play nice (nicer?) and didn't have much luck.  It seems
> you're taking a better approach, with smaller patches.  Good luck!

Thanks ! I would very much like to know your findings as far as
memory/node domains are concerned or are you going to be working on it?
I dont have any thoughts on it right now

> > -#ifdef CONFIG_HOTPLUG
> > +#if defined(CONFIG_HOTPLUG) || defined(CONFIG_CPUSETS)
> >  #define __devinit
> >  #define __devinitdata
> >  #define __devexit
> 
> This looks just plain wrong.  Why do you need this?  It doesn't seem that
> arch_init_sched_domains() and/or update_sched_domains() are called from
> anywhere that is cpuset related, so why the #ifdef CONFIG_CPUSETS?

cpu_attach_domain is defined as a __devinit, maybe I need to remove that
instead of the #ifdef

> >  #ifdef CONFIG_SMP
> > -#define SCHED_DOMAIN_DEBUG
> > +#undef SCHED_DOMAIN_DEBUG
> >  #ifdef SCHED_DOMAIN_DEBUG
> >  static void sched_domain_debug(struct sched_domain *sd, int cpu)
> >  {
> 
> Is this just to quiet boot for your testing?  Is there are better reason
> you're turning this off?  It seems unrelated to the rest of your patch.
> 

This gets called from cpu_attach_domain, and so everytime partitioning is done
and not only during boot with my changes

Thanks for your review !

	-Dinakar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH] Dynamic sched domains (v0.5)
  2005-05-04  0:28     ` Matthew Dobson
@ 2005-05-05 13:28       ` Dinakar Guniguntala
  0 siblings, 0 replies; 17+ messages in thread
From: Dinakar Guniguntala @ 2005-05-05 13:28 UTC (permalink / raw)
  To: Matthew Dobson
  Cc: Nick Piggin, Paul Jackson, Simon Derr, lkml, lse-tech,
	Dipankar Sarma, Andrew Morton

On Tue, May 03, 2005 at 05:28:20PM -0700, Matthew Dobson wrote:
> 
> build_disjoint_sched_domains(partition1, partition2)?  Or just
> partition_sched_domains(partition1, partition2)?  Partition and disjoint
> seem mildly redundant to me, for varying definitions of partition... ;)

partition_sched_domains sounds fine to me, Thanks

	-Dinakar

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-05-05 13:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-01 19:09 [RFC PATCH] Dynamic sched domains (v0.5) Dinakar Guniguntala
2005-05-02  9:10 ` Nick Piggin
2005-05-02 17:17   ` Dinakar Guniguntala
2005-05-02  9:44 ` Nick Piggin
2005-05-02 17:16   ` Dinakar Guniguntala
2005-05-02 23:23     ` Nick Piggin
2005-05-03 14:58       ` Dinakar Guniguntala
2005-05-03 15:31         ` Paul Jackson
2005-05-02 18:01 ` Paul Jackson
2005-05-03 14:44   ` Dinakar Guniguntala
2005-05-03 15:21     ` Paul Jackson
2005-05-03 15:24     ` Paul Jackson
2005-05-03 22:03 ` Matthew Dobson
2005-05-04  0:08   ` Nick Piggin
2005-05-04  0:28     ` Matthew Dobson
2005-05-05 13:28       ` Dinakar Guniguntala
2005-05-05 13:26   ` Dinakar Guniguntala

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox