[RFC] cpuset: Remove sched domain hooks from cpusets

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] cpuset:  Remove sched domain hooks from cpusets
@ 2006-10-30 21:26 Dinakar Guniguntala
  2006-10-30 21:29 ` [RFC] cpuset: Explicit dynamic sched domain cpuset flag Dinakar Guniguntala
  0 siblings, 1 reply; 7+ messages in thread
From: Dinakar Guniguntala @ 2006-10-30 21:26 UTC (permalink / raw)
  To: Paul Jackson
  Cc: nickpiggin, akpm, mbligh, menage, Simon.Derr, linux-kernel,
	rohitseth, holt, dipankar, suresh.b.siddha

[-- Attachment #1: Type: text/plain, Size: 938 bytes --]

Hi,

Remove the cpuset hooks that defined sched domains depending on the
setting of the 'cpu_exclusive' flag.

This patch is similar to what Paul Jackson had sent earlier, except that
since I am also attaching the alternative implementation in my next mail,
I didnt see the need to remove the API from sched.c

The existing cpuset code that partitioned sched domains at the
back of a exclusive cpuset has one major problem. Administrators
will find that tasks assigned to top level cpusets, that contain
child cpusets that are exclusive, can no longer be rebalanced across
the entire cpus_allowed mask. It was felt that instead of overloading
the cpu_exclusive flag to also create sched domains, it would be
better to have a separate flag that denotes a sched domain. That
way the admins have the flexibility to create exclusive cpusets
that do not necessarily define sched domains.

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>


[-- Attachment #2: pj-rem.patch --]
[-- Type: text/plain, Size: 4076 bytes --]

Index: latest/kernel/cpuset.c
===================================================================
--- latest.orig/kernel/cpuset.c	2006-10-29 16:46:07.000000000 +0530
+++ latest/kernel/cpuset.c	2006-10-29 16:46:30.000000000 +0530
@@ -754,68 +754,13 @@
 }
 
 /*
- * For a given cpuset cur, partition the system as follows
- * a. All cpus in the parent cpuset's cpus_allowed that are not part of any
- *    exclusive child cpusets
- * b. All cpus in the current cpuset's cpus_allowed that are not part of any
- *    exclusive child cpusets
- * Build these two partitions by calling partition_sched_domains
- *
- * Call with manage_mutex held.  May nest a call to the
- * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
- * Must not be called holding callback_mutex, because we must
- * not call lock_cpu_hotplug() while holding callback_mutex.
- */
-
-static void update_cpu_domains(struct cpuset *cur)
-{
-	struct cpuset *c, *par = cur->parent;
-	cpumask_t pspan, cspan;
-
-	if (par == NULL || cpus_empty(cur->cpus_allowed))
-		return;
-
-	/*
-	 * Get all cpus from parent's cpus_allowed not part of exclusive
-	 * children
-	 */
-	pspan = par->cpus_allowed;
-	list_for_each_entry(c, &par->children, sibling) {
-		if (is_cpu_exclusive(c))
-			cpus_andnot(pspan, pspan, c->cpus_allowed);
-	}
-	if (!is_cpu_exclusive(cur)) {
-		cpus_or(pspan, pspan, cur->cpus_allowed);
-		if (cpus_equal(pspan, cur->cpus_allowed))
-			return;
-		cspan = CPU_MASK_NONE;
-	} else {
-		if (cpus_empty(pspan))
-			return;
-		cspan = cur->cpus_allowed;
-		/*
-		 * Get all cpus from current cpuset's cpus_allowed not part
-		 * of exclusive children
-		 */
-		list_for_each_entry(c, &cur->children, sibling) {
-			if (is_cpu_exclusive(c))
-				cpus_andnot(cspan, cspan, c->cpus_allowed);
-		}
-	}
-
-	lock_cpu_hotplug();
-	partition_sched_domains(&pspan, &cspan);
-	unlock_cpu_hotplug();
-}
-
-/*
  * Call with manage_mutex held.  May take callback_mutex during call.
  */
 
 static int update_cpumask(struct cpuset *cs, char *buf)
 {
 	struct cpuset trialcs;
-	int retval, cpus_unchanged;
+	int retval;
 
 	/* top_cpuset.cpus_allowed tracks cpu_online_map; it's read-only */
 	if (cs == &top_cpuset)
@@ -831,12 +776,9 @@
 	retval = validate_change(cs, &trialcs);
 	if (retval < 0)
 		return retval;
-	cpus_unchanged = cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
 	mutex_lock(&callback_mutex);
 	cs->cpus_allowed = trialcs.cpus_allowed;
 	mutex_unlock(&callback_mutex);
-	if (is_cpu_exclusive(cs) && !cpus_unchanged)
-		update_cpu_domains(cs);
 	return 0;
 }
 
@@ -1046,7 +988,7 @@
 {
 	int turning_on;
 	struct cpuset trialcs;
-	int err, cpu_exclusive_changed;
+	int err;
 
 	turning_on = (simple_strtoul(buf, NULL, 10) != 0);
 
@@ -1059,14 +1001,10 @@
 	err = validate_change(cs, &trialcs);
 	if (err < 0)
 		return err;
-	cpu_exclusive_changed =
-		(is_cpu_exclusive(cs) != is_cpu_exclusive(&trialcs));
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs.flags;
 	mutex_unlock(&callback_mutex);
 
-	if (cpu_exclusive_changed)
-                update_cpu_domains(cs);
 	return 0;
 }
 
@@ -1930,17 +1868,6 @@
 	return cpuset_create(c_parent, dentry->d_name.name, mode | S_IFDIR);
 }
 
-/*
- * Locking note on the strange update_flag() call below:
- *
- * If the cpuset being removed is marked cpu_exclusive, then simulate
- * turning cpu_exclusive off, which will call update_cpu_domains().
- * The lock_cpu_hotplug() call in update_cpu_domains() must not be
- * made while holding callback_mutex.  Elsewhere the kernel nests
- * callback_mutex inside lock_cpu_hotplug() calls.  So the reverse
- * nesting would risk an ABBA deadlock.
- */
-
 static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cpuset *cs = dentry->d_fsdata;
@@ -1960,13 +1887,6 @@
 		mutex_unlock(&manage_mutex);
 		return -EBUSY;
 	}
-	if (is_cpu_exclusive(cs)) {
-		int retval = update_flag(CS_CPU_EXCLUSIVE, cs, "0");
-		if (retval < 0) {
-			mutex_unlock(&manage_mutex);
-			return retval;
-		}
-	}
 	parent = cs->parent;
 	mutex_lock(&callback_mutex);
 	set_bit(CS_REMOVED, &cs->flags);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC] cpuset:  Explicit dynamic sched domain cpuset flag
  2006-10-30 21:26 [RFC] cpuset: Remove sched domain hooks from cpusets Dinakar Guniguntala
@ 2006-10-30 21:29 ` Dinakar Guniguntala
  2006-10-31 14:43   ` Paul Jackson
  0 siblings, 1 reply; 7+ messages in thread
From: Dinakar Guniguntala @ 2006-10-30 21:29 UTC (permalink / raw)
  To: Paul Jackson
  Cc: nickpiggin, akpm, mbligh, menage, Simon.Derr, linux-kernel,
	rohitseth, holt, dipankar, suresh.b.siddha

[-- Attachment #1: Type: text/plain, Size: 1370 bytes --]

Hi,

The existing cpuset code that partitioned sched domains at the
back of a exclusive cpuset has one major problem. Administrators
will find that tasks assigned to top level cpusets, that contain
child cpusets that are exclusive, can no longer be rebalanced across
the entire cpus_allowed mask. It was felt that instead of overloading
the cpu_exclusive flag to also create sched domains, it would be
better to have a separate flag that denotes a sched domain. That
way the admins have the flexibility to create exclusive cpusets
that do not necessarily define sched domains.

This patch adds a new flag sched_domain. This can only be set on a
cpu_exclusive cpuset. If set then, the sched domain consists of all
CPUs in the current cpuset that are not part of any exclusive child
cpusets that also define sched domains.

However there are 2 additional extensions that may need to be
looked into
1. There is still no way to find the current sched domain
   configuration of the system on demand from userspace.
   (Apart from turning SCHED_DOMAIN_DEBUG on and checking dmesg)
2. A way to specify a NULL sched domain, ie mark an exclusive
   cpuset that defines a sched domain as one with no load balancing.
   This can be accomplished by adding yet another flag that says
   dont load balance (say no_load_balance)

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>


[-- Attachment #2: add_sd.patch --]
[-- Type: text/plain, Size: 9260 bytes --]

Index: linux-2.6.19-rc2/kernel/cpuset.c
===================================================================
--- linux-2.6.19-rc2.orig/kernel/cpuset.c	2006-10-31 00:56:04.000000000 +0530
+++ linux-2.6.19-rc2/kernel/cpuset.c	2006-10-31 01:12:52.000000000 +0530
@@ -104,6 +104,7 @@
 /* bits in struct cpuset flags field */
 typedef enum {
 	CS_CPU_EXCLUSIVE,
+	CS_SCHED_DOMAIN,
 	CS_MEM_EXCLUSIVE,
 	CS_MEMORY_MIGRATE,
 	CS_REMOVED,
@@ -118,6 +119,11 @@
 	return test_bit(CS_CPU_EXCLUSIVE, &cs->flags);
 }
 
+static inline int is_sched_domain(const struct cpuset *cs)
+{
+	return test_bit(CS_SCHED_DOMAIN, &cs->flags);
+}
+
 static inline int is_mem_exclusive(const struct cpuset *cs)
 {
 	return test_bit(CS_MEM_EXCLUSIVE, &cs->flags);
@@ -170,7 +176,7 @@
 static int cpuset_mems_generation;
 
 static struct cpuset top_cpuset = {
-	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
+	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_SCHED_DOMAIN) | (1 << CS_MEM_EXCLUSIVE)),
 	.cpus_allowed = CPU_MASK_ALL,
 	.mems_allowed = NODE_MASK_ALL,
 	.count = ATOMIC_INIT(0),
@@ -695,6 +701,7 @@
 	return	cpus_subset(p->cpus_allowed, q->cpus_allowed) &&
 		nodes_subset(p->mems_allowed, q->mems_allowed) &&
 		is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
+		is_sched_domain(p) <= is_sched_domain(q) &&
 		is_mem_exclusive(p) <= is_mem_exclusive(q);
 }
 
@@ -738,6 +745,10 @@
 	if (!is_cpuset_subset(trial, par))
 		return -EACCES;
 
+	/* We cannot define a sched domain if we are not also exclusive */
+	if (is_sched_domain(trial) && !is_cpu_exclusive(trial))
+		return -EINVAL;
+
 	/* If either I or some sibling (!= me) is exclusive, we can't overlap */
 	list_for_each_entry(c, &par->children, sibling) {
 		if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
@@ -754,13 +765,67 @@
 }
 
 /*
+ * For a given cpuset cur, partition the system as follows
+ * a. All cpus in the parent cpuset's cpus_allowed that are not part of any
+ *    child cpusets defining sched domain
+ * b. All cpus in the current cpuset's cpus_allowed that are not part of any
+ *    child cpusets defining sched domain
+ * Build these two partitions by calling partition_sched_domains
+ *
+ * Call with manage_mutex held.  May nest a call to the
+ * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
+ * Must not be called holding callback_mutex, because we must
+ * not call lock_cpu_hotplug() while holding callback_mutex.
+ */
+
+static void update_sched_domains(struct cpuset *cur)
+{
+	struct cpuset *c, *par = cur->parent;
+	cpumask_t pspan, cspan;
+
+	if (par == NULL || cpus_empty(cur->cpus_allowed))
+		return;
+
+	/*
+	 * Get all cpus from parent's cpus_allowed not part of exclusive
+	 * children
+	 */
+	pspan = par->cpus_allowed;
+	list_for_each_entry(c, &par->children, sibling) {
+		if (is_sched_domain(c))
+			cpus_andnot(pspan, pspan, c->cpus_allowed);
+	}
+	if (!is_sched_domain(cur)) {
+		if (cpus_equal(pspan, cur->cpus_allowed))
+			return;
+		cspan = CPU_MASK_NONE;
+	} else {
+		if (cpus_empty(pspan))
+			return;
+		cspan = cur->cpus_allowed;
+		/*
+		 * Get all cpus from current cpuset's cpus_allowed not part
+		 * of exclusive children
+		 */
+		list_for_each_entry(c, &cur->children, sibling) {
+			if (is_sched_domain(c))
+				cpus_andnot(cspan, cspan, c->cpus_allowed);
+		}
+	}
+
+	lock_cpu_hotplug();
+	partition_sched_domains(&pspan, &cspan);
+	unlock_cpu_hotplug();
+}
+
+/*
  * Call with manage_mutex held.  May take callback_mutex during call.
  */
 
 static int update_cpumask(struct cpuset *cs, char *buf)
 {
 	struct cpuset trialcs;
-	int retval;
+	int retval, cpus_unchanged;
 
 	/* top_cpuset.cpus_allowed tracks cpu_online_map; it's read-only */
 	if (cs == &top_cpuset)
@@ -776,9 +841,13 @@
 	retval = validate_change(cs, &trialcs);
 	if (retval < 0)
 		return retval;
+	cpus_unchanged = cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
 	mutex_lock(&callback_mutex);
 	cs->cpus_allowed = trialcs.cpus_allowed;
 	mutex_unlock(&callback_mutex);
+	if (is_sched_domain(cs) && !cpus_unchanged)
+		update_sched_domains(cs);
+
 	return 0;
 }
 
@@ -988,7 +1057,7 @@
 {
 	int turning_on;
 	struct cpuset trialcs;
-	int err;
+	int err, sched_domain_changed;
 
 	turning_on = (simple_strtoul(buf, NULL, 10) != 0);
 
@@ -1005,6 +1074,11 @@
 	cs->flags = trialcs.flags;
 	mutex_unlock(&callback_mutex);
 
+	sched_domain_changed =
+		(is_sched_domain(cs) != is_sched_domain(&trialcs));
+	if (sched_domain_changed)
+		update_sched_domains(cs);
+
 	return 0;
 }
 
@@ -1209,6 +1283,7 @@
 	FILE_CPULIST,
 	FILE_MEMLIST,
 	FILE_CPU_EXCLUSIVE,
+	FILE_SCHED_DOMAIN,
 	FILE_MEM_EXCLUSIVE,
 	FILE_NOTIFY_ON_RELEASE,
 	FILE_MEMORY_PRESSURE_ENABLED,
@@ -1259,6 +1334,9 @@
 	case FILE_CPU_EXCLUSIVE:
 		retval = update_flag(CS_CPU_EXCLUSIVE, cs, buffer);
 		break;
+	case FILE_SCHED_DOMAIN:
+		retval = update_flag(CS_SCHED_DOMAIN, cs, buffer);
+		break;
 	case FILE_MEM_EXCLUSIVE:
 		retval = update_flag(CS_MEM_EXCLUSIVE, cs, buffer);
 		break;
@@ -1376,6 +1454,9 @@
 	case FILE_CPU_EXCLUSIVE:
 		*s++ = is_cpu_exclusive(cs) ? '1' : '0';
 		break;
+	case FILE_SCHED_DOMAIN:
+		*s++ = is_sched_domain(cs) ? '1' : '0';
+		break;
 	case FILE_MEM_EXCLUSIVE:
 		*s++ = is_mem_exclusive(cs) ? '1' : '0';
 		break;
@@ -1735,6 +1816,11 @@
 	.private = FILE_CPU_EXCLUSIVE,
 };
 
+static struct cftype cft_sched_domain = {
+	.name = "sched_domain",
+	.private = FILE_SCHED_DOMAIN,
+};
+
 static struct cftype cft_mem_exclusive = {
 	.name = "mem_exclusive",
 	.private = FILE_MEM_EXCLUSIVE,
@@ -1780,6 +1866,8 @@
 		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_cpu_exclusive)) < 0)
 		return err;
+	if ((err = cpuset_add_file(cs_dentry, &cft_sched_domain)) < 0)
+		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_mem_exclusive)) < 0)
 		return err;
 	if ((err = cpuset_add_file(cs_dentry, &cft_notify_on_release)) < 0)
@@ -1868,6 +1956,17 @@
 	return cpuset_create(c_parent, dentry->d_name.name, mode | S_IFDIR);
 }
 
+/*
+ * Locking note on the strange update_flag() call below:
+ *
+ * If the cpuset being removed is marked cpu_exclusive, then simulate
+ * turning cpu_exclusive off, which will call update_cpu_domains().
+ * The lock_cpu_hotplug() call in update_cpu_domains() must not be
+ * made while holding callback_mutex.  Elsewhere the kernel nests
+ * callback_mutex inside lock_cpu_hotplug() calls.  So the reverse
+ * nesting would risk an ABBA deadlock.
+ */
+
 static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
 	struct cpuset *cs = dentry->d_fsdata;
@@ -1887,6 +1986,13 @@
 		mutex_unlock(&manage_mutex);
 		return -EBUSY;
 	}
+	if (is_sched_domain(cs)) {
+		int retval = update_flag(CS_SCHED_DOMAIN, cs, "0");
+		if (retval < 0) {
+			mutex_unlock(&manage_mutex);
+			return retval;
+		}
+	}
 	parent = cs->parent;
 	mutex_lock(&callback_mutex);
 	set_bit(CS_REMOVED, &cs->flags);
Index: linux-2.6.19-rc2/Documentation/cpusets.txt
===================================================================
--- linux-2.6.19-rc2.orig/Documentation/cpusets.txt	2006-10-31 00:55:02.000000000 +0530
+++ linux-2.6.19-rc2/Documentation/cpusets.txt	2006-10-31 02:23:23.000000000 +0530
@@ -86,7 +86,7 @@
       and a database), or
     * NUMA systems running large HPC applications with demanding
       performance characteristics.
-    * Also cpu_exclusive cpusets are useful for servers running orthogonal
+    * Also sched_domain cpusets are useful for servers running orthogonal
       workloads such as RT applications requiring low latency and HPC
       applications that are throughput sensitive
 
@@ -131,8 +131,8 @@
  - A cpuset may be marked exclusive, which ensures that no other
    cpuset (except direct ancestors and descendents) may contain
    any overlapping CPUs or Memory Nodes.
-   Also a cpu_exclusive cpuset would be associated with a sched
-   domain.
+ - Also a cpu_exclusive cpuset would be associated with a sched
+   domain, if the sched_domain flag is turned on.
  - You can list all the tasks (by pid) attached to any cpuset.
 
 The implementation of cpusets requires a few, simple hooks
@@ -177,6 +177,7 @@
  - mems: list of Memory Nodes in that cpuset
  - memory_migrate flag: if set, move pages to cpusets nodes
  - cpu_exclusive flag: is cpu placement exclusive?
+ - sched_domain flag: if turned on, cpusets 'cpus' define a sched domain
  - mem_exclusive flag: is memory placement exclusive?
  - tasks: list of tasks (by pid) attached to that cpuset
  - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
@@ -231,9 +232,10 @@
 a direct ancestor or descendent, may share any of the same CPUs or
 Memory Nodes.
 
-A cpuset that is cpu_exclusive has a scheduler (sched) domain
-associated with it.  The sched domain consists of all CPUs in the
-current cpuset that are not part of any exclusive child cpusets.
+A cpuset that is cpu_exclusive can be used to define a scheduler
+(sched) domain if the sched_domain flag is turned on. The sched domain
+consists of all CPUs in the current cpuset that are not part of any
+exclusive child cpusets that also define sched domains.
 This ensures that the scheduler load balancing code only balances
 against the CPUs that are in the sched domain as defined above and
 not all of the CPUs in the system. This removes any overhead due to

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] cpuset:  Explicit dynamic sched domain cpuset flag
  2006-10-30 21:29 ` [RFC] cpuset: Explicit dynamic sched domain cpuset flag Dinakar Guniguntala
@ 2006-10-31 14:43   ` Paul Jackson
  2006-10-31 23:58     ` Paul Jackson
  2006-11-08 10:38     ` Paul Jackson
  0 siblings, 2 replies; 7+ messages in thread
From: Paul Jackson @ 2006-10-31 14:43 UTC (permalink / raw)
  To: dino
  Cc: nickpiggin, akpm, mbligh, menage, Simon.Derr, linux-kernel,
	rohitseth, holt, dipankar, suresh.b.siddha

Thanks for doing this, Dinakar.

A couple of million messages earlier in this discussion, at its
beginning, I submitted a patch on October 18:

  Cpuset: explicit dynamic sched domain control flags
  http://lkml.org/lkml/2006/10/18/48

It too had a 'sched_domain' flag.  Some of my comments will
examine the differences between that patch and yours.

Lets call my old patch P1, and this current patch of yours P2.

Now comments on your patch, P2:

1) P1 also had a sched_domain_enabled flag, per-cpuset, to toggle
   between the old cpu_exclusive and this new sched_domain flags
   for defining sched domains.  Since then I threw out the old
   behaviour, as not worth preserving, and not creating a sufficient
   incompatibility that we needed to preserve the old way.

   So I agree with P2, not providing this sched_domain_enabled flag.

2) P1 had a rule that sibling cpusets marked 'sched_domain' could
   not overlap.  P2 has a rule that cpusets marked 'sched_domain'
   must also be marked 'cpu_exclusive.'

   I suspect I prefer P1 here.  That rule seems easier for users to
   code to.  It doesn't entangle these two flags.  But I am not sure
   I understand this detail correctly.

   I think it should be ok to have two overlapping sibling cpusets,
   one of which is marked as a sched domain and one of which is not.

   Ah - no - I'm wrong.  The "real" meaning of the 'sched_domain'
   is that cpusets so marked might *not* be sched domains!  This
   'sched_domain' name is bad, it is the reverse of what it should be.

   The 'sched_domain' flag name should be some variation of, say:

   	sched_ok_not_to_load_balance	# default to off (0)

   It is exactly the cpusets so marked which the user is allowing
   the kernel to carve up into sched domain partitions, disabling
   load balancing between these isolated partitions.

   And then this rule that cpusets marked with this flag must also
   be marked cpu_exclusive ensures that we don't have sibling cpusets
   that do require load balancing partially overlapping with cpusets
   that do not require it.  Such overlap would result in tasks stuck
   in no-man's land, expecting load balancing but stuck on a cpu
   that some other cpuset decreed need not be entirely load balanced.

3) In any case, this flag name must change.  It names the wrong thing,
   it's backwards, and its vague.

   (For the record - looks like I introduced the name 'sched_domain'
   in P1.  -pj)

    * This flags name must focus on whether load balancing is required.
      That's the publicly exposed affect of this flag.

      The construction of dynamic sched domains and partitioning them
      into isolated domains is an internal implementation detail,
      and as I guess Nick has been figuring for a while, should remain
      the private business of the kernel.

      The deal the kernel is offering the user is simply:

       	    Let us selectively disable load balancing, and
	    in turn we will give you better performance.

    * This flags name must correctly suggest the sense of setting it.
      To suggest in the name that we are setting up a sched domain when
      in fact we are using this flag to allow ripping sched domains
      asunder is backwards.

4) The following wording in P2 (and in earlier versions) is not clear to me:

  * For a given cpuset cur, partition the system as follows
  * a. All cpus in the parent cpuset's cpus_allowed that are not part of any
  *    child cpusets defining sched domain
  * b. All cpus in the current cpuset's cpus_allowed that are not part of any
  *    child cpusets defining sched domain
  * Build these two partitions by calling partition_sched_domains

   The first reference to 'child cpusets' confuses me - child of the
   parent (aka the current cpusets siblings) or child of the current
   cpuset.  And more essentially, the invariates are not clearly
   spelled out.  Granted, I can be dense at times.  But I have
   literally spent hours puzzling over the meaning of these lines,
   and the associated code, and I'm still figuring it out.

   How about this wording, using the flag renaming from (2) above:

  * By default, the kernel scheduler load balances across all CPUs
  * in the system.  This can get expensive on high CPU count systems.
  *
  * The per-cpuset flag 'sched_ok_not_to_load_balance' can be set by
  * the user to reduce this load balancing performance penalty.
  * If set in a cpuset, this tells the kernel it is ok not to load
  * balance tasks in that cpuset.  If such a cpuset overlaps other
  * cpusets that still require load balancing, the kernel may decide
  * to load balance these CPUs anyway - that's the kernel's choice.
  *
  * The kernel implements this by partitioning the CPUs in the system
  * into as many separate, isolated scheduler domains as it can, and
  * still avoid dividing any cpuset that -does- require load balancing
  * across two or more such partitions.
  *
  * A 'partition' P of a set S is a set of subsets of S, such that:
  *  1) the union of the members of P equals S,
  *  2) no two members of P overlap each other, and
  *  3) no member of P is empty.
  *
  * Marking a cpuset 'sched_ok_not_to_load_balance' grants the kernel
  * permission to partition the CPUs in that cpuset across multiple
  * isolated sched domains.  No load balancing is done between such
  * isolated domains.
  *
  * If 'C' is a cpuset marked 'sched_ok_not_to_load_balance',
  * then its set of CPUs are either a member of this partition,
  * or equal to the union of multiple members of this partition,
  * including the members corresponding to the cpuset descendents of
  * 'C' which are marked sched_ok_not_to_load_balance.
  *
  * If 'C' is a cpuset marked 'sched_ok_not_to_load_balance', and if
  * some of its child cpusets are also marked, and if there are any
  * left over CPUs in 'C' that are not in any of those child cpusets
  * so marked, then these left over cpus form a separate member of
  * the sched domain partition.
  *
  * Because we gather up such left over cpus to form another partition
  * member, therefore whenever we change this flag on a cpuset, we
  * have to recompute the partition for the parent of the affected
  * cpuset as well as for the affected cpuset itself.
  *
  * Similarly, if a cpuset that is so marked has child cpusets also
  * marked, and if it has CPUs added or removed, then we have to
  * recompute the partition for its child cpusets, as the left over
  * CPUs will have changed.

5) The above remarks just try to clarify what is with comments and
   name changes.

   There is one remaining capability that is missing.

   Batch schedulers, for example, can end up with multiple overlapping
   sibling cpusets, for a mix of active and inactive jobs.  At any point
   in time the active cpusets don't overlap, and to improve performance,
   the batch scheduler would like to partition the sched domains along
   the lines of its active jobs.

   For this to work, the sched_ok_not_to_load_balance flag has to become
   advisory rather than mandatory.

   That is, instead of imposing the rules up front:
     * must be cpu_exclusive, and
     * must have parent set too,
   rather let any cpuset be marked  sched_ok_not_to_load_balance,
   and then discover the finest grained sched domain partitioning
   consistent with those markings.  This discovery could be done in
   an N**2 algorithm, where N is the number of cpusets, by scanning
   over all cpusets, clustering any overlapping cpusets that require
   load balancing (don't have sched_ok_not_to_load_balance set.)

   Currently, anytime this flag, or the CPUs in a cpuset so marked,
   change, we immediately drive a new partitioning, along those lines.

   Instead, anytime the set of cpusets requiring load balancing
   (the cpusets with sched_ok_not_to_load_balance not set) changes,
   adding or remove such a cpuset or CPUs to or from such a cpuset,
   we should rediscover the partitioning, clustering into the same
   partition member any overlapping cpusets requiring balancing.

6) Now for some nits:

    @@ -1005,6 +1074,11 @@
 	    cs->flags = trialcs.flags;
 	    mutex_unlock(&callback_mutex);

    +	sched_domain_changed =
    +		(is_sched_domain(cs) != is_sched_domain(&trialcs));

   This looks borked.  You are looking for differences in the
   flags of cs and trialcs, just after the assignment making
   them the same.  I predict that "sched_domain_changed" will
   always be False.  You probably have to set sched_domain_changed
   earlier in the code.

7) The patch would be slightly easier to read if done with a "diff -p"
   option, displaying the procedure name for each diff chunk.

8) Looks like a cut+paste comment from earlier code did not change
   some mentions of cpu_exclusive to sched_domain:

    + * If the cpuset being removed is marked cpu_exclusive, then simulate
    + * turning cpu_exclusive off, which will call update_cpu_domains().

   Also:

    + - Also a cpu_exclusive cpuset would be associated with a sched
    +   domain, if the sched_domain flag is turned on.

   And:

    +A cpuset that is cpu_exclusive can be used to define a scheduler
    +(sched) domain if the sched_domain flag is turned on. The sched domain
    +consists of all CPUs in the current cpuset that are not part of any
    +exclusive child cpusets that also define sched domains.

9) Finally, we really do need a way, on a production system, for user
   space to ask the kernel where load balancing is limited.

   For example one possible interface would have user space pass a
   cpumask to the kernel, and get back a Boolean value, indicating
   whether or not there are any limitations on load balancing between
   any two CPUs specified in that cpumask.  Internally, the kernel
   would answer this by seeing if the provided mask was a subset of
   one of the partition members, or not.

   A performance aware batch scheduler, for instance, could use this
   API to verify that load balancing really was limited on its top
   level and other large, inactive, cpusets.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] cpuset:  Explicit dynamic sched domain cpuset flag
  2006-10-31 14:43   ` Paul Jackson
@ 2006-10-31 23:58     ` Paul Jackson
  2006-11-08 10:38     ` Paul Jackson
  1 sibling, 0 replies; 7+ messages in thread
From: Paul Jackson @ 2006-10-31 23:58 UTC (permalink / raw)
  To: Paul Jackson
  Cc: dino, nickpiggin, akpm, mbligh, menage, Simon.Derr, linux-kernel,
	rohitseth, holt, dipankar, suresh.b.siddha

pj wrote:
> 9) Finally, we really do need a way, on a production system, for user
>    space to ask the kernel where load balancing is limited.
>    
>    For example one possible interface would have user space pass a
>    cpumask to the kernel, and get back a Boolean value, indicating
>    whether or not there are any limitations on load balancing between
>    any two CPUs specified in that cpumask.

Ah - a simpler API, more user friendly, more "cpuset API style"
friendly:

    A read-only, per-cpuset Boolean flag that indicates whether the
    scheduler is fully load balancing across all 'cpus' of that cpuset.

    Internally, the kernel would answer this by seeing whether or
    not the cpusets cpus_allowed cpumask was a subset of one of the
    members of the scheduler domains partition.

Call this per-cpuset flag something such as:

    sched_is_fully_load_balanced	# read-only Boolean

This goes along with having the control flag named something such as:

    sched_ok_not_to_load_balance	# read-write Boolean

If a task was in a cpuset that had 'sched_is_fully_load_balanced'
False, then it might not get load balanced.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] cpuset:  Explicit dynamic sched domain cpuset flag
  2006-10-31 14:43   ` Paul Jackson
  2006-10-31 23:58     ` Paul Jackson
@ 2006-11-08 10:38     ` Paul Jackson
  2006-11-08 18:23       ` Dinakar Guniguntala
  1 sibling, 1 reply; 7+ messages in thread
From: Paul Jackson @ 2006-11-08 10:38 UTC (permalink / raw)
  To: Paul Jackson
  Cc: dino, nickpiggin, akpm, mbligh, menage, Simon.Derr, linux-kernel,
	rohitseth, holt, dipankar, suresh.b.siddha

Dinakar,

Where do we stand on this patch?

Last I knew, as of a week ago:

  * I had (still have) a patch in *-mm to nuke the old connection
    between the cpu_exclusive flag and sched domain partitioning:
	cpuset-remove-sched-domain-hooks-from-cpusets.patch
  * and you have this patch posted on lkml, with some non-trivial
    comments from myself, to provide a new 'sched_domain' per-cpuset
    flag to control sched domain partitioning.

Ideally, we'd agree on this new 'sched_domain' (or whatever we call it)
flag, so that my patch to remove the old hooks could travel to 2.6.20
along with this present patch to provide new and improved hooks.

However ... I need to focus on some other stuff for roughly four
weeks, so can't focus on pushing this effort along right now.

My guess is that I will end up asking Andrew to hold the above
named "remove ... hooks" patch in *-mm until you and I get our
act together on the replacement, which most likely will mean he
holds it until we start work on what will become 2.6.21.

Do you see any better choices?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] cpuset:  Explicit dynamic sched domain cpuset flag
  2006-11-08 10:38     ` Paul Jackson
@ 2006-11-08 18:23       ` Dinakar Guniguntala
  2006-11-08 21:22         ` Paul Jackson
  0 siblings, 1 reply; 7+ messages in thread
From: Dinakar Guniguntala @ 2006-11-08 18:23 UTC (permalink / raw)
  To: Paul Jackson
  Cc: nickpiggin, akpm, mbligh, menage, Simon.Derr, linux-kernel,
	rohitseth, holt, dipankar, suresh.b.siddha

On Wed, Nov 08, 2006 at 02:38:36AM -0800, Paul Jackson wrote:
> Dinakar,
> 
> Where do we stand on this patch?
> 
> Last I knew, as of a week ago:
> 
>   * I had (still have) a patch in *-mm to nuke the old connection
>     between the cpu_exclusive flag and sched domain partitioning:
> 	cpuset-remove-sched-domain-hooks-from-cpusets.patch
>   * and you have this patch posted on lkml, with some non-trivial
>     comments from myself, to provide a new 'sched_domain' per-cpuset
>     flag to control sched domain partitioning.
> 
> Ideally, we'd agree on this new 'sched_domain' (or whatever we call it)
> flag, so that my patch to remove the old hooks could travel to 2.6.20
> along with this present patch to provide new and improved hooks.
> 
> However ... I need to focus on some other stuff for roughly four
> weeks, so can't focus on pushing this effort along right now.
> 
> My guess is that I will end up asking Andrew to hold the above
> named "remove ... hooks" patch in *-mm until you and I get our
> act together on the replacement, which most likely will mean he
> holds it until we start work on what will become 2.6.21.
> 
> Do you see any better choices?

Paul, I got busy on my end too and hope to work on it next week.
I guess I'll work on it with your suggestions and post it as soon
as I can. You can take a look at them when you are free.
Thank you for the patience

	-Dinakar

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] cpuset:  Explicit dynamic sched domain cpuset flag
  2006-11-08 18:23       ` Dinakar Guniguntala
@ 2006-11-08 21:22         ` Paul Jackson
  0 siblings, 0 replies; 7+ messages in thread
From: Paul Jackson @ 2006-11-08 21:22 UTC (permalink / raw)
  To: dino
  Cc: nickpiggin, akpm, mbligh, menage, Simon.Derr, linux-kernel,
	rohitseth, holt, dipankar, suresh.b.siddha

> Paul, I got busy on my end too and hope to work on it next week.
> I guess I'll work on it with your suggestions and post it as soon
> as I can. You can take a look at them when you are free.
> Thank you for the patience

Sounds good.  Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-11-08 21:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-30 21:26 [RFC] cpuset: Remove sched domain hooks from cpusets Dinakar Guniguntala
2006-10-30 21:29 ` [RFC] cpuset: Explicit dynamic sched domain cpuset flag Dinakar Guniguntala
2006-10-31 14:43   ` Paul Jackson
2006-10-31 23:58     ` Paul Jackson
2006-11-08 10:38     ` Paul Jackson
2006-11-08 18:23       ` Dinakar Guniguntala
2006-11-08 21:22         ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox