[PATCH 1/1] cpusets/sched_domain reconciliation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/1] cpusets/sched_domain reconciliation
@ 2007-03-20 19:14 Cliff Wickman
  2007-03-22 21:21 ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Cliff Wickman @ 2007-03-20 19:14 UTC (permalink / raw)
  To: linux-kernel


From: Cliff Wickman <cpw@sgi.com>

This patch reconciles cpusets and sched_domains that get out of sync
due to disabling and re-enabling of cpu's.

Dinakar Guniguntala (IBM) is working on his own version of fixing this.
But as of this date that fix doesn't seem to be ready.

Here is an example of how the problem can occur:

   system of cpu's 0-31
   create cpuset /x  16-31
   create cpuset /x/y  16-23
   all cpu_exclusive

   disable cpu 17
     x is now    16,18-31
     x/y is now 16,18-23
   enable cpu 17
     x and x/y are unchanged

   to restore the cpusets:
     echo 16-31 > /dev/cpuset/x
     echo 16-23 > /dev/cpuset/x/y

   At the first echo, update_cpu_domains() is called for cpuset x/.

   The system is partitioned between:
	its parent, the root cpuset of 0-31, minus its
				    children (x/ is 16-31): 0-15
        and x/ (16-31), minus its children (x/y/ 16,18-23): 17,24-31

   The sched_domain's for parent 0-15 are updated.
   The sched_domain's for current 17,24-31 are updated.

   But 16 has been untouched.
   As a result, 17's SD points to sched_group_phys[17] which is the only
   sched_group_phys on 17's list.  It points to itself.
   But 16's SD points to sched_group_phys[16], which still points to
   sched_group_phys[17].
   When cpu 16 executes find_busiest_group() it will hang on the non-
   circular sched_group list.
           
This solution is to update the sched_domain's for the cpuset
whose cpu's were changed and, in addition, all its children.
The update_cpu_domains() will end with a (recursive) call to itself
for each child.
The extra sched_domain reconstruction is overhead, but only at the
frequency of administrative change to the cpusets.

This patch also includes checks in find_busiest_group() and
find_idlest_group() that break from their loops on a sched_group that
points to itself.  This is needed because other cpu's are going through
load balancing while the sched_domains are being reconstructed.

There seems to be no administrative procedural work-around.  In the
example above one could not reverse the two echo's and set x/y before
x/.  It is not logical, so not allowed (Permission denied).

Diffed against 2.6.20-rc6

Signed-off-by: Cliff Wickman <cpw@sgi.com>

---
 kernel/cpuset.c |   11 +++++++++--
 kernel/sched.c  |   19 +++++++++++++++----
 2 files changed, 24 insertions(+), 6 deletions(-)

Index: morton.070205/kernel/sched.c
===================================================================
--- morton.070205.orig/kernel/sched.c
+++ morton.070205/kernel/sched.c
@@ -1201,11 +1201,14 @@ static inline unsigned long cpu_avg_load
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 {
-	struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *idlest = NULL, *this = sd->groups, *group = sd->groups;
+	struct sched_group *self, *prev;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
+	prev = group;
+	self = group;
 	do {
 		unsigned long load, avg_load;
 		int local_group;
@@ -1241,8 +1244,10 @@ find_idlest_group(struct sched_domain *s
 			idlest = group;
 		}
 nextgroup:
+		prev = self;
+		self = group;
 		group = group->next;
-	} while (group != sd->groups);
+	} while (group != sd->groups && group != self && group != prev);
 
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
@@ -2259,7 +2264,8 @@ find_busiest_group(struct sched_domain *
 		   unsigned long *imbalance, enum idle_type idle, int *sd_idle,
 		   cpumask_t *cpus, int *balance)
 {
-	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *busiest = NULL, *this = sd->groups, *group = sd->groups;
+	struct sched_group *self, *prev;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
 	unsigned long max_pull;
 	unsigned long busiest_load_per_task, busiest_nr_running;
@@ -2282,6 +2288,8 @@ find_busiest_group(struct sched_domain *
 	else
 		load_idx = sd->idle_idx;
 
+	prev = group;
+	self = group;
 	do {
 		unsigned long load, group_capacity;
 		int local_group;
@@ -2410,8 +2418,11 @@ find_busiest_group(struct sched_domain *
 		}
 group_next:
 #endif
+		prev = self;
+		self = group;
 		group = group->next;
-	} while (group != sd->groups);
+		/* careful, a printk here can cause a spinlock hang */
+	} while (group != sd->groups && group != self && group != prev);
 
 	if (!busiest || this_load >= max_load || busiest_nr_running == 0)
 		goto out_balanced;
Index: morton.070205/kernel/cpuset.c
===================================================================
--- morton.070205.orig/kernel/cpuset.c
+++ morton.070205/kernel/cpuset.c
@@ -765,6 +765,8 @@ static int validate_change(const struct 
  * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
  * Must not be called holding callback_mutex, because we must
  * not call lock_cpu_hotplug() while holding callback_mutex.
+ *
+ * Recursive, on depth of cpuset subtree.
  */
 
 static void update_cpu_domains(struct cpuset *cur)
@@ -790,8 +792,8 @@ static void update_cpu_domains(struct cp
 			return;
 		cspan = CPU_MASK_NONE;
 	} else {
-		if (cpus_empty(pspan))
-			return;
+		/* parent may be empty, but update anyway */
+
 		cspan = cur->cpus_allowed;
 		/*
 		 * Get all cpus from current cpuset's cpus_allowed not part
@@ -806,6 +808,11 @@ static void update_cpu_domains(struct cp
 	lock_cpu_hotplug();
 	partition_sched_domains(&pspan, &cspan);
 	unlock_cpu_hotplug();
+
+	/* walk all its children to make sure it's all consistent */
+	list_for_each_entry(c, &cur->children, sibling) {
+		update_cpu_domains(c);
+	}
 }
 
 /*

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] cpusets/sched_domain reconciliation
  2007-03-20 19:14 Cliff Wickman
@ 2007-03-22 21:21 ` Andrew Morton
  2007-03-22 22:39   ` Cliff Wickman
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2007-03-22 21:21 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: linux-kernel

On Tue, 20 Mar 2007 13:14:35 -0600
cpw@sgi.com (Cliff Wickman) wrote:

> This patch reconciles cpusets and sched_domains that get out of sync
> due to disabling and re-enabling of cpu's.

I get three-out-of-three rejects in cpuset.c.  I could fix them, but I
wouldn't be very confident that the result works at runtime.  2.6.20-rc6 was
a long time ago - please, always raise patches against the latest mainline
kernel (the daily git snapshot suffices).

Recursion is a big no-no in kernel.  Is there any way in which it can be
avoided?  Is Dinakar's implementation also recursive?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] cpusets/sched_domain reconciliation
  2007-03-22 22:39   ` Cliff Wickman
@ 2007-03-22 22:32     ` Andrew Morton
  2007-03-23  5:07       ` Paul Jackson
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2007-03-22 22:32 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: dino, linux-kernel

On Thu, 22 Mar 2007 16:39:45 -0600
Cliff Wickman <cpw@sgi.com> wrote:

> Hello Andrew,
> 
> On Thu, Mar 22, 2007 at 02:21:52PM -0700, Andrew Morton wrote:
> > On Tue, 20 Mar 2007 13:14:35 -0600
> > cpw@sgi.com (Cliff Wickman) wrote:
> > 
> > > This patch reconciles cpusets and sched_domains that get out of sync
> > > due to disabling and re-enabling of cpu's.
> > 
> > I get three-out-of-three rejects in cpuset.c.  I could fix them, but I
> > wouldn't be very confident that the result works at runtime.  2.6.20-rc6 was
> > a long time ago - please, always raise patches against the latest mainline
> > kernel (the daily git snapshot suffices).
> 
> Will do.
>  
> > Recursion is a big no-no in kernel.  Is there any way in which it can be
> > avoided?  Is Dinakar's implementation also recursive?
> 
> I was a little reluctant to use recursion, but this use parallels another,
> existing such use in cpuset.c  The depth of the recursion is only the depth of
> the cpuset hierarchy, which is set up by an administrator, and which is
> logically limited by the number of cpus in the system.  e.g. it would be
> hard to even deliberately organize 16 cpus into a hierarchy greater
> than 16 layers deep, even if you wanted cpusets of single cpus.
> We've not run into such a problem on systems of hundreds of cpus.
> I would think it's safe.  What do you think?

It isn't very nice.  It probably won't crash, but it _can_ crash and when
it does, the results will be very obscure and people who will be affected
by the crash will be badly $impacted$ by it.

Perhaps as a middle-ground thing we could simply ban the creation of
cpusets hierarchies which are more than <mumble> layers deep.

Or, worse, we could take a peek at the depth of the tree before starting
the recursion, then just fail out if it exceeds <mumble>

Or, worse still, we could allow the recursion to proceed down <mumble>
levels, then refuse to apply the reconciliation any deeper.

Best would be to avoid the recursion ;) lib/radix-tree.c has a similar
problem, and has a possibly-conceptually-applicable solution.  It has a
fixed maximum depth so it uses a local array, but it could use kmalloc()
for the radix_tree_path.

Is there any sane way in which we can perform the recursion in userspace? 
Get the application to walk the hierarchy and do the fixups at each node? 
Probably not...

> Dinakar's solution is not written yet, as far as I know.  I'll copy him
> for his status.

Good idea.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] cpusets/sched_domain reconciliation
  2007-03-22 21:21 ` Andrew Morton
@ 2007-03-22 22:39   ` Cliff Wickman
  2007-03-22 22:32     ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Cliff Wickman @ 2007-03-22 22:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: dino, linux-kernel

Hello Andrew,

On Thu, Mar 22, 2007 at 02:21:52PM -0700, Andrew Morton wrote:
> On Tue, 20 Mar 2007 13:14:35 -0600
> cpw@sgi.com (Cliff Wickman) wrote:
> 
> > This patch reconciles cpusets and sched_domains that get out of sync
> > due to disabling and re-enabling of cpu's.
> 
> I get three-out-of-three rejects in cpuset.c.  I could fix them, but I
> wouldn't be very confident that the result works at runtime.  2.6.20-rc6 was
> a long time ago - please, always raise patches against the latest mainline
> kernel (the daily git snapshot suffices).

Will do.

> Recursion is a big no-no in kernel.  Is there any way in which it can be
> avoided?  Is Dinakar's implementation also recursive?

I was a little reluctant to use recursion, but this use parallels another,
existing such use in cpuset.c  The depth of the recursion is only the depth of
the cpuset hierarchy, which is set up by an administrator, and which is
logically limited by the number of cpus in the system.  e.g. it would be
hard to even deliberately organize 16 cpus into a hierarchy greater
than 16 layers deep, even if you wanted cpusets of single cpus.
We've not run into such a problem on systems of hundreds of cpus.
I would think it's safe.  What do you think?

Dinakar's solution is not written yet, as far as I know.  I'll copy him
for his status.

-- 
Cliff Wickman
Silicon Graphics, Inc.
cpw@sgi.com
(651) 683-3824

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] cpusets/sched_domain reconciliation
  2007-03-22 22:32     ` Andrew Morton
@ 2007-03-23  5:07       ` Paul Jackson
  0 siblings, 0 replies; 11+ messages in thread
From: Paul Jackson @ 2007-03-23  5:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cpw, dino, linux-kernel

Andrew wrote:
> It isn't very nice.  It probably won't crash, but it _can_ crash and when

I guess I got lucky, Cliff, when I snuck in the recursion in the
other cpuset.c routines that you were using as an example here ;).

Since the current kernel/cpuset.c recursion seems only to be in code
paths for cpu hot unplug, it's not surprising that no one has actually
hit it enough times so far to motivate them to hunt me down and
exterminate me.  The intersection of the worlds heavy cpuset users
with the worlds heavy cpu unpluggers is a very small set indeed.

Maybe we should do something about this before that set of people grows
to include someone with violent tendencies.

I suppose it would work to set a hard coded limit to how deep one
could make the cpuset hierarchy, and perhaps provide a kernel tunable
for those rare cases where someone needed more than this limit.

We could avoid the recursion here, by converting it to its iterative
equivalent.  This equivlaent would have to keep track in a dynamically
allocator vector the cpuset pointers being worked, and if it got to the
end of that vector, reallocate a longer one.

It's not that much more code - and it's a fairly simple transformation
of a simple recursion on one variable to an iteration using a vector
of that variable.

Since the iterative code, using a dynamically sized vector, probably
doesn't add much more kernel text than the code to limit the depth and
provide for a kernel tunable to tweak the limit, and since the iterative
approach avoids imposing some arbitrary small limit on the user visible
API for what is really a small corner case, the iterative approach seems
like the better idea.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/1] cpusets/sched_domain reconciliation
@ 2007-08-23 22:05 Cliff Wickman
  0 siblings, 0 replies; 11+ messages in thread
From: Cliff Wickman @ 2007-08-23 22:05 UTC (permalink / raw)
  To: linux-mm

This patch reconciles cpusets and sched_domains that get out of sync
due to disabling and re-enabling cpu's.

This is still a problem in the 2.6.23-rc3 kernel.  (and also 2.6.22)

Here is an example of how the problem can occur:

   system of cpu's   0 1 2 3 4 5
   create cpuset /x      2 3 4 5 
   create cpuset /x/y    2 3
   all cpusets are cpu_exclusive

   disable cpu 3
     x is now            2   4 5
     x/y is now          2
   enable cpu 3
     cpusets x and x/y are unchanged

   to restore the cpusets:
     echo 2-5 > /dev/cpuset/x
     echo 2-3 > /dev/cpuset/x/y

   At the first echo, which restores 3 to cpuset x, update_cpu_domains() is
   called for cpuset x/. 
   system of cpu's   0 1 2 3 4 5
   x is now              2 3 4 5
   x/y is now            2

   The system is partitioned between:
	its parent, the root cpuset, minus its child (x/ is 2-5): 0-1
        and x/ (2-5) , minus its child (x/y/ 2): 3-5

   The sched_domain's for parent 0-1 are updated.
   The sched_domain's for current 3-5 are updated.

   But 2 has been untouched.
   As a result, 3's SD points to sched_group_phys[3] which is the only
   sched_group_phys on 3's list.  It points to itself.
   But 2's SD points to sched_group_phys[2], which still points to
   sched_group_phys[3].
   When cpu 2 executes find_busiest_group() it will hang on the non-
   circular sched_group list.
           
cpuset.c:

This solution is to update the sched_domain's for the cpuset
whose cpu's were changed and, in addition, all its children.
Instead of calling update_cpu_domains(), call update_cpu_domains_tree(),
which calls update_cpu_domains() for every node from the one specified
down to all its children.

The extra sched_domain reconstruction is overhead, but only at the
frequency of administrative change to the cpusets.

There seems to be no administrative procedural work-around.  In the
example above one could not reverse the two echo's and set x/y before
x/.  It is not logical, so not allowed (Permission denied).

Thus the patch to cpuset.c makes the sched_domain's correct.

sched.c:

The patch to sched.c prevents the cpu hangs that otherwise occur
until the sched_domain's are made correct.

It puts checks into find_busiest_group() and find_idlest_group()
that break from their loops on a sched_group that points to itself.
This is needed because cpu's are going through load balancing before all
sched_domains have been reconstructed (see the example above).

This is admittedly a kludge. I leave it to the scheduler gurus to recommend
a better way to keep cpus out of the sched_domains while they are
being reconstructed.

Diffed against 2.6.23-rc3

Signed-off-by: Cliff Wickman <cpw@sgi.com>

---
 kernel/cpuset.c |   43 +++++++++++++++++++++++++++++++++++++++----
 kernel/sched.c  |   18 ++++++++++++++----
 2 files changed, 53 insertions(+), 8 deletions(-)

Index: linus.070821/kernel/sched.c
===================================================================
--- linus.070821.orig/kernel/sched.c
+++ linus.070821/kernel/sched.c
@@ -1196,11 +1196,14 @@ static inline unsigned long cpu_avg_load
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 {
-	struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *idlest = NULL, *this = sd->groups, *group = sd->groups;
+	struct sched_group *self, *prev;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
+	prev = group;
+	self = group;
 	do {
 		unsigned long load, avg_load;
 		int local_group;
@@ -1237,8 +1240,10 @@ find_idlest_group(struct sched_domain *s
 			idlest = group;
 		}
 nextgroup:
+		prev = self;
+		self = group;
 		group = group->next;
-	} while (group != sd->groups);
+	} while (group != sd->groups && group != self && group != prev);
 
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
@@ -2288,7 +2293,8 @@ find_busiest_group(struct sched_domain *
 		   unsigned long *imbalance, enum cpu_idle_type idle,
 		   int *sd_idle, cpumask_t *cpus, int *balance)
 {
-	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *busiest = NULL, *this = sd->groups, *group = sd->groups;
+	struct sched_group *self, *prev;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
 	unsigned long max_pull;
 	unsigned long busiest_load_per_task, busiest_nr_running;
@@ -2311,6 +2317,8 @@ find_busiest_group(struct sched_domain *
 	else
 		load_idx = sd->idle_idx;
 
+	prev = group;
+	self = group;
 	do {
 		unsigned long load, group_capacity;
 		int local_group;
@@ -2443,8 +2451,10 @@ find_busiest_group(struct sched_domain *
 		}
 group_next:
 #endif
+		prev = self;
+		self = group;
 		group = group->next;
-	} while (group != sd->groups);
+	} while (group != sd->groups && group != self && group != prev);
 
 	if (!busiest || this_load >= max_load || busiest_nr_running == 0)
 		goto out_balanced;
Index: linus.070821/kernel/cpuset.c
===================================================================
--- linus.070821.orig/kernel/cpuset.c
+++ linus.070821/kernel/cpuset.c
@@ -52,6 +52,7 @@
 #include <asm/uaccess.h>
 #include <asm/atomic.h>
 #include <linux/mutex.h>
+#include <linux/kfifo.h>
 
 #define CPUSET_SUPER_MAGIC		0x27e0eb
 
@@ -789,8 +790,8 @@ static void update_cpu_domains(struct cp
 			return;
 		cspan = CPU_MASK_NONE;
 	} else {
-		if (cpus_empty(pspan))
-			return;
+		/* parent may be empty, but update anyway */
+
 		cspan = cur->cpus_allowed;
 		/*
 		 * Get all cpus from current cpuset's cpus_allowed not part
@@ -808,6 +809,40 @@ static void update_cpu_domains(struct cp
 }
 
 /*
+ * Call update_cpu_domains for cpuset "cur", and for all of its children.
+ *
+ * This walk processes the tree from top to bottom, completing one layer
+ * before dropping down to the next.  It always processes a node before
+ * any of its children.
+ *
+ * Call with manage_mutex held.
+ * Must not be called holding callback_mutex, because we must
+ * not call lock_cpu_hotplug() while holding callback_mutex.
+ */
+static void
+update_cpu_domains_tree(struct cpuset *root)
+{
+	struct cpuset *cp;	/* scans cpusets being updated */
+	struct cpuset *child;	/* scans child cpusets of cp */
+	struct kfifo *queue;	/* fifo queue of cpusets to be updated */
+
+	queue = kfifo_alloc(number_of_cpusets * sizeof(cp), GFP_KERNEL, NULL);
+	if (queue == ERR_PTR(-ENOMEM))
+		return;
+
+	__kfifo_put(queue, (unsigned char *)&root, sizeof(root));
+
+	while (__kfifo_get(queue, (unsigned char *)&cp, sizeof(cp))) {
+		list_for_each_entry(child, &cp->children, sibling)
+		    __kfifo_put(queue,(unsigned char *)&child,sizeof(child));
+		update_cpu_domains(cp);
+	}
+
+	kfifo_free(queue);
+	return;
+}
+
+/*
  * Call with manage_mutex held.  May take callback_mutex during call.
  */
 
@@ -846,7 +881,7 @@ static int update_cpumask(struct cpuset 
 	cs->cpus_allowed = trialcs.cpus_allowed;
 	mutex_unlock(&callback_mutex);
 	if (is_cpu_exclusive(cs) && !cpus_unchanged)
-		update_cpu_domains(cs);
+		update_cpu_domains_tree(cs);
 	return 0;
 }
 
@@ -1087,7 +1122,7 @@ static int update_flag(cpuset_flagbits_t
 	mutex_unlock(&callback_mutex);
 
 	if (cpu_exclusive_changed)
-                update_cpu_domains(cs);
+                update_cpu_domains_tree(cs);
 	return 0;
 }
 
-- 
Cliff Wickman
Silicon Graphics, Inc.
cpw@sgi.com
(651) 683-3824

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/1] cpusets/sched_domain reconciliation
@ 2007-09-07 21:07 Cliff Wickman
  2007-09-13 22:46 ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Cliff Wickman @ 2007-09-07 21:07 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm



Re-send of patch sent 8/23/2007, but refreshed for 2.6.23-rc5.

This patch reconciles cpusets and sched_domains that get out of sync
due to disabling and re-enabling cpu's.

This is still a problem in the 2.6.23-rc5 kernel.

Here is an example of how the problem can occur:

   system of cpu's   0 1 2 3 4 5
   create cpuset /x      2 3 4 5 
   create cpuset /x/y    2 3
   all cpusets are cpu_exclusive

   disable cpu 3
     x is now            2   4 5
     x/y is now          2
   enable cpu 3
     cpusets x and x/y are unchanged

   to restore the cpusets:
     echo 2-5 > /dev/cpuset/x
     echo 2-3 > /dev/cpuset/x/y

   At the first echo, which restores 3 to cpuset x, update_cpu_domains() is
   called for cpuset x/. 
   system of cpu's   0 1 2 3 4 5
   x is now              2 3 4 5
   x/y is now            2

   The system is partitioned between:
	its parent, the root cpuset, minus its child (x/ is 2-5): 0-1
        and x/ (2-5) , minus its child (x/y/ 2): 3-5

   The sched_domain's for parent 0-1 are updated.
   The sched_domain's for current 3-5 are updated.

   But 2 has been untouched.
   As a result, 3's SD points to sched_group_phys[3] which is the only
   sched_group_phys on 3's list.  It points to itself.
   But 2's SD points to sched_group_phys[2], which still points to
   sched_group_phys[3].
   When cpu 2 executes find_busiest_group() it will hang on the non-
   circular sched_group list.
           
cpuset.c:

This solution is to update the sched_domain's for the cpuset
whose cpu's were changed and, in addition, all its children.
Instead of calling update_cpu_domains(), call update_cpu_domains_tree(),
which calls update_cpu_domains() for every node from the one specified
down to all its children.

The extra sched_domain reconstruction is overhead, but only at the
frequency of administrative change to the cpusets.

There seems to be no administrative procedural work-around.  In the
example above one could not reverse the two echo's and set x/y before
x/.  It is not logical, so not allowed (Permission denied).

Thus the patch to cpuset.c makes the sched_domain's correct.

sched.c:

The patch to sched.c prevents the cpu hangs that otherwise occur
until the sched_domain's are made correct.

It puts checks into find_busiest_group() and find_idlest_group()
that break from their loops on a sched_group that points to itself.
This is needed because cpu's are going through load balancing before all
sched_domains have been reconstructed (see the example above).

This is admittedly a kludge. I leave it to the scheduler gurus to recommend
a better way update the sched_domains or to keep cpus out of the
sched_domains while they are being reconstructed.

Diffed against 2.6.23-rc5

Signed-off-by: Cliff Wickman <cpw@sgi.com>

---
 kernel/cpuset.c |   43 +++++++++++++++++++++++++++++++++++++++----
 kernel/sched.c  |   18 ++++++++++++++----
 2 files changed, 53 insertions(+), 8 deletions(-)

Index: p1/kernel/sched.c
===================================================================
--- p1.orig/kernel/sched.c
+++ p1/kernel/sched.c
@@ -1219,11 +1219,14 @@ static inline unsigned long cpu_avg_load
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 {
-	struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *idlest = NULL, *this = sd->groups, *group = sd->groups;
+	struct sched_group *self, *prev;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
+	prev = group;
+	self = group;
 	do {
 		unsigned long load, avg_load;
 		int local_group;
@@ -1260,8 +1263,10 @@ find_idlest_group(struct sched_domain *s
 			idlest = group;
 		}
 nextgroup:
+		prev = self;
+		self = group;
 		group = group->next;
-	} while (group != sd->groups);
+	} while (group != sd->groups && group != self && group != prev);
 
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
@@ -2306,7 +2311,8 @@ find_busiest_group(struct sched_domain *
 		   unsigned long *imbalance, enum cpu_idle_type idle,
 		   int *sd_idle, cpumask_t *cpus, int *balance)
 {
-	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
+	struct sched_group *busiest = NULL, *this = sd->groups, *group = sd->groups;
+	struct sched_group *self, *prev;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
 	unsigned long max_pull;
 	unsigned long busiest_load_per_task, busiest_nr_running;
@@ -2329,6 +2335,8 @@ find_busiest_group(struct sched_domain *
 	else
 		load_idx = sd->idle_idx;
 
+	prev = group;
+	self = group;
 	do {
 		unsigned long load, group_capacity;
 		int local_group;
@@ -2461,8 +2469,10 @@ find_busiest_group(struct sched_domain *
 		}
 group_next:
 #endif
+		prev = self;
+		self = group;
 		group = group->next;
-	} while (group != sd->groups);
+	} while (group != sd->groups && group != self && group != prev);
 
 	if (!busiest || this_load >= max_load || busiest_nr_running == 0)
 		goto out_balanced;
Index: p1/kernel/cpuset.c
===================================================================
--- p1.orig/kernel/cpuset.c
+++ p1/kernel/cpuset.c
@@ -52,6 +52,7 @@
 #include <asm/uaccess.h>
 #include <asm/atomic.h>
 #include <linux/mutex.h>
+#include <linux/kfifo.h>
 
 #define CPUSET_SUPER_MAGIC		0x27e0eb
 
@@ -789,8 +790,8 @@ static void update_cpu_domains(struct cp
 			return;
 		cspan = CPU_MASK_NONE;
 	} else {
-		if (cpus_empty(pspan))
-			return;
+		/* parent may be empty, but update anyway */
+
 		cspan = cur->cpus_allowed;
 		/*
 		 * Get all cpus from current cpuset's cpus_allowed not part
@@ -808,6 +809,40 @@ static void update_cpu_domains(struct cp
 }
 
 /*
+ * Call update_cpu_domains for cpuset "cur", and for all of its children.
+ *
+ * This walk processes the tree from top to bottom, completing one layer
+ * before dropping down to the next.  It always processes a node before
+ * any of its children.
+ *
+ * Call with manage_mutex held.
+ * Must not be called holding callback_mutex, because we must
+ * not call lock_cpu_hotplug() while holding callback_mutex.
+ */
+static void
+update_cpu_domains_tree(struct cpuset *root)
+{
+	struct cpuset *cp;	/* scans cpusets being updated */
+	struct cpuset *child;	/* scans child cpusets of cp */
+	struct kfifo *queue;	/* fifo queue of cpusets to be updated */
+
+	queue = kfifo_alloc(number_of_cpusets * sizeof(cp), GFP_KERNEL, NULL);
+	if (queue == ERR_PTR(-ENOMEM))
+		return;
+
+	__kfifo_put(queue, (unsigned char *)&root, sizeof(root));
+
+	while (__kfifo_get(queue, (unsigned char *)&cp, sizeof(cp))) {
+		list_for_each_entry(child, &cp->children, sibling)
+		    __kfifo_put(queue,(unsigned char *)&child,sizeof(child));
+		update_cpu_domains(cp);
+	}
+
+	kfifo_free(queue);
+	return;
+}
+
+/*
  * Call with manage_mutex held.  May take callback_mutex during call.
  */
 
@@ -846,7 +881,7 @@ static int update_cpumask(struct cpuset 
 	cs->cpus_allowed = trialcs.cpus_allowed;
 	mutex_unlock(&callback_mutex);
 	if (is_cpu_exclusive(cs) && !cpus_unchanged)
-		update_cpu_domains(cs);
+		update_cpu_domains_tree(cs);
 	return 0;
 }
 
@@ -1087,7 +1122,7 @@ static int update_flag(cpuset_flagbits_t
 	mutex_unlock(&callback_mutex);
 
 	if (cpu_exclusive_changed)
-                update_cpu_domains(cs);
+                update_cpu_domains_tree(cs);
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] cpusets/sched_domain reconciliation
  2007-09-13 22:46 ` Andrew Morton
@ 2007-09-13  7:39   ` Nick Piggin
  2007-09-13 22:59   ` Paul Mundt
  2007-09-19  6:00   ` Paul Jackson
  2 siblings, 0 replies; 11+ messages in thread
From: Nick Piggin @ 2007-09-13  7:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Cliff Wickman, linux-mm, Paul Jackson, Ingo Molnar

On Friday 14 September 2007 08:46, Andrew Morton wrote:
> On Fri, 07 Sep 2007 16:07:04 -0500
>
> cpw@sgi.com (Cliff Wickman) wrote:
> > Re-send of patch sent 8/23/2007, but refreshed for 2.6.23-rc5.
> >
> > This patch reconciles cpusets and sched_domains that get out of sync
> > due to disabling and re-enabling cpu's.
> >
> > This is still a problem in the 2.6.23-rc5 kernel.
> >
> > Here is an example of how the problem can occur:
> >
> >    system of cpu's   0 1 2 3 4 5
> >    create cpuset /x      2 3 4 5
> >    create cpuset /x/y    2 3
> >    all cpusets are cpu_exclusive
> >
> >    disable cpu 3
> >      x is now            2   4 5
> >      x/y is now          2
> >    enable cpu 3
> >      cpusets x and x/y are unchanged
> >
> >    to restore the cpusets:
> >      echo 2-5 > /dev/cpuset/x
> >      echo 2-3 > /dev/cpuset/x/y
> >
> >    At the first echo, which restores 3 to cpuset x, update_cpu_domains()
> > is called for cpuset x/.
> >    system of cpu's   0 1 2 3 4 5
> >    x is now              2 3 4 5
> >    x/y is now            2
> >
> >    The system is partitioned between:
> > 	its parent, the root cpuset, minus its child (x/ is 2-5): 0-1
> >         and x/ (2-5) , minus its child (x/y/ 2): 3-5
> >
> >    The sched_domain's for parent 0-1 are updated.
> >    The sched_domain's for current 3-5 are updated.
> >
> >    But 2 has been untouched.
> >    As a result, 3's SD points to sched_group_phys[3] which is the only
> >    sched_group_phys on 3's list.  It points to itself.
> >    But 2's SD points to sched_group_phys[2], which still points to
> >    sched_group_phys[3].
> >    When cpu 2 executes find_busiest_group() it will hang on the non-
> >    circular sched_group list.
> >
> > cpuset.c:
> >
> > This solution is to update the sched_domain's for the cpuset
> > whose cpu's were changed and, in addition, all its children.
> > Instead of calling update_cpu_domains(), call update_cpu_domains_tree(),
> > which calls update_cpu_domains() for every node from the one specified
> > down to all its children.
> >
> > The extra sched_domain reconstruction is overhead, but only at the
> > frequency of administrative change to the cpusets.
> >
> > There seems to be no administrative procedural work-around.  In the
> > example above one could not reverse the two echo's and set x/y before
> > x/.  It is not logical, so not allowed (Permission denied).
> >
> > Thus the patch to cpuset.c makes the sched_domain's correct.
> >
> > sched.c:
> >
> > The patch to sched.c prevents the cpu hangs that otherwise occur
> > until the sched_domain's are made correct.
> >
> > It puts checks into find_busiest_group() and find_idlest_group()
> > that break from their loops on a sched_group that points to itself.
> > This is needed because cpu's are going through load balancing before all
> > sched_domains have been reconstructed (see the example above).
> >
> > This is admittedly a kludge. I leave it to the scheduler gurus to
> > recommend a better way update the sched_domains or to keep cpus out of
> > the sched_domains while they are being reconstructed.
>
> You should cc scheduler gurus when hoping things about them ;)
>
> I suspect your change is fundamentally incompatible with, and perhaps
> obsoleted by
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc4/2.
>6.23-rc4-mm1/broken-out/cpuset-remove-sched-domain-hooks-from-cpusets.patch
>
> Problem is, cpuset-remove-sched-domain-hooks-from-cpusets.patch has been
> hanging around in -mm for a year while Paul makes up his mind about it.
>
> Can we please get all this sorted out??

cpus_exclusive is supposed to partition the system (like a dynamically
configurable isolcpus=). However the exact semantics of it IIRC are
defined such that it is pretty well unusable.

IIRC I have a patch that converts it to sane semantics and does
opportunistic sched domains partitioning and of course fixes the hotplug
issue as well... but Paul didn't like it for some reason.

Anyway, until there is some very clear semantic for what cpus_exclusive
does, then it is silly to retain the broken sched-domains code. I don't know
why that patch isn't merged, but it should be... (I thought it was when this
last came up ages ago).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] cpusets/sched_domain reconciliation
  2007-09-07 21:07 [PATCH 1/1] cpusets/sched_domain reconciliation Cliff Wickman
@ 2007-09-13 22:46 ` Andrew Morton
  2007-09-13  7:39   ` Nick Piggin
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Andrew Morton @ 2007-09-13 22:46 UTC (permalink / raw)
  To: Cliff Wickman; +Cc: linux-mm, Nick Piggin, Paul Jackson, Ingo Molnar

On Fri, 07 Sep 2007 16:07:04 -0500
cpw@sgi.com (Cliff Wickman) wrote:

> 
> 
> 
> Re-send of patch sent 8/23/2007, but refreshed for 2.6.23-rc5.
> 
> This patch reconciles cpusets and sched_domains that get out of sync
> due to disabling and re-enabling cpu's.
> 
> This is still a problem in the 2.6.23-rc5 kernel.
> 
> Here is an example of how the problem can occur:
> 
>    system of cpu's   0 1 2 3 4 5
>    create cpuset /x      2 3 4 5 
>    create cpuset /x/y    2 3
>    all cpusets are cpu_exclusive
> 
>    disable cpu 3
>      x is now            2   4 5
>      x/y is now          2
>    enable cpu 3
>      cpusets x and x/y are unchanged
> 
>    to restore the cpusets:
>      echo 2-5 > /dev/cpuset/x
>      echo 2-3 > /dev/cpuset/x/y
> 
>    At the first echo, which restores 3 to cpuset x, update_cpu_domains() is
>    called for cpuset x/. 
>    system of cpu's   0 1 2 3 4 5
>    x is now              2 3 4 5
>    x/y is now            2
> 
>    The system is partitioned between:
> 	its parent, the root cpuset, minus its child (x/ is 2-5): 0-1
>         and x/ (2-5) , minus its child (x/y/ 2): 3-5
> 
>    The sched_domain's for parent 0-1 are updated.
>    The sched_domain's for current 3-5 are updated.
> 
>    But 2 has been untouched.
>    As a result, 3's SD points to sched_group_phys[3] which is the only
>    sched_group_phys on 3's list.  It points to itself.
>    But 2's SD points to sched_group_phys[2], which still points to
>    sched_group_phys[3].
>    When cpu 2 executes find_busiest_group() it will hang on the non-
>    circular sched_group list.
>            
> cpuset.c:
> 
> This solution is to update the sched_domain's for the cpuset
> whose cpu's were changed and, in addition, all its children.
> Instead of calling update_cpu_domains(), call update_cpu_domains_tree(),
> which calls update_cpu_domains() for every node from the one specified
> down to all its children.
> 
> The extra sched_domain reconstruction is overhead, but only at the
> frequency of administrative change to the cpusets.
> 
> There seems to be no administrative procedural work-around.  In the
> example above one could not reverse the two echo's and set x/y before
> x/.  It is not logical, so not allowed (Permission denied).
> 
> Thus the patch to cpuset.c makes the sched_domain's correct.
> 
> sched.c:
> 
> The patch to sched.c prevents the cpu hangs that otherwise occur
> until the sched_domain's are made correct.
> 
> It puts checks into find_busiest_group() and find_idlest_group()
> that break from their loops on a sched_group that points to itself.
> This is needed because cpu's are going through load balancing before all
> sched_domains have been reconstructed (see the example above).
> 
> This is admittedly a kludge. I leave it to the scheduler gurus to recommend
> a better way update the sched_domains or to keep cpus out of the
> sched_domains while they are being reconstructed.
> 

You should cc scheduler gurus when hoping things about them ;)

I suspect your change is fundamentally incompatible with, and perhaps
obsoleted by
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc4/2.6.23-rc4-mm1/broken-out/cpuset-remove-sched-domain-hooks-from-cpusets.patch

Problem is, cpuset-remove-sched-domain-hooks-from-cpusets.patch has been
hanging around in -mm for a year while Paul makes up his mind about it.

Can we please get all this sorted out??

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] cpusets/sched_domain reconciliation
  2007-09-13 22:46 ` Andrew Morton
  2007-09-13  7:39   ` Nick Piggin
@ 2007-09-13 22:59   ` Paul Mundt
  2007-09-19  6:00   ` Paul Jackson
  2 siblings, 0 replies; 11+ messages in thread
From: Paul Mundt @ 2007-09-13 22:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Cliff Wickman, linux-mm, Nick Piggin, Paul Jackson, Ingo Molnar

On Thu, Sep 13, 2007 at 03:46:07PM -0700, Andrew Morton wrote:
> On Fri, 07 Sep 2007 16:07:04 -0500
> cpw@sgi.com (Cliff Wickman) wrote:
> > Thus the patch to cpuset.c makes the sched_domain's correct.
> 
> You should cc scheduler gurus when hoping things about them ;)
> 
> I suspect your change is fundamentally incompatible with, and perhaps
> obsoleted by
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc4/2.6.23-rc4-mm1/broken-out/cpuset-remove-sched-domain-hooks-from-cpusets.patch
> 
> Problem is, cpuset-remove-sched-domain-hooks-from-cpusets.patch has been
> hanging around in -mm for a year while Paul makes up his mind about it.
> 
> Can we please get all this sorted out??
> 
Note that removing the scheduler domain hooks also fixes up the build for
cpusets on UP NUMA. If this patch isn't going to go in, the other
alternative is simply to stub out scheduler domains there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] cpusets/sched_domain reconciliation
  2007-09-13 22:46 ` Andrew Morton
  2007-09-13  7:39   ` Nick Piggin
  2007-09-13 22:59   ` Paul Mundt
@ 2007-09-19  6:00   ` Paul Jackson
  2 siblings, 0 replies; 11+ messages in thread
From: Paul Jackson @ 2007-09-19  6:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cpw, linux-mm, nickpiggin, mingo

Andrew replied to Cliff a few days ago:
> I suspect your change is fundamentally incompatible with, and perhaps
> obsoleted by ... cpuset-remove-sched-domain-hooks-from-cpusets.patch

I suspect the same thing.

> Problem is, cpuset-remove-sched-domain-hooks-from-cpusets.patch has been
> hanging around in -mm for a year while Paul makes up his mind about it.
> 
> Can we please get all this sorted out??

I'll see what I can do.  Cpuset support seems to have finally gotten
back to the front of my queue.

Sorry for the absurdly long hiatus.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-09-19  6:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-07 21:07 [PATCH 1/1] cpusets/sched_domain reconciliation Cliff Wickman
2007-09-13 22:46 ` Andrew Morton
2007-09-13  7:39   ` Nick Piggin
2007-09-13 22:59   ` Paul Mundt
2007-09-19  6:00   ` Paul Jackson
  -- strict thread matches above, loose matches on Subject: below --
2007-08-23 22:05 Cliff Wickman
2007-03-20 19:14 Cliff Wickman
2007-03-22 21:21 ` Andrew Morton
2007-03-22 22:39   ` Cliff Wickman
2007-03-22 22:32     ` Andrew Morton
2007-03-23  5:07       ` Paul Jackson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.