public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-09-30 10:44 [PATCH] cpuset and sched domains: sched_load_balance flag Paul Jackson
@ 2007-09-29 19:21 ` Nick Piggin
  2007-09-30 18:07   ` Paul Jackson
  2007-09-30 10:44 ` [PATCH] cpuset decrustify update and validate masks Paul Jackson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-09-29 19:21 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, Paul Menage, linux-kernel, Dinakar Guniguntala,
	cpw, Ingo Molnar

On Sunday 30 September 2007 20:44, Paul Jackson wrote:
> From: Paul Jackson <pj@sgi.com>
>
> Add a new per-cpuset flag called 'sched_load_balance'.
>
> When enabled in a cpuset (the default value) it tells the kernel
> scheduler that the scheduler should provide the normal load
> balancing on the CPUs in that cpuset, sometimes moving tasks
> from one CPU to a second CPU if the second CPU is less loaded
> and if that task is allowed to run there.

I don't like adding these funny special case sort of things like this.
The user should just be able to specify exactly the partitioning of
tasks required, and cpusets should ask the scheduler to do the best
job of load balancing possible.

I implemented that with my patches to do automatic discovery
of the largest set of disjoint cpusets.

>From there, the problem that cpusets has, is that it lacks a good
way to specify that the machine should be partitioned (IIRC because
stuff defaults to going into the root cpuset which covers all CPUs?).

Instead of adding these "I want the scheduler to do something a bit
vague but hopefully good albeit with some downsides" flags, there
should be a way to say "I want to partition the CPUs like so...", IMO.

Barring that (ie. maybe you always want a root cpuset to cover all
CPUs), then maybe we should retain the spanning sched domains
in order to balance the root cpuset, and add another set of domains
according to cpuset partitioning. This could be entirely transparent
to userspace, I think (using my patch).

Just my opinion. Good to see more thought going into this area, because
it is something that sched-domains can do really well but is underused.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-09-30 18:07   ` Paul Jackson
@ 2007-09-30  3:34     ` Nick Piggin
  2007-10-01  3:42       ` Paul Jackson
  2007-10-01 18:15       ` Paul Jackson
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2007-09-30  3:34 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Monday 01 October 2007 04:07, Paul Jackson wrote:
> Nick wrote:
> > The user should just be able to specify exactly the partitioning of
> > tasks required, and cpusets should ask the scheduler to do the best
> > job of load balancing possible.
>
> If the cpusets which have 'sched_load_balance' enabled are disjoint
> (their 'cpus' cpus_allowed masks don't overlap) then you get exactly
> what you're asking for.  In that case there is exactly one sched domain
> for the 'cpus' allowed by each cpuset that has sched_load_balanced
> enabled.

But you could do that just by having the current cpuset scheme able
to properly partition the system. You can't (easily) do this now because
you have so many tasks in the root cpuset that it is impossible to know
whether or not you actually want to load balance them.

You would do this by creating partitioning cpusets which carve up the
root cpuset (basically -- have multiple roots).


> But there is another case in which one does not want what you ask for.
>
> That case involves the situation where one is running a third part
> batch scheduler on part of ones big system, and doing other stuff
> (perhaps Ingo's realtime stuff) on another part of the system.

In this case the admin would simply not partition the system (they
would retain a single root cpuset).

Neither approach is really fundamentally more or less powerful than
the other, but what I object to in yours is adding these flags which
don't allow the admin to specify what they want, but to specify how they
want it done.

Moreover, sched_load_balance doesn't really sound like a good name
for asking for a partition. It's more like you're just asking to have better
load balancing over that set, which you could equally achieve by adding
a second set of sched domains (and the global domains could keep
globally balancing).

Basically: the admin doesn't know best when it comes to how the
scheduler should work; the admin knows best about how they intend
the system to be used.


> In that case, the system admin will be advised to turn off
> sched_load_balance on the top cpuset.  But in that case the system
> admin will -not- know from moment to moment what jobs the batch
> scheduler is running on the cpus assigned to its control.  Only the
> batch scheduler knows that.
>
> The batch scheduler is code that was written by someone else, in
> some other company, some other time.  That code does not get to
> control the overall sched domain partitioning of the entire system.
> The batch scheduler gets to say, in affect:
>
>   Here's where I need load balancing to occur, in the normal fashion,
>   and here's where I don't need it.

Rather than require the admin to know the intricate details about
how and why the scheduler load balancing gets broken, and when they
might or might not need to use this flag, they can just specify what they
want to be done, and the kernel can choose the optimal strategy.


> In short, you insisting that only a single administrative point of
> control determine the systems sched domains.  Sometimes that fits
> the way the system is managed, and my patch lets you do that.  But
> sometimes this is a shared responsibility, between a piece of third
> party software and the system admin, and my patch allows for that
> case as well.
>
> This is a typical sort of situation that arises from having hierarchical
> cpuset definitions, and highlights the reason (and the use case,
> involving third party batch schedulers) that I went with a hierarchical
> cpuset architecture in the first place.

No, I'm insisting that *no* single administrative point of control
determines the sched domains. Not directly. The kernel should.
cpusets API should be rich enough that the kernel can derive tihs
information from what the admin has intended.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH] cpuset and sched domains: sched_load_balance flag
@ 2007-09-30 10:44 Paul Jackson
  2007-09-29 19:21 ` Nick Piggin
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Paul Jackson @ 2007-09-30 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, Paul Menage, linux-kernel, Dinakar Guniguntala,
	Paul Jackson, cpw, Ingo Molnar

From: Paul Jackson <pj@sgi.com>

Add a new per-cpuset flag called 'sched_load_balance'.

When enabled in a cpuset (the default value) it tells the kernel
scheduler that the scheduler should provide the normal load
balancing on the CPUs in that cpuset, sometimes moving tasks
from one CPU to a second CPU if the second CPU is less loaded
and if that task is allowed to run there.

When disabled (write "0" to the file) then it tells the kernel
scheduler that load balancing is not required for the CPUs in
that cpuset.

Now even if this flag is disabled for some cpuset, the kernel
may still have to load balance some or all the CPUs in that
cpuset, if some overlapping cpuset has its sched_load_balance
flag enabled.

If there are some CPUs that are not in any cpuset whose
sched_load_balance flag is enabled, the kernel scheduler will
not load balance tasks to those CPUs.

Moreover the kernel will partition the 'sched domains'
(non-overlapping sets of CPUs over which load balancing is
attempted) into the finest granularity partition that it can
find, while still keeping any two CPUs that are in the same
shed_load_balance enabled cpuset in the same element of the
partition.

This serves two purposes:
 1) It provides a mechanism for real time isolation of some CPUs, and
 2) it can be used to improve performance on systems with many CPUs
    by supporting configurations in which load balancing is not done
    across all CPUs at once, but rather only done in several smaller
    disjoint sets of CPUs.

This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusets

See further the Documentation and comments in the code itself.

Acked-by: Paul Jackson <pj@sgi.com>

---

Andrew - this patch goes right after your *-mm patch:
  task-containers-enable-containers-by-default-in-some-configs.patch
and before "add-containerstats-v3.patch"

 Documentation/cpusets.txt |  141 +++++++++++++++++++++++++
 include/linux/sched.h     |    2 
 kernel/cpuset.c           |  254 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched.c            |   72 ++++++++++---
 4 files changed, 450 insertions(+), 19 deletions(-)

--- 2.6.23-rc8-mm1.orig/Documentation/cpusets.txt	2007-09-29 23:56:40.987962675 -0700
+++ 2.6.23-rc8-mm1/Documentation/cpusets.txt	2007-09-29 23:57:51.092979187 -0700
@@ -19,7 +19,8 @@ CONTENTS:
   1.4 What are exclusive cpusets ?
   1.5 What is memory_pressure ?
   1.6 What is memory spread ?
-  1.7 How do I use cpusets ?
+  1.7 What is sched_load_balance ?
+  1.8 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
@@ -359,8 +360,144 @@ policy, especially for jobs that might h
 data set, the memory allocation across the nodes in the jobs cpuset
 can become very uneven.
 
+1.7 What is sched_load_balance ?
+--------------------------------
+
+The kernel scheduler (kernel/sched.c) automatically load balances
+tasks.  If one CPU is underutilized, kernel code running on that
+CPU will look for tasks on other more overloaded CPUs and move those
+tasks to itself, within the constraints of such placement mechanisms
+as cpusets and sched_setaffinity.
+
+The algorithmic cost of load balancing and its impact on key shared
+kernel data structures such as the task list increases more than
+linearly with the number of CPUs being balanced.  So the scheduler
+has support to  partition the systems CPUs into a number of sched
+domains such that it only load balances within each sched domain.
+Each sched domain covers some subset of the CPUs in the system;
+no two sched domains overlap; some CPUs might not be in any sched
+domain and hence won't be load balanced.
+
+Put simply, it costs less to balance between two smaller sched domains
+than one big one, but doing so means that overloads in one of the
+two domains won't be load balanced to the other one.
+
+By default, there is one sched domain covering all CPUs, except those
+marked isolated using the kernel boot time "isolcpus=" argument.
+
+This default load balancing across all CPUs is not well suited for
+the following two situations:
+ 1) On large systems, load balancing across many CPUs is expensive.
+    If the system is managed using cpusets to place independent jobs
+    on separate sets of CPUs, full load balancing is unnecessary.
+ 2) Systems supporting realtime on some CPUs need to minimize
+    system overhead on those CPUs, including avoiding task load
+    balancing if that is not needed.
+
+When the per-cpuset flag "sched_load_balance" is enabled (the default
+setting), it requests that all the CPUs in that cpusets allowed 'cpus'
+be contained in a single sched domain, ensuring that load balancing
+can move a task (not otherwised pinned, as by sched_setaffinity)
+from any CPU in that cpuset to any other.
+
+When the per-cpuset flag "sched_load_balance" is disabled, then the
+scheduler will avoid load balancing across the CPUs in that cpuset,
+--except-- in so far as is necessary because some overlapping cpuset
+has "sched_load_balance" enabled.
+
+So, for example, if the top cpuset has the flag "sched_load_balance"
+enabled, then the scheduler will have one sched domain covering all
+CPUs, and the setting of the "sched_load_balance" flag in any other
+cpusets won't matter, as we're already fully load balancing.
+
+Therefore in the above two situations, the top cpuset flag
+"sched_load_balance" should be disabled, and only some of the smaller,
+child cpusets have this flag enabled.
+
+When doing this, you don't usually want to leave any unpinned tasks in
+the top cpuset that might use non-trivial amounts of CPU, as such tasks
+may be artificially constrained to some subset of CPUs, depending on
+the particulars of this flag setting in descendent cpusets.  Even if
+such a task could use spare CPU cycles in some other CPUs, the kernel
+scheduler might not consider the possibility of load balancing that
+task to that underused CPU.
+
+Of course, tasks pinned to a particular CPU can be left in a cpuset
+that disables "sched_load_balance" as those tasks aren't going anywhere
+else anyway.
+
+There is an impedance mismatch here, between cpusets and sched domains.
+Cpusets are hierarchical and nest.  Sched domains are flat; they don't
+overlap and each CPU is in at most one sched domain.
+
+It is necessary for sched domains to be flat because load balancing
+across partially overlapping sets of CPUs would risk unstable dynamics
+that would be beyond our understanding.  So if each of two partially
+overlapping cpusets enables the flag 'sched_load_balance', then we
+form a single sched domain that is a superset of both.  We won't move
+a task to a CPU outside it cpuset, but the scheduler load balancing
+code might waste some compute cycles considering that possibility.
+
+This mismatch is why there is not a simple one-to-one relation
+between which cpusets have the flag "sched_load_balance" enabled,
+and the sched domain configuration.  If a cpuset enables the flag, it
+will get balancing across all its CPUs, but if it disables the flag,
+it will only be assured of no load balancing if no other overlapping
+cpuset enables the flag.
+
+If two cpusets have partially overlapping 'cpus' allowed, and only
+one of them has this flag enabled, then the other may find its
+tasks only partially load balanced, just on the overlapping CPUs.
+This is just the general case of the top_cpuset example given a few
+paragraphs above.  In the general case, as in the top cpuset case,
+don't leave tasks that might use non-trivial amounts of CPU in
+such partially load balanced cpusets, as they may be artificially
+constrained to some subset of the CPUs allowed to them, for lack of
+load balancing to the other CPUs.
+
+1.7.1 sched_load_balance implementation details.
+------------------------------------------------
+
+The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary
+to most cpuset flags.)  When enabled for a cpuset, the kernel will
+ensure that it can load balance across all the CPUs in that cpuset
+(makes sure that all the CPUs in the cpus_allowed of that cpuset are
+in the same sched domain.)
+
+If two overlapping cpusets both have 'sched_load_balance' enabled,
+then they will be (must be) both in the same sched domain.
+
+If, as is the default, the top cpuset has 'sched_load_balance' enabled,
+then by the above that means there is a single sched domain covering
+the whole system, regardless of any other cpuset settings.
+
+The kernel commits to user space that it will avoid load balancing
+where it can.  It will pick as fine a granularity partition of sched
+domains as it can while still providing load balancing for any set
+of CPUs allowed to a cpuset having 'sched_load_balance' enabled.
+
+The internal kernel cpuset to scheduler interface passes from the
+cpuset code to the scheduler code a partition of the load balanced
+CPUs in the system. This partition is a set of subsets (represented
+as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all
+the CPUs that must be load balanced.
+
+Whenever the 'sched_load_balance' flag changes, or CPUs come or go
+from a cpuset with this flag enabled, or a cpuset with this flag
+enabled is removed, the cpuset code builds a new such partition and
+passes it to the scheduler sched domain setup code, to have the sched
+domains rebuilt as necessary.
+
+This partition exactly defines what sched domains the scheduler should
+setup - one sched domain for each element (cpumask_t) in the partition.
+
+The scheduler remembers the currently active sched domain partitions.
+When the scheduler routine partition_sched_domains() is invoked from
+the cpuset code to update these sched domains, it compares the new
+partition requested with the current, and updates its sched domains,
+removing the old and adding the new, for each change.
 
-1.7 How do I use cpusets ?
+1.8 How do I use cpusets ?
 --------------------------
 
 In order to minimize the impact of cpusets on critical kernel
--- 2.6.23-rc8-mm1.orig/include/linux/sched.h	2007-09-29 23:56:40.987962675 -0700
+++ 2.6.23-rc8-mm1/include/linux/sched.h	2007-09-29 23:57:51.116979535 -0700
@@ -713,6 +713,8 @@ struct sched_domain {
 #endif
 };
 
+extern void partition_sched_domains(int ndoms_new, cpumask_t *doms_new);
+
 #endif	/* CONFIG_SMP */
 
 /*
--- 2.6.23-rc8-mm1.orig/kernel/cpuset.c	2007-09-29 23:56:40.987962675 -0700
+++ 2.6.23-rc8-mm1/kernel/cpuset.c	2007-09-29 23:57:51.148979999 -0700
@@ -4,7 +4,7 @@
  *  Processor and Memory placement constraints for sets of tasks.
  *
  *  Copyright (C) 2003 BULL SA.
- *  Copyright (C) 2004-2006 Silicon Graphics, Inc.
+ *  Copyright (C) 2004-2007 Silicon Graphics, Inc.
  *  Copyright (C) 2006 Google, Inc
  *
  *  Portions derived from Patrick Mochel's sysfs code.
@@ -54,6 +54,7 @@
 #include <asm/uaccess.h>
 #include <asm/atomic.h>
 #include <linux/mutex.h>
+#include <linux/kfifo.h>
 
 /*
  * Tracks how many cpusets are currently defined in system.
@@ -91,6 +92,9 @@ struct cpuset {
 	int mems_generation;
 
 	struct fmeter fmeter;		/* memory_pressure filter */
+
+	/* partition number for rebuild_sched_domains() */
+	int pn;
 };
 
 /* Retrieve the cpuset for a cgroup */
@@ -113,6 +117,7 @@ typedef enum {
 	CS_CPU_EXCLUSIVE,
 	CS_MEM_EXCLUSIVE,
 	CS_MEMORY_MIGRATE,
+	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
 } cpuset_flagbits_t;
@@ -128,6 +133,11 @@ static inline int is_mem_exclusive(const
 	return test_bit(CS_MEM_EXCLUSIVE, &cs->flags);
 }
 
+static inline int is_sched_load_balance(const struct cpuset *cs)
+{
+	return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+}
+
 static inline int is_memory_migrate(const struct cpuset *cs)
 {
 	return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
@@ -482,6 +492,189 @@ static int validate_change(const struct 
 }
 
 /*
+ * Helper routine for rebuild_sched_domains().
+ * Do cpusets a, b have overlapping cpus_allowed masks?
+ */
+
+static int cpusets_overlap(struct cpuset *a, struct cpuset *b)
+{
+	return cpus_intersects(a->cpus_allowed, b->cpus_allowed);
+}
+
+/*
+ * rebuild_sched_domains()
+ *
+ * If the flag 'sched_load_balance' of any cpuset with non-empty
+ * 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
+ * which has that flag enabled, or if any cpuset with a non-empty
+ * 'cpus' is removed, then call this routine to rebuild the
+ * scheduler's dynamic sched domains.
+ *
+ * This routine builds a partial partition of the systems CPUs
+ * (the set of non-overlappping cpumask_t's in the array 'part'
+ * below), and passes that partial partition to the kernel/sched.c
+ * partition_sched_domains() routine, which will rebuild the
+ * schedulers load balancing domains (sched domains) as specified
+ * by that partial partition.  A 'partial partition' is a set of
+ * non-overlapping subsets whose union is a subset of that set.
+ *
+ * See "What is sched_load_balance" in Documentation/cpusets.txt
+ * for a background explanation of this.
+ *
+ * Does not return errors, on the theory that the callers of this
+ * routine would rather not worry about failures to rebuild sched
+ * domains when operating in the severe memory shortage situations
+ * that could cause allocation failures below.
+ *
+ * Call with cgroup_mutex held.  May take callback_mutex during
+ * call due to the kfifo_alloc() and kmalloc() calls.  May nest
+ * a call to the lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
+ * Must not be called holding callback_mutex, because we must not
+ * call lock_cpu_hotplug() while holding callback_mutex.  Elsewhere
+ * the kernel nests callback_mutex inside lock_cpu_hotplug() calls.
+ * So the reverse nesting would risk an ABBA deadlock.
+ *
+ * The three key local variables below are:
+ *    q  - a kfifo queue of cpuset pointers, used to implement a
+ *	   top-down scan of all cpusets.  This scan loads a pointer
+ *	   to each cpuset marked is_sched_load_balance into the
+ *	   array 'csa'.  For our purposes, rebuilding the schedulers
+ *	   sched domains, we can ignore !is_sched_load_balance cpusets.
+ *  csa  - (for CpuSet Array) Array of pointers to all the cpusets
+ *	   that need to be load balanced, for convenient iterative
+ *	   access by the subsequent code that finds the best partition,
+ *	   i.e the set of domains (subsets) of CPUs such that the
+ *	   cpus_allowed of every cpuset marked is_sched_load_balance
+ *	   is a subset of one of these domains, while there are as
+ *	   many such domains as possible, each as small as possible.
+ * doms  - Conversion of 'csa' to an array of cpumasks, for passing to
+ *	   the kernel/sched.c routine partition_sched_domains() in a
+ *	   convenient format, that can be easily compared to the prior
+ *	   value to determine what partition elements (sched domains)
+ *	   were changed (added or removed.)
+ *
+ * Finding the best partition (set of domains):
+ *	The triple nested loops below over i, j, k scan over the
+ *	load balanced cpusets (using the array of cpuset pointers in
+ *	csa[]) looking for pairs of cpusets that have overlapping
+ *	cpus_allowed, but which don't have the same 'pn' partition
+ *	number and gives them in the same partition number.  It keeps
+ *	looping on the 'restart' label until it can no longer find
+ *	any such pairs.
+ *
+ *	The union of the cpus_allowed masks from the set of
+ *	all cpusets having the same 'pn' value then form the one
+ *	element of the partition (one sched domain) to be passed to
+ *	partition_sched_domains().
+ */
+
+static void rebuild_sched_domains(void)
+{
+	struct kfifo *q;	/* queue of cpusets to be scanned */
+	struct cpuset *cp;	/* scans q */
+	struct cpuset **csa;	/* array of all cpuset ptrs */
+	int csn;		/* how many cpuset ptrs in csa so far */
+	int i, j, k;		/* indices for partition finding loops */
+	cpumask_t *doms;	/* resulting partition; i.e. sched domains */
+	int ndoms;		/* number of sched domains in result */
+	int nslot;		/* next empty doms[] cpumask_t slot */
+
+	q = NULL; csa = NULL; doms = NULL;
+
+	/* Special case for the 99% of systems with one, full, sched domain */
+	if (is_sched_load_balance(&top_cpuset)) {
+		ndoms = 1;
+		doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
+		*doms = top_cpuset.cpus_allowed;
+		goto rebuild;
+	}
+
+	q = kfifo_alloc(number_of_cpusets * sizeof(cp), GFP_KERNEL, NULL);
+	if (IS_ERR(q))
+		goto done;
+	csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
+	if (!csa)
+		goto done;
+	csn = 0;
+
+	cp = &top_cpuset;
+	__kfifo_put(q, (void *)&cp, sizeof(cp));
+	while (__kfifo_get(q, (void *)&cp, sizeof(cp))) {
+		struct cgroup *cont;
+		struct cpuset *child;   /* scans child cpusets of cp */
+		if (is_sched_load_balance(cp))
+			csa[csn++] = cp;
+		list_for_each_entry(cont, &cp->css.cgroup->children, sibling)
+			child = cgroup_cs(cont);
+			__kfifo_put(q, (void *)&child, sizeof(cp));
+  	}
+
+	for (i = 0; i < csn; i++)
+		csa[i]->pn = i;
+	ndoms = csn;
+
+restart:
+	/* Find the best partition (set of sched domains) */
+	for (i = 0; i < csn; i++) {
+		struct cpuset *a = csa[i];
+
+		for (j = 0; j < csn; j++) {
+			struct cpuset *b = csa[j];
+
+			if (a->pn != b->pn && cpusets_overlap(a, b)) {
+				for (k = 0; k < csn; k++) {
+					struct cpuset *c = csa[k];
+
+					if (c->pn == b->pn)
+						c->pn = a->pn;
+				}
+				ndoms--;	/* one less element */
+				goto restart;
+			}
+		}
+	}
+
+	/* Convert <csn, csa> to <ndoms, doms> */
+	doms = kmalloc(ndoms * sizeof(cpumask_t), GFP_KERNEL);
+	if (!doms)
+		goto done;
+
+	for (nslot = 0, i = 0; i < csn; i++) {
+		struct cpuset *a = csa[i];
+		int apn = a->pn;
+
+		if (apn >= 0) {
+			cpumask_t *dp = doms + nslot;
+
+			cpus_clear(*dp);
+			for (j = i; j < csn; j++) {
+				struct cpuset *b = csa[j];
+
+				if (apn == b->pn) {
+					cpus_or(*dp, *dp, b->cpus_allowed);
+					b->pn = -1;
+				}
+			}
+			nslot++;
+		}
+	}
+	BUG_ON(nslot != ndoms);
+
+rebuild:
+	/* Have scheduler rebuild sched domains */
+	lock_cpu_hotplug();
+	partition_sched_domains(ndoms, doms);
+	unlock_cpu_hotplug();
+
+done:
+	if (q && !IS_ERR(q))
+		kfree(q);
+	if (csa)
+		kfree(csa);
+	/* Don't kfree(doms) -- partition_sched_domains() does that. */
+}
+
+/*
  * Call with manage_mutex held.  May take callback_mutex during call.
  */
 
@@ -489,6 +682,7 @@ static int update_cpumask(struct cpuset 
 {
 	struct cpuset trialcs;
 	int retval;
+	int cpus_changed, is_load_balanced;
 
 	/* top_cpuset.cpus_allowed tracks cpu_online_map; it's read-only */
 	if (cs == &top_cpuset)
@@ -516,9 +710,17 @@ static int update_cpumask(struct cpuset 
 	retval = validate_change(cs, &trialcs);
 	if (retval < 0)
 		return retval;
+
+	cpus_changed = !cpus_equal(cs->cpus_allowed, trialcs.cpus_allowed);
+	is_load_balanced = is_sched_load_balance(&trialcs);
+
 	mutex_lock(&callback_mutex);
 	cs->cpus_allowed = trialcs.cpus_allowed;
 	mutex_unlock(&callback_mutex);
+
+	if (cpus_changed && is_load_balanced)
+		rebuild_sched_domains();
+
 	return 0;
 }
 
@@ -752,6 +954,7 @@ static int update_memory_pressure_enable
 /*
  * update_flag - read a 0 or a 1 in a file and update associated flag
  * bit:	the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
+ *				CS_SCHED_LOAD_BALANCE,
  *				CS_NOTIFY_ON_RELEASE, CS_MEMORY_MIGRATE,
  *				CS_SPREAD_PAGE, CS_SPREAD_SLAB)
  * cs:	the cpuset to update
@@ -765,6 +968,7 @@ static int update_flag(cpuset_flagbits_t
 	int turning_on;
 	struct cpuset trialcs;
 	int err;
+	int cpus_nonempty, balance_flag_changed;
 
 	turning_on = (simple_strtoul(buf, NULL, 10) != 0);
 
@@ -777,10 +981,18 @@ static int update_flag(cpuset_flagbits_t
 	err = validate_change(cs, &trialcs);
 	if (err < 0)
 		return err;
+
+	cpus_nonempty = !cpus_empty(trialcs.cpus_allowed);
+	balance_flag_changed = (is_sched_load_balance(cs) !=
+		 			is_sched_load_balance(&trialcs));
+
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs.flags;
 	mutex_unlock(&callback_mutex);
 
+	if (cpus_nonempty && balance_flag_changed)
+		rebuild_sched_domains();
+
 	return 0;
 }
 
@@ -928,6 +1140,7 @@ typedef enum {
 	FILE_MEMLIST,
 	FILE_CPU_EXCLUSIVE,
 	FILE_MEM_EXCLUSIVE,
+	FILE_SCHED_LOAD_BALANCE,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
@@ -946,7 +1159,7 @@ static ssize_t cpuset_common_file_write(
 	int retval = 0;
 
 	/* Crude upper limit on largest legitimate cpulist user might write. */
-	if (nbytes > 100 + 6 * max(NR_CPUS, MAX_NUMNODES))
+	if (nbytes > 100U + 6 * max(NR_CPUS, MAX_NUMNODES))
 		return -E2BIG;
 
 	/* +1 for nul-terminator */
@@ -979,6 +1192,9 @@ static ssize_t cpuset_common_file_write(
 	case FILE_MEM_EXCLUSIVE:
 		retval = update_flag(CS_MEM_EXCLUSIVE, cs, buffer);
 		break;
+	case FILE_SCHED_LOAD_BALANCE:
+		retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, buffer);
+		break;
 	case FILE_MEMORY_MIGRATE:
 		retval = update_flag(CS_MEMORY_MIGRATE, cs, buffer);
 		break;
@@ -1074,6 +1290,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_MEM_EXCLUSIVE:
 		*s++ = is_mem_exclusive(cs) ? '1' : '0';
 		break;
+	case FILE_SCHED_LOAD_BALANCE:
+		*s++ = is_sched_load_balance(cs) ? '1' : '0';
+		break;
 	case FILE_MEMORY_MIGRATE:
 		*s++ = is_memory_migrate(cs) ? '1' : '0';
 		break;
@@ -1137,6 +1356,11 @@ static struct cftype cft_mem_exclusive =
 	.private = FILE_MEM_EXCLUSIVE,
 };
 
+static struct cftype cft_sched_load_balance = {
+	.name = "sched_load_balance",
+	.private = FILE_SCHED_LOAD_BALANCE,
+};
+
 static struct cftype cft_memory_migrate = {
 	.name = "memory_migrate",
 	.read = cpuset_common_file_read,
@@ -1186,6 +1410,8 @@ static int cpuset_populate(struct cgroup
 		return err;
 	if ((err = cgroup_add_file(cont, ss, &cft_memory_migrate)) < 0)
 		return err;
+	if ((err = cgroup_add_file(cont, ss, &cft_sched_load_balance)) < 0)
+		return err;
 	if ((err = cgroup_add_file(cont, ss, &cft_memory_pressure)) < 0)
 		return err;
 	if ((err = cgroup_add_file(cont, ss, &cft_spread_page)) < 0)
@@ -1267,6 +1493,7 @@ static struct cgroup_subsys_state *cpuse
 		set_bit(CS_SPREAD_PAGE, &cs->flags);
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
+	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cs->cpus_allowed = CPU_MASK_NONE;
 	cs->mems_allowed = NODE_MASK_NONE;
 	cs->mems_generation = cpuset_mems_generation++;
@@ -1277,11 +1504,27 @@ static struct cgroup_subsys_state *cpuse
 	return &cs->css ;
 }
 
+/*
+ * Locking note on the strange update_flag() call below:
+ *
+ * If the cpuset being removed has its flag 'sched_load_balance'
+ * enabled, then simulate turning sched_load_balance off, which
+ * will call rebuild_sched_domains().  The lock_cpu_hotplug()
+ * call in rebuild_sched_domains() must not be made while holding
+ * callback_mutex.  Elsewhere the kernel nests callback_mutex inside
+ * lock_cpu_hotplug() calls.  So the reverse nesting would risk an
+ * ABBA deadlock.
+ */
+
 static void cpuset_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	struct cpuset *cs = cgroup_cs(cont);
 
 	cpuset_update_task_memory_state();
+
+	if (is_sched_load_balance(cs))
+		update_flag(CS_SCHED_LOAD_BALANCE, cs, "0");
+
 	number_of_cpusets--;
 	kfree(cs);
 }
@@ -1326,6 +1569,7 @@ int __init cpuset_init(void)
 
 	fmeter_init(&top_cpuset.fmeter);
 	top_cpuset.mems_generation = cpuset_mems_generation++;
+	set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
 
 	err = register_filesystem(&cpuset_fs_type);
 	if (err < 0)
@@ -1412,8 +1656,8 @@ static void common_cpu_mem_hotplug_unplu
  * cpu_online_map on each CPU hotplug (cpuhp) event.
  */
 
-static int cpuset_handle_cpuhp(struct notifier_block *nb,
-				unsigned long phase, void *cpu)
+static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
+				unsigned long phase, void *unused_cpu)
 {
 	if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
 		return NOTIFY_DONE;
@@ -1803,7 +2047,7 @@ void __cpuset_memory_pressure_bump(void)
  *    the_top_cpuset_hack in cpuset_exit(), which sets an exiting tasks
  *    cpuset to top_cpuset.
  */
-static int proc_cpuset_show(struct seq_file *m, void *v)
+static int proc_cpuset_show(struct seq_file *m, void *unused_v)
 {
 	struct pid *pid;
 	struct task_struct *tsk;
--- 2.6.23-rc8-mm1.orig/kernel/sched.c	2007-09-29 23:56:40.987962675 -0700
+++ 2.6.23-rc8-mm1/kernel/sched.c	2007-09-29 23:57:51.180980463 -0700
@@ -6321,24 +6321,22 @@ error:
 	return -ENOMEM;
 #endif
 }
+
+static cpumask_t *doms_cur;	/* current sched domains */
+static int ndoms_cur;		/* number of sched domains in 'doms_cur' */
+
 /*
  * Set up scheduler domains and groups.  Callers must hold the hotplug lock.
+ * For now this just excludes isolated cpus, but could be used to
+ * exclude other special cases in the future.
  */
 static int arch_init_sched_domains(const cpumask_t *cpu_map)
 {
-	cpumask_t cpu_default_map;
-	int err;
-
-	/*
-	 * Setup mask for cpus without special case scheduling requirements.
-	 * For now this just excludes isolated cpus, but could be used to
-	 * exclude other special cases in the future.
-	 */
-	cpus_andnot(cpu_default_map, *cpu_map, cpu_isolated_map);
+	ndoms_cur = 1;
+	doms_cur =  kmalloc(sizeof(cpumask_t), GFP_KERNEL);
+	cpus_andnot(*doms_cur, *cpu_map, cpu_isolated_map);
 
-	err = build_sched_domains(&cpu_default_map);
-
-	return err;
+	return build_sched_domains(doms_cur);
 }
 
 static void arch_destroy_sched_domains(const cpumask_t *cpu_map)
@@ -6360,6 +6358,56 @@ static void detach_destroy_domains(const
 	arch_destroy_sched_domains(cpu_map);
 }
 
+/*
+ * Partition sched domains as specified by the 'ndoms_new'
+ * cpumasks in the array doms_new[] of cpumasks.  This compares
+ * doms_new[] to the current sched domain partitioning, doms_cur[].
+ * It destroys each deleted domain and builds each new domain.
+ *
+ * 'doms_new' is an array of cpumask_t's of length 'ndoms_new'.
+ * The masks don't intersect (don't overlap.)  We should setup one
+ * sched domain for each mask.  CPUs not in any of the cpumasks will
+ * not be load balanced.  If the same cpumask appears both in the
+ * current 'doms_cur' domains and in the new 'doms_new', we can leave
+ * it as it is.
+ *
+ * The passed in 'doms_new' must be kmalloc'd, and this routine takes
+ * ownership of it and will kfree it when done with it.
+ *
+ * Call with hotplug lock held
+ */
+void partition_sched_domains(int ndoms_new, cpumask_t *doms_new)
+{
+	int i, j;
+
+	/* Destroy deleted domains */
+	for (i = 0; i < ndoms_cur; i++) {
+		for (j = 0; j < ndoms_new; j++) {
+			if (cpus_equal(doms_cur[i], doms_new[j]))
+				goto match1;
+		}
+		/* no match - a current sched domain not in new doms_new[] */
+		detach_destroy_domains(doms_cur + i);
+match1:;
+	}
+
+	/* Build new domains */
+	for (i = 0; i < ndoms_new; i++) {
+		for (j = 0; j < ndoms_cur; j++) {
+			if (cpus_equal(doms_new[i], doms_cur[j]))
+				goto match2;
+		}
+		/* no match - add a new doms_new */
+		build_sched_domains(doms_new + i);
+match2:;
+	}
+
+	/* Remember the new sched domains */
+	kfree(doms_cur);
+	doms_cur = doms_new;
+	ndoms_cur = ndoms_new;
+}
+
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
 static int arch_reinit_sched_domains(void)
 {

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH] cpuset decrustify update and validate masks
  2007-09-30 10:44 [PATCH] cpuset and sched domains: sched_load_balance flag Paul Jackson
  2007-09-29 19:21 ` Nick Piggin
@ 2007-09-30 10:44 ` Paul Jackson
  2007-09-30 17:33 ` [PATCH] cpuset and sched domains: sched_load_balance flag Ingo Molnar
  2007-10-02 20:22 ` Randy Dunlap
  3 siblings, 0 replies; 38+ messages in thread
From: Paul Jackson @ 2007-09-30 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, Paul Menage, linux-kernel, Dinakar Guniguntala,
	Paul Jackson, cpw, Ingo Molnar

From: Paul Jackson <pj@sgi.com>

The kernel/cpuset.c code handling the updating of a cpusets
'cpus' and 'mems' masks was starting to look a little bit
crufty to me.

So I rewrote it a little bit.  Other than subtle improvements
in the consistency of identifying white space at the beginning
and end of passed in masks, I don't see that it makes any
visible difference in behaviour.  But it's one or two hundred
kernel text bytes smaller, and to my eye, easier to understand.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

Andrew - this patch goes after:
  cpuset-and-sched-domains-sched_load_balance-flag

 kernel/cpuset.c |   50 ++++++++++++++++++++------------------------------
 1 file changed, 20 insertions(+), 30 deletions(-)

--- 2.6.23-rc8-mm1.orig/kernel/cpuset.c	2007-09-30 01:27:28.442825126 -0700
+++ 2.6.23-rc8-mm1/kernel/cpuset.c	2007-09-30 01:38:22.829256421 -0700
@@ -488,6 +488,14 @@ static int validate_change(const struct 
 			return -EINVAL;
 	}
 
+	/* Cpusets with tasks can't have empty cpus_allowed or mems_allowed */
+	if (cgroup_task_count(cur->css.cgroup)) {
+		if (cpus_empty(trial->cpus_allowed) ||
+	    	    nodes_empty(trial->mems_allowed)) {
+			return -ENOSPC;
+		}
+	}
+
 	return 0;
 }
 
@@ -691,11 +699,13 @@ static int update_cpumask(struct cpuset 
 	trialcs = *cs;
 
 	/*
-	 * We allow a cpuset's cpus_allowed to be empty; if it has attached
-	 * tasks, we'll catch it later when we validate the change and return
-	 * -ENOSPC.
+	 * An empty cpus_allowed is ok iff there are no tasks in the cpuset.
+	 * Since cpulist_parse() fails on an empty mask, we special case
+	 * that parsing.  The validate_change() call ensures that cpusets
+	 * with tasks have cpus.
 	 */
-	if (!buf[0] || (buf[0] == '\n' && !buf[1])) {
+	buf = strstrip(buf);
+	if (!*buf) {
 		cpus_clear(trialcs.cpus_allowed);
 	} else {
 		retval = cpulist_parse(buf, trialcs.cpus_allowed);
@@ -703,10 +713,6 @@ static int update_cpumask(struct cpuset 
 			return retval;
 	}
 	cpus_and(trialcs.cpus_allowed, trialcs.cpus_allowed, cpu_online_map);
-	/* cpus_allowed cannot be empty for a cpuset with attached tasks. */
-	if (cgroup_task_count(cs->css.cgroup) &&
-	    cpus_empty(trialcs.cpus_allowed))
-		return -ENOSPC;
 	retval = validate_change(cs, &trialcs);
 	if (retval < 0)
 		return retval;
@@ -811,29 +817,19 @@ static int update_nodemask(struct cpuset
 	trialcs = *cs;
 
 	/*
-	 * We allow a cpuset's mems_allowed to be empty; if it has attached
-	 * tasks, we'll catch it later when we validate the change and return
-	 * -ENOSPC.
+	 * An empty mems_allowed is ok iff there are no tasks in the cpuset.
+	 * Since nodelist_parse() fails on an empty mask, we special case
+	 * that parsing.  The validate_change() call ensures that cpusets
+	 * with tasks have memory.
 	 */
-	if (!buf[0] || (buf[0] == '\n' && !buf[1])) {
+	buf = strstrip(buf);
+	if (!*buf) {
 		nodes_clear(trialcs.mems_allowed);
 	} else {
 		retval = nodelist_parse(buf, trialcs.mems_allowed);
 		if (retval < 0)
 			goto done;
-		if (!nodes_intersects(trialcs.mems_allowed,
-						node_states[N_HIGH_MEMORY])) {
-			/*
-			 * error if only memoryless nodes specified.
-			 */
-			retval = -ENOSPC;
-			goto done;
-		}
 	}
-	/*
-	 * Exclude memoryless nodes.  We know that trialcs.mems_allowed
-	 * contains at least one node with memory.
-	 */
 	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed,
 						node_states[N_HIGH_MEMORY]);
 	oldmem = cs->mems_allowed;
@@ -841,12 +837,6 @@ static int update_nodemask(struct cpuset
 		retval = 0;		/* Too easy - nothing to do */
 		goto done;
 	}
-	/* mems_allowed cannot be empty for a cpuset with attached tasks. */
-	if (cgroup_task_count(cs->css.cgroup) &&
-	    nodes_empty(trialcs.mems_allowed)) {
-		retval = -ENOSPC;
-		goto done;
-	}
 	retval = validate_change(cs, &trialcs);
 	if (retval < 0)
 		goto done;

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-09-30 10:44 [PATCH] cpuset and sched domains: sched_load_balance flag Paul Jackson
  2007-09-29 19:21 ` Nick Piggin
  2007-09-30 10:44 ` [PATCH] cpuset decrustify update and validate masks Paul Jackson
@ 2007-09-30 17:33 ` Ingo Molnar
  2007-10-02 20:22 ` Randy Dunlap
  3 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2007-09-30 17:33 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, nickpiggin, Paul Menage, linux-kernel,
	Dinakar Guniguntala, cpw


* Paul Jackson <pj@sgi.com> wrote:

> Add a new per-cpuset flag called 'sched_load_balance'.
> 
> When enabled in a cpuset (the default value) it tells the kernel 
> scheduler that the scheduler should provide the normal load balancing 
> on the CPUs in that cpuset, sometimes moving tasks from one CPU to a 
> second CPU if the second CPU is less loaded and if that task is 
> allowed to run there.
> 
> When disabled (write "0" to the file) then it tells the kernel 
> scheduler that load balancing is not required for the CPUs in that 
> cpuset.

i like this, this feature would be quite useful for -rt and CPU 
shielding.

( a cpuset is a mandatory container for set_cpus_allowed(), so there is
  a material and app-visible difference between a 4-CPU cpuset that has 
  balancing disabled and 4x 1-CPU cpusets. )

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-09-29 19:21 ` Nick Piggin
@ 2007-09-30 18:07   ` Paul Jackson
  2007-09-30  3:34     ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-09-30 18:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

Nick wrote:
> The user should just be able to specify exactly the partitioning of
> tasks required, and cpusets should ask the scheduler to do the best
> job of load balancing possible.

If the cpusets which have 'sched_load_balance' enabled are disjoint
(their 'cpus' cpus_allowed masks don't overlap) then you get exactly
what you're asking for.  In that case there is exactly one sched domain
for the 'cpus' allowed by each cpuset that has sched_load_balanced
enabled.

But there is another case in which one does not want what you ask for.

That case involves the situation where one is running a third part
batch scheduler on part of ones big system, and doing other stuff
(perhaps Ingo's realtime stuff) on another part of the system.

In that case, the system admin will be advised to turn off
sched_load_balance on the top cpuset.  But in that case the system
admin will -not- know from moment to moment what jobs the batch
scheduler is running on the cpus assigned to its control.  Only the
batch scheduler knows that.

The batch scheduler is code that was written by someone else, in
some other company, some other time.  That code does not get to
control the overall sched domain partitioning of the entire system.
The batch scheduler gets to say, in affect:

  Here's where I need load balancing to occur, in the normal fashion,
  and here's where I don't need it.

In short, you insisting that only a single administrative point of
control determine the systems sched domains.  Sometimes that fits
the way the system is managed, and my patch lets you do that.  But
sometimes this is a shared responsibility, between a piece of third
party software and the system admin, and my patch allows for that
case as well.

This is a typical sort of situation that arises from having hierarchical
cpuset definitions, and highlights the reason (and the use case,
involving third party batch schedulers) that I went with a hierarchical
cpuset architecture in the first place.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-09-30  3:34     ` Nick Piggin
@ 2007-10-01  3:42       ` Paul Jackson
  2007-10-02 13:05         ` Nick Piggin
  2007-10-01 18:15       ` Paul Jackson
  1 sibling, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-01  3:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

Nick wrote:
> Moreover, sched_load_balance doesn't really sound like a good name
> for asking for a partition. 

Yup - it's not a good name for asking for a partition.

That's because it isn't asking for a partition.

It's asking for load balancing over the CPUs in the cpuset so marked.

> It's more like you're just asking to have better
> load balancing over that set,

Yup - it's asking for load balancing over that set.  That is why it is
called that.  There's no idea here of better or worse load balancing,
that's an internal kernel scheduler subtlety -- it's just a request that
load balancing be done.

That is what is visible to user space: whether or not tasks get moved
from overloaded CPUs to underloaded, though still allowed, CPUs.

This is visible to user space in two ways:
  1) as task movemement, which may or may not be what is desired, and
  2) as kernel CPU cycles spent, because load balancing costs CPU cycles
     that increase more than linearly with the number of CPUs being
     balanced.

The user doesn't give a hoot what a 'sched domain' is.  They care to
manage (1) whether their tasks might move under a load imbalance, and
(2) how many CPU cycles the kernel spends providing this service.

> You would do this by creating partitioning cpusets which carve up the
> root cpuset (basically -- have multiple roots).

You would do this with the current, single rooted cpuset (and now
cgroup) mechanism by having multiple immediate child cpusets of the
root cpuset, which partition the system CPUs.  There is no need to
invent some bastardized multiple root structure.

> You can't (easily) do this now because you have so many tasks in the
> root cpuset that it is impossible to know whether or not you
> actually want to load balance them.

I don't know what proposal you are reacting to here.  Clearly not this
patch that I have proposed, as it is trivially easy to indicate whether
you want to load balance the root cpuset - by setting or clearing the
'sched_load_balance' flag in the root cpuset. 

How could it possibly get any more direct that that?

> Neither approach is really fundamentally more or less powerful than
> the other, but what I object to in yours is adding these flags which
> don't allow the admin to specify what they want, but to specify how they
> want it done.

My approach doesn't do that - perhaps we aren't communicating.

We are in complete agreement that the admin should specify what they
want, and leave it to the kernel to figure out how to do it.

> Rather than require the admin to know the intricate details about
> how and why the scheduler load balancing gets broken, and when they
> might or might not need to use this flag, they can just specify what they
> want to be done, and the kernel can choose the optimal strategy.

Excellent -- I'm glad you like my approach </sarcasm>

> No, I'm insisting that *no* single administrative point of control
> determines the sched domains. Not directly. The kernel should.
> cpusets API should be rich enough that the kernel can derive tihs
> information from what the admin has intended.

We are in complete agreement in insisting on this.

In short:

    The kernel schedulers dynamic sched domains are --not-- the service
    being provided to the user.  "Sched domains" are just the kernel
    internal mechanism.

    The service being provided is dynamic load balancing of tasks from
    overloaded CPUs to underloaded CPUs.

    Some users will want to disable load balancing on some cpusets, because
    either:
      (1) it's too expensive to balance really large cpusets unless really
	  needed, or
      (2) real time users don't want to waste the CPU cycles doing
	  balancing even on small cpusets.

If you think I repeated everything two or three times above ... good,
you're right - I did.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-09-30  3:34     ` Nick Piggin
  2007-10-01  3:42       ` Paul Jackson
@ 2007-10-01 18:15       ` Paul Jackson
  2007-10-02 13:35         ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-01 18:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

Nick wrote:
> which you could equally achieve by adding
> a second set of sched domains (and the global domains could keep
> globally balancing).

Hmmm ... this could be the key to this discussion.

Nick - can two sched domains overlap?  And if they do, what does that
mean on any user or application behaviour.

>From the cpuset side - this patch handles overlap by joining the 'cpus'
into one sched domain.  If two cpusets with overlapping 'cpus' are both
marked 'sched_load_balance', then this patch forms a single, combined
sched domain.

As best as I can tell, you and I are actually in agreement in the
case that there is no overlap.  If the several cpusets which have
'sched_load_balance' enabled have mutually disjoint 'cpus' (no
overlap), then my patch forms exactly one sched domain for each such
cpuset, having the same 'cpus'.

The issue is the overlapping cases - are overlapping sched domains
allowed, and if so, how do they affect user space?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-01  3:42       ` Paul Jackson
@ 2007-10-02 13:05         ` Nick Piggin
  2007-10-03  6:58           ` Paul Jackson
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 13:05 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Monday 01 October 2007 13:42, Paul Jackson wrote:
> Nick wrote:
> > Moreover, sched_load_balance doesn't really sound like a good name
> > for asking for a partition.
>
> Yup - it's not a good name for asking for a partition.
>
> That's because it isn't asking for a partition.
>
> It's asking for load balancing over the CPUs in the cpuset so marked.

Yeah yeah OK, you turn it off in the parent cpuset of the child cpusets
which you want the partitioning to occur in, and ensure there are no
other overlapping cpusets with that flag turned on in order to create a
hard partition. I don't think this makes the API anynicer.


> > It's more like you're just asking to have better
> > load balancing over that set,
>
> Yup - it's asking for load balancing over that set.  That is why it is
> called that.  There's no idea here of better or worse load balancing,
> that's an internal kernel scheduler subtlety -- it's just a request that
> load balancing be done.

OK, if it prohibits balancing when sched_load_balance is 0, then it is
slightly more useful.


> That is what is visible to user space: whether or not tasks get moved
> from overloaded CPUs to underloaded, though still allowed, CPUs.
>
> This is visible to user space in two ways:
>   1) as task movemement, which may or may not be what is desired, and
>   2) as kernel CPU cycles spent, because load balancing costs CPU cycles
>      that increase more than linearly with the number of CPUs being
>      balanced.
>
> The user doesn't give a hoot what a 'sched domain' is.  They care to
> manage (1) whether their tasks might move under a load imbalance, and
> (2) how many CPU cycles the kernel spends providing this service.

Yeah, but the interface is not very nice. As an interface for hard
partitioning, it doesn't work nicely because it is hierarchical.


> > You would do this by creating partitioning cpusets which carve up the
> > root cpuset (basically -- have multiple roots).
>
> You would do this with the current, single rooted cpuset (and now
> cgroup) mechanism by having multiple immediate child cpusets of the
> root cpuset, which partition the system CPUs.  There is no need to
> invent some bastardized multiple root structure.

What do you mean by bastardized? What's wrong with having a real
(and sane) representation of the requested hard-partitions in the system?


> > You can't (easily) do this now because you have so many tasks in the
> > root cpuset that it is impossible to know whether or not you
> > actually want to load balance them.
>
> I don't know what proposal you are reacting to here.  Clearly not this
> patch that I have proposed, as it is trivially easy to indicate whether
> you want to load balance the root cpuset - by setting or clearing the
> 'sched_load_balance' flag in the root cpuset.

Not your proposal, just the idea to have enough information to be able
to work out a more optimal set of sched-domains automatically. Actually
we can do most of it already automatically, but not hard partitioning.

[snip]

As I said, neither is really semantically more powerful than the other. So
yeah those things are possible to do with your API, but I don't like the API.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-01 18:15       ` Paul Jackson
@ 2007-10-02 13:35         ` Nick Piggin
  2007-10-03  6:22           ` [patch] sched: fix sched-domains partitioning by cpusets Ingo Molnar
  2007-10-03  7:25           ` [PATCH] cpuset and sched domains: sched_load_balance flag Paul Jackson
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 13:35 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Tuesday 02 October 2007 04:15, Paul Jackson wrote:
> Nick wrote:
> > which you could equally achieve by adding
> > a second set of sched domains (and the global domains could keep
> > globally balancing).
>
> Hmmm ... this could be the key to this discussion.
>
> Nick - can two sched domains overlap?  And if they do, what does that
> mean on any user or application behaviour.

Yes, sched domains can be completely arbitrary, and of course in the
current kernel, parent domains always overlap their children.

A sched domain usually means that the scheduler can move tasks
around among that group of CPUs, given the correct flags (but if
there are no flags, then it would be a superfluous domain and should
get trimmed away I think).

BTW. as far as the sched.c changes in your patch go, I much prefer
the partition_sched_domains API: http://lkml.org/lkml/2006/10/19/85

The caller should manage everything itself, rather than
partition_sched_domains doing half of the memory allocation.


> From the cpuset side - this patch handles overlap by joining the 'cpus'
> into one sched domain.  If two cpusets with overlapping 'cpus' are both
> marked 'sched_load_balance', then this patch forms a single, combined
> sched domain.
>
> As best as I can tell, you and I are actually in agreement in the
> case that there is no overlap.  If the several cpusets which have
> 'sched_load_balance' enabled have mutually disjoint 'cpus' (no
> overlap), then my patch forms exactly one sched domain for each such
> cpuset, having the same 'cpus'.

OK, I don't think your patch actually does the wrong thing
technically (although admittedly your rebuild_sched_domains
isn't something I really applied my poor brain to).

> The issue is the overlapping cases - are overlapping sched domains
> allowed, and if so, how do they affect user space?

For hard partitions, you don't want them of course. And I think
we should come up with a cpusets solution for that first.

Afterwards, overlapping sched domains are allowed and could be
used to make balancing more efficient (rather than any real
affect on userspace). At the moment, the domain builder probably
wouldn't cope very well, though.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-03  6:56             ` Paul Jackson
@ 2007-10-02 15:46               ` Nick Piggin
  2007-10-03  9:21                 ` Paul Jackson
  2007-10-03  7:20               ` Ingo Molnar
  1 sibling, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 15:46 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Ingo Molnar, akpm, menage, linux-kernel, dino, cpw

On Wednesday 03 October 2007 16:56, Paul Jackson wrote:

> I must NAQ this patch, and I'm surprised to see Nick propose it
> again, as I thought he had already agreed that it didn't suffice.

Sorry for the confusion: I only meant the sched.c part of that
patch, not the full thing.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-03  6:58           ` Paul Jackson
@ 2007-10-02 16:09             ` Nick Piggin
  2007-10-03  9:55               ` Paul Jackson
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 16:09 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Wednesday 03 October 2007 16:58, Paul Jackson wrote:
> > > Yup - it's asking for load balancing over that set.  That is why it is
> > > called that.  There's no idea here of better or worse load balancing,
> > > that's an internal kernel scheduler subtlety -- it's just a request
> > > that load balancing be done.
> >
> > OK, if it prohibits balancing when sched_load_balance is 0, then it is
> > slightly more useful.
>
> It doesn't prohibit load balancing just because sched_load_balance is 0.
> Only if there are no overlapping cpusets still needing balancing does it
> prohibit balancing when 0.

Yeah, that's what I mean. Important point: prohibits, rather than
better/worse.


> > Yeah, but the interface is not very nice. As an interface for hard
> > partitioning, it doesn't work nicely because it is hierarchical.
>
> Yeah -- cpusets are hierarchical.  And some of the use cases for
> which cpusets are designed are hierarchical.

But partitioning isn't.


> > > > You would do this by creating partitioning cpusets which carve up the
> > > > root cpuset (basically -- have multiple roots).
> > >
> > > You would do this with the current, single rooted cpuset (and now
> > > cgroup) mechanism by having multiple immediate child cpusets of the
> > > root cpuset, which partition the system CPUs.  There is no need to
> > > invent some bastardized multiple root structure.
> >
> > What do you mean by bastardized?
>
> Changing cpusets from single root to multiple roots would be
> bastardizing it.

Well OK, if that's your definition. Not very helpful though.

Can I win this argument by defining sched_load_balance
to crapify it? :)


> My proposed sched_load_balance API is already quite capable of
> representing what you see the need for - hard partitioning.  It is also
> quite capable of representing some other situations, such as I've
> described in other replies, that you don't seem to see the need for.
>
> To repeat myself, in some cases, such as batch schedulers running in a
> subset of the CPUs on a large system, the code that knows some of the
> needs for load balancing does not have system wide control to mandate
> hard partitioning.  The batch scheduler can state where it is depending
> on load balancing being present, and the system administrator can choose
> or not to turn off load balancing in the top cpuset, thereby granting or
> not control over load balancing on the CPUs controlled by the batch
> scheduler to the batch scheduler.

Why isn't that possible with my approach?


> Hard partitioning is not the only use case here.
>
> If you don't appreciate the other cases, then fine ... but I don't think
> that gives you grounds to reject a patch just because it is not precisely
> the ideal, narrowly focused, API for the case you do appreciate.

What happens when you partition the system with your approach, and
you get kernel threads being spawned into the root cpuset and getting
unbalanced?

With my approach, these can get naturally balanced.


> >  What's wrong with having a real
> > (and sane) representation of the requested hard-partitions in the system?
>
> What's wrong with it is that 1) it doesn't cover all the use cases,

OK, you haven't explained why not.


> 2) it would require a new and different mechanism other than cpusets
> which are not multiple rooted, and do robustly support overlapping
> sets and hence are not a hard partitioning, and

Obviously any approach cannot retain the existing kernel API, true.


> 3) we'd still need 
> the cpuset based API to cover the remaining use cases.

I just still don't understand how those cases work... if you can spell it
out for me.


> Good grief -- I must be misunderstanding you here, Nick.  I can't
> imagine that you want to turn cpusets into a multiple rooted hard
> partition mechanism.  If you are, then "bastardized" is the right word.
>
> > Not your proposal, just the idea to have enough information to be able
> > to work out a more optimal set of sched-domains automatically.
>
> I can't figure out what the sentence is saying ... sorry.

You said:
> I don't know what proposal you are reacting to here.  Clearly not this
> patch that I have proposed, as it is trivially easy to indicate whether
> you want to load balance the root cpuset - by setting or clearing the
> 'sched_load_balance' flag in the root cpuset.

The above sentence was explaining which proposal I was talking
about.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-03  7:25           ` [PATCH] cpuset and sched domains: sched_load_balance flag Paul Jackson
@ 2007-10-02 16:14             ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 16:14 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Wednesday 03 October 2007 17:25, Paul Jackson wrote:
> Nick wrote:
> > BTW. as far as the sched.c changes in your patch go, I much prefer
> > the partition_sched_domains API: http://lkml.org/lkml/2006/10/19/85
> >
> > The caller should manage everything itself, rather than
> > partition_sched_domains doing half of the memory allocation.
>
> Please take a closer look at my partition_sched_domains() and its
> interface to the scheduler.
>
> You should recognize this API, once you look at it.  It simply passes
> the full flat, hard partition, in its entirety.  This is the
> partitioning that you speak of, I believe.  It's here; just not where
> you expected it.
>
> The portion of the code that is in kernel/sched.c is just a little bit
> of optimization.  It avoids rebuilding all the sched domains and
> reattaching every task to its sched domain; rather it determines which
> sched domains were added or removed and just rebuilds them.
>
> Once you take a closer look, I hope you will agree that this new
> interface between the cpuset and sched code provides a cleaner
> separation.

I don't know what you think I said that is incorrect and requires me to
look at again. I don't like your partition_sched_domains API because
of the allocation thing. So I prefer the existing (or better, the simplified
version in my patch referenced).

The caller should determine which domain to rebuild and reattach.
Simple.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-03  9:21                 ` Paul Jackson
@ 2007-10-02 17:23                   ` Nick Piggin
  2007-10-03 10:08                     ` Paul Jackson
  2007-10-03  9:35                   ` Ingo Molnar
  1 sibling, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 17:23 UTC (permalink / raw)
  To: Paul Jackson; +Cc: mingo, akpm, menage, linux-kernel, dino, cpw

On Wednesday 03 October 2007 19:21, Paul Jackson wrote:
> Nick wrote:
> > Sorry for the confusion: I only meant the sched.c part of that
> > patch, not the full thing.
>
> Ah - ok.  We're getting closer then.  Good.
>
> Let me be sure I've got this right then.
>
> You prefer the interface from your proposed patch, by which the
> cpuset code passes sched domain requests to the scheduler code a single
> cpumask that will define a sched domain:
>
>     int partition_sched_domains(cpumask_t *partition)
>
> and I am suggesting instead a new and different interface:

Yep.

>
>     void partition_sched_domains(int ndoms_new, cpumask_t *doms_new)
>
> In the first API, one cpumask is passed in, and a single sched
> domain is formed, taking those CPUs from any sched domain they
> might have already been a member of, into this new sched domain.
>
> In the second API, the entire flat partitioning is passed in,
> giving an array of masks, one mask for each desired sched domain.
> The passed in masks do not overlap, but might not cover all CPUs.
>
> Question -- how does one turn off load balancing on some CPUs
> using the first API?
>
>     Does one do this by forming singleton sched domains of one
>     CPU each?  Is there any downside to doing this?

Yes, and no (it does the same thing as your version internally, so
no downside).


>     The simplest cpuset code to work with this would end up exposing
>     this method of disabling load balancing to user space, forcing
>     users to create cpusets with one CPU each to be able do disable
>     load balancing.
>
>     However a little bit of additional kernel cpuset code could hide
>     this detail from user space, by recognizing when the user had
>     asked to turn off load balancing on some larger cpuset, and by
>     then calling partition_sched_domains() multiple times, once for
>     each CPU in that cpuset.

Yeah: do all that in cpusets. It's already information you would have
to derive in order to make it work properly anyway. If you are not
passing in the singleton domains ATM, then they will not get properly
detached and isolated.


> There might be an even simpler way.  If the kernel/sched.c routines
> detach_destroy_domains() and build_sched_domains() were exposed as
> external routines, then the cpuset code could call them directly,
> removing the partition_sched_domains() routine from sched.c entirely.
> Would this be worth persuing?

detach_destroy and build are things that may get reimplemented to
suit different capabilities (eg. we might want to have multiple trees of
domains for each CPU). So I think it is important to expose the simple
partition API which should be unambiguous and stable.

It's not a huge deal, but I'd like to keep partition_sched_domains. After
my patch, it's really simple.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-03  9:39                     ` Paul Jackson
@ 2007-10-02 17:29                       ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 17:29 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Ingo Molnar, akpm, menage, linux-kernel, dino, cpw

On Wednesday 03 October 2007 19:39, Paul Jackson wrote:
> > in any case i'd like to see the externally visible API get in foremost -
> > and there now seems to be agreement about that. (yay!) Any internal
> > shaping of APIs can be done flexibly between cpusets and the scheduler.
>
> Yup - though Nick and I will have to agree to -some- internal interface
> between the cpuset and sched code, at least for the moment.
>
> At least, if we thrash about on this, we won't be changing the externally
> visible API around.  We'll just continue driving Andrew nuts, not our
> users - that's an improvement.

OK look, I don't want to hold up progress. I do like to ask these questions
and be difficult if I think there might be a better way and/or to make sure
you've thought about all the angles. I'm not volunteering to maintain
cpusets and I'm not as close to the customers who care as you.

Obviously what your patches do here is a lot closer to "the right thing"
than cpus_exclusive. And the worst problem they'll cause is to add
cruft to cpusets.c.

So I'll keep persuing the other subthread for the same reasons, but aside
from implementation nits, I don't know if it is worth holding up a merge.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-03  9:55               ` Paul Jackson
@ 2007-10-02 17:56                 ` Nick Piggin
  2007-10-03 11:38                   ` Paul Jackson
  2007-10-03 12:17                   ` Paul Jackson
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 17:56 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Wednesday 03 October 2007 19:55, Paul Jackson wrote:
> > > Yeah -- cpusets are hierarchical.  And some of the use cases for
> > > which cpusets are designed are hierarchical.
> >
> > But partitioning isn't.
>
> Yup.  We've got a square peg and a round hole.  An impedance mismatch.
> That's the root cause of this entire wibbling session, in my view.

Basically, yeah.


> > > To repeat myself, in some cases, such as batch schedulers running in a
> > > subset of the CPUs on a large system, the code that knows some of the
> > > needs for load balancing does not have system wide control to mandate
> > > hard partitioning.  The batch scheduler can state where it is depending
> > > on load balancing being present, and the system administrator can
> > > choose or not to turn off load balancing in the top cpuset, thereby
> > > granting or not control over load balancing on the CPUs controlled by
> > > the batch scheduler to the batch scheduler.
> >
> > Why isn't that possible with my approach?
>
> If I understand your approach to the kernel-to-user interface correctly
> (sometimes I doubt I do) then your approach expected some user space code
> or person or semi-intelligent equivalent to define a flat partition,
> which will then be used to determine the sched domains.
>
> In the batch scheduler case, running on a large shared system used
> perhaps by several departments, no one entity can do that.  One person,
> perhaps the system admin, knows if they want to give complete control
> of some big chunk of CPUs to a batch scheduler.  The batch scheduler,
> written by someone else far away and long ago, knows which jobs are
> actively running on which subsets of the CPUs the batch scheduler is
> using.
>
> There is no single monolithic entity on such systems who knows all and
> can dictate all details of a single, flat, system-wide partitioning.

OK, so I don't exactly understand you either. To make it simple, can
you give a concrete example of a cpuset hierarchy that wouldn't
work?


> The partitioning has to be sythesized from the combined requests of
> several user space entities.  That's ok -- this is bread and butter
> work for cpusets.

OK, so to really do anything different (from a non-partitioned setup),
you would need to set sched_load_balance=0 for the root cpuset?
Suppose you do that to hard partition the machine, what happens to
newly created tasks like kernel threads or things that aren't in a
cpuset?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-03 11:38                   ` Paul Jackson
@ 2007-10-02 19:25                     ` Nick Piggin
  2007-10-03 12:14                       ` Paul Jackson
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 19:25 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Wednesday 03 October 2007 21:38, Paul Jackson wrote:
> > OK, so to really do anything different (from a non-partitioned setup),
> > you would need to set sched_load_balance=0 for the root cpuset?

> > Suppose you do that to hard partition the machine, what happens to
> > newly created tasks like kernel threads or things that aren't in a
> > cpuset?
>
> Well ... --every-- task is in a cpuset, always.  Newly created tasks
> start in the cpuset of their parent.  Grep for 'the_top_cpuset_hack'
> in kernel/cpuset.c to see the lengths to which we go to ensure that
> current->cpuset always resolves somewhere.

OK, then non-balancing cpuset.


> The usual case on the big systems that I care about the most is
> that we move (almost) every task out of the top cpuset, into smaller
> cpusets, because we don't want some random thread intruding on the
> CPUs dedicated to a particular job.  The only threads left in the root
> cpuset are pinned kernel threads, such as for thread migration, per-cpu
> irq handlers and various per-cpu and per-node disk and file flushers
> and such.  These threads aren't going anywhere, regardless.  But no
> thread that is willing to run anywhere is left free to run anywhere.

These are what I'm worried about, and things like kswapd, pdflush,
could definitely use a huge amount of CPU.

If you are interested in hard partitioning the system, you most
definitely want these things to be balanced across the non-isolated
CPUs.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-03 12:14                       ` Paul Jackson
@ 2007-10-02 19:53                         ` Nick Piggin
  2007-10-03 12:41                           ` Paul Jackson
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 19:53 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Wednesday 03 October 2007 22:14, Paul Jackson wrote:
> > These are what I'm worried about, and things like kswapd, pdflush,
> > could definitely use a huge amount of CPU.
> >
> > If you are interested in hard partitioning the system, you most
> > definitely want these things to be balanced across the non-isolated
> > CPUs.
>
> But these guys are pinned anyway (or else they would already be moved
> into a smaller load balanced cpuset), so why waste time load balancing
> what can't move?

They're not pinned (kswapds are pinned to a node, but still). pdflush
is not pinned at all and can be dynamically created and destroyed. Ditto
for kjournald, as well as many others.

Basically: it doesn't feel like a satisfactory solution to brush these under
the carpet.


> And on some of the systems I care about, we don't want to load balance
> these guys; rather we go to great lengths to see that they don't run at
> all when we don't want them to.

Most smaller realtime partitioned systems will want to, I'd expect.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-09-30 10:44 [PATCH] cpuset and sched domains: sched_load_balance flag Paul Jackson
                   ` (2 preceding siblings ...)
  2007-09-30 17:33 ` [PATCH] cpuset and sched domains: sched_load_balance flag Ingo Molnar
@ 2007-10-02 20:22 ` Randy Dunlap
  2007-10-02 20:57   ` Paul Jackson
  3 siblings, 1 reply; 38+ messages in thread
From: Randy Dunlap @ 2007-10-02 20:22 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Andrew Morton, nickpiggin, Paul Menage, linux-kernel,
	Dinakar Guniguntala, cpw, Ingo Molnar

On Sun, 30 Sep 2007 03:44:03 -0700 Paul Jackson wrote:

> From: Paul Jackson <pj@sgi.com>
> 
...
> 
> Acked-by: Paul Jackson <pj@sgi.com>

Are there some attributions missing, else S-O-B ?

> ---
> 
> Andrew - this patch goes right after your *-mm patch:
>   task-containers-enable-containers-by-default-in-some-configs.patch
> and before "add-containerstats-v3.patch"
> 
>  Documentation/cpusets.txt |  141 +++++++++++++++++++++++++
>  include/linux/sched.h     |    2 
>  kernel/cpuset.c           |  254 +++++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched.c            |   72 ++++++++++---
>  4 files changed, 450 insertions(+), 19 deletions(-)

> --- 2.6.23-rc8-mm1.orig/kernel/cpuset.c	2007-09-29 23:56:40.987962675 -0700
> +++ 2.6.23-rc8-mm1/kernel/cpuset.c	2007-09-29 23:57:51.148979999 -0700

>  /*
> + * Helper routine for rebuild_sched_domains().
> + * Do cpusets a, b have overlapping cpus_allowed masks?
> + */
> +
> +static int cpusets_overlap(struct cpuset *a, struct cpuset *b)

inline ?

> +{
> +	return cpus_intersects(a->cpus_allowed, b->cpus_allowed);
> +}
> +
...
> +
> +static void rebuild_sched_domains(void)
> +{
> +	struct kfifo *q;	/* queue of cpusets to be scanned */
> +	struct cpuset *cp;	/* scans q */
> +	struct cpuset **csa;	/* array of all cpuset ptrs */
> +	int csn;		/* how many cpuset ptrs in csa so far */
> +	int i, j, k;		/* indices for partition finding loops */
> +	cpumask_t *doms;	/* resulting partition; i.e. sched domains */
> +	int ndoms;		/* number of sched domains in result */
> +	int nslot;		/* next empty doms[] cpumask_t slot */
> +
> +	q = NULL; csa = NULL; doms = NULL;

That's not kernel style.  Use either (Andrew would say the second one):

	q = csa = doms = NULL;

or
	q = NULL;
	csa = NULL;
	doms = NULL;

> +
> +	/* Special case for the 99% of systems with one, full, sched domain */
> +	if (is_sched_load_balance(&top_cpuset)) {
> +		ndoms = 1;
> +		doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
> +		*doms = top_cpuset.cpus_allowed;
> +		goto rebuild;
> +	}
> +
...

> +
> +rebuild:
> +	/* Have scheduler rebuild sched domains */
> +	lock_cpu_hotplug();
> +	partition_sched_domains(ndoms, doms);
> +	unlock_cpu_hotplug();
> +
> +done:
> +	if (q && !IS_ERR(q))
> +		kfree(q);
> +	if (csa)

Don't need the conditional: kfree(NULL) is OK.

> +		kfree(csa);
> +	/* Don't kfree(doms) -- partition_sched_domains() does that. */
> +}
> +
> +/*
>   * Call with manage_mutex held.  May take callback_mutex during call.
>   */

---
~Randy

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-03 12:41                           ` Paul Jackson
@ 2007-10-02 20:30                             ` Nick Piggin
  2007-10-03 17:46                               ` Paul Jackson
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 20:30 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Wednesday 03 October 2007 22:41, Paul Jackson wrote:
> > pdflush
> > is not pinned at all and can be dynamically created and destroyed. Ditto
> > for kjournald, as well as many others.
>
> Whatever is not pinned is moved out of the top cpuset, on the kind of
> systems I'm most familiar with.  They are put in a smaller cpuset, with
> load balancing, that is sized for the workload they might present, but
> kept separate from the main jobs.

So if a new pdflush is spawned, it get's moved to some cpuset? That
probably isn't something these realtime systems want to do (ie. the
non-realtime portion probably doesn't want to have any sort of scheduler
or even worry about cpusets at all).


> > Basically: it doesn't feel like a satisfactory solution to brush
> > these under the carpet.
>
> We don't do a whole lot of brushing under the carpet on these kind of
> systems.  If I gave you the impression we do, then I misled you - sorry.

No, not on your systems. I'm worried about the smaller ones that don't
get so much attention (eg. hard partitioning for realtime).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-03 12:17                   ` Paul Jackson
@ 2007-10-02 20:31                     ` Nick Piggin
  2007-10-03 17:44                       ` Paul Jackson
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2007-10-02 20:31 UTC (permalink / raw)
  To: Paul Jackson; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

On Wednesday 03 October 2007 22:17, Paul Jackson wrote:
> Nick wrote:
> > OK, so I don't exactly understand you either. To make it simple, can
> > you give a concrete example of a cpuset hierarchy that wouldn't
> > work?
>
> It's more a matter of knowing how my third party batch scheduler
> coders think.  They will be off in some corner of their code with a
> cpuset in hand that they know is just being used to hold inactive
> (paused) tasks, and they can likely be persuaded to mark those cpusets
> as not being in need of any wasted CPU cycles load balancing them.

There won't be any CPU cycles used, if the tasks are paused (surely
they're not spin waiting).


> But these inactive cpusets will overlap in unknown (to them at
> the time, in that piece of code) ways with other cpusets holding
> active jobs, and there is no chance, unless it is a matter of major
> performance impact, that they will be in any position to comment on
> the proper partitioning of the sched domains on all the CPUs under the
> control of their batch scheduler, much less comment on the partitioning
> of the rest of the system.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 20:22 ` Randy Dunlap
@ 2007-10-02 20:57   ` Paul Jackson
  0 siblings, 0 replies; 38+ messages in thread
From: Paul Jackson @ 2007-10-02 20:57 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: akpm, nickpiggin, menage, linux-kernel, dino, cpw, mingo

Thanks for the review, Randy.  Good comments.

> > Acked-by: Paul Jackson <pj@sgi.com>
> 
> Are there some attributions missing, else S-O-B ?

Yup - I should have written this line as:

	Signed-off-by: Paul Jackson <pj@sgi.com>

> > +static int cpusets_overlap(struct cpuset *a, struct cpuset *b)
> 
> inline ?

It makes no difference to the code generated.  I tend to leave
out 'compiler optimization' hint words if I don't need them to
get the compiler to optimize.  In this case, of a single use
file static routine, the compiler inlines anyway.

> > +	q = NULL; csa = NULL; doms = NULL;
> 
> That's not kernel style.  Use either (Andrew would say the second one):
> 
> 	q = csa = doms = NULL;
> 
> or
> 	q = NULL;
> 	csa = NULL;
> 	doms = NULL;

You're right - and Andrew would be right as well, since the form:

	q = csa = doms = NULL;

generates a compiler warning, as not all three pointers are the
same type.

So three lines of code it must be.

> > +	if (q && !IS_ERR(q))
> > +		kfree(q);
> > +	if (csa)
> 
> Don't need the conditional: kfree(NULL) is OK.

Yup - you're right - about the 'csa' check.

However the if(q ...) check is needed, because I have another bug
here.  I allocated 'q' using kfifo_alloc(), so must free using
kfifo_free (or else leak the kfifo buffer memory.)  Calls to
kfifo_free() have to guard against NULL pointers before the call.

Thanks, Randy!

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-02 13:35         ` Nick Piggin
@ 2007-10-03  6:22           ` Ingo Molnar
  2007-10-03  6:56             ` Paul Jackson
  2007-10-03  7:25           ` [PATCH] cpuset and sched domains: sched_load_balance flag Paul Jackson
  1 sibling, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2007-10-03  6:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Paul Jackson, akpm, menage, linux-kernel, dino, cpw


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> BTW. as far as the sched.c changes in your patch go, I much prefer the 
> partition_sched_domains API: http://lkml.org/lkml/2006/10/19/85
> 
> The caller should manage everything itself, rather than 
> partition_sched_domains doing half of the memory allocation.

i've merged your patch to my scheduler queue - see the patch below. (And 
could you send me your SoB line too?) Paul, if we went with the patch 
below, what else would be needed for your purposes?

	Ingo

--------------------------------->
Subject: sched: fix sched-domains partitioning by cpusets
From: Nick Piggin <nickpiggin@yahoo.com.au>

Fix sched-domains partitioning by cpusets. Walk the whole cpusets tree after
something interesting changes, and recreate all partitions.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/cpuset.h |    2 
 include/linux/sched.h  |    3 -
 kernel/cpuset.c        |  109 ++++++++++++++++++++++---------------------------
 kernel/sched.c         |   31 +++++++------
 4 files changed, 70 insertions(+), 75 deletions(-)

Index: linux/include/linux/cpuset.h
===================================================================
--- linux.orig/include/linux/cpuset.h
+++ linux/include/linux/cpuset.h
@@ -14,6 +14,8 @@
 
 #ifdef CONFIG_CPUSETS
 
+extern int cpuset_hotplug_update_sched_domains(void);
+
 extern int number_of_cpusets;	/* How many cpusets are defined in system? */
 
 extern int cpuset_init_early(void);
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -798,8 +798,7 @@ struct sched_domain {
 #endif
 };
 
-extern int partition_sched_domains(cpumask_t *partition1,
-				    cpumask_t *partition2);
+extern int partition_sched_domains(cpumask_t *partition);
 
 #endif	/* CONFIG_SMP */
 
Index: linux/kernel/cpuset.c
===================================================================
--- linux.orig/kernel/cpuset.c
+++ linux/kernel/cpuset.c
@@ -752,6 +752,24 @@ static int validate_change(const struct 
 	return 0;
 }
 
+static void update_cpu_domains_children(struct cpuset *par,
+					cpumask_t *non_partitioned)
+{
+	struct cpuset *c;
+
+	list_for_each_entry(c, &par->children, sibling) {
+		if (cpus_empty(c->cpus_allowed))
+			continue;
+		if (is_cpu_exclusive(c)) {
+			if (!partition_sched_domains(&c->cpus_allowed)) {
+				cpus_andnot(*non_partitioned,
+					*non_partitioned, c->cpus_allowed);
+			}
+		} else
+			update_cpu_domains_children(c, non_partitioned);
+	}
+}
+
 /*
  * For a given cpuset cur, partition the system as follows
  * a. All cpus in the parent cpuset's cpus_allowed that are not part of any
@@ -761,53 +779,38 @@ static int validate_change(const struct 
  * Build these two partitions by calling partition_sched_domains
  *
  * Call with manage_mutex held.  May nest a call to the
- * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.
- * Must not be called holding callback_mutex, because we must
- * not call lock_cpu_hotplug() while holding callback_mutex.
+ * lock_cpu_hotplug()/unlock_cpu_hotplug() pair.  Must not be called holding
+ * callback_mutex, because we must not call lock_cpu_hotplug() while holding
+ * callback_mutex.
  */
 
-static void update_cpu_domains(struct cpuset *cur)
+static void update_cpu_domains(void)
 {
-	struct cpuset *c, *par = cur->parent;
-	cpumask_t pspan, cspan;
-
-	if (par == NULL || cpus_empty(cur->cpus_allowed))
-		return;
+	cpumask_t non_partitioned;
 
-	/*
-	 * Get all cpus from parent's cpus_allowed not part of exclusive
-	 * children
-	 */
-	pspan = par->cpus_allowed;
-	list_for_each_entry(c, &par->children, sibling) {
-		if (is_cpu_exclusive(c))
-			cpus_andnot(pspan, pspan, c->cpus_allowed);
-	}
-	if (!is_cpu_exclusive(cur)) {
-		cpus_or(pspan, pspan, cur->cpus_allowed);
-		if (cpus_equal(pspan, cur->cpus_allowed))
-			return;
-		cspan = CPU_MASK_NONE;
-	} else {
-		if (cpus_empty(pspan))
-			return;
-		cspan = cur->cpus_allowed;
-		/*
-		 * Get all cpus from current cpuset's cpus_allowed not part
-		 * of exclusive children
-		 */
-		list_for_each_entry(c, &cur->children, sibling) {
-			if (is_cpu_exclusive(c))
-				cpus_andnot(cspan, cspan, c->cpus_allowed);
-		}
-	}
+	BUG_ON(!mutex_is_locked(&manage_mutex));
 
 	lock_cpu_hotplug();
-	partition_sched_domains(&pspan, &cspan);
+	non_partitioned = top_cpuset.cpus_allowed;
+	update_cpu_domains_children(&top_cpuset, &non_partitioned);
+	partition_sched_domains(&non_partitioned);
 	unlock_cpu_hotplug();
 }
 
 /*
+ * Same as above except called with lock_cpu_hotplug and without manage_mutex.
+ */
+
+int cpuset_hotplug_update_sched_domains(void)
+{
+	cpumask_t non_partitioned;
+
+	non_partitioned = top_cpuset.cpus_allowed;
+	update_cpu_domains_children(&top_cpuset, &non_partitioned);
+	return partition_sched_domains(&non_partitioned);
+}
+
+/*
  * Call with manage_mutex held.  May take callback_mutex during call.
  */
 
@@ -845,8 +848,8 @@ static int update_cpumask(struct cpuset 
 	mutex_lock(&callback_mutex);
 	cs->cpus_allowed = trialcs.cpus_allowed;
 	mutex_unlock(&callback_mutex);
-	if (is_cpu_exclusive(cs) && !cpus_unchanged)
-		update_cpu_domains(cs);
+	if (!cpus_unchanged)
+		update_cpu_domains();
 	return 0;
 }
 
@@ -1087,7 +1090,7 @@ static int update_flag(cpuset_flagbits_t
 	mutex_unlock(&callback_mutex);
 
 	if (cpu_exclusive_changed)
-                update_cpu_domains(cs);
+                update_cpu_domains();
 	return 0;
 }
 
@@ -1947,19 +1950,9 @@ static int cpuset_mkdir(struct inode *di
 	return cpuset_create(c_parent, dentry->d_name.name, mode | S_IFDIR);
 }
 
-/*
- * Locking note on the strange update_flag() call below:
- *
- * If the cpuset being removed is marked cpu_exclusive, then simulate
- * turning cpu_exclusive off, which will call update_cpu_domains().
- * The lock_cpu_hotplug() call in update_cpu_domains() must not be
- * made while holding callback_mutex.  Elsewhere the kernel nests
- * callback_mutex inside lock_cpu_hotplug() calls.  So the reverse
- * nesting would risk an ABBA deadlock.
- */
-
 static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry)
 {
+	int is_exclusive;
 	struct cpuset *cs = dentry->d_fsdata;
 	struct dentry *d;
 	struct cpuset *parent;
@@ -1977,13 +1970,8 @@ static int cpuset_rmdir(struct inode *un
 		mutex_unlock(&manage_mutex);
 		return -EBUSY;
 	}
-	if (is_cpu_exclusive(cs)) {
-		int retval = update_flag(CS_CPU_EXCLUSIVE, cs, "0");
-		if (retval < 0) {
-			mutex_unlock(&manage_mutex);
-			return retval;
-		}
-	}
+	is_exclusive = is_cpu_exclusive(cs);
+
 	parent = cs->parent;
 	mutex_lock(&callback_mutex);
 	set_bit(CS_REMOVED, &cs->flags);
@@ -1998,8 +1986,13 @@ static int cpuset_rmdir(struct inode *un
 	mutex_unlock(&callback_mutex);
 	if (list_empty(&parent->children))
 		check_for_release(parent, &pathbuf);
+
+	if (is_exclusive)
+		update_cpu_domains();
+
 	mutex_unlock(&manage_mutex);
 	cpuset_release_agent(pathbuf);
+
 	return 0;
 }
 
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -6274,6 +6274,9 @@ error:
  */
 static int arch_init_sched_domains(const cpumask_t *cpu_map)
 {
+#ifdef CONFIG_CPUSETS
+	return cpuset_hotplug_update_sched_domains();
+#else
 	cpumask_t cpu_default_map;
 	int err;
 
@@ -6287,6 +6290,7 @@ static int arch_init_sched_domains(const
 	err = build_sched_domains(&cpu_default_map);
 
 	return err;
+#endif
 }
 
 static void arch_destroy_sched_domains(const cpumask_t *cpu_map)
@@ -6310,29 +6314,26 @@ static void detach_destroy_domains(const
 
 /*
  * Partition sched domains as specified by the cpumasks below.
- * This attaches all cpus from the cpumasks to the NULL domain,
+ * This attaches all cpus from the partition to the NULL domain,
  * waits for a RCU quiescent period, recalculates sched
- * domain information and then attaches them back to the
- * correct sched domains
- * Call with hotplug lock held
+ * domain information and then attaches them back to their own
+ * isolated partition.
+ *
+ * Called with hotplug lock held
+ *
+ * Returns 0 on success.
  */
-int partition_sched_domains(cpumask_t *partition1, cpumask_t *partition2)
+int partition_sched_domains(cpumask_t *partition)
 {
+	cpumask_t non_isolated_cpus;
 	cpumask_t change_map;
-	int err = 0;
 
-	cpus_and(*partition1, *partition1, cpu_online_map);
-	cpus_and(*partition2, *partition2, cpu_online_map);
-	cpus_or(change_map, *partition1, *partition2);
+	cpus_andnot(non_isolated_cpus, cpu_online_map, cpu_isolated_map);
+	cpus_and(change_map, *partition, non_isolated_cpus);
 
 	/* Detach sched domains from all of the affected cpus */
 	detach_destroy_domains(&change_map);
-	if (!cpus_empty(*partition1))
-		err = build_sched_domains(partition1);
-	if (!err && !cpus_empty(*partition2))
-		err = build_sched_domains(partition2);
-
-	return err;
+	return build_sched_domains(&change_map);
 }
 
 #if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-03  6:22           ` [patch] sched: fix sched-domains partitioning by cpusets Ingo Molnar
@ 2007-10-03  6:56             ` Paul Jackson
  2007-10-02 15:46               ` Nick Piggin
  2007-10-03  7:20               ` Ingo Molnar
  0 siblings, 2 replies; 38+ messages in thread
From: Paul Jackson @ 2007-10-03  6:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: nickpiggin, akpm, menage, linux-kernel, dino, cpw

Ingo wrote:
> i've merged your patch to my scheduler queue - see the patch below. (And 
> could you send me your SoB line too?) Paul, if we went with the patch 
> below, what else would be needed for your purposes?

Nick and I already resolved that, when he first posted this patch
in October of 2006.  The cpu_exclusive flag doesn't work for this.

Here's a copy of the key message, from Nick, near the end of that
thread in which he earlier proposed this patch, also available at:
http://lkml.org/lkml/2006/10/21/12

====================================================
    Paul Jackson wrote:
    > Nick wrote:
    > 
    >>Or, another question, how does my patch hijack cpus_allowed? In
    >>what way does it change the semantics of cpus_allowed?
    > 
    > 
    > It limits load balancing for tasks in cpusets containing
    > a superset of that cpusets cpus.
    > 
    > There are always such cpusets - the top cpuset if no other.

    Ah OK, and there is my misunderstanding with cpusets. From the
    documentation it appears as though cpu_exclusive cpusets are
    made in order to do the partitioning thing.

    If you always have other domains overlapping them (regardless
    that it is a parent), then what actual use does cpu_exclusive
    flag have?
====================================================

A couple messages later in that thread, Nick wrote:
> But even the way cpu_exclusive semantics are defined makes it not
> quite compatible with partitioning anyway, unfortunately.

I agree with Nick on this conclusion, and with his other conclusion
that the 'cpu_exclusive' flag is pretty near useless.

Some per-cpuset flag other the 'cpu_exclusive' is required to
control sched domains from cpusets.

This has specific impact on one of the key users of cpusets, the
various developers of batch schedulers.  One by one, they have
determined that the cpu_exclusive flag is incompatible with the
way they set up cpusets, and have decided they should not enable
that flag on any cpuset under their control.  It gets in their way,
and serves no useful purpose for them.  However we need someway
for them to specify where they need load balancing, so that on
large systems, they can allow the admin to avoid the cost of load
balancing over the batch schedulers entire subset of the system at
once, but rather just load balance over the smaller sets where the
batch scheduler has active jobs running that might depend on load
balancing.

Batch schedulers need to be able to specify where they need load
balancing and where they don't, and they can't use the 'cpu_exclusive'
flag.  The defining characteristic of 'cpu_exclusive' is no overlap of
CPUs with sibling cpusets.  That is incompatible with their needs.

Therefore, they need a different flag.

I must NAQ this patch, and I'm surprised to see Nick propose it
again, as I thought he had already agreed that it didn't suffice.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 13:05         ` Nick Piggin
@ 2007-10-03  6:58           ` Paul Jackson
  2007-10-02 16:09             ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-03  6:58 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

> > Yup - it's asking for load balancing over that set.  That is why it is
> > called that.  There's no idea here of better or worse load balancing,
> > that's an internal kernel scheduler subtlety -- it's just a request that
> > load balancing be done.
> 
> OK, if it prohibits balancing when sched_load_balance is 0, then it is
> slightly more useful.

It doesn't prohibit load balancing just because sched_load_balance is 0.
Only if there are no overlapping cpusets still needing balancing does it
prohibit balancing when 0.

> Yeah, but the interface is not very nice. As an interface for hard
> partitioning, it doesn't work nicely because it is hierarchical.

Yeah -- cpusets are hierarchical.  And some of the use cases for
which cpusets are designed are hierarchical.

> > > You would do this by creating partitioning cpusets which carve up the
> > > root cpuset (basically -- have multiple roots).
> >
> > You would do this with the current, single rooted cpuset (and now
> > cgroup) mechanism by having multiple immediate child cpusets of the
> > root cpuset, which partition the system CPUs.  There is no need to
> > invent some bastardized multiple root structure.
> 
> What do you mean by bastardized?

Changing cpusets from single root to multiple roots would be
bastardizing it.

My proposed sched_load_balance API is already quite capable of
representing what you see the need for - hard partitioning.  It is also
quite capable of representing some other situations, such as I've
described in other replies, that you don't seem to see the need for.

To repeat myself, in some cases, such as batch schedulers running in a
subset of the CPUs on a large system, the code that knows some of the
needs for load balancing does not have system wide control to mandate
hard partitioning.  The batch scheduler can state where it is depending
on load balancing being present, and the system administrator can choose
or not to turn off load balancing in the top cpuset, thereby granting or
not control over load balancing on the CPUs controlled by the batch
scheduler to the batch scheduler.

Hard partitioning is not the only use case here.

If you don't appreciate the other cases, then fine ... but I don't think
that gives you grounds to reject a patch just because it is not precisely
the ideal, narrowly focused, API for the case you do appreciate.

>  What's wrong with having a real
> (and sane) representation of the requested hard-partitions in the system?

What's wrong with it is that 1) it doesn't cover all the use cases,
2) it would require a new and different mechanism other than cpusets
which are not multiple rooted, and do robustly support overlapping
sets and hence are not a hard partitioning, and 3) we'd still need
the cpuset based API to cover the remaining use cases.

Good grief -- I must be misunderstanding you here, Nick.  I can't
imagine that you want to turn cpusets into a multiple rooted hard
partition mechanism.  If you are, then "bastardized" is the right word.

> Not your proposal, just the idea to have enough information to be able
> to work out a more optimal set of sched-domains automatically.

I can't figure out what the sentence is saying ... sorry.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-03  6:56             ` Paul Jackson
  2007-10-02 15:46               ` Nick Piggin
@ 2007-10-03  7:20               ` Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2007-10-03  7:20 UTC (permalink / raw)
  To: Paul Jackson; +Cc: nickpiggin, akpm, menage, linux-kernel, dino, cpw


* Paul Jackson <pj@sgi.com> wrote:

> Batch schedulers need to be able to specify where they need load 
> balancing and where they don't, and they can't use the 'cpu_exclusive' 
> flag.  The defining characteristic of 'cpu_exclusive' is no overlap of 
> CPUs with sibling cpusets.  That is incompatible with their needs.
> 
> Therefore, they need a different flag.
> 
> I must NAQ this patch, and I'm surprised to see Nick propose it again, 
> as I thought he had already agreed that it didn't suffice.

ok. Then lets go back to the original plan: your two patches and the new 
flag. Nick?

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 13:35         ` Nick Piggin
  2007-10-03  6:22           ` [patch] sched: fix sched-domains partitioning by cpusets Ingo Molnar
@ 2007-10-03  7:25           ` Paul Jackson
  2007-10-02 16:14             ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-03  7:25 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

Nick wrote:
> BTW. as far as the sched.c changes in your patch go, I much prefer
> the partition_sched_domains API: http://lkml.org/lkml/2006/10/19/85
> 
> The caller should manage everything itself, rather than
> partition_sched_domains doing half of the memory allocation.

Please take a closer look at my partition_sched_domains() and its
interface to the scheduler.

You should recognize this API, once you look at it.  It simply passes
the full flat, hard partition, in its entirety.  This is the
partitioning that you speak of, I believe.  It's here; just not where
you expected it.

The portion of the code that is in kernel/sched.c is just a little bit
of optimization.  It avoids rebuilding all the sched domains and
reattaching every task to its sched domain; rather it determines which
sched domains were added or removed and just rebuilds them.

Once you take a closer look, I hope you will agree that this new
interface between the cpuset and sched code provides a cleaner
separation.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-02 15:46               ` Nick Piggin
@ 2007-10-03  9:21                 ` Paul Jackson
  2007-10-02 17:23                   ` Nick Piggin
  2007-10-03  9:35                   ` Ingo Molnar
  0 siblings, 2 replies; 38+ messages in thread
From: Paul Jackson @ 2007-10-03  9:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: mingo, akpm, menage, linux-kernel, dino, cpw

Nick wrote:
> Sorry for the confusion: I only meant the sched.c part of that
> patch, not the full thing.

Ah - ok.  We're getting closer then.  Good.

Let me be sure I've got this right then.

You prefer the interface from your proposed patch, by which the
cpuset code passes sched domain requests to the scheduler code a single
cpumask that will define a sched domain:

    int partition_sched_domains(cpumask_t *partition)

and I am suggesting instead a new and different interface:

    void partition_sched_domains(int ndoms_new, cpumask_t *doms_new)

In the first API, one cpumask is passed in, and a single sched
domain is formed, taking those CPUs from any sched domain they
might have already been a member of, into this new sched domain.

In the second API, the entire flat partitioning is passed in,
giving an array of masks, one mask for each desired sched domain.
The passed in masks do not overlap, but might not cover all CPUs.

Question -- how does one turn off load balancing on some CPUs
using the first API?

    Does one do this by forming singleton sched domains of one
    CPU each?  Is there any downside to doing this?

    The simplest cpuset code to work with this would end up exposing
    this method of disabling load balancing to user space, forcing
    users to create cpusets with one CPU each to be able do disable
    load balancing.

    However a little bit of additional kernel cpuset code could hide
    this detail from user space, by recognizing when the user had
    asked to turn off load balancing on some larger cpuset, and by
    then calling partition_sched_domains() multiple times, once for
    each CPU in that cpuset.

There might be an even simpler way.  If the kernel/sched.c routines
detach_destroy_domains() and build_sched_domains() were exposed as
external routines, then the cpuset code could call them directly,
removing the partition_sched_domains() routine from sched.c entirely.
Would this be worth persuing?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-03  9:21                 ` Paul Jackson
  2007-10-02 17:23                   ` Nick Piggin
@ 2007-10-03  9:35                   ` Ingo Molnar
  2007-10-03  9:39                     ` Paul Jackson
  1 sibling, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2007-10-03  9:35 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Nick Piggin, akpm, menage, linux-kernel, dino, cpw


* Paul Jackson <pj@sgi.com> wrote:

> There might be an even simpler way.  If the kernel/sched.c routines 
> detach_destroy_domains() and build_sched_domains() were exposed as 
> external routines, then the cpuset code could call them directly, 
> removing the partition_sched_domains() routine from sched.c entirely. 
> Would this be worth persuing?

in any case i'd like to see the externally visible API get in foremost - 
and there now seems to be agreement about that. (yay!) Any internal 
shaping of APIs can be done flexibly between cpusets and the scheduler.

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-03  9:35                   ` Ingo Molnar
@ 2007-10-03  9:39                     ` Paul Jackson
  2007-10-02 17:29                       ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-03  9:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: nickpiggin, akpm, menage, linux-kernel, dino, cpw

> in any case i'd like to see the externally visible API get in foremost - 
> and there now seems to be agreement about that. (yay!) Any internal 
> shaping of APIs can be done flexibly between cpusets and the scheduler.

Yup - though Nick and I will have to agree to -some- internal interface
between the cpuset and sched code, at least for the moment.

At least, if we thrash about on this, we won't be changing the externally
visible API around.  We'll just continue driving Andrew nuts, not our
users - that's an improvement.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 16:09             ` Nick Piggin
@ 2007-10-03  9:55               ` Paul Jackson
  2007-10-02 17:56                 ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-03  9:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

> > Yeah -- cpusets are hierarchical.  And some of the use cases for
> > which cpusets are designed are hierarchical.
> 
> But partitioning isn't.

Yup.  We've got a square peg and a round hole.  An impedance mismatch.
That's the root cause of this entire wibbling session, in my view.

The essential role of cpusets, cgroups and much other such work of
recent, in my view, is pounding this square peg into that round hole.
In essence, it is fitting the hierarchical structure of the
organizations (corporations, universities and governments) who own big
systems to the flat, system-wide mandates needed to manage a given
computer system.

> > Changing cpusets from single root to multiple roots would be
> > bastardizing it.
> 
> Well OK, if that's your definition. Not very helpful though.

Well, such a change would be rather substantial and undesired,
if those terms help you more.

> > To repeat myself, in some cases, such as batch schedulers running in a
> > subset of the CPUs on a large system, the code that knows some of the
> > needs for load balancing does not have system wide control to mandate
> > hard partitioning.  The batch scheduler can state where it is depending
> > on load balancing being present, and the system administrator can choose
> > or not to turn off load balancing in the top cpuset, thereby granting or
> > not control over load balancing on the CPUs controlled by the batch
> > scheduler to the batch scheduler.
> 
> Why isn't that possible with my approach?

If I understand your approach to the kernel-to-user interface correctly
(sometimes I doubt I do) then your approach expected some user space code
or person or semi-intelligent equivalent to define a flat partition,
which will then be used to determine the sched domains.

In the batch scheduler case, running on a large shared system used
perhaps by several departments, no one entity can do that.  One person,
perhaps the system admin, knows if they want to give complete control
of some big chunk of CPUs to a batch scheduler.  The batch scheduler,
written by someone else far away and long ago, knows which jobs are
actively running on which subsets of the CPUs the batch scheduler is
using.

There is no single monolithic entity on such systems who knows all and
can dictate all details of a single, flat, system-wide partitioning.

The partitioning has to be sythesized from the combined requests of
several user space entities.  That's ok -- this is bread and butter
work for cpusets.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [patch] sched: fix sched-domains partitioning by cpusets
  2007-10-02 17:23                   ` Nick Piggin
@ 2007-10-03 10:08                     ` Paul Jackson
  0 siblings, 0 replies; 38+ messages in thread
From: Paul Jackson @ 2007-10-03 10:08 UTC (permalink / raw)
  To: Nick Piggin; +Cc: mingo, akpm, menage, linux-kernel, dino, cpw

Nick, responding to pj, wrote:
> >     However a little bit of additional kernel cpuset code could hide
> >     this detail from user space, by recognizing when the user had
> >     asked to turn off load balancing on some larger cpuset, and by
> >     then calling partition_sched_domains() multiple times, once for
> >     each CPU in that cpuset.
> 
> Yeah: do all that in cpusets. It's already information you would have
> to derive in order to make it work properly anyway. If you are not
> passing in the singleton domains ATM, then they will not get properly
> detached and isolated.

ok

> It's not a huge deal, but I'd like to keep partition_sched_domains. After
> my patch, it's really simple.

ok

It's a deal.

I've got a couple of brown paper bag bug fixes almost ready to send
out, for the patch I sent Andrew a few days ago:

  cpuset and sched domains: sched_load_balance flag

I'll send these in, and then get some sleep and code up these changes
to the partition_sched_domains, along the lines you have recommended.

Thanks, Nick and Ingo.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 17:56                 ` Nick Piggin
@ 2007-10-03 11:38                   ` Paul Jackson
  2007-10-02 19:25                     ` Nick Piggin
  2007-10-03 12:17                   ` Paul Jackson
  1 sibling, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-03 11:38 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

> OK, so to really do anything different (from a non-partitioned setup),
> you would need to set sched_load_balance=0 for the root cpuset?

Yup - exactly.  In fact one code fragment in my patch highlights this:

        /* Special case for the 99% of systems with one, full, sched domain */
        if (is_sched_load_balance(&top_cpuset)) {
                ndoms = 1;
                doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
                *doms = top_cpuset.cpus_allowed;
                goto rebuild;
        }

This code says: if the top cpuset is load balanced, you've got one
big fat sched domain covering all (nonisolated) CPUs - end of story.
None of the other 'sched_load_balance' flags matter in this case.

Logically, the above code fragment is not needed.  Without it, the
code would still do the same thing, just wasting more CPU cycles doing
it.

> Suppose you do that to hard partition the machine, what happens to
> newly created tasks like kernel threads or things that aren't in a
> cpuset?

Well ... --every-- task is in a cpuset, always.  Newly created tasks
start in the cpuset of their parent.  Grep for 'the_top_cpuset_hack'
in kernel/cpuset.c to see the lengths to which we go to ensure that
current->cpuset always resolves somewhere.

The usual case on the big systems that I care about the most is
that we move (almost) every task out of the top cpuset, into smaller
cpusets, because we don't want some random thread intruding on the
CPUs dedicated to a particular job.  The only threads left in the root
cpuset are pinned kernel threads, such as for thread migration, per-cpu
irq handlers and various per-cpu and per-node disk and file flushers
and such.  These threads aren't going anywhere, regardless.  But no
thread that is willing to run anywhere is left free to run anywhere.

I will advise my third party batch scheduler developers to turn off
sched_load_balance on their main cpuset, and on any big "holding tank"
cpusets they have which hold only inactive jobs.  This way, on big
systems that are managed to optimize for this, the kernel scheduler
won't waste time load balancing the batch schedulers big cpusets that
don't need it.  With the 'sched_load_balance' flag defined the way
it is, the batch scheduler won't have to make system-wide decisions
as to sched domain partitioning.  They can just make local 'advisory'
markings on particular cpusets that (1) are or might be big, and (2)
don't hold any active tasks that might need load balancing.  The system
will take it from there, providing the finest granularity sched domain
partitioning that will accomplish that.

I will advise the system admins of bigger systems to turn off
sched_load_balance on the top cpuset, as part of the above work
routinely done to get all non-pinned tasks out of the top cpuset.

I will advise the real time developers using cpusets to: (1) turn off
sched_load_balance on their real time cpusets, and (2) insist that
the sys admins using their products turn off sched_load_balance on
the top cpuset, to ensure the expected realtime performance is obtained.

Most systems, even medium size ones (for some definition of medium,
perhaps dozens of CPUs?) so long as they aren't running realtime on
some CPUs, can just run with the default - one big fat load balanced
sched domain ... unless of course they have some other need not
considered above.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 19:25                     ` Nick Piggin
@ 2007-10-03 12:14                       ` Paul Jackson
  2007-10-02 19:53                         ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-03 12:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

> These are what I'm worried about, and things like kswapd, pdflush,
> could definitely use a huge amount of CPU.
> 
> If you are interested in hard partitioning the system, you most
> definitely want these things to be balanced across the non-isolated
> CPUs.

But these guys are pinned anyway (or else they would already be moved
into a smaller load balanced cpuset), so why waste time load balancing
what can't move?

And on some of the systems I care about, we don't want to load balance
these guys; rather we go to great lengths to see that they don't run at
all when we don't want them to.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 17:56                 ` Nick Piggin
  2007-10-03 11:38                   ` Paul Jackson
@ 2007-10-03 12:17                   ` Paul Jackson
  2007-10-02 20:31                     ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-03 12:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

Nick wrote:
> OK, so I don't exactly understand you either. To make it simple, can
> you give a concrete example of a cpuset hierarchy that wouldn't
> work?

It's more a matter of knowing how my third party batch scheduler
coders think.  They will be off in some corner of their code with a
cpuset in hand that they know is just being used to hold inactive
(paused) tasks, and they can likely be persuaded to mark those cpusets
as not being in need of any wasted CPU cycles load balancing them.

But these inactive cpusets will overlap in unknown (to them at
the time, in that piece of code) ways with other cpusets holding
active jobs, and there is no chance, unless it is a matter of major
performance impact, that they will be in any position to comment on
the proper partitioning of the sched domains on all the CPUs under the
control of their batch scheduler, much less comment on the partitioning
of the rest of the system.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 19:53                         ` Nick Piggin
@ 2007-10-03 12:41                           ` Paul Jackson
  2007-10-02 20:30                             ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Paul Jackson @ 2007-10-03 12:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

> pdflush
> is not pinned at all and can be dynamically created and destroyed. Ditto
> for kjournald, as well as many others.

Whatever is not pinned is moved out of the top cpuset, on the kind of
systems I'm most familiar with.  They are put in a smaller cpuset, with
load balancing, that is sized for the workload they might present, but
kept separate from the main jobs.

> Basically: it doesn't feel like a satisfactory solution to brush
> these under the carpet.

We don't do a whole lot of brushing under the carpet on these kind of
systems.  If I gave you the impression we do, then I misled you - sorry.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 20:31                     ` Nick Piggin
@ 2007-10-03 17:44                       ` Paul Jackson
  0 siblings, 0 replies; 38+ messages in thread
From: Paul Jackson @ 2007-10-03 17:44 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

Nick wrote:
> There won't be any CPU cycles used, if the tasks are paused (surely
> they're not spin waiting).

Consider the case when there are two, smaller, non-overlapping cpusets
with active jobs, and one larger cpuset, covering both those smaller
ones, with only paused tasks.

If we realize we don't need to balance the larger cpuset, then we can
have two smaller sched domains rather than one larger one.

Since the CPU cycle cost of load balancing increases more than linearly
with the size of the sched domain, therefore it will save CPU cycles to
have the two smaller ones, rather than the one larger one.

If user space can just tell us that the larger cpuset doesn't need
balancing, then the kernel has enough information to perform this
optimization.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] cpuset and sched domains: sched_load_balance flag
  2007-10-02 20:30                             ` Nick Piggin
@ 2007-10-03 17:46                               ` Paul Jackson
  0 siblings, 0 replies; 38+ messages in thread
From: Paul Jackson @ 2007-10-03 17:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: akpm, menage, linux-kernel, dino, cpw, mingo

Nick wrote:
> So if a new pdflush is spawned, it get's moved to some cpuset? That
> probably isn't something these realtime systems want to do (ie. the
> non-realtime portion probably doesn't want to have any sort of scheduler
> or even worry about cpusets at all).

No - the new pdflush is put in the same cpuset as its parent, with a
patch that I sent in early this year.  See the following code in
mm/pdflush.c:

        /*
         * Some configs put our parent kthread in a limited cpuset,
         * which kthread() overrides, forcing cpus_allowed == CPU_MASK_ALL.
         * Our needs are more modest - cut back to our cpusets cpus_allowed.
         * This is needed as pdflush's are dynamically created and destroyed.
         * The boottime pdflush's are easily placed w/o these 2 lines.
         */
        cpus_allowed = cpuset_cpus_allowed(current);
        set_cpus_allowed(current, cpus_allowed);

        return __pdflush(&my_work);

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2007-10-03 17:46 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-30 10:44 [PATCH] cpuset and sched domains: sched_load_balance flag Paul Jackson
2007-09-29 19:21 ` Nick Piggin
2007-09-30 18:07   ` Paul Jackson
2007-09-30  3:34     ` Nick Piggin
2007-10-01  3:42       ` Paul Jackson
2007-10-02 13:05         ` Nick Piggin
2007-10-03  6:58           ` Paul Jackson
2007-10-02 16:09             ` Nick Piggin
2007-10-03  9:55               ` Paul Jackson
2007-10-02 17:56                 ` Nick Piggin
2007-10-03 11:38                   ` Paul Jackson
2007-10-02 19:25                     ` Nick Piggin
2007-10-03 12:14                       ` Paul Jackson
2007-10-02 19:53                         ` Nick Piggin
2007-10-03 12:41                           ` Paul Jackson
2007-10-02 20:30                             ` Nick Piggin
2007-10-03 17:46                               ` Paul Jackson
2007-10-03 12:17                   ` Paul Jackson
2007-10-02 20:31                     ` Nick Piggin
2007-10-03 17:44                       ` Paul Jackson
2007-10-01 18:15       ` Paul Jackson
2007-10-02 13:35         ` Nick Piggin
2007-10-03  6:22           ` [patch] sched: fix sched-domains partitioning by cpusets Ingo Molnar
2007-10-03  6:56             ` Paul Jackson
2007-10-02 15:46               ` Nick Piggin
2007-10-03  9:21                 ` Paul Jackson
2007-10-02 17:23                   ` Nick Piggin
2007-10-03 10:08                     ` Paul Jackson
2007-10-03  9:35                   ` Ingo Molnar
2007-10-03  9:39                     ` Paul Jackson
2007-10-02 17:29                       ` Nick Piggin
2007-10-03  7:20               ` Ingo Molnar
2007-10-03  7:25           ` [PATCH] cpuset and sched domains: sched_load_balance flag Paul Jackson
2007-10-02 16:14             ` Nick Piggin
2007-09-30 10:44 ` [PATCH] cpuset decrustify update and validate masks Paul Jackson
2007-09-30 17:33 ` [PATCH] cpuset and sched domains: sched_load_balance flag Ingo Molnar
2007-10-02 20:22 ` Randy Dunlap
2007-10-02 20:57   ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox