Re: a quest for a better scheduler

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: a quest for a better scheduler
  2001-04-04 13:43 a quest for a better scheduler Hubertus Franke
@ 2001-04-04 13:25 ` Ingo Molnar
  2001-04-04 13:34 ` Ingo Molnar
  2001-04-04 15:44 ` Khalid Aziz
  2 siblings, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2001-04-04 13:25 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Mike Kravetz, Fabio Riccardi, Linux Kernel List, lse-tech

On Wed, 4 Apr 2001, Hubertus Franke wrote:

> It is not clear that yielding the same decision as the current
> scheduler is the ultimate goal to shoot for, but it allows
> comparision.

obviously the current scheduler is not cast into stone, it never was,
never will be.

but determining whether the current behavior is possible in a different
scheduler design is sure a good metric of how flexible that different
scheduler design is.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: a quest for a better scheduler
  2001-04-04 13:43 a quest for a better scheduler Hubertus Franke
  2001-04-04 13:25 ` Ingo Molnar
@ 2001-04-04 13:34 ` Ingo Molnar
  2001-04-04 15:08   ` Andrea Arcangeli
  2001-04-04 16:39   ` Kanoj Sarcar
  2001-04-04 15:44 ` Khalid Aziz
  2 siblings, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2001-04-04 13:34 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Mike Kravetz, Fabio Riccardi, Linux Kernel List, lse-tech

On Wed, 4 Apr 2001, Hubertus Franke wrote:

> Another point to raise is that the current scheduler does a exhaustive
> search for the "best" task to run. It touches every process in the
> runqueue. this is ok if the runqueue length is limited to a very small
> multiple of the #cpus. [...]

indeed. The current scheduler handles UP and SMP systems, up to 32
(perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
approach anyway in many other subsystems too, Kanoj is doing some
scheduler work in that area.

but the original claim was that the scheduling of thousands of runnable
processes (which is not equal to having thousands of sleeping processes)
must perform well - which is a completely different issue.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: a quest for a better scheduler
@ 2001-04-04 13:43 Hubertus Franke
  2001-04-04 13:25 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Hubertus Franke @ 2001-04-04 13:43 UTC (permalink / raw)
  To: Ingo Molnar, Mike Kravetz; +Cc: Fabio Riccardi, Linux Kernel List, lse-tech

This is an important point that Mike is raising and it also addresses a
critique that Ingo issued yesterday, namely interactivity and fairness.
The HP scheduler completely separates the per-CPU runqueues and does
not take preemption goodness or alike into account. This can lead to
unfair proportionment of CPU cycles, strong priority inversion and a
potential
lack of interactivity.

Our MQ scheduler does yield the same decision in most cases
(other than defined by some race condition on locks and counter members)

It is not clear that yielding the same decision as the current scheduler
is the ultimate goal to shoot for, but it allows comparision.

Another point to raise is that the current scheduler does a exhaustive
search
for the "best" task to run. It touches every process in the runqueue. this
is
ok if the runqueue length is limited to a very small multiple of the #cpus.
But that is not what high end server systems encounter.

With the rising number of processors, lock contention can quickly become
a bottleneck. If we assume that load (#running-task) increases somewhat
linear with #cpus, the problem gets even worth because not only have I
increased the number of clients but also the lock hold time.

Clinging on to the statement that #running-tasks ~ #cpus, ofcourse saves
you from that dilemma, but not everybody is signing on to this limitation.

MQ and priority-list help in 2 ways.

MQ reduces the average lock holdtime because on average it only inspects
#running-tasks/#cpus tasks to make a local decision. It then goes on to
inspect (#cpus-1) datastructures representing the next best to run tasks
on those remote cpus all without holding the lock, thus substantially
reducing lock contention. Note we still search the entire set of runnable
tasks, however we do it in a distributed collaborative manner.
The only time we deviate from the current scheduler decision is in the
case when two cpus have identified the same remote task as a target for
remote stealing. In that case one will win and the other cpu will continue
looking somewhere else, although there might have been another tasks
on that cpu to steal.

priority list schedulers (PRS) only helps in reducing lock hold time,
which can result in some relieve wrt lock contention, but not a whole lot.
PRS can limit the lists it has to search based on the PROC_CHANGE_PENALTY.
It also keeps 0-counter in a list that is never inspected. One can
even go further and put YIELD tasks in a separate list, given that the
sys_sched_yield already does some optimizations.
The older version (12/00) posted on LSE is functionally equivalent to the
current scheduler.
I will put up another version this week that is based on a bitmask and
which is a bit more agressive in that it ignores the MM goodness boost of 1
which in my books is merely a tie breaker between two task of equal
goodness.

Beyond that we have done some work on cpu pooling, which is to identify
a set of cpus that form a scheduling set. We still consider in
reschedule_idle all cpus for preemption but in schedule it is sufficient
to only schedule within the own set. That again can limit lock hold time
with MQ and we have seen some improvements. We also deploy load balacing.

To summarize, we have taken great care of trying to preserve the current
scheduler semantics and have laid out a path to relax some of the semantics
for further improvements.
I don't believe that the HP scheduler is sufficient since it is lacking
load balacing, which naturally occurs in our MQ scheduler, and it lacks
the interactivity requirements that Ingo pointed out.

Most of these things are discussed in great detail in the writeups under
lse.sourceforge.net/scheduling. I also believe we show there that the
MQ performance for low thread counts is also matching the vanilla case.

I further don't understand the obsession of keeping the scheduler simple.
If there are improvements and I don't believe the MQ is all that
complicated
and its well documented, why not put it in.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003

Mike Kravetz <mkravetz@sequent.com> on 04/03/2001 10:47:00 PM

To:   Fabio Riccardi <fabio@chromium.com>
cc:   Mike Kravetz <mkravetz@sequent.com>, Ingo Molnar <mingo@elte.hu>,
      Hubertus Franke/Watson/IBM@IBMUS, Linux Kernel List
      <linux-kernel@vger.kernel.org>, Alan Cox <alan@lxorguk.ukuu.org.uk>
Subject:  Re: a quest for a better scheduler

On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote:
>
> I have measured the HP and not the "scalability" patch because the two do
more
> or less the same thing and give me the same performance advantages, but
the
> former is a lot simpler and I could port it with no effort on any recent
> kernel.

Actually, there is a significant difference between the HP patch and
the one I developed.  In the HP patch, if there is a schedulable task
on the 'local' (current CPU) runqueue it will ignore runnable tasks on
other (remote) runqueues.  In the multi-queue patch I developed, the
scheduler always attempts to make the same global scheduling decisions
as the current scheduler.

--
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: a quest for a better scheduler
  2001-04-04 13:34 ` Ingo Molnar
@ 2001-04-04 15:08   ` Andrea Arcangeli
  2001-04-04 16:50     ` [Lse-tech] " Kanoj Sarcar
  2001-04-04 16:39   ` Kanoj Sarcar
  1 sibling, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2001-04-04 15:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Hubertus Franke, Mike Kravetz, Fabio Riccardi, Linux Kernel List,
	lse-tech

On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote:
> 
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
> 
> > Another point to raise is that the current scheduler does a exhaustive
> > search for the "best" task to run. It touches every process in the
> > runqueue. this is ok if the runqueue length is limited to a very small
> > multiple of the #cpus. [...]
> 
> indeed. The current scheduler handles UP and SMP systems, up to 32
> (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
> approach anyway in many other subsystems too, Kanoj is doing some
> scheduler work in that area.

I didn't seen anything from Kanoj but I did something myself for the wildfire:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

this is mostly an userspace issue, not really intended as a kernel optimization
(however it's also partly a kernel optimization). Basically it splits the load
of the numa machine into per-node load, there can be unbalanced load across the
nodes but fairness is guaranteed inside each node. It's not extremely well
tested but benchmarks were ok and it is at least certainly stable.

However Ingo consider that in a 32-way if you don't have at least 32 tasks
running all the time _always_ you're really stupid paying such big money for
nothing ;). So the fact the scheduler is optimized for 1/2 tasks running all
the time is not nearly enough for those machines (and of course also the
scheduling rate automatically increases linearly with the increase of the
number of cpus). Now it's perfectly fine that we don't ask the embedded and
desktop guys to pay for that, but a kernel configuration option to select an
algorithm that scales would be a good idea IMHO. The above patch just adds a
CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will be
hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles).

Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: a quest for a better scheduler
  2001-04-04 13:43 a quest for a better scheduler Hubertus Franke
  2001-04-04 13:25 ` Ingo Molnar
  2001-04-04 13:34 ` Ingo Molnar
@ 2001-04-04 15:44 ` Khalid Aziz
  2001-04-04 15:55   ` [Lse-tech] " Christoph Hellwig
  2 siblings, 1 reply; 18+ messages in thread
From: Khalid Aziz @ 2001-04-04 15:44 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: Ingo Molnar, Mike Kravetz, Fabio Riccardi, Linux Kernel List,
	lse-tech

Hubertus Franke wrote:
> 
> This is an important point that Mike is raising and it also addresses a
> critique that Ingo issued yesterday, namely interactivity and fairness.
> The HP scheduler completely separates the per-CPU runqueues and does
> not take preemption goodness or alike into account. This can lead to
> unfair proportionment of CPU cycles, strong priority inversion and a
> potential
> lack of interactivity.
> 
> Our MQ scheduler does yield the same decision in most cases
> (other than defined by some race condition on locks and counter members)
> 

Let me stress that HP scheduler is not meant to be a replacement for the
current scheduler. The HP scheduler patch allows the current scheduler
to be replaced by another scheduler by loading a module in special
cases. HP is providing three different loadable scheduler modules -
Processor sets, Constant time scheduler, and Multi-runqueue scheduler.
Each one of these is geared towards a specific requirement. I would not
suggest using any of these for a generalized case. Processor sets
scheduler is designed to make scheduling decisions on a per-cpu basis
and not global basis. All we are trying to do is to make the current
scheduler modular so we CAN load an alternate scheduling policy module
in cases where the process mix requires a different scheduling policy or
the site policy require a different scheduling policy. An example of a
specific site processor allocation policy could be a site that runs a
database server on an MP machine along with a few other processes and
the administrator wants to guarantee that the database server process
always gets x percent of processing time or one CPU be dedicated to just
the database server. A policy like this is not meant to be fair and of
course, not a policy we want to impose upon others. The only HP changes
I would put in the kernel sources for general release would be the
changes to scheduler to allow such alternate (not necessarily fair or
the fastest for benchmarks, general process mix or 1000's of processes)
policies to be loaded. When a policy module is not loaded, scheduler
works exactly like the current scheduler even after HP patches. There
are people who could benefit from being able to load alternate policy
schedules. Fabio Ricardi happens to be one of them :-) Anyone who does
not want to even allow an alternate scheduler module to be loaded can
simply compile the alternate scheduler support out and that is how I
would expect most kernels to be compiled, especially the ones that ship
with various distributions. I would like the decision to include support
for alternate scheduler to be made by sys admins themselves for their
individual cases.

--
Khalid

====================================================================
Khalid Aziz                             Linux Development Laboratory
(970)898-9214                                        Hewlett-Packard
khalid@fc.hp.com                                    Fort Collins, CO

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
  2001-04-04 15:44 ` Khalid Aziz
@ 2001-04-04 15:55   ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2001-04-04 15:55 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Hubertus Franke, Ingo Molnar, Mike Kravetz, Fabio Riccardi,
	Linux Kernel List, lse-tech

On Wed, Apr 04, 2001 at 09:44:22AM -0600, Khalid Aziz wrote:
> Let me stress that HP scheduler is not meant to be a replacement for the
> current scheduler. The HP scheduler patch allows the current scheduler
> to be replaced by another scheduler by loading a module in special
> cases.

HP also has a simple mq patch that is _not_ integrated into the pluggable
scheduler framework, I have used it myself.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
  2001-04-04 13:34 ` Ingo Molnar
  2001-04-04 15:08   ` Andrea Arcangeli
@ 2001-04-04 16:39   ` Kanoj Sarcar
  2001-04-04 17:00     ` Andrea Arcangeli
  1 sibling, 1 reply; 18+ messages in thread
From: Kanoj Sarcar @ 2001-04-04 16:39 UTC (permalink / raw)
  To: mingo
  Cc: Hubertus Franke, Mike Kravetz, Fabio Riccardi, Linux Kernel List,
	lse-tech

> 
> 
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
> 
> > Another point to raise is that the current scheduler does a exhaustive
> > search for the "best" task to run. It touches every process in the
> > runqueue. this is ok if the runqueue length is limited to a very small
> > multiple of the #cpus. [...]
> 
> indeed. The current scheduler handles UP and SMP systems, up to 32
> (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
> approach anyway in many other subsystems too, Kanoj is doing some
> scheduler work in that area.

Actually, not _much_ work has been done in this area. Alongwith a bunch
of other people, I have some ideas about what needs to be done. For 
example, for NUMA, we need to try hard to schedule a thread on the 
node that has most of its memory (for no reason other than to decrease
memory latency). Independently, some NUMA machines build in multilevel 
caches and local snoops that also means that specific processors on
the same node as the last_processor are also good candidates to run 
the process next.

To handle a single layer of shared caches, I have tried certain simple
things, mostly as hacks, but am not pleased with the results yet. More
testing needed.

Kanoj

> 
> but the original claim was that the scheduling of thousands of runnable
> processes (which is not equal to having thousands of sleeping processes)
> must perform well - which is a completely different issue.
> 
> 	Ingo
> 
> 
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> http://lists.sourceforge.net/lists/listinfo/lse-tech
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
  2001-04-04 15:08   ` Andrea Arcangeli
@ 2001-04-04 16:50     ` Kanoj Sarcar
  2001-04-04 17:16       ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Kanoj Sarcar @ 2001-04-04 16:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Hubertus Franke, Mike Kravetz, Fabio Riccardi,
	Linux Kernel List, lse-tech

> 
> I didn't seen anything from Kanoj but I did something myself for the wildfire:
> 
> 	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1
> 
> this is mostly an userspace issue, not really intended as a kernel optimization
> (however it's also partly a kernel optimization). Basically it splits the load
> of the numa machine into per-node load, there can be unbalanced load across the
> nodes but fairness is guaranteed inside each node. It's not extremely well
> tested but benchmarks were ok and it is at least certainly stable.
>

Just a quick comment. Andrea, unless your machine has some hardware
that imply pernode runqueues will help (nodelevel caches etc), I fail 
to understand how this is helping you ... here's a simple theory though. 
If your system is lightly loaded, your pernode queues are actually 
implementing some sort of affinity, making sure processes stick to 
cpus on nodes where they have allocated most of their memory on. I am 
not sure what the situation will be under huge loads though.

As I have mentioned to some people before, percpu/pernode/percpuset/global
runqueues probably all have their advantages and disadvantages, and their
own sweet spots. Wouldn't it be really neat if a system administrator
or performance expert could pick and choose what scheduler behavior he
wants, based on how the system is going to be used?

Kanoj

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
  2001-04-04 16:39   ` Kanoj Sarcar
@ 2001-04-04 17:00     ` Andrea Arcangeli
  0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2001-04-04 17:00 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: mingo, Hubertus Franke, Mike Kravetz, Fabio Riccardi,
	Linux Kernel List, lse-tech

On Wed, Apr 04, 2001 at 09:39:23AM -0700, Kanoj Sarcar wrote:
> example, for NUMA, we need to try hard to schedule a thread on the 
> node that has most of its memory (for no reason other than to decrease
> memory latency). Independently, some NUMA machines build in multilevel 
> caches and local snoops that also means that specific processors on
> the same node as the last_processor are also good candidates to run 
> the process next.

yes. That will probably need to be optional and choosen by the architecture
at compile time too. The probably most important factor to consider is the
penality of accessing remote memory, I think I can say on all recent and future
machines with a small difference between local and remote memory (and possibly
as you say with a decent cache protocol able to snoop cacheline data from the
other cpus even if they're not dirty) it's much better to always try to keep
the task in its last node. My patch is actually assuming recent machines and it
keeps the task in its last node if not in the last cpu and it keeps doing
memory allocation from there and it forgets about its original node where it
started allocating the memory from.  This provided the best performance during
userspace CPU bound load as far I can tell and it also better distribute the load.

Kanoj could you also have a look at the NUMA related common code MM fixes I did
in this patch? I'd like to get them integrated (just skip the arch/alpha/*
include/asm-alpha/* stuff while reading the patch, they're totally orthogonal).

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/00_alpha-numa-1

If you prefer I can extract them in a more finegrinded patch just dropping
the alpha stuff by hand.

Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
@ 2001-04-04 17:03 Hubertus Franke
  2001-04-04 17:14 ` Kanoj Sarcar
  0 siblings, 1 reply; 18+ messages in thread
From: Hubertus Franke @ 2001-04-04 17:03 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Linux Kernel List, lse-tech

Kanoj, our cpu-pooling + loadbalancing allows you to do that.
The system adminstrator can specify at runtime through a
/proc filesystem interface the cpu-pool-size, whether loadbalacing
should take place.
We can put limiting to the local cpu-set during reschedule_idle
back into the code, to make it complete and compatible with
the approach that Andrea has taken.
This way, one can fully isolate or combine cpu-sets.

here is the code for the pooling.
http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool

loadbalancing and /proc system combined in this module.
http://lse.sourceforge.net/scheduling/LB/loadbalance.c

a writeup explaining this concept is available under
http://lse.sourceforge.net/scheduling/LB/poolMQ.html

Prerequisite is the MQ scheduler...
http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched

We need to update these for 2.4.3 .... (coming)

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003

Kanoj Sarcar <kanoj@google.engr.sgi.com>@lists.sourceforge.net on
04/04/2001 12:50:58 PM

Sent by:  lse-tech-admin@lists.sourceforge.net

To:   andrea@suse.de (Andrea Arcangeli)
cc:   mingo@elte.hu (Ingo Molnar), Hubertus Franke/Watson/IBM@IBMUS,
      mkravetz@sequent.com (Mike Kravetz), fabio@chromium.com (Fabio
      Riccardi), linux-kernel@vger.kernel.org (Linux Kernel List),
      lse-tech@lists.sourceforge.net
Subject:  Re: [Lse-tech] Re: a quest for a better scheduler

>
> I didn't seen anything from Kanoj but I did something myself for the
wildfire:
>
>
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1

>
> this is mostly an userspace issue, not really intended as a kernel
optimization
> (however it's also partly a kernel optimization). Basically it splits the
load
> of the numa machine into per-node load, there can be unbalanced load
across the
> nodes but fairness is guaranteed inside each node. It's not extremely
well
> tested but benchmarks were ok and it is at least certainly stable.
>

Just a quick comment. Andrea, unless your machine has some hardware
that imply pernode runqueues will help (nodelevel caches etc), I fail
to understand how this is helping you ... here's a simple theory though.
If your system is lightly loaded, your pernode queues are actually
implementing some sort of affinity, making sure processes stick to
cpus on nodes where they have allocated most of their memory on. I am
not sure what the situation will be under huge loads though.

As I have mentioned to some people before, percpu/pernode/percpuset/global
runqueues probably all have their advantages and disadvantages, and their
own sweet spots. Wouldn't it be really neat if a system administrator
or performance expert could pick and choose what scheduler behavior he
wants, based on how the system is going to be used?

Kanoj

_______________________________________________
Lse-tech mailing list
Lse-tech@lists.sourceforge.net
http://lists.sourceforge.net/lists/listinfo/lse-tech

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
  2001-04-04 17:03 Hubertus Franke
@ 2001-04-04 17:14 ` Kanoj Sarcar
  0 siblings, 0 replies; 18+ messages in thread
From: Kanoj Sarcar @ 2001-04-04 17:14 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Linux Kernel List, lse-tech

> 
> 
> 
> Kanoj, our cpu-pooling + loadbalancing allows you to do that.
> The system adminstrator can specify at runtime through a
> /proc filesystem interface the cpu-pool-size, whether loadbalacing
> should take place.

Yes, I think this approach can support the various requirements
put on the scheduler.

I think there are two degrees of freedom that are needed in the
scheduler. One, as you say, for the sysadmin to be able to specify
what overall scheduler behavior he wants. 

Secondly, from the kernel standpoint, there needs to be perarch
hooks, to be able to utilize nodelevel/multilevel caches, NUMA
aspects etc.

Kanoj

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
  2001-04-04 16:50     ` [Lse-tech] " Kanoj Sarcar
@ 2001-04-04 17:16       ` Andrea Arcangeli
  2001-04-04 17:49         ` Kanoj Sarcar
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2001-04-04 17:16 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Ingo Molnar, Hubertus Franke, Mike Kravetz, Fabio Riccardi,
	Linux Kernel List, lse-tech

On Wed, Apr 04, 2001 at 09:50:58AM -0700, Kanoj Sarcar wrote:
> > 
> > I didn't seen anything from Kanoj but I did something myself for the wildfire:
> > 
> > 	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1
> > 
> > this is mostly an userspace issue, not really intended as a kernel optimization
> > (however it's also partly a kernel optimization). Basically it splits the load
> > of the numa machine into per-node load, there can be unbalanced load across the
> > nodes but fairness is guaranteed inside each node. It's not extremely well
> > tested but benchmarks were ok and it is at least certainly stable.
> >
> 
> Just a quick comment. Andrea, unless your machine has some hardware
> that imply pernode runqueues will help (nodelevel caches etc), I fail 
> to understand how this is helping you ... here's a simple theory though. 

It helps by keeping the task in the same node if it cannot keep it in
the same cpu anymore.

Assume task A is sleeping and it last run on cpu 8 node 2. It gets a wakeup
and it gets running and for some reason cpu 8 is busy and there are other
cpus idle in the system. Now with the current scheduler it can be moved in any
cpu in the system, with the numa sched applied we will try to first reschedule
it in the idles cpus of node 2 for example. The per-node runqueue are mainly
necessary to implement the heuristic.

> cpus on nodes where they have allocated most of their memory on. I am 
> not sure what the situation will be under huge loads though.

after all cpus are busy we try to reschedule only on the cpus of the local
node, that's why it can generate some unbalance yes, but it will tend to
rebalance over the time because some node will end with all tasks with
zero counter first if it's less loaded, and so then it will start
getting tasks with has_cpu 0 in the runqueues out of other nodes.

You may want to give it a try on your machines and see what difference it
makes, I'd be curious to know of course.

Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
@ 2001-04-04 17:34 Hubertus Franke
  0 siblings, 0 replies; 18+ messages in thread
From: Hubertus Franke @ 2001-04-04 17:34 UTC (permalink / raw)
  To: Kanoj Sarcar; +Cc: Linux Kernel List, lse-tech


Correct, that's true.

Our patch does various things.
(a) limit search for a task to a admin specified set of cpu's
    during schedule()..
(b) limits search for a preemptable task to another set of cpu's
    during reschedule_idle()
     <need to reactivate this functionality 10 lines of code>
(c) loadbalancing, i.e. moving from queue to queue.
    Currently we balance within a set and across sets.

Obviously in NUMA one could specify
     (a) such that multiple sets fall into the same node
         no node crossings.
     (b) specify this set to at least span a node
     (c) do some intelligent moving based on memory maps
          etc.

I guess (c) would be first instance on where to plug architecture
dependent information, e.g. how much memory footprint does a task
have on a particular node and how much would the moving cost.
The loadbalance we provide is a simple sceleton to tickle you mind,
not a solution. Nevertheless, one can see it can have some impact.

See for results for various combinations of poolsizes and balancings:
http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Kanoj Sarcar <kanoj@google.engr.sgi.com> on 04/04/2001 01:14:28 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   linux-kernel@vger.kernel.org (Linux Kernel List),
      lse-tech@lists.sourceforge.net
Subject:  Re: [Lse-tech] Re: a quest for a better scheduler



>
>
>
> Kanoj, our cpu-pooling + loadbalancing allows you to do that.
> The system adminstrator can specify at runtime through a
> /proc filesystem interface the cpu-pool-size, whether loadbalacing
> should take place.

Yes, I think this approach can support the various requirements
put on the scheduler.

I think there are two degrees of freedom that are needed in the
scheduler. One, as you say, for the sysadmin to be able to specify
what overall scheduler behavior he wants.

Secondly, from the kernel standpoint, there needs to be perarch
hooks, to be able to utilize nodelevel/multilevel caches, NUMA
aspects etc.

Kanoj




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
@ 2001-04-04 17:40 Paul McKenney
  0 siblings, 0 replies; 18+ messages in thread
From: Paul McKenney @ 2001-04-04 17:40 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Andrea Arcangeli, Fabio Riccardi, Hubertus Franke,
	Linux Kernel List, lse-tech, lse-tech-admin, Ingo Molnar,
	Mike Kravetz

> Just a quick comment. Andrea, unless your machine has some hardware
> that imply pernode runqueues will help (nodelevel caches etc), I fail
> to understand how this is helping you ... here's a simple theory though.
> If your system is lightly loaded, your pernode queues are actually
> implementing some sort of affinity, making sure processes stick to
> cpus on nodes where they have allocated most of their memory on. I am
> not sure what the situation will be under huge loads though.

Exactly.  If a given task has run on a particular nodes for a while,
its memory will tend to be allocated on that node.  So preferentially
running it on another CPU on that same node should get you better
memory latency than would running it on some other node's CPUs.

In addition, continuing to run the task on a particular node means
that more of that task's memory is from that node, which again means
good memory latency.  In contrast, if you move a task back and forth
between nodes, it can end up with its memory spread over many nodes,
which means that it does not get good memory latency no matter where
you run it.

                              Thanx, Paul

> As I have mentioned to some people before,
percpu/pernode/percpuset/global
> runqueues probably all have their advantages and disadvantages, and their
> own sweet spots. Wouldn't it be really neat if a system administrator
> or performance expert could pick and choose what scheduler behavior he
> wants, based on how the system is going to be used?
>
> Kanoj

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
  2001-04-04 17:16       ` Andrea Arcangeli
@ 2001-04-04 17:49         ` Kanoj Sarcar
  2001-04-04 18:00           ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Kanoj Sarcar @ 2001-04-04 17:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Ingo Molnar, Hubertus Franke, Mike Kravetz, Fabio Riccardi,
	Linux Kernel List, lse-tech

> 
> It helps by keeping the task in the same node if it cannot keep it in
> the same cpu anymore.
> 
> Assume task A is sleeping and it last run on cpu 8 node 2. It gets a wakeup
> and it gets running and for some reason cpu 8 is busy and there are other
> cpus idle in the system. Now with the current scheduler it can be moved in any
> cpu in the system, with the numa sched applied we will try to first reschedule
> it in the idles cpus of node 2 for example. The per-node runqueue are mainly
> necessary to implement the heuristic.
>

Yes. But this is not the best solution, if I can add on to the example
and make some assumptions.

Imagine that most of the program's memory is on node 1, it was scheduled
on node 2 cpu 8 momentarily (maybe because kswapd ran on node 1, other
higher priority processes took over other cpus on node 1, etc). 

Then, your patch will try to keep the process on node 2, which is not
neccessarily the best solution. Of course, as I mentioned before, if
you have a node local cache on node 2, that cache might have been warmed
enough to make scheduling on node 2 a good option. 

I am not saying there is a wrong or right answer, there are so many
possibilities, everything probably works and breaks under different
circumstances. 

Btw, while we are swapping patches, the patch at

	http://oss.sgi.com/projects/numa/download/sched242.patch

tries to implement per-arch scheduling. The current scheduler behavior
is smp_arch_goodness() and smp_pick_cpu(), but the patch allows the
possibility for a specific platform to change that to something else. 

Linus has seen this patch, and agrees to it in principle. He does not
consider this 2.4 material though. Of course, I am completely open to
Ingo (or someone else) coming up with a different way of providing the
same freedom to arch specific code.

Kanoj

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
  2001-04-04 17:49         ` Kanoj Sarcar
@ 2001-04-04 18:00           ` Andrea Arcangeli
  2001-04-05 11:13             ` Zdenek Kabelac
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2001-04-04 18:00 UTC (permalink / raw)
  To: Kanoj Sarcar
  Cc: Ingo Molnar, Hubertus Franke, Mike Kravetz, Fabio Riccardi,
	Linux Kernel List, lse-tech

On Wed, Apr 04, 2001 at 10:49:04AM -0700, Kanoj Sarcar wrote:
> Imagine that most of the program's memory is on node 1, it was scheduled
> on node 2 cpu 8 momentarily (maybe because kswapd ran on node 1, other
> higher priority processes took over other cpus on node 1, etc). 
> 
> Then, your patch will try to keep the process on node 2, which is not
> neccessarily the best solution. Of course, as I mentioned before, if
> you have a node local cache on node 2, that cache might have been warmed
> enough to make scheduling on node 2 a good option. 

Exactly it made it a good option, and more important this heuristic can
only improve performance if compared to the mainline scheduler.

Infact I tried to reschedule the task back to its original node and it dropped
performance, anyways I cannot say to have done an extensive research on that, I
believe if we take care of some more history of the node migration we may
understand it's right time to push it back to its original node but that was
not obvious and I wanted a simple solution to boost the performance under CPU
bound load to start with.

Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
  2001-04-04 18:00           ` Andrea Arcangeli
@ 2001-04-05 11:13             ` Zdenek Kabelac
  0 siblings, 0 replies; 18+ messages in thread
From: Zdenek Kabelac @ 2001-04-05 11:13 UTC (permalink / raw)
  To: Andrea Arcangeli

Hello

Just dump idea - why not make scheduler switchable with modules - so
users
could select any scheduler they want ?

This should not be that hard and would make it easy to replace scheduler
at runtime so everyone could easily try what's the best for him/her.

kabi@i.am

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Lse-tech] Re: a quest for a better scheduler
@ 2001-04-05 11:14 alad
  0 siblings, 0 replies; 18+ messages in thread
From: alad @ 2001-04-05 11:14 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: Andrea Arcangeli, linux-kernel

This concept I think is used in Solaris .. as they have dynamic loadable
schedulers..

Zdenek Kabelac <kabi@i.am> on 04/05/2001 05:43:15 PM

To:   Andrea Arcangeli <andrea@suse.de>
cc:    (bcc: Amol Lad/HSS)

Subject:  Re: [Lse-tech] Re: a quest for a better scheduler

Hello

Just dump idea - why not make scheduler switchable with modules - so
users
could select any scheduler they want ?

This should not be that hard and would make it easy to replace scheduler
at runtime so everyone could easily try what's the best for him/her.

kabi@i.am

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2001-04-05 11:22 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-04-04 13:43 a quest for a better scheduler Hubertus Franke
2001-04-04 13:25 ` Ingo Molnar
2001-04-04 13:34 ` Ingo Molnar
2001-04-04 15:08   ` Andrea Arcangeli
2001-04-04 16:50     ` [Lse-tech] " Kanoj Sarcar
2001-04-04 17:16       ` Andrea Arcangeli
2001-04-04 17:49         ` Kanoj Sarcar
2001-04-04 18:00           ` Andrea Arcangeli
2001-04-05 11:13             ` Zdenek Kabelac
2001-04-04 16:39   ` Kanoj Sarcar
2001-04-04 17:00     ` Andrea Arcangeli
2001-04-04 15:44 ` Khalid Aziz
2001-04-04 15:55   ` [Lse-tech] " Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2001-04-04 17:03 Hubertus Franke
2001-04-04 17:14 ` Kanoj Sarcar
2001-04-04 17:34 Hubertus Franke
2001-04-04 17:40 Paul McKenney
2001-04-05 11:14 alad

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox