public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: a quest for a better scheduler
@ 2001-04-04 14:03 Hubertus Franke
  2001-04-04 13:23 ` Ingo Molnar
  2001-04-04 15:12 ` Andrea Arcangeli
  0 siblings, 2 replies; 44+ messages in thread
From: Hubertus Franke @ 2001-04-04 14:03 UTC (permalink / raw)
  To: mingo; +Cc: Mike Kravetz, Fabio Riccardi, Linux Kernel List


I grant you that the code is not as clean as the current scheduler,
so maybe you missed that part.

For the priority scheduler:
Yes the task_to_qid assumes a NON-affinity (no cpu, no mm) to determine
the list index to where the task has to be enqueued.
However, if you wonder down to the search_range section of the scheduler,
you see that we  do add the "+1" for equal mm. All I did here was take
the goodness() function a part for search_range in order to insert some
extra code that speeds up the code.

Again, I don't want to lean to hard on the priority scheduler, because
we only did this to have a reference point regarding lock contention and
lock hold. This is stuff that many operating systems did years ago, most
of which have moved on to add MQ to that. So that is what the LSE
scheduling team is pushing.

I understand the dilemma that the Linux scheduler is in, namely satisfy
the low end at all cost. But I believe that in order for Linux to push
into the high end we need to address the scalability of the current
scheduler.
Simply handwaving and declaring that lots of running tasks is a stupid
idea doesn't get us there.
If indeed you assume that there #running-tasks ~ #cpus then each task
should alot a reasonable amount of work and any small overhead incurred
should be amortizable. However as we contend if #running-tasks >> #cpus
is a situation we need to deal with, the living with the current lock
contention can really drag us down.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Ingo Molnar <mingo@elte.hu>@elte.hu> on 04/04/2001 02:28:31 AM

Please respond to <mingo@elte.hu>

Sent by:  <mingo@elte.hu>


To:   Mike Kravetz <mkravetz@sequent.com>
cc:   Hubertus Franke/Watson/IBM@IBMUS, Fabio Riccardi
      <fabio@chromium.com>, Linux Kernel List
      <linux-kernel@vger.kernel.org>
Subject:  Re: a quest for a better scheduler




On Tue, 3 Apr 2001, Mike Kravetz wrote:

> Our 'priority queue' implementation uses almost the same goodness
> function as the current scheduler.  The main difference between our
> 'priority queue' scheduler and the current scheduler is the structure
> of the runqueue.  We break up the single runqueue into a set of
> priority based runqueues. The 'goodness' of a task determines what
> sub-queue a task is placed in.  Tasks with higher goodness values are
> placed in higher priority queues than tasks with lower goodness
> values. [...]

we are talking about the same thing, re-read my mail. this design assumes
that 'goodness' is constant in the sense that it does not depend on the
previous process.

and no, your are not using the same goodness() function, you omitted the
prev->mm == next->mm component to goodness(), due to this design
limitation.

     Ingo






^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: a quest for a better scheduler
@ 2001-04-18 14:50 Yoav Etsion
  0 siblings, 0 replies; 44+ messages in thread
From: Yoav Etsion @ 2001-04-18 14:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tsafrir Dan


I don't want to open any old wounds, but I just got a summary from a
colleague of mine, Dan Tsafrir, who measured the context switch overhead 
on Linux with multiple processes. 

You can find the document at:
http://www.cs.huji.ac.il/~dants/linux-2.2.18-context-switch.ps

The measurements were taken on a quad Pentium III 550MHz IBM NetFinity
server with 1GB RAM. 

Even though this was done on the older 2.2.18 kernel, this is quite
intereseting with regard to the current scheduler discussion.


Yoav Etsion


^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: a quest for a better scheduler
@ 2001-04-06 13:15 Hubertus Franke
  0 siblings, 0 replies; 44+ messages in thread
From: Hubertus Franke @ 2001-04-06 13:15 UTC (permalink / raw)
  To: Torrey Hoffman; +Cc: 'Timothy D. Witham', Linux Kernel List




Torrey, nothing to worry about (or at least not yet).

The new scheduler only replaces the current SMP scheduler, not the single
cpu scheduler. So the devices that you refer to are not affected at
all by these changes.

The desire for interactivity etc, lead us to stick to the current
proposed MQ scheduler semantics, namely not to completely isolate the
runqueues from each other (e.g. the HP-MQ does so), but do some cross
checking to ensure that high priority tasks are run when they need to.

First you get the same (similar good or bad scheduler semantics as the
current one) but at a significantly lower cost for medium to high end
loads (defined as #runnable tasks > #cpus)

Here is a simple approximate arithmetic. Assume the following:
- Let X be the fixed overhead to get through the scheduler (cur or MQ)
once.
- Let Y the overhead of touching a task to inspect its goodness
- Let N be the number of tasks
- Let C be the number of cpus.

The cost for the current scheduler is: X(cur) + N*Y.
The cost for the MQ scheduler is:     X(MQ)  + (N/C)*Y + C*Y
          I assumed here equal runqueue length.

You can see that this ballpark math shows that if X(cur) ~ X(MQ)
MQ is expected to be better when (N/C) + C < N   or
if (  N  >  C*C/(C-1)  )

C=2 N>4
C=3 N>4
C=4 N>5
C=5 N>6
C=6 N>7
C=7 N>8
C=8 N>9

Turns out that for C>2  this amounts to N > C+1 MQ will do better.
Now this is in the face of lack of runqueue lock contention. When
runqueue lock contention surfaces, then this will shift in favor of MQ.




Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Torrey Hoffman <torrey.hoffman@myrio.com>@vger.kernel.org on 04/05/2001
07:53:27 PM

Sent by:  linux-kernel-owner@vger.kernel.org


To:   "'Timothy D. Witham'" <wookie@osdlab.org>, Linux Kernel List
      <linux-kernel@vger.kernel.org>
cc:
Subject:  RE: a quest for a better scheduler



Timothy D. Witham wrote :
[...]
> I propose that we work on setting up a straight forward test harness
> that allows developers to quickly test a kernel patch against
> various performance yardsticks.

[...
(proposed large server testbeds)
...]

I like this idea, but could the testbeds also include some
resource-constrained "embedded" or appliance-style systems, and
include performance yardsticks for latency and responsiveness?

It would be unfortunate if work on a revised scheduler resulted
in Linux being a monster web server on 4-way systems, but having
lousy interactive performance on web pads, hand helds, and set top
boxes.

How about a 120Mhz Pentium with 32MB of RAM and a flash disk, a 200
Mhz PowerPC with 64 MB?  Maybe a Transmeta web pad?

For the process load, something that tests responsiveness and
latency - how about a set of processes doing multicast network
transfers, CPU-intensive MPEG video and audio encode / decode,
and a test app that measures "user response", i.e. if X Windows
was running, would the mouse pointer move smoothly or jerkily?

The "better" scheduler for these applications would make sure that
multicast packets were never dropped, the MPEG decode never dropped
frames, and the "user interface" stayed responsive, etc.

Also, I would not mind a bit if the kernel had tuning options, either
in configuration or through /proc to adjust for these very different
situations.

Torrey Hoffman
Torrey.Hoffman@myrio.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 44+ messages in thread
* RE: a quest for a better scheduler
@ 2001-04-05 23:53 Torrey Hoffman
  0 siblings, 0 replies; 44+ messages in thread
From: Torrey Hoffman @ 2001-04-05 23:53 UTC (permalink / raw)
  To: 'Timothy D. Witham', Linux Kernel List

Timothy D. Witham wrote :
[...] 
> I propose that we work on setting up a straight forward test harness 
> that allows developers to quickly test a kernel patch against 
> various performance yardsticks.

[...
(proposed large server testbeds)
...]

I like this idea, but could the testbeds also include some 
resource-constrained "embedded" or appliance-style systems, and
include performance yardsticks for latency and responsiveness?

It would be unfortunate if work on a revised scheduler resulted
in Linux being a monster web server on 4-way systems, but having
lousy interactive performance on web pads, hand helds, and set top
boxes.  

How about a 120Mhz Pentium with 32MB of RAM and a flash disk, a 200 
Mhz PowerPC with 64 MB?  Maybe a Transmeta web pad?  

For the process load, something that tests responsiveness and 
latency - how about a set of processes doing multicast network 
transfers, CPU-intensive MPEG video and audio encode / decode,
and a test app that measures "user response", i.e. if X Windows 
was running, would the mouse pointer move smoothly or jerkily?

The "better" scheduler for these applications would make sure that 
multicast packets were never dropped, the MPEG decode never dropped 
frames, and the "user interface" stayed responsive, etc.

Also, I would not mind a bit if the kernel had tuning options, either 
in configuration or through /proc to adjust for these very different
situations.

Torrey Hoffman
Torrey.Hoffman@myrio.com


^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: a quest for a better scheduler
@ 2001-04-05 23:01 Hubertus Franke
  0 siblings, 0 replies; 44+ messages in thread
From: Hubertus Franke @ 2001-04-05 23:01 UTC (permalink / raw)
  To: Timothy D. Witham; +Cc: Linux Kernel List


Exellent idea....

Assuming that you have set up these environments already,
what would be a real treat is to get the average runqueue
length at a given time, for instance every second or so, while
running some of these more sophisticated server oriented applications
that you mention.

>From that we can see which of the various assumption regarding
runqueue length is holding up, when the runqueue is not empty.
This would help the current discussion trememdously as we don't
seem to be able to even agree on this.

Simple bincount could do. If you want a kernel patch that counts
every scheduling cycle I'll write it.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



"Timothy D. Witham" <wookie@osdlab.org>@vger.kernel.org on 04/05/2001
06:38:41 PM

Sent by:  linux-kernel-owner@vger.kernel.org


To:   Linux Kernel List <linux-kernel@vger.kernel.org>
cc:   wookie@osdlab.org
Subject:  Re: a quest for a better scheduler




  I have been following this thread and thinking that everybody has some
truth in
what they are saying but with the absence of a repeatable test environment
there
really isn't a way of arriving at a data driven decision.

Given the following conditions.

1)The diversity of the problem sets that developers are working on results
in a
  real concern that another developers solution to their performance issue
  might result in a worsening of my performance situation.

2)Most of the developers have a given set of tests that they use when they
  make changes to determine if they change did what they want.

3)The Open Source Development Lab has the faculties for providing a large
scale
  testing environment on several different configurations that remain the
same over
  time.

  I propose that we work on setting up a straight forward test harness that
allows
developers to quickly test a kernel patch against various performance
yardsticks.
If we use the same set of hardware for the generation of the performance
data
from patch to patch there will be a correlation between the runs allowing
for
a real comparison of the different changes.

The following should be taken only as a starting point.

  As for the base hardware configurations I propose:
     2 way with 1 GB main memory and 2 IDE drives
     4 way with 4 GB main memory and 16 disk drives
     8 way with 8 GB main memory and 24 disk drives

  The types of applications that I have heard people express concern are:

     Web infrastructure:
          Apache
          TUX
          Jabber

     Corporate infrastructure:
          NFS
          raw TCP/IP performance
          Samba

     Database performance:
          Raw storage I/O performance
           OLTP workload

     General usage:
          compile speed (usually measured by kernel compile)

   The OSDL also has the ability to make some of the "for fee" benchmarks
available for use for testing that is done onsite (network access to OSDL
machines counts as onsite) so that people don't have to purchase individual
copies.  Things like SECMAIL2001, SPECSFS2.0 and SEPCWEB99 come to mind.

Comments, suggestions, volunteers?

--
Timothy D. Witham - Lab Director - wookie@osdlab.org
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office)    (503)-702-2871     (cell)
(503)-626-2455     (fax)

On Wed, Apr 04, 2001 at 03:54:54PM -0700, Christopher Smith wrote:
> --On Wednesday, April 04, 2001 15:16:32 -0700 Tim Wright <timw@splhi.com>
> wrote:
> > On Wed, Apr 04, 2001 at 03:23:34PM +0200, Ingo Molnar wrote:
> >> nope. The goal is to satisfy runnable processes in the range of
NR_CPUS.
> >> You are playing word games by suggesting that the current behavior
> >> prefers 'low end'. 'thousands of runnable processes' is not 'high end'
> >> at all, it's 'broken end'. Thousands of runnable processes are the
sign
> >> of a broken application design, and 'fixing' the scheduler to perform
> >> better in that case is just fixing the symptom. [changing the
scheduler
> >> to perform better in such situations is possible too, but all
solutions
> >> proposed so far had strings attached.]
> >
> > Ingo, you continue to assert this without giving much evidence to back
it
> > up. All the world is not a web server. If I'm running a large OLTP
> > database with thousands of clients, it's not at all unreasonable to
> > expect periods where several hundred (forget the thousands) want to be
> > serviced by the database engine. That sounds like hundreds of
schedulable
> > entities be they processes or threads or whatever. This sort of load is
> > regularly run on machine with 16-64 CPUs.
>
> Actually, it's not just OLTP, anytime you are doing time sharing between
> hundreds of users (something POSIX systems are supposed to be good at)
this
> will happen.
>
> > Now I will admit that it is conceivable that you can design an
> > application that finds out how many CPUs are available, creates threads
> > to match that number and tries to divvy up the work between them using
> > some combination of polling and asynchronous I/O etc. There are,
however
> > a number of problems with this approach:
>
> Actually, one way to semi-support this approach is to implement
> many-to-many threads as per the Solaris approach. This also requires
> significant hacking of both the kernel and the runtime, and certainly is
> significantly more error prone than trying to write a flexible scheduler.
>
> One problem you didn't highlight that even the above case does not
happily
> identify is that for security reasons you may very well need each user's
> requests to take place in a different process. If you don't, then you
have
> to implement a very well tested and secure user-level security mechanism
to
> ensure things like privacy (above and beyond the time-sharing).
>
> The world is filled with a wide variety of types of applications, and
> unless you know two programming approaches are functionaly equivalent
(and
> event driven/polling I/O vs. tons of running processes are NOT), you
> shouldn't say one approach is "broken". You could say it's a "broken"
> approach to building web servers. Unfortunately, things like kernels and
> standard libraries should work well in the general case.
>
> --Chris
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
Timothy D. Witham - Lab Director - wookie@osdlab.org
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office)    (503)-702-2871     (cell)
(503)-626-2455     (fax)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: a quest for a better scheduler
@ 2001-04-04 19:06 Hubertus Franke
  0 siblings, 0 replies; 44+ messages in thread
From: Hubertus Franke @ 2001-04-04 19:06 UTC (permalink / raw)
  To: Mark Hahn; +Cc: Linux Kernel List, lse-tech


I give you a concrete example:

Running DB2 on an SMP system.

In DB2 there is a processes/thread pool that is sized based
on memory and numcpus. People tell me that the size of this pool
is in the order of 100s for an 8-way system with reasonable
sized database. These <maxagents> determine the number of agents
that can simultaneously execute an SQL statement.

Requests are flying in for transactions (e.g. driven by TPC-W like
applications). The agents are grepped from the pool and concurrently
fire the SQL transactions.
Assuming that there is enough concurrency in the database, there is
no reason to believe that the majority of those active agents is
not effectively running. TPC-W loads have observed 100 of active
transactions at a time.

Ofcourse limiting the number of agents would reduce concurrently
running tasks, but would limit the responsiveness of the system.
Implementing a database in the kernel ala TUX doesn't seem to be
the right approach either (complexity, fault containment, ...)

Hope that is one example people accept.

I can dig up some information on WebSphere Applications.

I'd love to hear from some other applications that fall into
a similar category as the above and substantiate a bit the need
for 100s of running processes, without claiming that the
application is broke.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Mark Hahn <hahn@coffee.psychology.mcmaster.ca> on 04/04/2001 02:28:42 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:
Subject:  Re: a quest for a better scheduler



> ok if the runqueue length is limited to a very small multiple of the
#cpus.
> But that is not what high end server systems encounter.

do you have some example of this in mind?  so far, noone has
actually produced an example of a "high end" server that has
long runqueues.





^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: a quest for a better scheduler
@ 2001-04-04 17:17 Hubertus Franke
  0 siblings, 0 replies; 44+ messages in thread
From: Hubertus Franke @ 2001-04-04 17:17 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel List


Well put, this how we can eliminate searching all bins or lists and that's
how we do it under.
http://lse.sourceforge.net/scheduling/2.4.1-pre8-prioSched.

If you have a list per priority level, then you can even pick the first one
you find if its on
the same level. That's what I tried in a more recent implementation. Also
combined that
with using a bitmask to represent non-empty tasks.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Davide Libenzi <davidel@xmailserver.org>@ewok.dev.mycio.com on 04/04/2001
12:12:54 PM

Sent by:  davidel@ewok.dev.mycio.com


To:   Ingo Molnar <mingo@elte.hu>
cc:   Linus Torvalds <torvalds@transmeta.com>, Alan Cox
      <alan@lxorguk.ukuu.org.uk>, Linux Kernel List
      <linux-kernel@vger.kernel.org>, Hubertus Franke/Watson/IBM@IBMUS,
      Mike Kravetz <mkravetz@sequent.com>, Fabio Riccardi
      <fabio@chromium.com>
Subject:  Re: a quest for a better scheduler




On 04-Apr-2001 Ingo Molnar wrote:
>
> On Tue, 3 Apr 2001, Fabio Riccardi wrote:
>
>> I've spent my afternoon running some benchmarks to see if MQ patches
>> would degrade performance in the "normal case".
>
> no doubt priority-queue can run almost as fast as the current scheduler.
> What i'm worried about is the restriction of the 'priority' of processes,
> it cannot depend on previous processes (and other 'current state')
> anymore.
>
> to so we have two separate issues:
>
>#1: priority-queue: has the fundamental goodness() design limitation.
>
>#2: per-CPU-runqueues: changes semantics, makes scheduler less
>     effective due to nonglobal decisions.
>
> about #1: while right now the prev->mm rule appears to be a tiny issue
(it
> might not affect performance significantly), but forbidding it by
> hardcoding the assumption into data structures is a limitation of
*future*
> goodness() functions. Eg. with the possible emergence of CPU-level
> threading and other, new multiprocessing technologies, this could be a
> *big* mistake.

This is not correct Ingo. I haven't seen the HP code but if You store
processes
in slots S :

S = FS( goodness(p, p->processor, p->mm) )

and You start scanning from the higher slots, as soon as you find a task
with a
goodness G' that is equal to the max goodness in slot You can choose that
process to run.
Again, if You haven't found such a goodness during the slot scan but You've
found a task with a goodness G' :

G' >= SG - DD

where :

SG = max slot goodness
DD = SG(i) - SG(i - 1)

You can select that task as the next to spin.
This was the behaviour that was implemented in my scheduler patch about 2
years
ago.
Beside this, I this that with such loads We've more serious problem to face
with inside the kernel.



- Davide





^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: a quest for a better scheduler
@ 2001-04-04 15:36 Hubertus Franke
  0 siblings, 0 replies; 44+ messages in thread
From: Hubertus Franke @ 2001-04-04 15:36 UTC (permalink / raw)
  To: mingo; +Cc: Linux Kernel List, lse-tech



You imply that high end means thousands of processes,
simply because we have shown that in our graphs as an
asymptotic end.
No, it could mean 5*#cpus and that is not all that absurd.
This could happen with a spike in demand.

TUX is not the greatest example to use, because it does
static webpage serving and is hence tied into the
file cache. If you move up the food chain, where middleware
is active and things are a bit more complicated than
sending stuff out of the filecache, having a bunch of threads
hanging around to deal with the spike in demand is common
practive, although you think its bad.

Now coming again to MQ (forget about priority list from
now on).

When we scan either the local or the realtime queues we do
use goodness value. So we have the same flexibility as the
current scheduler if it comes to goodness() flexibility and
future improvements.

For remote stealing we do use na_goodness to compare
for a better process to steal. Hence we would loose the "+1"
information here, nevertheless, once a decision
has been made, we still use preemption verification with goodness.
Eitherway, being off by "+1" for regular tasks once in a while
is no big deal, because this problem already exists today.
While walking the runqueue, another processor can either update
the counter value of task (ok that's covered by can_schedule)
or can run recalculate, which makes the decision that one is
about to make wrong from the point of always running the best.
But that's ok, because counter, nice etc. are approximations
anyway. Through in PROC_CHANGE_PENALTY and you have a few knobs
that are used to control interactivity and throughput.

If you look at some of the results with our reflex benchmark.
For the low thread count we basically show improved performance
on the 2,4,5,6,7,8 way system if #threads < #cpus, they all show
improvements. Look at the following numbers running the
reflex benchmark, left most columns are number of threads, with
typically 1/2 of them runnable.
You can clearly see that the priority list suffers from overhead,
but MQ is beating vanilla pretty much everywhere. Again, this
is due because of rapid scheduler invocation and resulting
lock contention.

          2-way
     2.4.1          2.4.1-mq1 2.4.1-prbit
4    6.725          4.691          7.387
8    6.326          4.766          6.421
12   6.838          5.233          6.431
16   7.13      5.415          7.29

          4-way
     2.4.1          2.4.1-mq1 2.4.1-prbit
4    9.42      7.873          10.592
8    8.143          3.799          8.691
12   7.877          3.537          8.101
16   7.688          2.953          7.144

          6-way
     2.4.1          2.4.1-mq1 2.4.1-prbit
4    9.595          7.88      10.358
8    9.703          7.278          10.523
12   10.016    4.652          10.985
16   9.882          3.629          10.525

          8-way
     2.4.1          2.4.1-mq1 2.4.1-prbit
4    9.804          8.033          10.548
8    10.436    5.783          11.475
12   10.925    6.787          11.646
16   11.426    5.048          11.877
20   11.438    3.895          11.633
24   11.457    3.304          11.347
28   11.495    3.073          11.09
32   11.53          2.944          10.898


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Ingo Molnar <mingo@elte.hu>@elte.hu> on 04/04/2001 09:23:34 AM

Please respond to <mingo@elte.hu>

Sent by:  <mingo@elte.hu>


To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   Mike Kravetz <mkravetz@sequent.com>, Fabio Riccardi
      <fabio@chromium.com>, Linux Kernel List
      <linux-kernel@vger.kernel.org>
Subject:  Re: a quest for a better scheduler




On Wed, 4 Apr 2001, Hubertus Franke wrote:

> I understand the dilemma that the Linux scheduler is in, namely
> satisfy the low end at all cost. [...]

nope. The goal is to satisfy runnable processes in the range of NR_CPUS.
You are playing word games by suggesting that the current behavior prefers
'low end'. 'thousands of runnable processes' is not 'high end' at all,
it's 'broken end'. Thousands of runnable processes are the sign of a
broken application design, and 'fixing' the scheduler to perform better in
that case is just fixing the symptom. [changing the scheduler to perform
better in such situations is possible too, but all solutions proposed so
far had strings attached.]

     Ingo





^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: a quest for a better scheduler
@ 2001-04-04 15:28 Hubertus Franke
  0 siblings, 0 replies; 44+ messages in thread
From: Hubertus Franke @ 2001-04-04 15:28 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linux Kernel List, lse-tech


Yes, Andrea.

We actually already went a step further. We treat the scheduler
as a single entity, rather than splitting it up.

Based on the MQ scheduler we do the balancing across all nodes
at reschedule_idle time. We experimented to see whether only looking
for idle tasks remotely is a good idea, but it bloats the code.
Local scheduling decisions are limited to a set of cpus, which
could coincide with the cpus of one node, or if desirable on
smaller granularities.

In addition we implemented a global load balancing scheme that
ensures that load is equally distributed across all run queues.
This is made a loadable module, so you can plug in what ever
you want.

I grant in NUMA it might actually be desirable to separate
schedulers completely (we can do that trivially in reschedule_idle),
but you need loadbalancing at some point.

Here is the list of patches:

MultiQueue Scheduler: http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched
Pooling Extension:    http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool
LoadBalancing:
http://lse.sourceforge.net/scheduling/LB/loadbalance.c


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Andrea Arcangeli <andrea@suse.de> on 04/04/2001 11:08:47 AM

To:   Ingo Molnar <mingo@elte.hu>
cc:   Hubertus Franke/Watson/IBM@IBMUS, Mike Kravetz
      <mkravetz@sequent.com>, Fabio Riccardi <fabio@chromium.com>, Linux
      Kernel List <linux-kernel@vger.kernel.org>,
      lse-tech@lists.sourceforge.net
Subject:  Re: a quest for a better scheduler



On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote:
>
> On Wed, 4 Apr 2001, Hubertus Franke wrote:
>
> > Another point to raise is that the current scheduler does a exhaustive
> > search for the "best" task to run. It touches every process in the
> > runqueue. this is ok if the runqueue length is limited to a very small
> > multiple of the #cpus. [...]
>
> indeed. The current scheduler handles UP and SMP systems, up to 32
> (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different
> approach anyway in many other subsystems too, Kanoj is doing some
> scheduler work in that area.

I didn't seen anything from Kanoj but I did something myself for the
wildfire:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1


this is mostly an userspace issue, not really intended as a kernel
optimization
(however it's also partly a kernel optimization). Basically it splits the
load
of the numa machine into per-node load, there can be unbalanced load across
the
nodes but fairness is guaranteed inside each node. It's not extremely well
tested but benchmarks were ok and it is at least certainly stable.

However Ingo consider that in a 32-way if you don't have at least 32 tasks
running all the time _always_ you're really stupid paying such big money
for
nothing ;). So the fact the scheduler is optimized for 1/2 tasks running
all
the time is not nearly enough for those machines (and of course also the
scheduling rate automatically increases linearly with the increase of the
number of cpus). Now it's perfectly fine that we don't ask the embedded and
desktop guys to pay for that, but a kernel configuration option to select
an
algorithm that scales would be a good idea IMHO. The above patch just adds
a
CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will
be
hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles).

Andrea




^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: a quest for a better scheduler
@ 2001-04-04 13:43 Hubertus Franke
  2001-04-04 13:25 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Hubertus Franke @ 2001-04-04 13:43 UTC (permalink / raw)
  To: Ingo Molnar, Mike Kravetz; +Cc: Fabio Riccardi, Linux Kernel List, lse-tech



This is an important point that Mike is raising and it also addresses a
critique that Ingo issued yesterday, namely interactivity and fairness.
The HP scheduler completely separates the per-CPU runqueues and does
not take preemption goodness or alike into account. This can lead to
unfair proportionment of CPU cycles, strong priority inversion and a
potential
lack of interactivity.

Our MQ scheduler does yield the same decision in most cases
(other than defined by some race condition on locks and counter members)

It is not clear that yielding the same decision as the current scheduler
is the ultimate goal to shoot for, but it allows comparision.

Another point to raise is that the current scheduler does a exhaustive
search
for the "best" task to run. It touches every process in the runqueue. this
is
ok if the runqueue length is limited to a very small multiple of the #cpus.
But that is not what high end server systems encounter.

With the rising number of processors, lock contention can quickly become
a bottleneck. If we assume that load (#running-task) increases somewhat
linear with #cpus, the problem gets even worth because not only have I
increased the number of clients but also the lock hold time.

Clinging on to the statement that #running-tasks ~ #cpus, ofcourse saves
you from that dilemma, but not everybody is signing on to this limitation.

MQ and priority-list help in 2 ways.

MQ reduces the average lock holdtime because on average it only inspects
#running-tasks/#cpus tasks to make a local decision. It then goes on to
inspect (#cpus-1) datastructures representing the next best to run tasks
on those remote cpus all without holding the lock, thus substantially
reducing lock contention. Note we still search the entire set of runnable
tasks, however we do it in a distributed collaborative manner.
The only time we deviate from the current scheduler decision is in the
case when two cpus have identified the same remote task as a target for
remote stealing. In that case one will win and the other cpu will continue
looking somewhere else, although there might have been another tasks
on that cpu to steal.

priority list schedulers (PRS) only helps in reducing lock hold time,
which can result in some relieve wrt lock contention, but not a whole lot.
PRS can limit the lists it has to search based on the PROC_CHANGE_PENALTY.
It also keeps 0-counter in a list that is never inspected. One can
even go further and put YIELD tasks in a separate list, given that the
sys_sched_yield already does some optimizations.
The older version (12/00) posted on LSE is functionally equivalent to the
current scheduler.
I will put up another version this week that is based on a bitmask and
which is a bit more agressive in that it ignores the MM goodness boost of 1
which in my books is merely a tie breaker between two task of equal
goodness.

Beyond that we have done some work on cpu pooling, which is to identify
a set of cpus that form a scheduling set. We still consider in
reschedule_idle all cpus for preemption but in schedule it is sufficient
to only schedule within the own set. That again can limit lock hold time
with MQ and we have seen some improvements. We also deploy load balacing.

To summarize, we have taken great care of trying to preserve the current
scheduler semantics and have laid out a path to relax some of the semantics
for further improvements.
I don't believe that the HP scheduler is sufficient since it is lacking
load balacing, which naturally occurs in our MQ scheduler, and it lacks
the interactivity requirements that Ingo pointed out.

Most of these things are discussed in great detail in the writeups under
lse.sourceforge.net/scheduling. I also believe we show there that the
MQ performance for low thread counts is also matching the vanilla case.

I further don't understand the obsession of keeping the scheduler simple.
If there are improvements and I don't believe the MQ is all that
complicated
and its well documented, why not put it in.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Mike Kravetz <mkravetz@sequent.com> on 04/03/2001 10:47:00 PM

To:   Fabio Riccardi <fabio@chromium.com>
cc:   Mike Kravetz <mkravetz@sequent.com>, Ingo Molnar <mingo@elte.hu>,
      Hubertus Franke/Watson/IBM@IBMUS, Linux Kernel List
      <linux-kernel@vger.kernel.org>, Alan Cox <alan@lxorguk.ukuu.org.uk>
Subject:  Re: a quest for a better scheduler



On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote:
>
> I have measured the HP and not the "scalability" patch because the two do
more
> or less the same thing and give me the same performance advantages, but
the
> former is a lot simpler and I could port it with no effort on any recent
> kernel.

Actually, there is a significant difference between the HP patch and
the one I developed.  In the HP patch, if there is a schedulable task
on the 'local' (current CPU) runqueue it will ignore runnable tasks on
other (remote) runqueues.  In the multi-queue patch I developed, the
scheduler always attempts to make the same global scheduling decisions
as the current scheduler.

--
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center




^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: a quest for a better scheduler
@ 2001-04-04  6:36 alad
  0 siblings, 0 replies; 44+ messages in thread
From: alad @ 2001-04-04  6:36 UTC (permalink / raw)
  To: Fabio Riccardi; +Cc: Alan Cox, linux-kernel



If we are facing these problems for "normal case" then hope the Solaris is
handling it !!

Amol





Fabio Riccardi <fabio@chromium.com> on 04/04/2001 07:03:57 AM

To:   Alan Cox <alan@lxorguk.ukuu.org.uk>
cc:   linux-kernel@vger.kernel.org (bcc: Amol Lad/HSS)

Subject:  Re: a quest for a better scheduler




Alan,

for the "normal case" performance see my other message.

I agree that a better threading model would surely help in a web server, but to
me this is not an excuse to live up with a broken scheduler.

The X15 server I'm working on now is a sort of user-space TUX, it uses only 8
threads per CPU and it achieves the same level of performance of the kernel
space TUX. Even in this case the performance advantage of the multiqueue
scheduler is remarkable, especially on a multi-CPU (> 2) architecture.

To achieve some decent disk/CPU/network overlapping with the current linux
blocking disk IO limitations there is no way to avoid a "bunch of server
threads". I've (experimentally) figured out that 8-16 threads per CPU can
assure some reasonable overlapping, depending on the memory size and disk
subsystem speed. On a 8-way machine this means 64-128 active tasks, a total
disaster with the current scheduler.

Unless we want to maintain the position tha the only way to achieve good
performance is to embed server applications in the kernel, some minimal help
should be provided to goodwilling user applications :)

TIA, ciao,

 - Fabio

Alan Cox wrote:

> > Is there any special reason why any of those patches didn't make it to
> > the mainstream kernel code?
>
> All of them are worse for the normal case. Also 1500 running apache's isnt
> a remotely useful situation, you are thrashing the cache even if you are now
> not thrashing the scheduler. Use an httpd designed for that situation. Then
> you can also downgrade to a cheap pentium class box for the task ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/





^ permalink raw reply	[flat|nested] 44+ messages in thread
* a quest for a better scheduler
@ 2001-04-03  2:23 Fabio Riccardi
  2001-04-03  8:55 ` Ingo Molnar
  2001-04-03 12:31 ` Alan Cox
  0 siblings, 2 replies; 44+ messages in thread
From: Fabio Riccardi @ 2001-04-03  2:23 UTC (permalink / raw)
  To: linux-kernel

Hello,

I sent a message a few days ago about some limitations I found in the
linux scheduler.

In servers like Apache where a large (> 1000) number of processes can be
running at the same time and where many of them are runnable at the same
time, the default Linux scheduler just starts trashing and the machine
becomes very rapidly unusable.

Performance degradations are quite noticeable on a two-way SMP machine
(20-30% of the CPU gets lost) and are even more glaring on a multi-cpu
machine. As an example, an 8-way Compaq Proliant just crawls with linux.

>From the feedback I received I realized that there are at least two
possible solutions to the problem:

    http://lse.sourceforge.net/scheduling/

    http://resourcemanagement.unixsolutions.hp.com/WaRM/schedpolicy.html

Indeed I've tried the patches available on the sites for the multi-queue
scheduler and I was amazed by the performance improvement that I got.
Both patches allow me to get to a 100% real CPU utilization on a two way
machine running ~1500 processes.

What those patches do is quite simple, instead of having the single
global process queue present in the normal Linux scheduler, they add
multiple queues (one per CPU). In this way the scheduling decision can
be greatly simplified and almost made local to each CPU. No hotspots, no
global locks (well, almost).

Although some scalability problems are still there (there still is a
global decision to make), the performance improvement obtained and the
simplicity of the solution are remarkable.

The HP patch is probably the most interesting, since it consists of
really a few lines of code and it gets (for what I could measure) the
same kind of performance improvement of the more elaborate (but still
quite simple) sourceforge patch.

Is there any special reason why any of those patches didn't make it to
the mainstream kernel code?

TIA, ciao,

 - Fabio



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2001-04-18 14:51 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-04-04 14:03 a quest for a better scheduler Hubertus Franke
2001-04-04 13:23 ` Ingo Molnar
2001-04-04 22:16   ` Tim Wright
2001-04-04 22:54     ` Christopher Smith
2001-04-05 22:38       ` Timothy D. Witham
2001-04-06  3:27         ` Christopher Smith
2001-04-06 18:06         ` Timothy D. Witham
2001-04-06 21:08           ` Michael Peddemors
2001-04-06 22:33           ` Nathan Straz
2001-04-04 15:12 ` Andrea Arcangeli
2001-04-04 15:49   ` Khalid Aziz
  -- strict thread matches above, loose matches on Subject: below --
2001-04-18 14:50 Yoav Etsion
2001-04-06 13:15 Hubertus Franke
2001-04-05 23:53 Torrey Hoffman
2001-04-05 23:01 Hubertus Franke
2001-04-04 19:06 Hubertus Franke
2001-04-04 17:17 Hubertus Franke
2001-04-04 15:36 Hubertus Franke
2001-04-04 15:28 Hubertus Franke
2001-04-04 13:43 Hubertus Franke
2001-04-04 13:25 ` Ingo Molnar
2001-04-04 13:34 ` Ingo Molnar
2001-04-04 15:08   ` Andrea Arcangeli
2001-04-04 15:44 ` Khalid Aziz
2001-04-04  6:36 alad
2001-04-03  2:23 Fabio Riccardi
2001-04-03  8:55 ` Ingo Molnar
2001-04-03 19:13   ` Mike Kravetz
2001-04-03 18:47     ` Ingo Molnar
2001-04-03 22:43       ` Mike Kravetz
2001-04-04  0:18         ` Fabio Riccardi
2001-04-04  2:47           ` Mike Kravetz
2001-04-04  4:21             ` Fabio Riccardi
2001-04-04 17:27               ` Mike Kravetz
2001-04-04  6:53           ` Ingo Molnar
2001-04-04 16:12             ` Davide Libenzi
2001-04-04  6:28         ` Ingo Molnar
2001-04-03 12:31 ` Alan Cox
2001-04-04  0:33   ` Fabio Riccardi
2001-04-04  0:35     ` Alan Cox
2001-04-04  1:17       ` Fabio Riccardi
2001-04-04  1:50         ` Christopher Smith
2001-04-04 11:57       ` Ingo Molnar
2001-04-04 11:51     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox