[ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler

All of lore.kernel.org
 help / color / mirror / Atom feed

* [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
@ 2007-03-04  7:00 Con Kolivas
  2007-03-04  7:45 ` [ck] " Con Kolivas
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Con Kolivas @ 2007-03-04  7:00 UTC (permalink / raw)
  To: linux kernel mailing list, ck list

This message is to announce the first general public release of the "Rotating 
Staircase DeadLine" cpu scheduler. 

Based on previous work from the staircase cpu scheduler I set out to design, 
from scratch, a new scheduling policy design which satisfies every 
requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER) task management.

Available for download are:

 A full rollup of the patch for 2.6.20:
http://ck.kolivas.org/patches/staircase-deadline/sched-rsdl-0.26.patch

 Split patches for 2.6.20(which will follow this email):
http://ck.kolivas.org/patches/staircase-deadline/split-out/

 The readme (which will also constitute the rest of this email):
http://ck.kolivas.org/patches/staircase-deadline/rsdl_scheduler.readme

The following readme is also included as documentation in 
Documentation/sched-design.txt

Rotating Staircase Deadline cpu scheduler policy
================================================

Design summary
==============

A novel design which incorporates a foreground-background descending priority
system (the staircase) with runqueue managed minor and major epochs (rotation
and deadline).

Features
========

A starvation free, strict fairness O(1) scalable design with interactivity
as good as the above restrictions can provide. There is no interactivity
estimator, no sleep/run measurements and only simple fixed accounting.
The design has strict enough a design and accounting that task behaviour
can be modelled and maximum scheduling latencies can be predicted by
the virtual deadline mechanism that manages runqueues. The prime concern
in this design is to maintain fairness at all costs determined by nice level,
yet to maintain as good interactivity as can be allowed within the
constraints of strict fairness.

Design description
==================

RSDL works off the principle of providing each task a quota of runtime that
it is allowed to run at each priority level equal to its static priority
(ie. its nice level) and every priority below that. When each task is queued,
the cpu that it is queued onto also keeps a record of that quota. If the
task uses up its quota it is decremented one priority level. Also, if the cpu
notices a quota full has been used for that priority level, it pushes
everything remaining at that priority level to the next lowest priority
level. Once every runtime quota has been consumed of every priority level,
a task is queued on the "expired" array. When no other tasks exist with
quota, the expired array is activated and fresh quotas are handed out. This
is all done in O(1).

Design details
==============

Each cpu has its own runqueue which micromanages its own epochs, and each
task keeps a record of its own entitlement of cpu time. Most of the rest
of these details apply to non-realtime tasks as rt task management is
straight forward.

Each runqueue keeps a record of what major epoch it is up to in the
rq->prio_rotation field which is incremented on each major epoch. It also
keeps a record of quota available to each priority value valid for that
major epoch in rq->prio_quota[].

Each task keeps a record of what major runqueue epoch it was last running
on in p->rotation. It also keeps a record of what priority levels it has
already been allocated quota from during this epoch in a bitmap p->bitmap.

The only tunable that determines all other details is the RR_INTERVAL. This
is set to 6ms (minimum on 1000HZ, higher at different HZ values).

All tasks are initially given a quota based on RR_INTERVAL. This is equal to
RR_INTERVAL between nice values of 0 and 19, and progressively larger for
nice values from -1 to -20. This is assigned to p->quota and only changes
with changes in nice level.

As a task is first queued, it checks in recalc_task_prio to see if it has
run at this runqueue's current priority rotation. If it has not, it will
have its p->prio level set to equal its p->static_prio (nice level) and will
be given a p->time_slice equal to the p->quota, and has its allocation
bitmap bit set in p->bitmap for its static priority (nice value). This
quota is then also added to the current runqueue's rq->prio_quota[p->prio].
It is then queued on the current active priority array.

If a task has already been running during this major epoch, if it has
p->time_slice left and the rq->prio_quota for the task's p->prio still
has quota, it will be placed back on the active array, but no more quota
will be added to either the task or the runqueue quota.

If a task has been running during this major epoch, but does not have
p->time_slice left or the runqueue's prio_quota for this task's p->prio
does not have quota, it will find the next lowest priority in its bitmap
that it has not been allocated quota from. It then gets the a full quota
in p->time_slice and adds that to the quota value for the relevant priority
rq->prio_quota. It is then queued on the current active priority array at
the newly determined lower priority.

If a task has been running during this major epoch, and does not have
any entitlement left in p->bitmap and no time_slice left, it will have its
bitmap cleared, and be queued at its p->static_prio again, but on the expired
priority array. No quota will be allocated until this task is scheduled.

When a task is queued, it has its static_prio bit set in the current
runqueue's rq->static_bitmap, and the relevant bit in the rq->dyn_bitmap.
In order to minimise the number of bitmap lookups, the bitmap of queued
tasks on the expired array is at the end of the same bitmap as the active
array. The number of tasks queued at the current static_prio is kept in
rq->prio_queued[].

During a scheduler_tick where a task is running, the p->time_slice is
decremented, and if it reaches zero then the recalc_task_prio is readjusted
and the task rescheduled.

During a task running tick, the runqueue prio_quota is also decremented. If
it empties then a priority rotation occurs (a major or minor epoch). If the
current runqueue's priority level is better than that of nice 19 tasks, a
minor rotation is performed, otherwise a major rotation will occur.

A minor rotation takes the remaining tasks at this priority level queue and
merges them with a list_splice_tail with the queue from the next lowest
priority level. At this time, any tasks that have been merged will now
have invalid values in p->prio so this must be considered when dequeueing
the task, and for testing for preemption.

A major rotation takes the remaining tasks at this priority level queue and
merges them with a list_splice_tail with the best priority task running on
the expired array, and swaps the priority arrays. The priority quotas are
reset at this time. Any tasks that have been merged will now have invalid
values in p->array and possibly p->prio so this must be considered. The
rq->prio_rotation is incremented at this time.

When a task is dequeued, the dyn_bitmap bit is unset only after testing
that the relevant queue is actually empty since p->prio may be inaccurate
and no hard accounting of the number of tasks at that level is possible.

When selecting a new task for scheduling, after the first dynamic bit is
found on the dyn_bitmap, it is checked to see that a task is really queued
at that priority or if it is a false positive due to the task being
dequeued at a time when its p->prio does not match which queue it is on
after some form of priority rotation. This is a rare occurrence as it tends
to only occur if a task that is already waiting on a runqueue gets dequeued.
If the bitmap value is in the expired array range, a major priority rotation
is performed. If the chosen task has not been running during this major or
minor rotation it has new quota allocated at this time, and added to the
runqueue's quota.

Modelling deadline behaviour
============================

As the accounting in this design is hard and not modified by sleep average
calculations or interactivity modifiers, it is possible to accurately
predict the maximum latency that a task may experience under different
conditions. This is a virtual deadline mechanism enforced by mandatory
runqueue epochs, and not by trying to keep complicated accounting of each
task.

The maximum duration a task can run during one major epoch is determined
by its nice value. Nice 0 tasks can run at 19 different priority levels
for RR_INTERVAL duration during each epoch (the equivalent of nice 0 to nice
19). Nice 10 tasks can run at 9 priority levels for each epoch, and so on.

Therefore the maximum duration a runqueue epoch can take is determined by
the number of tasks running, and their nice level. After that, the maximum
duration it can take before a task can wait before it get scheduled is
determined by the difference between its nice value and the nice value of
the highest priority task queued.

In the following examples, these are _worst case scenarios_ and would rarely
occur, but can be modelled nonetheless to determine the maximum possible
latency.

So for example, if two nice 0 tasks are running, and one has just expired as
another is activated for the first time receiving a full quota for this
runqueue rotation, the first task will wait:

nr_tasks * max_duration + nice_difference * rr_interval
1 * 19 * RR_INTERVAL + 0 = 114ms

In the presence of a nice 10 task, a nice 0 task would wait a maximum of
1 * 10 * RR_INTERVAL + 0 = 60ms

In the presence of a nice 0 task, a nice 10 task would wait a maximum of
1 * 19 * RR_INTERVAL + 9 * RR_INTERVAL = 168ms

Using a more complicated example, if there are 4 tasks running fully cpu
bound, one each at nice -20, nice 0, nice 10 and nice 19, we can calculate
the maximum latency possible for the nice 10 task. Note that -20 tasks are
heavily biased for so this will be a long time, but can be modelled.

The nice -20 task has quota = RR_INTERVAL + 20*RR_INTERVAL = 21*RR_INTERVAL.
It can run at 39 priority levels so its maximum duration =
39 * 21 * RR_INTERVAL.
The nice 0 task works out to
19 * RR_INTERVAL
The nice 19 task works out to
RR_INTERVAL.

So major epoch can take up a maximum of
39 * 21 * RR_INTERVAL + 19 * RR_INTERVAL + RR_INTERVAL = 1229 * RR_INTERVAL;

Then before the nice 10 task will run, the nice -20 and nice 0 task will
run for 28 * 21 * RR_INTERVAL and 9 * RR_INTERVAL respectively for a total
of 597 * RR_INTERVAL.

This means the maximum duration a nice 10 task can wait in the presence of
these other tasks is 1826*RR_INTERVAL. This is a long time of course and is
heavily penalised by the presence of nice -20 tasks which would not be part
of a normal environment.

While this section describes the maximum latency a task can have, this size
latencies will only be seen by fully cpu bound tasks.

Achieving interactivity
=======================

A requirement of this scheduler design was to achieve good interactivity
despite being a completely fair deadline based design. The disadvantage of
designs that try to achieve interactivity is that they usually do so at
the expense of maintaining fairness. As cpu speeds increase, the requirement
for some sort of metered unfairness towards interactive tasks becomes a less
desirable phenomenon, but low latency and fairness remains mandatory to
good interactive performance.

This design relies on the fact that interactive tasks, by their nature,
sleep often. Most fair scheduling designs end up penalising such tasks
indirectly giving them less than their fair possible share because of the
sleep, and have to use a mechanism of bonusing their priority to offset
this based on the duration they sleep. This becomes increasingly inaccurate
as the number of running tasks rises and more tasks spend time waiting on
runqueues rather than sleeping, and it is impossible to tell whether the
task that's waiting on a runqueue only intends to run for a short period and
then sleep again after than runqueue wait. Furthermore, all such designs rely
on a period of time to pass to accumulate some form of statistic on the task
before deciding on how much to give them preference. The shorter this period,
the more rapidly bursts of cpu ruin the interactive tasks behaviour. The
longer this period, the longer it takes for interactive tasks to get low
scheduling latencies and fair cpu.

This design does not measure sleep time at all. Interactive tasks that sleep
often will wake up having consumed very little if any of their quota for
the current major priority rotation. The longer they have slept, the less
likely they are to even be on the current major priority rotation. Once
woken up, though, they get to use up a their full quota for that epoch,
whether part of a quota remains or a full quota. Overall, however, they
can still only run as much cpu time for that epoch as any other task of the
same nice level. This means that two tasks behaving completely differently
from fully cpu bound to waking/sleeping extremely frequently will still
get the same quota of cpu, but the latter will be using its quota for that
epoch in bursts rather than continuously. This guarantees that interactive
tasks get the same amount of cpu as cpu bound ones.

The other requirement of interactive tasks is also to obtain low latencies
for when they are scheduled. Unlike fully cpu bound tasks and the maximum
latencies possible described in the modelling deadline behaviour section
above, tasks that sleep will wake up with quota available usually at the
current runqueue's priority_level or better. This means that the most latency
they are likely to see is one RR_INTERVAL, and often they will preempt the
current task if it is not of a sleeping nature. This then guarantees very
low latency for interactive tasks, and the lowest latencies for the least
cpu bound tasks.

Sunday, 4th March 2007
Con Kolivas

-- 
-ck

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ck] [ANNOUNCE] RSDL completely fair starvation free interactive  cpu scheduler
  2007-03-04  7:00 [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler Con Kolivas
@ 2007-03-04  7:45 ` Con Kolivas
  2007-03-04 14:04   ` Con Kolivas
  2007-03-04 11:08 ` Gene Heskett
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Con Kolivas @ 2007-03-04  7:45 UTC (permalink / raw)
  To: ck; +Cc: linux kernel mailing list

On Sunday 04 March 2007 18:00, Con Kolivas wrote:
> This message is to announce the first general public release of the
> "Rotating Staircase DeadLine" cpu scheduler.
>
> Based on previous work from the staircase cpu scheduler I set out to
> design, from scratch, a new scheduling policy design which satisfies every
> requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER) task
> management.
>
> Available for download are:
>
>  A full rollup of the patch for 2.6.20:
> http://ck.kolivas.org/patches/staircase-deadline/sched-rsdl-0.26.patch

It's probably worth mentioning that this scheduler shows a not insignificant 
improvement in the mysql test case that recently received a lot of publicity. 
Why exactly that's the case I'm not sure but it may help track down what is 
actually responsible for the performance drop off as well. Note that the SMP 
balancing of this cpu scheduler is essentially unchanged from the mainline 
one.

-- 
-ck

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-04  7:00 [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler Con Kolivas
  2007-03-04  7:45 ` [ck] " Con Kolivas
@ 2007-03-04 11:08 ` Gene Heskett
  2007-03-04 11:47   ` Con Kolivas
  2007-03-05 21:52 ` Con Kolivas
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Gene Heskett @ 2007-03-04 11:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Con Kolivas, ck list

On Sunday 04 March 2007, Con Kolivas wrote:
>This message is to announce the first general public release of the
> "Rotating Staircase DeadLine" cpu scheduler.

I assume to test this, we select the deadline scheduler?

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu  scheduler
  2007-03-04 11:08 ` Gene Heskett
@ 2007-03-04 11:47   ` Con Kolivas
  2007-03-04 12:24     ` Gene Heskett
  0 siblings, 1 reply; 28+ messages in thread
From: Con Kolivas @ 2007-03-04 11:47 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, ck list

On Sunday 04 March 2007 22:08, Gene Heskett wrote:
> On Sunday 04 March 2007, Con Kolivas wrote:
> >This message is to announce the first general public release of the
> > "Rotating Staircase DeadLine" cpu scheduler.
>
> I assume to test this, we select the deadline scheduler?

No, only the "deadline" in the name is shared. This is a cpu process scheduler 
whereas the deadline scheduler you're thinking of is an I/O scheduler. To 
test this you just patch it in and it replaces the current mainline cpu 
scheduler (the same way the staircase cpu scheduler in -ck replaces it).

-- 
-ck

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-04 11:47   ` Con Kolivas
@ 2007-03-04 12:24     ` Gene Heskett
  2007-03-04 12:46       ` Con Kolivas
  0 siblings, 1 reply; 28+ messages in thread
From: Gene Heskett @ 2007-03-04 12:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: Con Kolivas, ck list

On Sunday 04 March 2007, Con Kolivas wrote:
>On Sunday 04 March 2007 22:08, Gene Heskett wrote:
>> On Sunday 04 March 2007, Con Kolivas wrote:
>> >This message is to announce the first general public release of the
>> > "Rotating Staircase DeadLine" cpu scheduler.
>>
>> I assume to test this, we select the deadline scheduler?
>
>No, only the "deadline" in the name is shared. This is a cpu process
> scheduler whereas the deadline scheduler you're thinking of is an I/O
> scheduler. To test this you just patch it in and it replaces the
> current mainline cpu scheduler (the same way the staircase cpu
> scheduler in -ck replaces it).

Oh, then I tried to put the -ck1 patch on top of that, and blew the tree 
up.  I'd built it the first time with the deadline scheduler as the 
default while I was waiting on your reply.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu  scheduler
  2007-03-04 12:24     ` Gene Heskett
@ 2007-03-04 12:46       ` Con Kolivas
  2007-03-04 13:25         ` Gene Heskett
  0 siblings, 1 reply; 28+ messages in thread
From: Con Kolivas @ 2007-03-04 12:46 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, ck list

On Sunday 04 March 2007 23:24, Gene Heskett wrote:
> On Sunday 04 March 2007, Con Kolivas wrote:
> >On Sunday 04 March 2007 22:08, Gene Heskett wrote:
> >> On Sunday 04 March 2007, Con Kolivas wrote:
> >> >This message is to announce the first general public release of the
> >> > "Rotating Staircase DeadLine" cpu scheduler.
> >>
> >> I assume to test this, we select the deadline scheduler?
> >
> >No, only the "deadline" in the name is shared. This is a cpu process
> > scheduler whereas the deadline scheduler you're thinking of is an I/O
> > scheduler. To test this you just patch it in and it replaces the
> > current mainline cpu scheduler (the same way the staircase cpu
> > scheduler in -ck replaces it).
>
> Oh, then I tried to put the -ck1 patch on top of that, and blew the tree
> up.  I'd built it the first time with the deadline scheduler as the
> default while I was waiting on your reply.

Yes, sorry. This is mutually exclusive with the -ck1 patch. It is a standalone 
piece of code.

-- 
-ck

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-04 12:46       ` Con Kolivas
@ 2007-03-04 13:25         ` Gene Heskett
  2007-03-04 13:49           ` Con Kolivas
  0 siblings, 1 reply; 28+ messages in thread
From: Gene Heskett @ 2007-03-04 13:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Con Kolivas, ck list

On Sunday 04 March 2007, Con Kolivas wrote:
>On Sunday 04 March 2007 23:24, Gene Heskett wrote:
>> On Sunday 04 March 2007, Con Kolivas wrote:
>> >On Sunday 04 March 2007 22:08, Gene Heskett wrote:
>> >> On Sunday 04 March 2007, Con Kolivas wrote:
>> >> >This message is to announce the first general public release of
>> >> > the "Rotating Staircase DeadLine" cpu scheduler.
>> >>
>> >> I assume to test this, we select the deadline scheduler?
>> >
>> >No, only the "deadline" in the name is shared. This is a cpu process
>> > scheduler whereas the deadline scheduler you're thinking of is an
>> > I/O scheduler. To test this you just patch it in and it replaces the
>> > current mainline cpu scheduler (the same way the staircase cpu
>> > scheduler in -ck replaces it).
>>
>> Oh, then I tried to put the -ck1 patch on top of that, and blew the
>> tree up.  I'd built it the first time with the deadline scheduler as
>> the default while I was waiting on your reply.
>
>Yes, sorry. This is mutually exclusive with the -ck1 patch. It is a
> standalone piece of code.

I just this instant got it booted, what with building a driver for nvidia 
and all.  I'll let you know what I think in a few hours after I've gotten 
a feel for it.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu  scheduler
  2007-03-04 13:25         ` Gene Heskett
@ 2007-03-04 13:49           ` Con Kolivas
  2007-03-04 14:11             ` Gene Heskett
  2007-03-04 14:36             ` Willy Tarreau
  0 siblings, 2 replies; 28+ messages in thread
From: Con Kolivas @ 2007-03-04 13:49 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, ck list

On Monday 05 March 2007 00:25, Gene Heskett wrote:
> On Sunday 04 March 2007, Con Kolivas wrote:
> >On Sunday 04 March 2007 23:24, Gene Heskett wrote:
> >> On Sunday 04 March 2007, Con Kolivas wrote:
> >> >On Sunday 04 March 2007 22:08, Gene Heskett wrote:
> >> >> On Sunday 04 March 2007, Con Kolivas wrote:
> >> >> >This message is to announce the first general public release of
> >> >> > the "Rotating Staircase DeadLine" cpu scheduler.
> >> >>
> >> >> I assume to test this, we select the deadline scheduler?
> >> >
> >> >No, only the "deadline" in the name is shared. This is a cpu process
> >> > scheduler whereas the deadline scheduler you're thinking of is an
> >> > I/O scheduler. To test this you just patch it in and it replaces the
> >> > current mainline cpu scheduler (the same way the staircase cpu
> >> > scheduler in -ck replaces it).
> >>
> >> Oh, then I tried to put the -ck1 patch on top of that, and blew the
> >> tree up.  I'd built it the first time with the deadline scheduler as
> >> the default while I was waiting on your reply.
> >
> >Yes, sorry. This is mutually exclusive with the -ck1 patch. It is a
> > standalone piece of code.
>
> I just this instant got it booted, what with building a driver for nvidia
> and all.  I'll let you know what I think in a few hours after I've gotten
> a feel for it.

Great, thanks.

Just to make it clear. The purpose of this scheduler is at all costs to 
maintain absolute fairness no matter what type of load it is put under. This 
means that if you heavily load up your machine without the use of 'nice' then 
your interactive tasks _will_ slow down proportionately to the amount of cpu 
you use. So doing make -j4 for example will make any other task started in 
taht presence run precisely 1/5th speed, but they will still be responsive, 
have low latency (and audio shouldn't skip for example). 

There will be times when the mainline scheduler feels more interactive than 
this scheduler, and that is because it has significant unfairness granted 
towards interactive tasks. This degree of unfairness in an effort to maintain 
interactivity has been criticised and causes problems in certain environments 
with both loss of fairness, relative starvation and is not entirely 
predictable. 

This was designed to be robust for any application since linux demands a 
general purpose scheduler design, while preserving interactivity, instead of 
optimising for one particular end use.

-- 
-ck

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive  cpu scheduler
  2007-03-04  7:45 ` [ck] " Con Kolivas
@ 2007-03-04 14:04   ` Con Kolivas
  0 siblings, 0 replies; 28+ messages in thread
From: Con Kolivas @ 2007-03-04 14:04 UTC (permalink / raw)
  To: ck; +Cc: linux kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

On Sunday 04 March 2007 18:45, Con Kolivas wrote:
> On Sunday 04 March 2007 18:00, Con Kolivas wrote:
> > This message is to announce the first general public release of the
> > "Rotating Staircase DeadLine" cpu scheduler.
> >
> > Based on previous work from the staircase cpu scheduler I set out to
> > design, from scratch, a new scheduling policy design which satisfies
> > every requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER) task
> > management.
> >
> > Available for download are:
> >
> >  A full rollup of the patch for 2.6.20:
> > http://ck.kolivas.org/patches/staircase-deadline/sched-rsdl-0.26.patch
>
> It's probably worth mentioning that this scheduler shows a not
> insignificant improvement in the mysql test case that recently received a
> lot of publicity. Why exactly that's the case I'm not sure but it may help
> track down what is actually responsible for the performance drop off as
> well. Note that the SMP balancing of this cpu scheduler is essentially
> unchanged from the mainline one.

Only preliminary other benchmarking has been done so far on RSDL, and so far 
the results are equal to or slightly better than mainline on scalability 
grounds.

These are the sysbench graphs just for completion that Nishanth Aravamudan 
performed. We were actually trying to track down the cause of the mysql 
performance problem and I suggested trying different cpu schedulers as well. 
Needless to say I'm not unhappy with the results although they haven't told 
us exactly where the problem lies but it certainly does add evidence that 
perhaps it is a scheduling issue.

-- 
-ck

[-- Attachment #2: transactions.png --]
[-- Type: image/png, Size: 7950 bytes --]

[-- Attachment #3: idle.png --]
[-- Type: image/png, Size: 6317 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-04 13:49           ` Con Kolivas
@ 2007-03-04 14:11             ` Gene Heskett
  2007-03-05  2:31               ` Zwane Mwaikambo
  2007-03-04 14:36             ` Willy Tarreau
  1 sibling, 1 reply; 28+ messages in thread
From: Gene Heskett @ 2007-03-04 14:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Con Kolivas, ck list

On Sunday 04 March 2007, Con Kolivas wrote:
>On Monday 05 March 2007 00:25, Gene Heskett wrote:
>> On Sunday 04 March 2007, Con Kolivas wrote:
>> >On Sunday 04 March 2007 23:24, Gene Heskett wrote:
>> >> On Sunday 04 March 2007, Con Kolivas wrote:
>> >> >On Sunday 04 March 2007 22:08, Gene Heskett wrote:
>> >> >> On Sunday 04 March 2007, Con Kolivas wrote:
>> >> >> >This message is to announce the first general public release of
>> >> >> > the "Rotating Staircase DeadLine" cpu scheduler.
>> >> >>
>> >> >> I assume to test this, we select the deadline scheduler?
>> >> >
>> >> >No, only the "deadline" in the name is shared. This is a cpu
>> >> > process scheduler whereas the deadline scheduler you're thinking
>> >> > of is an I/O scheduler. To test this you just patch it in and it
>> >> > replaces the current mainline cpu scheduler (the same way the
>> >> > staircase cpu scheduler in -ck replaces it).
>> >>
>> >> Oh, then I tried to put the -ck1 patch on top of that, and blew the
>> >> tree up.  I'd built it the first time with the deadline scheduler
>> >> as the default while I was waiting on your reply.
>> >
>> >Yes, sorry. This is mutually exclusive with the -ck1 patch. It is a
>> > standalone piece of code.
>>
>> I just this instant got it booted, what with building a driver for
>> nvidia and all.  I'll let you know what I think in a few hours after
>> I've gotten a feel for it.
>
>Great, thanks.
>
>Just to make it clear. The purpose of this scheduler is at all costs to
>maintain absolute fairness no matter what type of load it is put under.
> This means that if you heavily load up your machine without the use of
> 'nice' then your interactive tasks _will_ slow down proportionately to
> the amount of cpu you use. So doing make -j4 for example will make any
> other task started in taht presence run precisely 1/5th speed, but they
> will still be responsive, have low latency (and audio shouldn't skip
> for example).
>
>There will be times when the mainline scheduler feels more interactive
> than this scheduler, and that is because it has significant unfairness
> granted towards interactive tasks. This degree of unfairness in an
> effort to maintain interactivity has been criticised and causes
> problems in certain environments with both loss of fairness, relative
> starvation and is not entirely predictable.

Well, in 20 minutes of playing, I am so far VERY impressed, the kmail 
composer is typing to the screen in unison to my keystrokes, where with 
the older version it often went away for 10 or more seconds at a time, 
then displayed the last sentence I had typed in one (or 2 sometimes) 
swell foop.  Now I can back up and correct a typo in real time whereas 
before, it was often faster if the typo was half a line back, and the key 
repeat so darned slow it was often over a second per character cell 
moved, to go grab the mouse, position the cursor on the typo, click, wait 
for the indicator to move, and then fix it.  Typing is now pleasurable 
again.  Key repeats seem to remain at the set in the bios key repeat 
speed.  Amazing.  I do believe you have given me back my machine.

I may find something to squawk about in due time, I think that Murphy guys 
laws may have a corollary on that subject, but it sure feels good right 
now.

>This was designed to be robust for any application since linux demands a
>general purpose scheduler design, while preserving interactivity,
> instead of optimising for one particular end use.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu  scheduler
  2007-03-04 13:49           ` Con Kolivas
  2007-03-04 14:11             ` Gene Heskett
@ 2007-03-04 14:36             ` Willy Tarreau
  2007-03-04 16:08               ` [ck] " jos poortvliet
  1 sibling, 1 reply; 28+ messages in thread
From: Willy Tarreau @ 2007-03-04 14:36 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Gene Heskett, linux-kernel, ck list

Hi Con !

On Mon, Mar 05, 2007 at 12:49:49AM +1100, Con Kolivas wrote:
> On Monday 05 March 2007 00:25, Gene Heskett wrote:
> > On Sunday 04 March 2007, Con Kolivas wrote:
> > >On Sunday 04 March 2007 23:24, Gene Heskett wrote:
> > >> On Sunday 04 March 2007, Con Kolivas wrote:
> > >> >On Sunday 04 March 2007 22:08, Gene Heskett wrote:
> > >> >> On Sunday 04 March 2007, Con Kolivas wrote:
> > >> >> >This message is to announce the first general public release of
> > >> >> > the "Rotating Staircase DeadLine" cpu scheduler.
> > >> >>
> > >> >> I assume to test this, we select the deadline scheduler?
> > >> >
> > >> >No, only the "deadline" in the name is shared. This is a cpu process
> > >> > scheduler whereas the deadline scheduler you're thinking of is an
> > >> > I/O scheduler. To test this you just patch it in and it replaces the
> > >> > current mainline cpu scheduler (the same way the staircase cpu
> > >> > scheduler in -ck replaces it).
> > >>
> > >> Oh, then I tried to put the -ck1 patch on top of that, and blew the
> > >> tree up.  I'd built it the first time with the deadline scheduler as
> > >> the default while I was waiting on your reply.
> > >
> > >Yes, sorry. This is mutually exclusive with the -ck1 patch. It is a
> > > standalone piece of code.
> >
> > I just this instant got it booted, what with building a driver for nvidia
> > and all.  I'll let you know what I think in a few hours after I've gotten
> > a feel for it.
> 
> Great, thanks.
> 
> Just to make it clear. The purpose of this scheduler is at all costs to 
> maintain absolute fairness no matter what type of load it is put under. This 
> means that if you heavily load up your machine without the use of 'nice' then 
> your interactive tasks _will_ slow down proportionately to the amount of cpu 
> you use. So doing make -j4 for example will make any other task started in 
> taht presence run precisely 1/5th speed, but they will still be responsive, 
> have low latency (and audio shouldn't skip for example). 
> 
> There will be times when the mainline scheduler feels more interactive than 
> this scheduler, and that is because it has significant unfairness granted 
> towards interactive tasks. This degree of unfairness in an effort to maintain 
> interactivity has been criticised and causes problems in certain environments 
> with both loss of fairness, relative starvation and is not entirely 
> predictable. 
> 
> This was designed to be robust for any application since linux demands a 
> general purpose scheduler design, while preserving interactivity, instead of 
> optimising for one particular end use.

Well, I haven't tested it yet, but your design choices please me. As you
know, I've been one of those encountering big starvation problems with
the original scheduler, making 2.6 unusable for me in many situations. I
welcome your work and want to thank you for the time you spend trying to
fix it.

Keep up the good work,
Willy

PS: I've looked at your graphs, I hope you're on the way to something really
    better than the 21 first 2.6 releases !


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ck] Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-04 14:36             ` Willy Tarreau
@ 2007-03-04 16:08               ` jos poortvliet
  2007-03-05 23:05                 ` Bill Davidsen
  0 siblings, 1 reply; 28+ messages in thread
From: jos poortvliet @ 2007-03-04 16:08 UTC (permalink / raw)
  To: ck; +Cc: Willy Tarreau, Con Kolivas, Gene Heskett, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1313 bytes --]

Op Sunday 04 March 2007, schreef Willy Tarreau:
> Hi Con !
> > This was designed to be robust for any application since linux demands a
> > general purpose scheduler design, while preserving interactivity, instead
> > of optimising for one particular end use.
>
> Well, I haven't tested it yet, but your design choices please me. As you
> know, I've been one of those encountering big starvation problems with
> the original scheduler, making 2.6 unusable for me in many situations. I
> welcome your work and want to thank you for the time you spend trying to
> fix it.
>
> Keep up the good work,
> Willy
>
> PS: I've looked at your graphs, I hope you're on the way to something
> really better than the 21 first 2.6 releases !
Well, imho his current staircase scheduler already does a better job compared 
to mainline, but it won't make it in (or at least, it's not likely). So we 
can hope this WILL make it into mainline, but I wouldn't count on it.

grtz

Jos



-- 
Disclaimer:

Alles wat ik doe denk en zeg is gebaseerd op het wereldbeeld wat ik nu heb. 
Ik ben niet verantwoordelijk voor wijzigingen van de wereld, of het beeld wat 
ik daarvan heb, noch voor de daaruit voortvloeiende gedragingen van mezelf. 
Alles wat ik zeg is aardig bedoeld, tenzij expliciet vermeld.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-04 14:11             ` Gene Heskett
@ 2007-03-05  2:31               ` Zwane Mwaikambo
  2007-03-05  3:16                 ` Gene Heskett
  0 siblings, 1 reply; 28+ messages in thread
From: Zwane Mwaikambo @ 2007-03-05  2:31 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, Con Kolivas, ck list

On Sun, 4 Mar 2007, Gene Heskett wrote:

> >There will be times when the mainline scheduler feels more interactive
> > than this scheduler, and that is because it has significant unfairness
> > granted towards interactive tasks. This degree of unfairness in an
> > effort to maintain interactivity has been criticised and causes
> > problems in certain environments with both loss of fairness, relative
> > starvation and is not entirely predictable.
> 
> Well, in 20 minutes of playing, I am so far VERY impressed, the kmail 
> composer is typing to the screen in unison to my keystrokes, where with 
> the older version it often went away for 10 or more seconds at a time, 
> then displayed the last sentence I had typed in one (or 2 sometimes) 
> swell foop.  Now I can back up and correct a typo in real time whereas 
> before, it was often faster if the typo was half a line back, and the key 
> repeat so darned slow it was often over a second per character cell 
> moved, to go grab the mouse, position the cursor on the typo, click, wait 
> for the indicator to move, and then fix it.  Typing is now pleasurable 
> again.  Key repeats seem to remain at the set in the bios key repeat 
> speed.  Amazing.  I do believe you have given me back my machine.

What are the specs on your hardware?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-05  2:31               ` Zwane Mwaikambo
@ 2007-03-05  3:16                 ` Gene Heskett
  0 siblings, 0 replies; 28+ messages in thread
From: Gene Heskett @ 2007-03-05  3:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: Zwane Mwaikambo, Con Kolivas, ck list

On Sunday 04 March 2007, Zwane Mwaikambo wrote:
>On Sun, 4 Mar 2007, Gene Heskett wrote:
>> >There will be times when the mainline scheduler feels more
>> > interactive than this scheduler, and that is because it has
>> > significant unfairness granted towards interactive tasks. This
>> > degree of unfairness in an effort to maintain interactivity has been
>> > criticised and causes problems in certain environments with both
>> > loss of fairness, relative starvation and is not entirely
>> > predictable.
>>
>> Well, in 20 minutes of playing, I am so far VERY impressed, the kmail
>> composer is typing to the screen in unison to my keystrokes, where
>> with the older version it often went away for 10 or more seconds at a
>> time, then displayed the last sentence I had typed in one (or 2
>> sometimes) swell foop.  Now I can back up and correct a typo in real
>> time whereas before, it was often faster if the typo was half a line
>> back, and the key repeat so darned slow it was often over a second per
>> character cell moved, to go grab the mouse, position the cursor on the
>> typo, click, wait for the indicator to move, and then fix it.  Typing
>> is now pleasurable again.  Key repeats seem to remain at the set in
>> the bios key repeat speed.  Amazing.  I do believe you have given me
>> back my machine.
>
>What are the specs on your hardware?

XP-2800 Athlon on a biostar board whose model number I've forgotten. 1 GB 
of dual channel 400 mhz ram running at 333 because this athlon falls over 
at 400mhz fsb settings.  480GB of drives in 3 pieces.

Running a jaton 3dForce 6200-256 video card with nvidia driver, which does 
the opengl stuff better than the ati 9200se I took out a week ago, but 
this nvidia card is several times harder on the cpu when tvtime is 
running than the ati was.  There are other cards in this box, 1394, nic, 
usb expander, but that is the main list of power burners.

Its dmesg clock is just short of 2100. and:
Calibrating delay using timer specific routine.. 4177.82 BogoMIPS 
(lpj=2088910)

In comparison to an XP-1400 I had before, this is nowhere near 2x faster.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive  cpu scheduler
  2007-03-04  7:00 [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler Con Kolivas
  2007-03-04  7:45 ` [ck] " Con Kolivas
  2007-03-04 11:08 ` Gene Heskett
@ 2007-03-05 21:52 ` Con Kolivas
  2007-03-08  8:53 ` Ingo Molnar
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Con Kolivas @ 2007-03-05 21:52 UTC (permalink / raw)
  To: ck; +Cc: linux kernel mailing list

On Sunday 04 March 2007 18:00, Con Kolivas wrote:
> This message is to announce the first general public release of the
> "Rotating Staircase DeadLine" cpu scheduler.

>  A full rollup of the patch for 2.6.20:
> http://ck.kolivas.org/patches/staircase-deadline/sched-rsdl-0.26.patch

This patch has been queued at test.kernel.org (thanks Andy Whitcroft).

Here is a summary of the first few benchmarks from those tests rsdl 0.26 vs 
mainline 2.6.20 (However much it is that you value these particular 
benchmarks...) on amd64 4x:


Kernbench (lower elapsed is better);
=========
2.6.20:
Elapsed: 103.058s User: 351.974s System: 36.474s CPU: 376.6%
2.6.20-rsdl:
Elapsed: 102.848s User: 359.186s System: 36.666s CPU: 384.6%


tbench (higher is better):
======
2.6.20:
Throughput 165.5 MB/sec 1 procs
2.6.20-rsdl:
Throughput 261.418 MB/sec 1 procs


reaim (higher is better):
=====
2.6.20:
Max Jobs per Minute 500727.27
2.6.20-rsdl:
Max Jobs per Minute 612000.00

-- 
-ck

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free   interactive cpu scheduler
  2007-03-04 16:08               ` [ck] " jos poortvliet
@ 2007-03-05 23:05                 ` Bill Davidsen
  2007-03-06  0:18                   ` Con Kolivas
  0 siblings, 1 reply; 28+ messages in thread
From: Bill Davidsen @ 2007-03-05 23:05 UTC (permalink / raw)
  To: jos poortvliet; +Cc: ck, Gene Heskett, Willy Tarreau, linux-kernel

jos poortvliet wrote:
> Op Sunday 04 March 2007, schreef Willy Tarreau:
>> Hi Con !
>>> This was designed to be robust for any application since linux demands a
>>> general purpose scheduler design, while preserving interactivity, instead
>>> of optimising for one particular end use.
>> Well, I haven't tested it yet, but your design choices please me. As you
>> know, I've been one of those encountering big starvation problems with
>> the original scheduler, making 2.6 unusable for me in many situations. I
>> welcome your work and want to thank you for the time you spend trying to
>> fix it.
>>
>> Keep up the good work,
>> Willy
>>
>> PS: I've looked at your graphs, I hope you're on the way to something
>> really better than the 21 first 2.6 releases !
> Well, imho his current staircase scheduler already does a better job compared 
> to mainline, but it won't make it in (or at least, it's not likely). So we 
> can hope this WILL make it into mainline, but I wouldn't count on it.
> 
Wrong problem, what is really needed is to get CPU scheduler choice into 
mainline, just as i/o scheduler finally did. Con has noted that for some 
loads this will present suboptimal performance, as will his -ck patches, 
as will the default scheduler. Instead of trying to make ANY one size 
fit all, we should have a means to select, at runtime, between any of 
the schedulers, and preferably to define an interface by which a user 
can insert a new scheduler in the kernel (compile in, I don't mean 
plugable) with clear and well defined rules for how that can be done.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free   interactive cpu scheduler
  2007-03-05 23:05                 ` Bill Davidsen
@ 2007-03-06  0:18                   ` Con Kolivas
  2007-03-06  4:41                     ` Willy Tarreau
  0 siblings, 1 reply; 28+ messages in thread
From: Con Kolivas @ 2007-03-06  0:18 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: jos poortvliet, ck, Gene Heskett, Willy Tarreau, linux-kernel

On Tuesday 06 March 2007 10:05, Bill Davidsen wrote:
> jos poortvliet wrote:
> > Well, imho his current staircase scheduler already does a better job
> > compared to mainline, but it won't make it in (or at least, it's not
> > likely). So we can hope this WILL make it into mainline, but I wouldn't
> > count on it.
>
> Wrong problem, what is really needed is to get CPU scheduler choice into
> mainline, just as i/o scheduler finally did. Con has noted that for some
> loads this will present suboptimal performance, as will his -ck patches,
> as will the default scheduler. Instead of trying to make ANY one size
> fit all, we should have a means to select, at runtime, between any of
> the schedulers, and preferably to define an interface by which a user
> can insert a new scheduler in the kernel (compile in, I don't mean
> plugable) with clear and well defined rules for how that can be done.

Been there, done that. Wli wrote the infrastructure for plugsched; I took his 
code and got it booting and ported 3 or so different scheduler designs. It 
allowed you to build as few or as many different schedulers into the kernel 
and either boot the only one you built into your kernel, or choose a 
scheduler at boot time. That code got permavetoed by both Ingo and Linus. 
After that I gave up on that code and handed it over to Peter Williams who 
still maintains it. So please note that I pushed the plugsched barrow 
previously and still don't think it's a bad idea, but the maintainers think 
it's the wrong approach.

RSDL is my attempt to create a cpu scheduler to try to do everything well 
instead of some things perfectly, to be best fit when trying to shoehorn it 
into any environment. 

-- 
-ck

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free   interactive cpu scheduler
  2007-03-06  0:18                   ` Con Kolivas
@ 2007-03-06  4:41                     ` Willy Tarreau
  2007-03-06  5:39                       ` Nicholas Miell
                                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Willy Tarreau @ 2007-03-06  4:41 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Bill Davidsen, jos poortvliet, ck, Gene Heskett, linux-kernel

On Tue, Mar 06, 2007 at 11:18:44AM +1100, Con Kolivas wrote:
> On Tuesday 06 March 2007 10:05, Bill Davidsen wrote:
> > jos poortvliet wrote:
> > > Well, imho his current staircase scheduler already does a better job
> > > compared to mainline, but it won't make it in (or at least, it's not
> > > likely). So we can hope this WILL make it into mainline, but I wouldn't
> > > count on it.
> >
> > Wrong problem, what is really needed is to get CPU scheduler choice into
> > mainline, just as i/o scheduler finally did. Con has noted that for some
> > loads this will present suboptimal performance, as will his -ck patches,
> > as will the default scheduler. Instead of trying to make ANY one size
> > fit all, we should have a means to select, at runtime, between any of
> > the schedulers, and preferably to define an interface by which a user
> > can insert a new scheduler in the kernel (compile in, I don't mean
> > plugable) with clear and well defined rules for how that can be done.
> 
> Been there, done that. Wli wrote the infrastructure for plugsched; I took his 
> code and got it booting and ported 3 or so different scheduler designs. It 
> allowed you to build as few or as many different schedulers into the kernel 
> and either boot the only one you built into your kernel, or choose a 
> scheduler at boot time. That code got permavetoed by both Ingo and Linus. 
> After that I gave up on that code and handed it over to Peter Williams who 
> still maintains it. So please note that I pushed the plugsched barrow 
> previously and still don't think it's a bad idea, but the maintainers think 
> it's the wrong approach.

In a way, I think they are right. Let me explain. Pluggable schedulers are
useful when you want to switch away from the default one. This is very useful
during development of a new scheduler, as well as when you're not satisfied
with the default scheduler. Having this feature will incitate many people to
develop their own scheduler for their very specific workload, and nothing
generic. It's a bit what happened after all : you, Peter, Nick, and Mike
have worked a lot trying to provide alternative solutions.

But when you think about it, there are other OSes which have only one scheduler
and which behave very well with tens of thousands of tasks and scale very well
with lots of CPUs (eg: solaris). So there is a real challenge here to try to
provide something at least as good and universal because we know that it can
exist. And this is what you finally did : work on a scheduler which ought to be
good with any workload.

Then, when we have a generic, good enough scheduler for most situations, I
think that it could be good to get the plugsched for very specific usages.
People working in HPC may prefer to allocate ressource differently for
instance. There may also be people refusing to mix tasks from different users
on two different siblings of one CPU for security reasons, etc... All those
would justify a plugable scheduler. But it should not be an excuse to provide
a set of bad schedulers and no good one.

The CPU scheduler is often compared to the I/O schedulers while in fact this
is a completely different story. The I/O schedulers are needed because the
hardware and filesystems may lead to very different behaviours, and the
workload may vary a lot (eg: news server, ftp server, cache, desktop, real
time streaming, ...). But at least, the default I/O scheduler was good enough
for most usages, and alternative ones are here to provide optimal solutions
to specific needs.

Regards,
Willy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free   interactive cpu scheduler
  2007-03-06  4:41                     ` Willy Tarreau
@ 2007-03-06  5:39                       ` Nicholas Miell
  2007-03-06 19:04                       ` jos poortvliet
  2007-03-06 21:37                       ` Bill Davidsen
  2 siblings, 0 replies; 28+ messages in thread
From: Nicholas Miell @ 2007-03-06  5:39 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Con Kolivas, Bill Davidsen, jos poortvliet, ck, Gene Heskett,
	linux-kernel

On Tue, 2007-03-06 at 05:41 +0100, Willy Tarreau wrote:
> On Tue, Mar 06, 2007 at 11:18:44AM +1100, Con Kolivas wrote:
> > On Tuesday 06 March 2007 10:05, Bill Davidsen wrote:
> > > jos poortvliet wrote:
> > > > Well, imho his current staircase scheduler already does a better job
> > > > compared to mainline, but it won't make it in (or at least, it's not
> > > > likely). So we can hope this WILL make it into mainline, but I wouldn't
> > > > count on it.
> > >
> > > Wrong problem, what is really needed is to get CPU scheduler choice into
> > > mainline, just as i/o scheduler finally did. Con has noted that for some
> > > loads this will present suboptimal performance, as will his -ck patches,
> > > as will the default scheduler. Instead of trying to make ANY one size
> > > fit all, we should have a means to select, at runtime, between any of
> > > the schedulers, and preferably to define an interface by which a user
> > > can insert a new scheduler in the kernel (compile in, I don't mean
> > > plugable) with clear and well defined rules for how that can be done.
> > 
> > Been there, done that. Wli wrote the infrastructure for plugsched; I took his 
> > code and got it booting and ported 3 or so different scheduler designs. It 
> > allowed you to build as few or as many different schedulers into the kernel 
> > and either boot the only one you built into your kernel, or choose a 
> > scheduler at boot time. That code got permavetoed by both Ingo and Linus. 
> > After that I gave up on that code and handed it over to Peter Williams who 
> > still maintains it. So please note that I pushed the plugsched barrow 
> > previously and still don't think it's a bad idea, but the maintainers think 
> > it's the wrong approach.
> 
> In a way, I think they are right. Let me explain. Pluggable schedulers are
> useful when you want to switch away from the default one. This is very useful
> during development of a new scheduler, as well as when you're not satisfied
> with the default scheduler. Having this feature will incitate many people to
> develop their own scheduler for their very specific workload, and nothing
> generic. It's a bit what happened after all : you, Peter, Nick, and Mike
> have worked a lot trying to provide alternative solutions.
> 
> But when you think about it, there are other OSes which have only one scheduler
> and which behave very well with tens of thousands of tasks and scale very well
> with lots of CPUs (eg: solaris). So there is a real challenge here to try to
> provide something at least as good and universal because we know that it can
> exist. And this is what you finally did : work on a scheduler which ought to be
> good with any workload.

Solaris has a pluggable scheduler framework (each policy --
OTHER/FIFO/RR/etc. -- is it's own separate component).

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free   interactive cpu scheduler
  2007-03-06  4:41                     ` Willy Tarreau
  2007-03-06  5:39                       ` Nicholas Miell
@ 2007-03-06 19:04                       ` jos poortvliet
  2007-03-06 21:37                       ` Bill Davidsen
  2 siblings, 0 replies; 28+ messages in thread
From: jos poortvliet @ 2007-03-06 19:04 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Con Kolivas, Bill Davidsen, ck, Gene Heskett, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3615 bytes --]

Op Tuesday 06 March 2007, schreef Willy Tarreau:
> In a way, I think they are right. Let me explain. Pluggable schedulers are
> useful when you want to switch away from the default one. This is very
> useful during development of a new scheduler, as well as when you're not
> satisfied with the default scheduler. Having this feature will incitate
> many people to develop their own scheduler for their very specific
> workload, and nothing generic. It's a bit what happened after all : you,
> Peter, Nick, and Mike have worked a lot trying to provide alternative
> solutions.

Did that happen for I/O? There are a few schedulers, eg some for servers, 
other more for desktop or throughput. But not 10 or something...

> But when you think about it, there are other OSes which have only one
> scheduler and which behave very well with tens of thousands of tasks and
> scale very well with lots of CPUs (eg: solaris). So there is a real
> challenge here to try to provide something at least as good and universal
> because we know that it can exist. And this is what you finally did : work
> on a scheduler which ought to be good with any workload.

I can imagine a desktop can work optimally with another scheduler than a tiny 
embedded OS in a phone or an 8-core system serving a website, or a 
distributed 512 core system doing heavy scientific calculations?!?

Optimizing for all at the same time involves some compromises, and thus limits 
performance on certain scenario's, right?

> Then, when we have a generic, good enough scheduler for most situations, I
> think that it could be good to get the plugsched for very specific usages.
> People working in HPC may prefer to allocate ressource differently for
> instance. There may also be people refusing to mix tasks from different
> users on two different siblings of one CPU for security reasons, etc... All
> those would justify a plugable scheduler. But it should not be an excuse to
> provide a set of bad schedulers and no good one.

CFQ does pretty well at most workloads, that's why it's default, right? But 
there is choice, which is a good thing. And the current mainline CPU 
scheduler isn't bad at all, so having 'no good one' won't happen anyway.

> The CPU scheduler is often compared to the I/O schedulers while in fact
> this is a completely different story. The I/O schedulers are needed because
> the hardware and filesystems may lead to very different behaviours, and the
> workload may vary a lot (eg: news server, ftp server, cache, desktop, real
> time streaming, ...). But at least, the default I/O scheduler was good
> enough for most usages, and alternative ones are here to provide optimal
> solutions to specific needs.

Ok, for I/O, the diff could be pretty big. But still, there are workloads 
which could be improved by a certain scheduler, right?

And wouldn't it make sense then to have a choice in the default kernel at 
boottime? If that wouldn't hurt performance, it would be an improvement for 
desktop distributions like (K)ubuntu who can set staircase by default, and 
server distro's offering RSDL...

At least having a desktop/interactivity optimized scheduler like staircase and 
a fair, throughput-optimized scheduler like RSDL sounds sane. RSDL does 
better at the msql testcase, staircase is better on the desktop... We're not 
talking about huge amounts of code, or 10 schedulers, and the diff of a few 
percent and better scaling on many cpu's and processes versus better 
interactivity on the desktop sounds like it's worth it.

> Regards,
> Willy

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free   interactive cpu scheduler
  2007-03-06  4:41                     ` Willy Tarreau
  2007-03-06  5:39                       ` Nicholas Miell
  2007-03-06 19:04                       ` jos poortvliet
@ 2007-03-06 21:37                       ` Bill Davidsen
  2007-03-06 21:54                         ` Willy Tarreau
  2 siblings, 1 reply; 28+ messages in thread
From: Bill Davidsen @ 2007-03-06 21:37 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Con Kolivas, jos poortvliet, ck, Gene Heskett, linux-kernel

Willy Tarreau wrote:
> On Tue, Mar 06, 2007 at 11:18:44AM +1100, Con Kolivas wrote:
>   
>> On Tuesday 06 March 2007 10:05, Bill Davidsen wrote:
>>     
>>> jos poortvliet wrote:
>>>       
>>>> Well, imho his current staircase scheduler already does a better job
>>>> compared to mainline, but it won't make it in (or at least, it's not
>>>> likely). So we can hope this WILL make it into mainline, but I wouldn't
>>>> count on it.
>>>>         
>>> Wrong problem, what is really needed is to get CPU scheduler choice into
>>> mainline, just as i/o scheduler finally did. Con has noted that for some
>>> loads this will present suboptimal performance, as will his -ck patches,
>>> as will the default scheduler. Instead of trying to make ANY one size
>>> fit all, we should have a means to select, at runtime, between any of
>>> the schedulers, and preferably to define an interface by which a user
>>> can insert a new scheduler in the kernel (compile in, I don't mean
>>> plugable) with clear and well defined rules for how that can be done.
>>>       
>> Been there, done that. Wli wrote the infrastructure for plugsched; I took his 
>> code and got it booting and ported 3 or so different scheduler designs. It 
>> allowed you to build as few or as many different schedulers into the kernel 
>> and either boot the only one you built into your kernel, or choose a 
>> scheduler at boot time. That code got permavetoed by both Ingo and Linus. 
>> After that I gave up on that code and handed it over to Peter Williams who 
>> still maintains it. So please note that I pushed the plugsched barrow 
>> previously and still don't think it's a bad idea, but the maintainers think 
>> it's the wrong approach.
>>     
>
> In a way, I think they are right. Let me explain. Pluggable schedulers are
> useful when you want to switch away from the default one. This is very useful
> during development of a new scheduler, as well as when you're not satisfied
> with the default scheduler. Having this feature will incitate many people to
> develop their own scheduler for their very specific workload, and nothing
> generic. It's a bit what happened after all : you, Peter, Nick, and Mike
> have worked a lot trying to provide alternative solutions.
>
> But when you think about it, there are other OSes which have only one scheduler
> and which behave very well with tens of thousands of tasks and scale very well
> with lots of CPUs (eg: solaris). So there is a real challenge here to try to
> provide something at least as good and universal because we know that it can
> exist. And this is what you finally did : work on a scheduler which ought to be
> good with any workload.
>
>   
The problem is not with "any workload," because that's not the issue, 
the issue is the definition of "good" matching the administrator's 
policy. And that's where the problem comes in. We have the default 
scheduler, which favors interactive jobs. We have Con's staircase 
scheduler which is part of an interactivity package. We have the 
absolutely fair scheduler which is, well... fair, and keeps things 
smooth and under reasonable load crisp.

There are other schedulers in the pluggable package, I did a doorknob 
scheduler for 2.2 (everybody gets a turn, special case of round-robin). 
I'm sure people have quietly hacked many more, which have never been 
presented to public view.

The point is that no one CPU scheduler will satisfy the policy needs of 
all users, any more than one i/o scheduler does so. We have realtime 
scheduling, preempt both voluntary and involuntary, why should we not 
have multiple CPU schedulers. If Linus has an objection to plugable 
schedulers, then let's identify what the problem is and address it. If 
that means one scheduler or the other must be compiled in, or all 
compiled in and selected, so be it.

> Then, when we have a generic, good enough scheduler for most situations, I
> think that it could be good to get the plugsched for very specific usages.
> People working in HPC may prefer to allocate ressource differently for
> instance. There may also be people refusing to mix tasks from different users
> on two different siblings of one CPU for security reasons, etc... All those
> would justify a plugable scheduler. But it should not be an excuse to provide
> a set of bad schedulers and no good one.
>
>   
Unless you force the the definition of "good" to "what the default 
scheduler does," there can be no "one" good one. Choice is good, no one 
is calling for bizarre niche implementations, but we have at minimum 
three CPU schedulers which as "best" for a large number of users. 
(current default, and Con's fair and interactive flavors, before you ask).
> The CPU scheduler is often compared to the I/O schedulers while in fact this
> is a completely different story. The I/O schedulers are needed because the
> hardware and filesystems may lead to very different behaviours, and the
> workload may vary a lot (eg: news server, ftp server, cache, desktop, real
> time streaming, ...). But at least, the default I/O scheduler was good enough
> for most usages, and alternative ones are here to provide optimal solutions
> to specific needs.
And multiple schedulers are needed because the type of load, mix of 
loads, and admin preference all require decisions at the policy which 
can't be covered by a single solution. Or at least none of the existing 
solutions, and I think letting people tune the guts of scheduler policy 
is more dangerous than giving a selection of solutions. Linux has been 
about choice all along, I hope it's nearly time for a solution better 
than patches to be presented.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free   interactive cpu scheduler
  2007-03-06 21:37                       ` Bill Davidsen
@ 2007-03-06 21:54                         ` Willy Tarreau
  0 siblings, 0 replies; 28+ messages in thread
From: Willy Tarreau @ 2007-03-06 21:54 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Con Kolivas, jos poortvliet, ck, Gene Heskett, linux-kernel

Hi Bill,

On Tue, Mar 06, 2007 at 04:37:37PM -0500, Bill Davidsen wrote:
(...)
> The point is that no one CPU scheduler will satisfy the policy needs of 
> all users, any more than one i/o scheduler does so. We have realtime 
> scheduling, preempt both voluntary and involuntary, why should we not 
> have multiple CPU schedulers. If Linus has an objection to plugable 
> schedulers, then let's identify what the problem is and address it. If 
> that means one scheduler or the other must be compiled in, or all 
> compiled in and selected, so be it.

I'm not in Linus' head, but I think that he wanted the recurrent scheduler
problems to be addressed first for most users before going further. Too
much choice is often dangerous for quality. For instance, look at all the
netfilter modules. Many of them were completely bogus in their early stages,
and some of them even do mostly the same jobs, and many of them have never
left the "extra" stage. Choice is good to detect users' needs, it's good
for global evolution, but it's not as good when you want to have something
good enough for most people.

> >Then, when we have a generic, good enough scheduler for most situations, I
> >think that it could be good to get the plugsched for very specific usages.
> >People working in HPC may prefer to allocate ressource differently for
> >instance. There may also be people refusing to mix tasks from different 
> >users
> >on two different siblings of one CPU for security reasons, etc... All those
> >would justify a plugable scheduler. But it should not be an excuse to 
> >provide
> >a set of bad schedulers and no good one.
> >
> >  
> Unless you force the the definition of "good" to "what the default 
> scheduler does," there can be no "one" good one. Choice is good, no one 
> is calling for bizarre niche implementations, but we have at minimum 
> three CPU schedulers which as "best" for a large number of users. 
> (current default, and Con's fair and interactive flavors, before you ask).

By "good", I mean a scheduler that is not trivially DoSable, and which
does not cause unexpected long pauses to some processes without any reason
(processes which cannot get any time slice for tens of seconds, or ssh
daemons which freeze under system load, to the point of totally preventing
remote administration past 50% CPU usage on some systems).

> >The CPU scheduler is often compared to the I/O schedulers while in fact 
> >this
> >is a completely different story. The I/O schedulers are needed because the
> >hardware and filesystems may lead to very different behaviours, and the
> >workload may vary a lot (eg: news server, ftp server, cache, desktop, real
> >time streaming, ...). But at least, the default I/O scheduler was good 
> >enough
> >for most usages, and alternative ones are here to provide optimal solutions
> >to specific needs.
> And multiple schedulers are needed because the type of load, mix of 
> loads, and admin preference all require decisions at the policy which 
> can't be covered by a single solution. Or at least none of the existing 
> solutions, and I think letting people tune the guts of scheduler policy 
> is more dangerous than giving a selection of solutions. Linux has been 
> about choice all along, I hope it's nearly time for a solution better 
> than patches to be presented.

There's a difference between CPU and I/O scheduler though. With the CPU
scheduler, you've always had the choice to assign per-process priorities 
with "nice". Don't get me wrong, I'm all for pluggable schedulers, as I'm
an ever unsatisfied optimizer. It's just that I think it has been good to
encourage people to focus on real issues before dispersing efforts on
different needs. I hope that Con's work will eventually get merged and
that the door will be opened towards pluggable schedulers.

Best regards,
Willy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-04  7:00 [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler Con Kolivas
                   ` (2 preceding siblings ...)
  2007-03-05 21:52 ` Con Kolivas
@ 2007-03-08  8:53 ` Ingo Molnar
  2007-03-08 10:07   ` Con Kolivas
  2007-03-08 20:25 ` Fabio Comolli
  2007-03-09 21:11 ` [ck] " Rodney Gordon II
  5 siblings, 1 reply; 28+ messages in thread
From: Ingo Molnar @ 2007-03-08  8:53 UTC (permalink / raw)
  To: Con Kolivas
  Cc: linux kernel mailing list, ck list, Andrew Morton, Linus Torvalds


* Con Kolivas <kernel@kolivas.org> wrote:

> This message is to announce the first general public release of the 
> "Rotating Staircase DeadLine" cpu scheduler.
> 
> Based on previous work from the staircase cpu scheduler I set out to 
> design, from scratch, a new scheduling policy design which satisfies 
> every requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER) 
> task management.

cool! I like this even more than i liked your original staircase 
scheduler from 2 years ago :) Lets try what we did back then: put it 
into -mm and see what breaks (if anything). But in general, it is 
becoming increasingly clear that the interactivity estimator is a more 
fragile concept than the built-in quota mechanism of the staircase 
scheduler, so if it works in practice i'm quite in favor of it, even if 
it regresses /some/ workloads.

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-08  8:53 ` Ingo Molnar
@ 2007-03-08 10:07   ` Con Kolivas
  0 siblings, 0 replies; 28+ messages in thread
From: Con Kolivas @ 2007-03-08 10:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux kernel mailing list, ck list, Andrew Morton, Linus Torvalds

On Thursday 08 March 2007 19:53, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > This message is to announce the first general public release of the
> > "Rotating Staircase DeadLine" cpu scheduler.
> >
> > Based on previous work from the staircase cpu scheduler I set out to
> > design, from scratch, a new scheduling policy design which satisfies
> > every requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER)
> > task management.
>
> cool! I like this even more than i liked your original staircase
> scheduler from 2 years ago :) Lets try what we did back then: put it
> into -mm and see what breaks (if anything). But in general, it is
> becoming increasingly clear that the interactivity estimator is a more
> fragile concept than the built-in quota mechanism of the staircase
> scheduler, so if it works in practice i'm quite in favor of it, even if
> it regresses /some/ workloads.

Great! Thanks for your support. 

After futzing around for all that time I've become sure that an approach 
without an interactive estimator is our only way forward. So far the 
throughput benchmarks are encouraging too so I suspect the estimator may be 
causing harm there too.

Ensuring the different arches and cpuidle work properly I likely will need 
help with, though, so I'd appreciate any help from people if they see 
something obvious and can get a grip of my code.

Thanks!

-- 
-ck

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-04  7:00 [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler Con Kolivas
                   ` (3 preceding siblings ...)
  2007-03-08  8:53 ` Ingo Molnar
@ 2007-03-08 20:25 ` Fabio Comolli
  2007-03-08 20:57   ` Con Kolivas
  2007-03-09 21:11 ` [ck] " Rodney Gordon II
  5 siblings, 1 reply; 28+ messages in thread
From: Fabio Comolli @ 2007-03-08 20:25 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux kernel mailing list, ck list

Hi Con
It would be nice if you could rebase this patch to latest git or at
least to 2.6.21-rc3.
Regards,
Fabio




On 3/4/07, Con Kolivas <kernel@kolivas.org> wrote:
> This message is to announce the first general public release of the "Rotating
> Staircase DeadLine" cpu scheduler.
>
> Based on previous work from the staircase cpu scheduler I set out to design,
> from scratch, a new scheduling policy design which satisfies every
> requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER) task management.
>
> Available for download are:
>
>  A full rollup of the patch for 2.6.20:
> http://ck.kolivas.org/patches/staircase-deadline/sched-rsdl-0.26.patch
>
>  Split patches for 2.6.20(which will follow this email):
> http://ck.kolivas.org/patches/staircase-deadline/split-out/
>
>  The readme (which will also constitute the rest of this email):
> http://ck.kolivas.org/patches/staircase-deadline/rsdl_scheduler.readme
>
>
> The following readme is also included as documentation in
> Documentation/sched-design.txt
>
>
> Rotating Staircase Deadline cpu scheduler policy
> ================================================
>
> Design summary
> ==============
>
> A novel design which incorporates a foreground-background descending priority
> system (the staircase) with runqueue managed minor and major epochs (rotation
> and deadline).
>
>
> Features
> ========
>
> A starvation free, strict fairness O(1) scalable design with interactivity
> as good as the above restrictions can provide. There is no interactivity
> estimator, no sleep/run measurements and only simple fixed accounting.
> The design has strict enough a design and accounting that task behaviour
> can be modelled and maximum scheduling latencies can be predicted by
> the virtual deadline mechanism that manages runqueues. The prime concern
> in this design is to maintain fairness at all costs determined by nice level,
> yet to maintain as good interactivity as can be allowed within the
> constraints of strict fairness.
>
>
> Design description
> ==================
>
> RSDL works off the principle of providing each task a quota of runtime that
> it is allowed to run at each priority level equal to its static priority
> (ie. its nice level) and every priority below that. When each task is queued,
> the cpu that it is queued onto also keeps a record of that quota. If the
> task uses up its quota it is decremented one priority level. Also, if the cpu
> notices a quota full has been used for that priority level, it pushes
> everything remaining at that priority level to the next lowest priority
> level. Once every runtime quota has been consumed of every priority level,
> a task is queued on the "expired" array. When no other tasks exist with
> quota, the expired array is activated and fresh quotas are handed out. This
> is all done in O(1).
>
>
> Design details
> ==============
>
> Each cpu has its own runqueue which micromanages its own epochs, and each
> task keeps a record of its own entitlement of cpu time. Most of the rest
> of these details apply to non-realtime tasks as rt task management is
> straight forward.
>
> Each runqueue keeps a record of what major epoch it is up to in the
> rq->prio_rotation field which is incremented on each major epoch. It also
> keeps a record of quota available to each priority value valid for that
> major epoch in rq->prio_quota[].
>
> Each task keeps a record of what major runqueue epoch it was last running
> on in p->rotation. It also keeps a record of what priority levels it has
> already been allocated quota from during this epoch in a bitmap p->bitmap.
>
> The only tunable that determines all other details is the RR_INTERVAL. This
> is set to 6ms (minimum on 1000HZ, higher at different HZ values).
>
> All tasks are initially given a quota based on RR_INTERVAL. This is equal to
> RR_INTERVAL between nice values of 0 and 19, and progressively larger for
> nice values from -1 to -20. This is assigned to p->quota and only changes
> with changes in nice level.
>
> As a task is first queued, it checks in recalc_task_prio to see if it has
> run at this runqueue's current priority rotation. If it has not, it will
> have its p->prio level set to equal its p->static_prio (nice level) and will
> be given a p->time_slice equal to the p->quota, and has its allocation
> bitmap bit set in p->bitmap for its static priority (nice value). This
> quota is then also added to the current runqueue's rq->prio_quota[p->prio].
> It is then queued on the current active priority array.
>
> If a task has already been running during this major epoch, if it has
> p->time_slice left and the rq->prio_quota for the task's p->prio still
> has quota, it will be placed back on the active array, but no more quota
> will be added to either the task or the runqueue quota.
>
> If a task has been running during this major epoch, but does not have
> p->time_slice left or the runqueue's prio_quota for this task's p->prio
> does not have quota, it will find the next lowest priority in its bitmap
> that it has not been allocated quota from. It then gets the a full quota
> in p->time_slice and adds that to the quota value for the relevant priority
> rq->prio_quota. It is then queued on the current active priority array at
> the newly determined lower priority.
>
> If a task has been running during this major epoch, and does not have
> any entitlement left in p->bitmap and no time_slice left, it will have its
> bitmap cleared, and be queued at its p->static_prio again, but on the expired
> priority array. No quota will be allocated until this task is scheduled.
>
> When a task is queued, it has its static_prio bit set in the current
> runqueue's rq->static_bitmap, and the relevant bit in the rq->dyn_bitmap.
> In order to minimise the number of bitmap lookups, the bitmap of queued
> tasks on the expired array is at the end of the same bitmap as the active
> array. The number of tasks queued at the current static_prio is kept in
> rq->prio_queued[].
>
> During a scheduler_tick where a task is running, the p->time_slice is
> decremented, and if it reaches zero then the recalc_task_prio is readjusted
> and the task rescheduled.
>
> During a task running tick, the runqueue prio_quota is also decremented. If
> it empties then a priority rotation occurs (a major or minor epoch). If the
> current runqueue's priority level is better than that of nice 19 tasks, a
> minor rotation is performed, otherwise a major rotation will occur.
>
> A minor rotation takes the remaining tasks at this priority level queue and
> merges them with a list_splice_tail with the queue from the next lowest
> priority level. At this time, any tasks that have been merged will now
> have invalid values in p->prio so this must be considered when dequeueing
> the task, and for testing for preemption.
>
> A major rotation takes the remaining tasks at this priority level queue and
> merges them with a list_splice_tail with the best priority task running on
> the expired array, and swaps the priority arrays. The priority quotas are
> reset at this time. Any tasks that have been merged will now have invalid
> values in p->array and possibly p->prio so this must be considered. The
> rq->prio_rotation is incremented at this time.
>
> When a task is dequeued, the dyn_bitmap bit is unset only after testing
> that the relevant queue is actually empty since p->prio may be inaccurate
> and no hard accounting of the number of tasks at that level is possible.
>
> When selecting a new task for scheduling, after the first dynamic bit is
> found on the dyn_bitmap, it is checked to see that a task is really queued
> at that priority or if it is a false positive due to the task being
> dequeued at a time when its p->prio does not match which queue it is on
> after some form of priority rotation. This is a rare occurrence as it tends
> to only occur if a task that is already waiting on a runqueue gets dequeued.
> If the bitmap value is in the expired array range, a major priority rotation
> is performed. If the chosen task has not been running during this major or
> minor rotation it has new quota allocated at this time, and added to the
> runqueue's quota.
>
>
> Modelling deadline behaviour
> ============================
>
> As the accounting in this design is hard and not modified by sleep average
> calculations or interactivity modifiers, it is possible to accurately
> predict the maximum latency that a task may experience under different
> conditions. This is a virtual deadline mechanism enforced by mandatory
> runqueue epochs, and not by trying to keep complicated accounting of each
> task.
>
> The maximum duration a task can run during one major epoch is determined
> by its nice value. Nice 0 tasks can run at 19 different priority levels
> for RR_INTERVAL duration during each epoch (the equivalent of nice 0 to nice
> 19). Nice 10 tasks can run at 9 priority levels for each epoch, and so on.
>
> Therefore the maximum duration a runqueue epoch can take is determined by
> the number of tasks running, and their nice level. After that, the maximum
> duration it can take before a task can wait before it get scheduled is
> determined by the difference between its nice value and the nice value of
> the highest priority task queued.
>
> In the following examples, these are _worst case scenarios_ and would rarely
> occur, but can be modelled nonetheless to determine the maximum possible
> latency.
>
> So for example, if two nice 0 tasks are running, and one has just expired as
> another is activated for the first time receiving a full quota for this
> runqueue rotation, the first task will wait:
>
> nr_tasks * max_duration + nice_difference * rr_interval
> 1 * 19 * RR_INTERVAL + 0 = 114ms
>
> In the presence of a nice 10 task, a nice 0 task would wait a maximum of
> 1 * 10 * RR_INTERVAL + 0 = 60ms
>
> In the presence of a nice 0 task, a nice 10 task would wait a maximum of
> 1 * 19 * RR_INTERVAL + 9 * RR_INTERVAL = 168ms
>
> Using a more complicated example, if there are 4 tasks running fully cpu
> bound, one each at nice -20, nice 0, nice 10 and nice 19, we can calculate
> the maximum latency possible for the nice 10 task. Note that -20 tasks are
> heavily biased for so this will be a long time, but can be modelled.
>
> The nice -20 task has quota = RR_INTERVAL + 20*RR_INTERVAL = 21*RR_INTERVAL.
> It can run at 39 priority levels so its maximum duration =
> 39 * 21 * RR_INTERVAL.
> The nice 0 task works out to
> 19 * RR_INTERVAL
> The nice 19 task works out to
> RR_INTERVAL.
>
> So major epoch can take up a maximum of
> 39 * 21 * RR_INTERVAL + 19 * RR_INTERVAL + RR_INTERVAL = 1229 * RR_INTERVAL;
>
> Then before the nice 10 task will run, the nice -20 and nice 0 task will
> run for 28 * 21 * RR_INTERVAL and 9 * RR_INTERVAL respectively for a total
> of 597 * RR_INTERVAL.
>
> This means the maximum duration a nice 10 task can wait in the presence of
> these other tasks is 1826*RR_INTERVAL. This is a long time of course and is
> heavily penalised by the presence of nice -20 tasks which would not be part
> of a normal environment.
>
> While this section describes the maximum latency a task can have, this size
> latencies will only be seen by fully cpu bound tasks.
>
>
> Achieving interactivity
> =======================
>
> A requirement of this scheduler design was to achieve good interactivity
> despite being a completely fair deadline based design. The disadvantage of
> designs that try to achieve interactivity is that they usually do so at
> the expense of maintaining fairness. As cpu speeds increase, the requirement
> for some sort of metered unfairness towards interactive tasks becomes a less
> desirable phenomenon, but low latency and fairness remains mandatory to
> good interactive performance.
>
> This design relies on the fact that interactive tasks, by their nature,
> sleep often. Most fair scheduling designs end up penalising such tasks
> indirectly giving them less than their fair possible share because of the
> sleep, and have to use a mechanism of bonusing their priority to offset
> this based on the duration they sleep. This becomes increasingly inaccurate
> as the number of running tasks rises and more tasks spend time waiting on
> runqueues rather than sleeping, and it is impossible to tell whether the
> task that's waiting on a runqueue only intends to run for a short period and
> then sleep again after than runqueue wait. Furthermore, all such designs rely
> on a period of time to pass to accumulate some form of statistic on the task
> before deciding on how much to give them preference. The shorter this period,
> the more rapidly bursts of cpu ruin the interactive tasks behaviour. The
> longer this period, the longer it takes for interactive tasks to get low
> scheduling latencies and fair cpu.
>
> This design does not measure sleep time at all. Interactive tasks that sleep
> often will wake up having consumed very little if any of their quota for
> the current major priority rotation. The longer they have slept, the less
> likely they are to even be on the current major priority rotation. Once
> woken up, though, they get to use up a their full quota for that epoch,
> whether part of a quota remains or a full quota. Overall, however, they
> can still only run as much cpu time for that epoch as any other task of the
> same nice level. This means that two tasks behaving completely differently
> from fully cpu bound to waking/sleeping extremely frequently will still
> get the same quota of cpu, but the latter will be using its quota for that
> epoch in bursts rather than continuously. This guarantees that interactive
> tasks get the same amount of cpu as cpu bound ones.
>
> The other requirement of interactive tasks is also to obtain low latencies
> for when they are scheduled. Unlike fully cpu bound tasks and the maximum
> latencies possible described in the modelling deadline behaviour section
> above, tasks that sleep will wake up with quota available usually at the
> current runqueue's priority_level or better. This means that the most latency
> they are likely to see is one RR_INTERVAL, and often they will preempt the
> current task if it is not of a sleeping nature. This then guarantees very
> low latency for interactive tasks, and the lowest latencies for the least
> cpu bound tasks.
>
> Sunday, 4th March 2007
> Con Kolivas
>
> --
> -ck
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-08 20:25 ` Fabio Comolli
@ 2007-03-08 20:57   ` Con Kolivas
  2007-03-08 21:31     ` Fabio Comolli
  0 siblings, 1 reply; 28+ messages in thread
From: Con Kolivas @ 2007-03-08 20:57 UTC (permalink / raw)
  To: Fabio Comolli; +Cc: linux kernel mailing list, ck list

On Friday 09 March 2007 07:25, Fabio Comolli wrote:
> Hi Con
> It would be nice if you could rebase this patch to latest git or at
> least to 2.6.21-rc3.
> Regards,

Check in http://ck.kolivas.org/patches/staircase-deadline/
There's an -rc3 patch there.

-- 
-ck

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-08 20:57   ` Con Kolivas
@ 2007-03-08 21:31     ` Fabio Comolli
  0 siblings, 0 replies; 28+ messages in thread
From: Fabio Comolli @ 2007-03-08 21:31 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux kernel mailing list, ck list

Well, downloaded - compiled - booted: initng measures 17.369 seconds
to complete the boot process; without the patch the same kernel booted
in 21.553 seconds.

Very impressive.
Many thanks for your work.

Fabio






On 3/8/07, Con Kolivas <kernel@kolivas.org> wrote:
> On Friday 09 March 2007 07:25, Fabio Comolli wrote:
> > Hi Con
> > It would be nice if you could rebase this patch to latest git or at
> > least to 2.6.21-rc3.
> > Regards,
>
> Check in http://ck.kolivas.org/patches/staircase-deadline/
> There's an -rc3 patch there.
>
> --
> -ck
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ck] [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler
  2007-03-04  7:00 [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler Con Kolivas
                   ` (4 preceding siblings ...)
  2007-03-08 20:25 ` Fabio Comolli
@ 2007-03-09 21:11 ` Rodney Gordon II
  5 siblings, 0 replies; 28+ messages in thread
From: Rodney Gordon II @ 2007-03-09 21:11 UTC (permalink / raw)
  To: ck; +Cc: Con Kolivas, linux kernel mailing list

On Sunday 04 March 2007 01:00, Con Kolivas wrote:
> This message is to announce the first general public release of the
> "Rotating Staircase DeadLine" cpu scheduler.
>
> Based on previous work from the staircase cpu scheduler I set out to
> design, from scratch, a new scheduling policy design which satisfies every
> requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER) task
> management.
>

Con, you've really outdone yourself this time ! :D

As a long time user of the -ck patchset, RSDL is a welcome change, and a great 
piece of code to play around with, and USE!

Booted up on my system perfectly, Pentium-D 830 3GHz, 1.5GB RAM.

No problems whatsoever so far, using 0.26. I can launch up a bunch of encode 
jobs, in SCHED_NORMAL even, and still have low latency on my desktop (I know 
it's not low latency _specific_ code, but it works very well).

I guess all I can say is.. wow. This code isn't "prime time" ready, yet.. But 
it can be, and would be a great addition to mainline.

Hell, a little tuning and merging this with a few current ck patches could 
make a damn fine kernel, and probably beat out the original staircase in 
desktops. :)

Keep up the good work !
-r

-- 
Rodney "meff" Gordon II -*- meff@pobox.com
Systems Administrator / Coder Geek -*- Open yourself to OpenSource

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-03-09 21:11 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-04  7:00 [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler Con Kolivas
2007-03-04  7:45 ` [ck] " Con Kolivas
2007-03-04 14:04   ` Con Kolivas
2007-03-04 11:08 ` Gene Heskett
2007-03-04 11:47   ` Con Kolivas
2007-03-04 12:24     ` Gene Heskett
2007-03-04 12:46       ` Con Kolivas
2007-03-04 13:25         ` Gene Heskett
2007-03-04 13:49           ` Con Kolivas
2007-03-04 14:11             ` Gene Heskett
2007-03-05  2:31               ` Zwane Mwaikambo
2007-03-05  3:16                 ` Gene Heskett
2007-03-04 14:36             ` Willy Tarreau
2007-03-04 16:08               ` [ck] " jos poortvliet
2007-03-05 23:05                 ` Bill Davidsen
2007-03-06  0:18                   ` Con Kolivas
2007-03-06  4:41                     ` Willy Tarreau
2007-03-06  5:39                       ` Nicholas Miell
2007-03-06 19:04                       ` jos poortvliet
2007-03-06 21:37                       ` Bill Davidsen
2007-03-06 21:54                         ` Willy Tarreau
2007-03-05 21:52 ` Con Kolivas
2007-03-08  8:53 ` Ingo Molnar
2007-03-08 10:07   ` Con Kolivas
2007-03-08 20:25 ` Fabio Comolli
2007-03-08 20:57   ` Con Kolivas
2007-03-08 21:31     ` Fabio Comolli
2007-03-09 21:11 ` [ck] " Rodney Gordon II

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.