[PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
@ 2004-08-02  6:31 Peter Williams
  2004-08-02 13:42 ` William Lee Irwin III
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Williams @ 2004-08-02  6:31 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Michal Kaczmarski, Shane Shrybman

Version 3 of the various single priority array scheduler patches for 
2.6.7, 2.6.8-rc2 and 2.6.8-rc2-mm1 kernels are now available for 
download and evaluation:

1. Standard O(1) scheduler with active/expired arrays replaced by a 
single array and an O(1) promotion mechanism plus scheduling statistics:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-spa_trad_FULL-v3.0?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-spa_trad_FULL-v3.0.1?download>
2.6.8-rc2-mm1 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2_mm1-spa_trad_FULL-v3.0?download>

2. Basic O(1) scheduler with active/expired arrays replaced by a single 
array and an O(1) promotion mechanism plus scheduling statistics with 
all interactive bonus etc. removed:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-spa_base_FULL-v3.0?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-spa_base_FULL-v3.0.1?download>
2.6.8-rc2-mm1 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2_mm1-spa_base_FULL-v3.0?download>

3. Priority based O(1) scheduler with active/expired arrays replaced by 
a single array and an O(1) promotion mechanism plus scheduling 
statistics with new interactive bonus mechanism and throughput bonus 
mechanism:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-spa_pb_FULL-v3.0?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-spa_pb_FULL-v3.0.1?download>
2.6.8-rc2-mm1 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2_mm1-spa_pb_FULL-v3.0?download>

4. Runtime selectable choice between a priority based or entitlement 
based O(1) scheduler with active/expired arrays replaced by a single 
array and an O(1) promotion mechanism plus scheduling statistics with 
new interactive bonus mechanism and throughput bonus mechanism:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-spa_zaphod_FULL-v3.0?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-spa_zaphod_FULL-v3.0.1?download>
2.6.8-rc2-mm1 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2_mm1-spa_zaphod_FULL-v3.0?download>

5. Slightly modified version of Con Kolivas's staircase O(1) scheduler 
with active/expired arrays replaced by a single array and an O(1) 
promotion mechanism:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-spa_sc_FULL-v3.0?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-spa_sc_FULL-v3.0.1?download>
2.6.8-rc2-mm1 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2_mm1-spa_sc_FULL-v3.0?download>

6. Runtime selection between staircase, priority based and entitlement 
based O(1) schedulers:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-spa_hydra_FULL-v3.0?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-spa_hydra_FULL-v3.0.1?download>
2.6.8-rc2-mm1 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2_mm1-spa_hydra_FULL-v3.0?download>

7. Standard O(1) scheduler with scheduling statistics added:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-stats-v3.0?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-stats-v3.0.1?download>
2.6.8-rc2-mm1 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2_mm1-stats-v3.0?download>

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
       [not found] <2oEEn-197-9@gated-at.bofh.it>
@ 2004-08-02 13:27 ` Andi Kleen
  2004-08-03  0:27   ` Peter Williams
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2004-08-02 13:27 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel

Peter Williams <pwil3058@bigpond.net.au> writes:

> Version 3 of the various single priority array scheduler patches for
> 2.6.7, 2.6.8-rc2 and 2.6.8-rc2-mm1 kernels are now available for
> download and evaluation:

[...] So many schedulers. It is hard to chose one for testing.

Do you have one you prefer and would like to be tested especially? 

Perhaps a few standard benchmark numbers for the different ones 
(e.g. volanomark or hackbench or the OSDL SDL tests) would make it 
easier to preselect.

Have you considered submitting one to -mm* for wider testing?

-Andi 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-02  6:31 [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation Peter Williams
@ 2004-08-02 13:42 ` William Lee Irwin III
  2004-08-03  0:33   ` Peter Williams
  0 siblings, 1 reply; 22+ messages in thread
From: William Lee Irwin III @ 2004-08-02 13:42 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

On Mon, Aug 02, 2004 at 04:31:46PM +1000, Peter Williams wrote:
> 3. Priority based O(1) scheduler with active/expired arrays replaced by 
> a single array and an O(1) promotion mechanism plus scheduling 
> statistics with new interactive bonus mechanism and throughput bonus 
> mechanism:

Hmm. Given do_promotions() I'd expect fenceposts, not iteration over
the priority levels of the runqueue.


-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-02 13:27 ` Andi Kleen
@ 2004-08-03  0:27   ` Peter Williams
  2004-08-03  3:53     ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Williams @ 2004-08-03  0:27 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

Andi Kleen wrote:
> Peter Williams <pwil3058@bigpond.net.au> writes:
> 
> 
>>Version 3 of the various single priority array scheduler patches for
>>2.6.7, 2.6.8-rc2 and 2.6.8-rc2-mm1 kernels are now available for
>>download and evaluation:
> 
> 
> [...] So many schedulers. It is hard to chose one for testing.
> 
> Do you have one you prefer and would like to be tested especially? 

I think that ZAPHOD in "pb" mode is probably the best but I'm hoping for 
input from a wider audience.  That's why I built HYDRA.  So that various 
options can be tried without having to rebuild and reboot.

The questions I'm trying to answer are:

1.  Are "priority" based ("pb") semantics for "nice" better than 
"entitlement" based ("eb") semantics?  Does it depend on what the 
system's being used for?

2. What are the appropriate values for controlling interactive bonuses 
for "pb" and "eb"?

3. Are interactive bonuses even necessary for "eb"?

4. What's the best time slice size for interactive use?

5. What's the best time slice size for a server?

6. Does the throughput bonus help?  In particular, does it reduce the 
amount of queuing when the system is not fully loaded.

7. Should any (or all) of the scheduling knobs in 
/proc/sys/kernel/cpusched/ be retained for a production scheduler.

> 
> Perhaps a few standard benchmark numbers for the different ones 
> (e.g. volanomark or hackbench or the OSDL SDL tests) would make it 
> easier to preselect.

OK, I'll do that. But some of the questions for which I am seeking 
answers are more subjective.  In particular, interactive responsiveness 
is hard to test automatically.

Also, these tests are no good for evaluating the throughput bonus 
because when the system is heavily loaded there's going to be queuing 
anyway.  Where this bonus is expected to be most helpful is at 
intermediate system loads where there is still significant queuing (due 
to lack of serendipity i.e. they're all sleeping at the same time and 
trying to run at the same time instead of slotting in nicely between 
each other) even though there are (theoretically) enough CPU resources 
to handle the load without any queuing.  Tests with an earlier scheduler 
showed that (where such queuing was occurring) the techniques used in 
the throughput bonus could significantly reduce it.  These tests were 
conducted with artificial loads and confirmation from the real world 
would be helpful.  (Of import here is whether there are significant 
queuing problems at intermediate loads in the real world because if 
there isn't then there is nothing for the throughput bonus to fix.  This 
is possible because the higher the randomness of a system's load then 
the better the serendipity will be.)

The scheduling statistics that are included in my patches measure time 
spent on the run queue so that measurements of improvements due to the 
throughput bonus are possible.  Another way that such improvements would 
be increased responsiveness from servers.

> 
> Have you considered submitting one to -mm* for wider testing?

I've made patches available for 2.6.8-rc2-mm1 and I'll provide them for 
mm2 as soon as possible.  Is there something else I should be doing?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-02 13:42 ` William Lee Irwin III
@ 2004-08-03  0:33   ` Peter Williams
  2004-08-03  2:03     ` William Lee Irwin III
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Williams @ 2004-08-03  0:33 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
> On Mon, Aug 02, 2004 at 04:31:46PM +1000, Peter Williams wrote:
> 
>>3. Priority based O(1) scheduler with active/expired arrays replaced by 
>>a single array and an O(1) promotion mechanism plus scheduling 
>>statistics with new interactive bonus mechanism and throughput bonus 
>>mechanism:
> 
> 
> Hmm. Given do_promotions() I'd expect fenceposts, not iteration over
> the priority levels of the runqueue.

I don't understand what you mean.  Do you mean something like the more 
complex promotion mechanism in the (earlier) EBS scheduler where tasks 
only get promoted if they've been on a queue without being serviced 
within a given time?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-03  0:33   ` Peter Williams
@ 2004-08-03  2:03     ` William Lee Irwin III
  2004-08-03  3:39       ` Peter Williams
  0 siblings, 1 reply; 22+ messages in thread
From: William Lee Irwin III @ 2004-08-03  2:03 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
>> Hmm. Given do_promotions() I'd expect fenceposts, not iteration over
>> the priority levels of the runqueue.

On Tue, Aug 03, 2004 at 10:33:36AM +1000, Peter Williams wrote:
> I don't understand what you mean.  Do you mean something like the more 
> complex promotion mechanism in the (earlier) EBS scheduler where tasks 
> only get promoted if they've been on a queue without being serviced 
> within a given time?

An array of size N can be rotated in O(1) time if an integer is kept
along with it to represent an offset that has to be added to externally-
visible indices mod N to recover the true index.


-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-03  2:03     ` William Lee Irwin III
@ 2004-08-03  3:39       ` Peter Williams
  2004-08-03 10:49         ` William Lee Irwin III
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Williams @ 2004-08-03  3:39 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>Hmm. Given do_promotions() I'd expect fenceposts, not iteration over
>>>the priority levels of the runqueue.
> 
> 
> On Tue, Aug 03, 2004 at 10:33:36AM +1000, Peter Williams wrote:
> 
>>I don't understand what you mean.  Do you mean something like the more 
>>complex promotion mechanism in the (earlier) EBS scheduler where tasks 
>>only get promoted if they've been on a queue without being serviced 
>>within a given time?
> 
> 
> An array of size N can be rotated in O(1) time

And with a smaller constant than my do_promotions(). :-)

> if an integer is kept
> along with it to represent an offset that has to be added to externally-
> visible indices mod N to recover the true index.

OK.  Now I understand.

The main reason that I didn't do something like that is that 
(considering that real time tasks don't get promoted) it would complicate:

1. the selection (in schedule()) of the next task to be run as it would 
no longer be a case of just finding the first bit in the bitmap,
2. determining the appropriate list to put the task on in 
enqueue_task(), etc., and
3. determining the right bit to turn off in the bit map when dequeuing 
the last task in a slot.

As these are frequent operations compared to promotion I thought it 
would be better to leave the complexity in do_promotion().  Now that 
you've caused me to think about it again I realize that the changes in 
the above areas may not be as complicated as I thought would be 
necessary.  So I'll give it some more thought.

Thanks for the suggestion
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-03  0:27   ` Peter Williams
@ 2004-08-03  3:53     ` Andrew Morton
  2004-08-03  4:38       ` Peter Williams
  2004-08-03  6:51       ` Andi Kleen
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2004-08-03  3:53 UTC (permalink / raw)
  To: Peter Williams; +Cc: ak, linux-kernel

Peter Williams <pwil3058@bigpond.net.au> wrote:
>
> > Have you considered submitting one to -mm* for wider testing?
> 
>  I've made patches available for 2.6.8-rc2-mm1 and I'll provide them for 
>  mm2 as soon as possible.  Is there something else I should be doing?

I'll probably drop staircase soon, give nicksched a whizz for a couple of
cycles.  You're welcome to join the queue ;)

But let me re-repeat again that CPU scheduler problems tend to take a
_long_ time to turn up - you make some change and two months later some
person with a weird workload on expensive hardware hits a nasty corner
case.  So I do think that we'd have to hit a nasty problem with the current
scheduler to go making deep changes.

Although most of the fragility has been in CPU/node/HT balancing rather
than in the timeslice allocation area.  I assume you're not touching the
former.  It's the desktop users who seem to be more affected by the
timeslice allocation algorithms, and the testing turnaround is much faster
there.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-03  3:53     ` Andrew Morton
@ 2004-08-03  4:38       ` Peter Williams
  2004-08-03  6:51       ` Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Williams @ 2004-08-03  4:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ak, linux-kernel

Andrew Morton wrote:
> Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>>>Have you considered submitting one to -mm* for wider testing?
>>
>> I've made patches available for 2.6.8-rc2-mm1 and I'll provide them for 
>> mm2 as soon as possible.  Is there something else I should be doing?
> 
> 
> I'll probably drop staircase soon, give nicksched a whizz for a couple of
> cycles.  You're welcome to join the queue ;)

OK, thanks.

> 
> But let me re-repeat again that CPU scheduler problems tend to take a
> _long_ time to turn up - you make some change and two months later some
> person with a weird workload on expensive hardware hits a nasty corner
> case.  So I do think that we'd have to hit a nasty problem with the current
> scheduler to go making deep changes.
> 
> Although most of the fragility has been in CPU/node/HT balancing rather
> than in the timeslice allocation area.  I assume you're not touching the
> former.

Correct.  No (algorithmic) changes have been made to load balancing type 
code.  There have been some modifications so that my statistics 
gathering copes with a task moving to different CPU and some 
modifications due to changes in data structures but these should not 
change the way that load balancing etc. work.

>  It's the desktop users who seem to be more affected by the
> timeslice allocation algorithms, and the testing turnaround is much faster
> there.

OK.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-03  3:53     ` Andrew Morton
  2004-08-03  4:38       ` Peter Williams
@ 2004-08-03  6:51       ` Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2004-08-03  6:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Peter Williams, linux-kernel

> But let me re-repeat again that CPU scheduler problems tend to take a
> _long_ time to turn up - you make some change and two months later some
> person with a weird workload on expensive hardware hits a nasty corner
> case.  So I do think that we'd have to hit a nasty problem with the current
> scheduler to go making deep changes.

How about just simplifying the code? Both Con's and Peter's code 
look a lot simpler compared to the stock scheduler and are easier 
to understand. If they don't work significantly worse I think that
would be a strong argument to move to one of them.

-Andi


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-03  3:39       ` Peter Williams
@ 2004-08-03 10:49         ` William Lee Irwin III
  2004-08-04  0:37           ` Peter Williams
  0 siblings, 1 reply; 22+ messages in thread
From: William Lee Irwin III @ 2004-08-03 10:49 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

On Tue, Aug 03, 2004 at 01:39:02PM +1000, Peter Williams wrote:
> OK.  Now I understand.
> The main reason that I didn't do something like that is that 
> (considering that real time tasks don't get promoted) it would complicate:
> 1. the selection (in schedule()) of the next task to be run as it would 
> no longer be a case of just finding the first bit in the bitmap,
> 2. determining the appropriate list to put the task on in 
> enqueue_task(), etc., and
> 3. determining the right bit to turn off in the bit map when dequeuing 
> the last task in a slot.
> As these are frequent operations compared to promotion I thought it 
> would be better to leave the complexity in do_promotion().  Now that 
> you've caused me to think about it again I realize that the changes in 
> the above areas may not be as complicated as I thought would be 
> necessary.  So I'll give it some more thought.

In such schemes, realtime tasks are considered separately from
timesharing tasks. Finding a task to run or migrate proceeds with a
circular search of the portion of the bitmap used for timesharing tasks
after a linear search of that for RT tasks. The list to enqueue a
timesharing task in is just an offset from the fencepost determined by
priority. Dequeueing is supported with a tag for actual array position.
I did this for aperiodic queue rotations, which differs from your SPA.


-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-03 10:49         ` William Lee Irwin III
@ 2004-08-04  0:37           ` Peter Williams
  2004-08-04  0:50             ` William Lee Irwin III
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Williams @ 2004-08-04  0:37 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
> On Tue, Aug 03, 2004 at 01:39:02PM +1000, Peter Williams wrote:
> 
>>OK.  Now I understand.
>>The main reason that I didn't do something like that is that 
>>(considering that real time tasks don't get promoted) it would complicate:
>>1. the selection (in schedule()) of the next task to be run as it would 
>>no longer be a case of just finding the first bit in the bitmap,
>>2. determining the appropriate list to put the task on in 
>>enqueue_task(), etc., and
>>3. determining the right bit to turn off in the bit map when dequeuing 
>>the last task in a slot.
>>As these are frequent operations compared to promotion I thought it 
>>would be better to leave the complexity in do_promotion().  Now that 
>>you've caused me to think about it again I realize that the changes in 
>>the above areas may not be as complicated as I thought would be 
>>necessary.  So I'll give it some more thought.
> 
> 
> In such schemes, realtime tasks are considered separately from
> timesharing tasks. Finding a task to run or migrate proceeds with a
> circular search of the portion of the bitmap used for timesharing tasks
> after a linear search of that for RT tasks. The list to enqueue a
> timesharing task in is just an offset from the fencepost determined by
> priority. Dequeueing is supported with a tag for actual array position.
> I did this for aperiodic queue rotations, which differs from your SPA.

While pondering this I have stumbled on a problem that rules out using a 
rotating list for implementing promotion.  The problem is that one of 
the requirements is that once a SCHED_NORMAL task is promoted to the 
MAX_RT_PRIO slot it stays there (as far as promotion is concerned). 
With the rotating list this isn't guaranteed and, in fact, any tasks 
that are in the MAX_RT_PRIO slot when promotion occurs will actually be 
demoted to IDLE_PRIO - 1.

Promotion should be a rare event as it is unnecessary if there's less 
than two tasks on the runqueue and when there are more than one task on 
the runqueue the interval between promotions increases linearly with the 
number of runnable tasks.  It is also an O(1) operation albeit with a 
constant factor determined by the number of occupied SCHED_NORMAL 
priority slots.

I will modify the code to take better advantage of the fact that 
promotion is not required when the number of runnable tasks is less than 
2 e.g. by resetting next_prom_due so that the first promotion after the 
number of runnable tasks exceeds 1 will only occur after a full 
promotion interval has expired.  At normal loads (and with sensible 
promotion interval settings i.e. greater than the time slice size) this 
should result in promotion never (or hardly ever) occurring and the 
overhead of do_promotions() will only have to be endured when it's 
absolutely necessary.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-04  0:37           ` Peter Williams
@ 2004-08-04  0:50             ` William Lee Irwin III
  2004-08-04  1:36               ` Peter Williams
  0 siblings, 1 reply; 22+ messages in thread
From: William Lee Irwin III @ 2004-08-04  0:50 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
>> In such schemes, realtime tasks are considered separately from
>> timesharing tasks. Finding a task to run or migrate proceeds with a
>> circular search of the portion of the bitmap used for timesharing tasks
>> after a linear search of that for RT tasks. The list to enqueue a
>> timesharing task in is just an offset from the fencepost determined by
>> priority. Dequeueing is supported with a tag for actual array position.
>> I did this for aperiodic queue rotations, which differs from your SPA.

On Wed, Aug 04, 2004 at 10:37:57AM +1000, Peter Williams wrote:
> While pondering this I have stumbled on a problem that rules out using a 
> rotating list for implementing promotion.  The problem is that one of 
> the requirements is that once a SCHED_NORMAL task is promoted to the 
> MAX_RT_PRIO slot it stays there (as far as promotion is concerned). 
> With the rotating list this isn't guaranteed and, in fact, any tasks 
> that are in the MAX_RT_PRIO slot when promotion occurs will actually be 
> demoted to IDLE_PRIO - 1.

Aperiodic rotations defer movement until MAX_RT_PRIO's slot is evacuated.


On Wed, Aug 04, 2004 at 10:37:57AM +1000, Peter Williams wrote:
> Promotion should be a rare event as it is unnecessary if there's less 
> than two tasks on the runqueue and when there are more than one task on 
> the runqueue the interval between promotions increases linearly with the 
> number of runnable tasks.  It is also an O(1) operation albeit with a 
> constant factor determined by the number of occupied SCHED_NORMAL 
> priority slots.

The asymptotics were in terms of SCHED_NORMAL priorities.


On Wed, Aug 04, 2004 at 10:37:57AM +1000, Peter Williams wrote:
> I will modify the code to take better advantage of the fact that 
> promotion is not required when the number of runnable tasks is less than 
> 2 e.g. by resetting next_prom_due so that the first promotion after the 
> number of runnable tasks exceeds 1 will only occur after a full 
> promotion interval has expired.  At normal loads (and with sensible 
> promotion interval settings i.e. greater than the time slice size) this 
> should result in promotion never (or hardly ever) occurring and the 
> overhead of do_promotions() will only have to be endured when it's 
> absolutely necessary.

The primary concern was that ticklessness etc. may require it to occur
during context switches.


-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-04  0:50             ` William Lee Irwin III
@ 2004-08-04  1:36               ` Peter Williams
  2004-08-04  1:51                 ` William Lee Irwin III
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Williams @ 2004-08-04  1:36 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>In such schemes, realtime tasks are considered separately from
>>>timesharing tasks. Finding a task to run or migrate proceeds with a
>>>circular search of the portion of the bitmap used for timesharing tasks
>>>after a linear search of that for RT tasks. The list to enqueue a
>>>timesharing task in is just an offset from the fencepost determined by
>>>priority. Dequeueing is supported with a tag for actual array position.
>>>I did this for aperiodic queue rotations, which differs from your SPA.
> 
> 
> On Wed, Aug 04, 2004 at 10:37:57AM +1000, Peter Williams wrote:
> 
>>While pondering this I have stumbled on a problem that rules out using a 
>>rotating list for implementing promotion.  The problem is that one of 
>>the requirements is that once a SCHED_NORMAL task is promoted to the 
>>MAX_RT_PRIO slot it stays there (as far as promotion is concerned). 
>>With the rotating list this isn't guaranteed and, in fact, any tasks 
>>that are in the MAX_RT_PRIO slot when promotion occurs will actually be 
>>demoted to IDLE_PRIO - 1.
> 
> 
> Aperiodic rotations defer movement until MAX_RT_PRIO's slot is evacuated.

Unfortunately, to ensure no starvation, promotion has to continue even 
when there are tasks in MAX_RT_PRIO's slot.

> 
> 
> On Wed, Aug 04, 2004 at 10:37:57AM +1000, Peter Williams wrote:
> 
>>Promotion should be a rare event as it is unnecessary if there's less 
>>than two tasks on the runqueue and when there are more than one task on 
>>the runqueue the interval between promotions increases linearly with the 
>>number of runnable tasks.  It is also an O(1) operation albeit with a 
>>constant factor determined by the number of occupied SCHED_NORMAL 
>>priority slots.
> 
> 
> The asymptotics were in terms of SCHED_NORMAL priorities.
> 
> 
> On Wed, Aug 04, 2004 at 10:37:57AM +1000, Peter Williams wrote:
> 
>>I will modify the code to take better advantage of the fact that 
>>promotion is not required when the number of runnable tasks is less than 
>>2 e.g. by resetting next_prom_due so that the first promotion after the 
>>number of runnable tasks exceeds 1 will only occur after a full 
>>promotion interval has expired.  At normal loads (and with sensible 
>>promotion interval settings i.e. greater than the time slice size) this 
>>should result in promotion never (or hardly ever) occurring and the 
>>overhead of do_promotions() will only have to be endured when it's 
>>absolutely necessary.
> 
> 
> The primary concern was that ticklessness etc. may require it to occur
> during context switches.

On a tickless system, I'd consider using a timer to control when 
do_promotions() gets called.  I imagine something similar will be 
necessary to manage time slices?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-04  1:36               ` Peter Williams
@ 2004-08-04  1:51                 ` William Lee Irwin III
  2004-08-04  2:40                   ` Peter Williams
  0 siblings, 1 reply; 22+ messages in thread
From: William Lee Irwin III @ 2004-08-04  1:51 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
>> Aperiodic rotations defer movement until MAX_RT_PRIO's slot is evacuated.

On Wed, Aug 04, 2004 at 11:36:59AM +1000, Peter Williams wrote:
> Unfortunately, to ensure no starvation, promotion has to continue even 
> when there are tasks in MAX_RT_PRIO's slot.

One may either demote to evict MAX_RT_PRIO immediately prior to
rotation or rely on timeslice expiry to evict MAX_RT_PRIO. Forcibly
evicting MAX_RT_PRIO undesirably accumulates tasks at the fencepost.


William Lee Irwin III wrote:
>> The primary concern was that ticklessness etc. may require it to occur
>> during context switches.

On Wed, Aug 04, 2004 at 11:36:59AM +1000, Peter Williams wrote:
> On a tickless system, I'd consider using a timer to control when 
> do_promotions() gets called.  I imagine something similar will be 
> necessary to manage time slices?

This is an alternative to scheduler accounting in context switches.
Periodicity often loses power conservation benefits.


-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-04  1:51                 ` William Lee Irwin III
@ 2004-08-04  2:40                   ` Peter Williams
  2004-08-04  7:05                     ` Ingo Molnar
  2004-08-04  7:44                     ` William Lee Irwin III
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Williams @ 2004-08-04  2:40 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>Aperiodic rotations defer movement until MAX_RT_PRIO's slot is evacuated.
> 
> 
> On Wed, Aug 04, 2004 at 11:36:59AM +1000, Peter Williams wrote:
> 
>>Unfortunately, to ensure no starvation, promotion has to continue even 
>>when there are tasks in MAX_RT_PRIO's slot.
> 
> 
> One may either demote to evict MAX_RT_PRIO immediately prior to
> rotation or rely on timeslice expiry to evict MAX_RT_PRIO. Forcibly
> evicting MAX_RT_PRIO undesirably accumulates tasks at the fencepost.

It's starting to get almost as complex as the current scheme :-)

> 
> 
> William Lee Irwin III wrote:
> 
>>>The primary concern was that ticklessness etc. may require it to occur
>>>during context switches.
> 
> 
> On Wed, Aug 04, 2004 at 11:36:59AM +1000, Peter Williams wrote:
> 
>>On a tickless system, I'd consider using a timer to control when 
>>do_promotions() gets called.  I imagine something similar will be 
>>necessary to manage time slices?
> 
> 
> This is an alternative to scheduler accounting in context switches.
> Periodicity often loses power conservation benefits.

The timer would be deactivated whenever the number of runnable tasks for 
the runqueue goes below 2.  The whole thing could be managed from the 
enqueue and dequeue functions i.e.

dequeue - if the number running is now less than two cancel the timer 
and otherwise decrease the expiry time to maintain the linear 
relationship of the interval with the number of runnable tasks

enqueue - if the number of runnable tasks is now 2 then start the time 
with a single interval setting and if the number is greater than two 
then increase the timer interval to maintain the linear relationship.

I'm assuming here that add_timer(), del_timer() and (especially) 
mod_timer() are relatively cheap.  If mod_timer() is too expensive some 
alternative method could be devised to maintain the linear relationship.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-04  2:40                   ` Peter Williams
@ 2004-08-04  7:05                     ` Ingo Molnar
  2004-08-04  7:44                     ` William Lee Irwin III
  1 sibling, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2004-08-04  7:05 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, Linux Kernel Mailing List,
	Michal Kaczmarski, Shane Shrybman


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> >>Unfortunately, to ensure no starvation, promotion has to continue even 
> >>when there are tasks in MAX_RT_PRIO's slot.
> >
> >One may either demote to evict MAX_RT_PRIO immediately prior to
> >rotation or rely on timeslice expiry to evict MAX_RT_PRIO. Forcibly
> >evicting MAX_RT_PRIO undesirably accumulates tasks at the fencepost.
> 
> It's starting to get almost as complex as the current scheme :-)

hey, it's 'complex' for a reason ;)

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-04  2:40                   ` Peter Williams
  2004-08-04  7:05                     ` Ingo Molnar
@ 2004-08-04  7:44                     ` William Lee Irwin III
  2004-08-05  1:06                       ` Peter Williams
  1 sibling, 1 reply; 22+ messages in thread
From: William Lee Irwin III @ 2004-08-04  7:44 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
>> One may either demote to evict MAX_RT_PRIO immediately prior to
>> rotation or rely on timeslice expiry to evict MAX_RT_PRIO. Forcibly
>> evicting MAX_RT_PRIO undesirably accumulates tasks at the fencepost.

On Wed, Aug 04, 2004 at 12:40:15PM +1000, Peter Williams wrote:
> It's starting to get almost as complex as the current scheme :-)

Compare it to the background scan of the queue if several (potentially
numerous) events whose handling has been deferred are to be processed
when timer device interrupts are delivered at irregular intervals, or
not at all.


William Lee Irwin III wrote:
>> This is an alternative to scheduler accounting in context switches.
>> Periodicity often loses power conservation benefits.

On Wed, Aug 04, 2004 at 12:40:15PM +1000, Peter Williams wrote:
> The timer would be deactivated whenever the number of runnable tasks for 
> the runqueue goes below 2.  The whole thing could be managed from the 
> enqueue and dequeue functions i.e.
> dequeue - if the number running is now less than two cancel the timer 
> and otherwise decrease the expiry time to maintain the linear 
> relationship of the interval with the number of runnable tasks
> enqueue - if the number of runnable tasks is now 2 then start the time 
> with a single interval setting and if the number is greater than two 
> then increase the timer interval to maintain the linear relationship.
> I'm assuming here that add_timer(), del_timer() and (especially) 
> mod_timer() are relatively cheap.  If mod_timer() is too expensive some 
> alternative method could be devised to maintain the linear relationship.

Naive schemes reprogram the timer device too frequently. Software
constructs are less of a concern. This also presumes that taking timer
interrupts when cpu-intensive workloads voluntarily yield often enough
is necessary or desirable. This is not so in virtualized environments,
and unnecessary interruption of userspace also degrades performance.


-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-04  7:44                     ` William Lee Irwin III
@ 2004-08-05  1:06                       ` Peter Williams
  2004-08-05  2:00                         ` William Lee Irwin III
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Williams @ 2004-08-05  1:06 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> On Wed, Aug 04, 2004 at 12:40:15PM +1000, Peter Williams wrote:
> 
>>The timer would be deactivated whenever the number of runnable tasks for 
>>the runqueue goes below 2.  The whole thing could be managed from the 
>>enqueue and dequeue functions i.e.
>>dequeue - if the number running is now less than two cancel the timer 
>>and otherwise decrease the expiry time to maintain the linear 
>>relationship of the interval with the number of runnable tasks
>>enqueue - if the number of runnable tasks is now 2 then start the time 
>>with a single interval setting and if the number is greater than two 
>>then increase the timer interval to maintain the linear relationship.
>>I'm assuming here that add_timer(), del_timer() and (especially) 
>>mod_timer() are relatively cheap.  If mod_timer() is too expensive some 
>>alternative method could be devised to maintain the linear relationship.
> 
> 
> Naive schemes reprogram the timer device too frequently.

I had a look at mod_timer() and I agree that it's too expensive to call 
every time a task gets queued or dequeued.

> Software
> constructs are less of a concern. This also presumes that taking timer
> interrupts when cpu-intensive workloads voluntarily yield often enough
> is necessary or desirable.

Voluntary yielding can't be relied upon.  Writing a program that never 
gives up the CPU voluntarily is trivial.  Some have been known to do it 
without even trying :-)

> This is not so in virtualized environments,
> and unnecessary interruption of userspace also degrades performance.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-05  1:06                       ` Peter Williams
@ 2004-08-05  2:00                         ` William Lee Irwin III
  2004-08-05  2:12                           ` Peter Williams
  0 siblings, 1 reply; 22+ messages in thread
From: William Lee Irwin III @ 2004-08-05  2:00 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
>> Software constructs are less of a concern. This also presumes that
>> taking timer interrupts when cpu-intensive workloads voluntarily
>> yield often enough is necessary or desirable.

On Thu, Aug 05, 2004 at 11:06:51AM +1000, Peter Williams wrote:
> Voluntary yielding can't be relied upon.  Writing a program that never 
> gives up the CPU voluntarily is trivial.  Some have been known to do it 
> without even trying :-)

No reliance is implied. In such a scenario, the timers for timeslice
expiry are always cancelled because userspace voluntarily yields first,
so no timer interrupts are delivered. Should userspace fail to do so,
timer interrupts programmed for timeslice expiry would not be cancelled.


-- wli

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
  2004-08-05  2:00                         ` William Lee Irwin III
@ 2004-08-05  2:12                           ` Peter Williams
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Williams @ 2004-08-05  2:12 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linux Kernel Mailing List, Michal Kaczmarski, Shane Shrybman

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>Software constructs are less of a concern. This also presumes that
>>>taking timer interrupts when cpu-intensive workloads voluntarily
>>>yield often enough is necessary or desirable.
> 
> 
> On Thu, Aug 05, 2004 at 11:06:51AM +1000, Peter Williams wrote:
> 
>>Voluntary yielding can't be relied upon.  Writing a program that never 
>>gives up the CPU voluntarily is trivial.  Some have been known to do it 
>>without even trying :-)
> 
> 
> No reliance is implied. In such a scenario, the timers for timeslice
> expiry are always cancelled because userspace voluntarily yields first,
> so no timer interrupts are delivered. Should userspace fail to do so,
> timer interrupts programmed for timeslice expiry would not be cancelled.

OK.  My misunderstanding.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation
@ 2004-08-07  1:44 Peter Williams
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Williams @ 2004-08-07  1:44 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Michal Kaczmarski

Version 3.1 of the various single priority array scheduler patches for 
2.6.7, 2.6.8-rc2 and 2.6.8-rc3 kernels are now available for evaluation.

This is a bug fix release to fix a problem caused by sched_clock() not 
being monotonic.  This meant that when measuring small time intervals by 
taking the difference between successive calls to sched_clock() there is 
a possibility of obtaining a negative value (or, as unsigned long longs 
are being used, a very big one).  One unfortunate consequence of this is 
that adding to non zero intervals together could lead to a zero result 
and possible divide by zero problems.

1. [ZAPHOD] Runtime selectable choice between a priority based or 
entitlement based O(1) scheduler with active/expired arrays replaced by 
a single array and an O(1) promotion mechanism plus scheduling 
statistics with new interactive bonus mechanism and throughput bonus 
mechanism:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-spa_zaphod_FULL-v3.1?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-spa_zaphod_FULL-v3.1?download>
2.6.8-rc3 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc3-spa_zaphod_FULL-v3.1?download>

2. Slightly modified version of Con Kolivas's staircase O(1) scheduler 
with active/expired arrays replaced by a single array and an O(1) 
promotion mechanism:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-spa_sc_FULL-v3.1?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-spa_sc_FULL-v3.1?download>
2.6.8-rc3 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc3-spa_sc_FULL-v3.1?download>

3. [HYDRA] Runtime selection between staircase, priority based and 
entitlement based O(1) schedulers:

2.6.7 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.7-spa_hydra_FULL-v3.1?download>
2.6.8-rc2 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc2-spa_hydra_FULL-v3.1?download>
2.6.8-rc3 
<http://prdownloads.sourceforge.net/cpuse/patch-2.6.8-rc3-spa_hydra_FULL-v3.1?download>

Other schedulers are also available from 
<https://sourceforge.net/projects/cpuse/>

So as not to interfere with the staircase scheduler's evaluation I do 
not propose to release patches for rc3-mm kernels unless requested.

I'm in the process of simplifying my schedulers (I got carried away with 
the theory) in order to reduce their overhead and hope to release a new 
slimmer version early next week.  Also on the "to do" list are CPU rate 
caps.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2004-08-07  1:45 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-02  6:31 [PATCH] V-3.0 Single Priority Array O(1) CPU Scheduler Evaluation Peter Williams
2004-08-02 13:42 ` William Lee Irwin III
2004-08-03  0:33   ` Peter Williams
2004-08-03  2:03     ` William Lee Irwin III
2004-08-03  3:39       ` Peter Williams
2004-08-03 10:49         ` William Lee Irwin III
2004-08-04  0:37           ` Peter Williams
2004-08-04  0:50             ` William Lee Irwin III
2004-08-04  1:36               ` Peter Williams
2004-08-04  1:51                 ` William Lee Irwin III
2004-08-04  2:40                   ` Peter Williams
2004-08-04  7:05                     ` Ingo Molnar
2004-08-04  7:44                     ` William Lee Irwin III
2004-08-05  1:06                       ` Peter Williams
2004-08-05  2:00                         ` William Lee Irwin III
2004-08-05  2:12                           ` Peter Williams
     [not found] <2oEEn-197-9@gated-at.bofh.it>
2004-08-02 13:27 ` Andi Kleen
2004-08-03  0:27   ` Peter Williams
2004-08-03  3:53     ` Andrew Morton
2004-08-03  4:38       ` Peter Williams
2004-08-03  6:51       ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2004-08-07  1:44 Peter Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox