[RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
@ 2004-05-28  4:52 Peter Williams
  2004-05-28  9:05 ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Williams @ 2004-05-28  4:52 UTC (permalink / raw)
  To: Linux Kernel Mailing List

This is a modification of the O(1) scheduler to replace the active and 
expired priority arrays with a single priority array:

<http://users.bigpond.net.au/Peter-Williams/patch-2.6.6-spa>

This patch simplifies the scheduler code by:

- removing the need to manage two priority arrays per CPU
- removing the need to use variable time slices to implement "nice" 
which means no more sharing of time slices between parents and children 
at fork and exit.

Key features are:

-- uses the existing "interactive task" bonus scheme so there is no 
change to interactive responsiveness
-- the number of priority slots has been increased by MAX_BONUS + 1 so 
that there is no need to truncate the allocated priority after 
allocation of the bonus and to allow the "idle" tasks to be placed on 
the queues
-- at the end of each time slice (or when waking up) each task is given 
a complete new time slice and, if class SCHED_NORMAL, is put in a 
priority slot given by (static_prio + MAX_BONUS - interactive_bonus)
-- starvation is prevented by promoting all runnable tasks by one 
priority slot at intervals determined by a base promotion interval 
multiplied by the number of runnable tasks on the CPU in question
-- when there are less than 2 runnable processes on the CPU the 
promotion is not performed
-- in order for the anti starvation promotion mechanism to be O(1) the 
"prio" field has been removed from the task structure
-- to enable the priority of the current task to be available for 
various uses that previously used current->prio a record of the priority 
slot for the current task is now kept in the runqueue struct
-- having the "idle" tasks on the queues allows the code for selecting 
the "next" task in schedule() to be simplified

Effected files are:

init_task.h -- cope with removal of "prio" and "array" from task struct
sched.h -- remove "prio" and "array" from task struct
exit.c -- remove call to sched_exit()
sched.c -- the bulk of the change

Performance:

- no discernible change in interactive responsiveness
- slight improvements in results from tests such as kernbench e.g.

Average Half Load -j 3 Run:
Elapsed Time 573.582 (vs 576.196)
User Time 1576.36 (vs 1582.19)
System Time 100.866 (vs 102.266)
Percent CPU 291.8 (vs 291.8)
Context Switches 9970.2 (vs 10653.6)
Sleeps 23996 (vs 23736.2)

Average Optimal -j 16 Load Run:
Elapsed Time 439.88 (vs 443.21)
User Time 1605.09 (vs 1613.75)
System Time 104.618 (vs 106.99)
Percent CPU 388 (vs 387.6)
Context Switches 29816.4 (vs 35444.6)
Sleeps 25589.8 (vs 27509.4)

Average Maximal -j Load Run:
Elapsed Time 460.252 (vs 460.72)
User Time 1628.57 (vs 1633.04)
System Time 147.666 (vs 147.222)
Percent CPU 385.4 (vs 386.2)
Context Switches 78475.6 (vs 78242.4)
Sleeps 126011 (vs 113026)

In summary, this patch provides code simplification without cost.
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-28  4:52 [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array Peter Williams
@ 2004-05-28  9:05 ` Ingo Molnar
  2004-05-28  9:24   ` Peter Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2004-05-28  9:05 UTC (permalink / raw)
  To: Peter Williams; +Cc: Linux Kernel Mailing List

* Peter Williams <peterw@aurema.com> wrote:

> -- at the end of each time slice (or when waking up) each task is
> given a complete new time slice and, if class SCHED_NORMAL, is put in
> a priority slot given by (static_prio + MAX_BONUS - interactive_bonus)

this is the Achilles' heel of approaches that try to get rid of the
active/expired array and/or try to get rid of timeslice tracking. A
CPU-bound task which schedules away for small amounts of time will get a
disproportionatly larger share of the CPU than a CPU-bound task that
doesnt schedule at all.

just try it - run a task that runs 95% of the time and sleeps 5% of the
time, and run a (same prio) task that runs 100% of the time. With the
current scheduler the slightly-sleeping task gets 45% of the CPU, the
looping one gets 55% of the CPU. With your patch the slightly-sleeping
process can easily monopolize 90% of the CPU!

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-28  9:05 ` Ingo Molnar
@ 2004-05-28  9:24   ` Peter Williams
  2004-05-28  9:29     ` Ingo Molnar
  2004-05-28  9:57     ` Con Kolivas
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Williams @ 2004-05-28  9:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

Ingo Molnar wrote:
> * Peter Williams <peterw@aurema.com> wrote:
> 
> 
>>-- at the end of each time slice (or when waking up) each task is
>>given a complete new time slice and, if class SCHED_NORMAL, is put in
>>a priority slot given by (static_prio + MAX_BONUS - interactive_bonus)
> 
> 
> this is the Achilles' heel of approaches that try to get rid of the
> active/expired array and/or try to get rid of timeslice tracking. A
> CPU-bound task which schedules away for small amounts of time will get a
> disproportionatly larger share of the CPU than a CPU-bound task that
> doesnt schedule at all.
> 
> just try it - run a task that runs 95% of the time and sleeps 5% of the
> time, and run a (same prio) task that runs 100% of the time. With the
> current scheduler the slightly-sleeping task gets 45% of the CPU, the
> looping one gets 55% of the CPU. With your patch the slightly-sleeping
> process can easily monopolize 90% of the CPU!

If these two tasks have the same nice value they should around robin 
with each other in the same priority slot and this means that the one 
doing the smaller bites of CPU each time will in fact get less CPU than 
the other i.e. the outcome will be the opposite of what you claim.

This does, of course, not take into account the interactive bonus.  If 
the task doing the shorter CPU bursts manages to earn a larger 
interactivity bonus than the other then it will get more CPU but isn't 
that the intention of the interactivity bonus?

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-28  9:24   ` Peter Williams
@ 2004-05-28  9:29     ` Ingo Molnar
  2004-05-28  9:57     ` Con Kolivas
  1 sibling, 0 replies; 15+ messages in thread
From: Ingo Molnar @ 2004-05-28  9:29 UTC (permalink / raw)
  To: Peter Williams; +Cc: Linux Kernel Mailing List


* Peter Williams <peterw@aurema.com> wrote:

> >just try it - run a task that runs 95% of the time and sleeps 5% of the
> >time, and run a (same prio) task that runs 100% of the time. With the
> >current scheduler the slightly-sleeping task gets 45% of the CPU, the
> >looping one gets 55% of the CPU. With your patch the slightly-sleeping
> >process can easily monopolize 90% of the CPU!
> 
> If these two tasks have the same nice value they should around robin
> with each other in the same priority slot and this means that the one
> doing the smaller bites of CPU each time will in fact get less CPU
> than the other i.e. the outcome will be the opposite of what you
> claim.

just try what i described with and without your patch and look at the
'top' output. You can do a simple loop plus short 10-20msec sleeps (via
nanosleep) to simulate a 95% busy task.

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-28  9:24   ` Peter Williams
  2004-05-28  9:29     ` Ingo Molnar
@ 2004-05-28  9:57     ` Con Kolivas
  1 sibling, 0 replies; 15+ messages in thread
From: Con Kolivas @ 2004-05-28  9:57 UTC (permalink / raw)
  To: Peter Williams; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Fri, 28 May 2004 19:24, Peter Williams wrote:
> Ingo Molnar wrote:
> > just try it - run a task that runs 95% of the time and sleeps 5% of the
> > time, and run a (same prio) task that runs 100% of the time. With the
> > current scheduler the slightly-sleeping task gets 45% of the CPU, the
> > looping one gets 55% of the CPU. With your patch the slightly-sleeping
> > process can easily monopolize 90% of the CPU!

> This does, of course, not take into account the interactive bonus.  If
> the task doing the shorter CPU bursts manages to earn a larger
> interactivity bonus than the other then it will get more CPU but isn't
> that the intention of the interactivity bonus?

No. Ideally the interactivity bonus should decide what goes first every time 
to decrease the latency of interactive tasks, but the cpu percentage should 
remain close to the same for equal "nice" tasks. Interactive tasks need low 
scheduling latency and short bursts of high cpu usage; not more cpu usage 
overall. When the cpu percentage differs significantly from this the logic 
has failed.

Con

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
@ 2004-05-29  1:39 Peter Williams
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Williams @ 2004-05-29  1:39 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

 > Ingo Molnar wrote:
 >> Peter Williams <peterw@xxxxxxxxxx> wrote:
 >> >just try it - run a task that runs 95% of the time and sleeps 5% of the
 >> >time, and run a (same prio) task that runs 100% of the time. With the
 >> >current scheduler the slightly-sleeping task gets 45% of the CPU, the
 >> >looping one gets 55% of the CPU. With your patch the slightly-sleeping
 >> >process can easily monopolize 90% of the CPU!
 >>
 >> If these two tasks have the same nice value they should around robin
 >> with each other in the same priority slot and this means that the one
 >> doing the smaller bites of CPU each time will in fact get less CPU
 >> than the other i.e. the outcome will be the opposite of what you
 >> claim.
 >
 > just try what i described with and without your patch and look at the
 > 'top' output. You can do a simple loop plus short 10-20msec sleeps
 > (via nanosleep) to simulate a 95% busy task.

OK.  I've tried this experiment and an effect similar to what you 
described (but not as severe) was observed.  However, as I opined, this 
was due to the interactive bonus mechanism being too generous to the 
sleeper - sometimes giving at as many as 9 bonus points out of a 
possible maximum of 10.  The severity of the effect was variable and 
proportional to the size of the bonus awarded at any given time.  So the 
problem is due to the bonus point scheme and not the absence of the 
active and expired arrays.

Further investigation of this problem has revealed that the reason that 
the interactive bonus scheme is so generous to this type of task is due 
the code in schedule() that allows tasks with "activated" greater than 1 
to count time spent on the run queue as sleep time.  If this code is 
removed the effect disappears.

Interestingly enough, removing that code has no effect (that I could 
discern) on the interactive response even under heavy loads.  This had 
been observed previously and consideration was given to removing this 
code as part of this patch but it was decided to make no significant 
changes to the interactive bonus scheme for this patch.  I.e. 
modifying/improving the interactive bonus mechanism was considered to be 
another issue.

Peter
PS Sorry about the change in e-mail address but my ISP changed my IP 
address and this has caused me to lose my SSH connection with my 
aurema.com account and as it's now Saturday here in Sydney I'm unlikely 
to get it back until Monday.
-- 
Dr Peter Williams                                pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
@ 2004-05-29  5:27 Peter Williams
  2004-05-29 11:17 ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Williams @ 2004-05-29  5:27 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Ingo Molnar, Linux Kernel Mailing List

Con Kolivas wrote:
 > On Fri, 28 May 2004 19:24, Peter Williams wrote:
 > > Ingo Molnar wrote:
 > > > just try it - run a task that runs 95% of the time and sleeps 5%
 > > > of the time, and run a (same prio) task that runs 100% of the
 > > > time. With the current scheduler the slightly-sleeping task gets
 > > > 45% of the CPU, the looping one gets 55% of the CPU. With your
 > > > patch the slightly-sleeping process can easily monopolize 90% of
 > > > the CPU!
 >
 > > This does, of course, not take into account the interactive bonus.
 > > If the task doing the shorter CPU bursts manages to earn a larger
 > > interactivity bonus than the other then it will get more CPU but
 > > isn't that the intention of the interactivity bonus?
 >
 > No. Ideally the interactivity bonus should decide what goes first
 > every time to decrease the latency of interactive tasks, but the cpu
 > percentage should remain close to the same for equal "nice" tasks.

There are at least two possible ways of viewing "nice": one of these is 
that it is an indicator of the tasks entitlement to CPU resource (which 
is more or less the view you describe) and another that it is an 
indicator of the task's priority with respect to access to CPU resources.

If you wish the system to take the first of these views then the 
appropriate solution to the scheduling problem is to use an entitlement 
based scheduler such as EBS (see 
<http://sourceforge.net/projects/ebs-linux/>) which is also much simpler 
than the current O(1) scheduler and has the advantage that it gives 
pretty good interactive responsiveness without treating interactive 
tasks specially (although some modification in this regard may be 
desirable if very high loads are going to be encountered).

If you want the second of these then this proposed modification is a 
simple way of getting it (with the added proviso that starvation be 
avoided).

Of course, there can be other scheduling aims such as maximising 
throughput where different scheduler paradigms need to be used.  As a 
matter of interest these tend to have not very good interactive response.

If the system is an interactive system then all of these models (or at 
least two of them) need to be modified to "break the rules" as far as 
interactive tasks are concerned and give them higher priority in order 
not to try human patience.

 > Interactive tasks need low scheduling latency and short bursts of high
 > cpu usage; not more cpu usage overall. When the cpu percentage 
differs > significantly from this the logic has failed.

The only way this will happen is if the interactive bonus mechanism 
misidentifies a CPU bound task as an interactive task and gives it a 
large bonus.  This seems to be the case as tasks with a 95% CPU demand 
rate are being given a bonus of 9 (out of 10 possible) points.

Peter
-- 
Dr Peter Williams                                pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-29  5:27 Peter Williams
@ 2004-05-29 11:17 ` Con Kolivas
  2004-05-30  0:19   ` Peter Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2004-05-29 11:17 UTC (permalink / raw)
  To: Peter Williams; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Sat, 29 May 2004 15:27, Peter Williams wrote:
> Con Kolivas wrote:
>  > On Fri, 28 May 2004 19:24, Peter Williams wrote:
>  > > Ingo Molnar wrote:
>  > > > just try it - run a task that runs 95% of the time and sleeps 5%
>  > > > of the time, and run a (same prio) task that runs 100% of the
>  > > > time. With the current scheduler the slightly-sleeping task gets
>  > > > 45% of the CPU, the looping one gets 55% of the CPU. With your
>  > > > patch the slightly-sleeping process can easily monopolize 90% of
>  > > > the CPU!
>  > >
>  > > This does, of course, not take into account the interactive bonus.
>  > > If the task doing the shorter CPU bursts manages to earn a larger
>  > > interactivity bonus than the other then it will get more CPU but
>  > > isn't that the intention of the interactivity bonus?
>  >
>  > No. Ideally the interactivity bonus should decide what goes first
>  > every time to decrease the latency of interactive tasks, but the cpu
>  > percentage should remain close to the same for equal "nice" tasks.
>
> There are at least two possible ways of viewing "nice": one of these is
> that it is an indicator of the tasks entitlement to CPU resource (which
> is more or less the view you describe) and another that it is an
> indicator of the task's priority with respect to access to CPU resources.
>
> If you wish the system to take the first of these views then the
> appropriate solution to the scheduling problem is to use an entitlement
> based scheduler such as EBS (see
> <http://sourceforge.net/projects/ebs-linux/>) which is also much simpler
> than the current O(1) scheduler and has the advantage that it gives
> pretty good interactive responsiveness without treating interactive
> tasks specially (although some modification in this regard may be
> desirable if very high loads are going to be encountered).
>
> If you want the second of these then this proposed modification is a
> simple way of getting it (with the added proviso that starvation be
> avoided).
>
> Of course, there can be other scheduling aims such as maximising
> throughput where different scheduler paradigms need to be used.  As a
> matter of interest these tend to have not very good interactive response.
>
> If the system is an interactive system then all of these models (or at
> least two of them) need to be modified to "break the rules" as far as
> interactive tasks are concerned and give them higher priority in order
> not to try human patience.
>
>  > Interactive tasks need low scheduling latency and short bursts of high
>  > cpu usage; not more cpu usage overall. When the cpu percentage
>
> differs > significantly from this the logic has failed.
>
> The only way this will happen is if the interactive bonus mechanism
> misidentifies a CPU bound task as an interactive task and gives it a
> large bonus.  This seems to be the case as tasks with a 95% CPU demand
> rate are being given a bonus of 9 (out of 10 possible) points.

This is all a matter of semantics and I have no argument with it.

I think your aims of simplifying the scheduler are admirable but I hope you 
don't suffer the quagmire that is manipulating the interactivity stuff. 
Changing one value and saying it has no apparent effect is almost certainly 
wrong; surely it was put there for a reason - or rather I put it there for a 
reason.

Con

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-29 11:17 ` Con Kolivas
@ 2004-05-30  0:19   ` Peter Williams
  2004-05-30 12:56     ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Williams @ 2004-05-30  0:19 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Ingo Molnar, Linux Kernel Mailing List

Con Kolivas wrote:
> On Sat, 29 May 2004 15:27, Peter Williams wrote:
> 
>>Con Kolivas wrote:
>> > On Fri, 28 May 2004 19:24, Peter Williams wrote:
>> > > Ingo Molnar wrote:
>> > > > just try it - run a task that runs 95% of the time and sleeps 5%
>> > > > of the time, and run a (same prio) task that runs 100% of the
>> > > > time. With the current scheduler the slightly-sleeping task gets
>> > > > 45% of the CPU, the looping one gets 55% of the CPU. With your
>> > > > patch the slightly-sleeping process can easily monopolize 90% of
>> > > > the CPU!
>> > >
>> > > This does, of course, not take into account the interactive bonus.
>> > > If the task doing the shorter CPU bursts manages to earn a larger
>> > > interactivity bonus than the other then it will get more CPU but
>> > > isn't that the intention of the interactivity bonus?
>> >
>> > No. Ideally the interactivity bonus should decide what goes first
>> > every time to decrease the latency of interactive tasks, but the cpu
>> > percentage should remain close to the same for equal "nice" tasks.
>>
>>There are at least two possible ways of viewing "nice": one of these is
>>that it is an indicator of the tasks entitlement to CPU resource (which
>>is more or less the view you describe) and another that it is an
>>indicator of the task's priority with respect to access to CPU resources.
>>
>>If you wish the system to take the first of these views then the
>>appropriate solution to the scheduling problem is to use an entitlement
>>based scheduler such as EBS (see
>><http://sourceforge.net/projects/ebs-linux/>) which is also much simpler
>>than the current O(1) scheduler and has the advantage that it gives
>>pretty good interactive responsiveness without treating interactive
>>tasks specially (although some modification in this regard may be
>>desirable if very high loads are going to be encountered).
>>
>>If you want the second of these then this proposed modification is a
>>simple way of getting it (with the added proviso that starvation be
>>avoided).
>>
>>Of course, there can be other scheduling aims such as maximising
>>throughput where different scheduler paradigms need to be used.  As a
>>matter of interest these tend to have not very good interactive response.
>>
>>If the system is an interactive system then all of these models (or at
>>least two of them) need to be modified to "break the rules" as far as
>>interactive tasks are concerned and give them higher priority in order
>>not to try human patience.
>>
>> > Interactive tasks need low scheduling latency and short bursts of high
>> > cpu usage; not more cpu usage overall. When the cpu percentage
>>
>>differs > significantly from this the logic has failed.
>>
>>The only way this will happen is if the interactive bonus mechanism
>>misidentifies a CPU bound task as an interactive task and gives it a
>>large bonus.  This seems to be the case as tasks with a 95% CPU demand
>>rate are being given a bonus of 9 (out of 10 possible) points.
> 
> 
> This is all a matter of semantics and I have no argument with it.
> 
> I think your aims of simplifying the scheduler are admirable but I hope you 
> don't suffer the quagmire that is manipulating the interactivity stuff. 

As you surmise, this patch is just a starting point and there are some 
parts of it the may need to be fine tuned.

For instance, the current time slice used is set at the average that the 
current mechanism would have dispensed.  Making this smaller would 
lessen the severity of the anomaly under discussion but making it too 
small would increase the context switch rate.  There is evidence from 
our kernbench results that we have room to decrease this value and still 
keep the context switch rate below that of the current scheduler (at 
least, for normal to moderately heavy loads).  If possible I'd like to 
get some statistics on the sleep/wake cycles of tasks on a typical 
system to help make a judgment about what is the best value here.

Another area that needs more consideration is the determination of the 
promotion interval.  At the moment, there's no promotion if there's less 
than 2 runnable tasks on a CPU and the interval is a constant multiplied 
by the number of runnable tasks otherwise.

Another area of investigation is (yet another) bonus intended to 
increase system throughput by minimizing (or at least attempting to) the 
time tasks spend on the run queues.  The principal difficulty here is 
making sure that this doesn't adversely effect interactive 
responsiveness as it's an unfortunate fact of life that what's good for 
interactive response isn't necessarily (and usually isn't) good for 
maximizing throughput and vice versa.

Then, the interactive bonus mechanism might be examined but this is of 
low priority as the current one seems to do a reasonable job.

Lastly, with the simplification of the scheduler I believe that it would 
be possible to make both the interactive response and throughput bonuses 
optional.  An example of why this MIGHT BE desirable is that the 
interactive response bonus adversely effects throughput and turning it 
off on servers where there are no interactive users may be worthwhile.

> Changing one value and saying it has no apparent effect is almost certainly 
> wrong; surely it was put there for a reason - or rather I put it there for a 
> reason.

Out of interest, what was the reason?  What problem were you addressing?

Peter
-- 
Dr Peter Williams                                pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-30  0:19   ` Peter Williams
@ 2004-05-30 12:56     ` Con Kolivas
  2004-05-31  0:04       ` Peter Williams
  0 siblings, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2004-05-30 12:56 UTC (permalink / raw)
  To: Peter Williams; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Sun, 30 May 2004 10:19, Peter Williams wrote:
> Out of interest, what was the reason?  What problem were you addressing?

The interactive credit?

There was a problem with difficulty elevating back to interactive state if an 
interactive task had used too long a burst of cpu (ie Xfree) which was 
addressed by making the bonus pseudo-exponentially curved for rapid recovery 
and slow decay - in fact this is probably the most important part of 
addressing the interactive tasks and had the best effect. The problem was 
that giving this to all tasks meant that cpu bound tasks that had, as a 
property of their behaviour, long waits on say pipes or i/o would also get 
this rapid recovery to interactive state and as soon as they became fully 
bound to cpu again they would cause noticable stalls. The standard example is 
the increasing number of jobs in a make, where each job waits longer for i/o 
as the job numbers increase. However there were much worse examples at even 
normal - low loads, such as mpeg or divx encoding where the encoder would 
buffer say 250 frames sleeping and then do them in a burst. (this is the 
maximum space between key [I] frame or intervals). The interactive credit 
prevented those tasks that would have long but only infrequent sleeps from 
getting the curved bonus/penalty.

Hmm... if this is black magic I guess I'm breaking the magician's cardinal 
rules and revealing my tricks ;-)

Con

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-30 12:56     ` Con Kolivas
@ 2004-05-31  0:04       ` Peter Williams
  2004-05-30 23:13         ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Williams @ 2004-05-31  0:04 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Ingo Molnar, Linux Kernel Mailing List

Con Kolivas wrote:
> On Sun, 30 May 2004 10:19, Peter Williams wrote:
> 
>>Out of interest, what was the reason?  What problem were you addressing?
> 
> 
> The interactive credit?

No.  I was asking about the mechanism in schedule() that, based on the 
value of "activated", allows some tasks to count their time on the run 
queue as sleep time.

> 
> There was a problem with difficulty elevating back to interactive state if an 
> interactive task had used too long a burst of cpu (ie Xfree) which was 
> addressed by making the bonus pseudo-exponentially curved for rapid recovery 
> and slow decay - in fact this is probably the most important part of 
> addressing the interactive tasks and had the best effect.

But this probably answers my question anyway.

> The problem was 
> that giving this to all tasks meant that cpu bound tasks that had, as a 
> property of their behaviour, long waits on say pipes or i/o would also get 
> this rapid recovery to interactive state and as soon as they became fully 
> bound to cpu again they would cause noticable stalls. The standard example is 
> the increasing number of jobs in a make, where each job waits longer for i/o 
> as the job numbers increase. However there were much worse examples at even 
> normal - low loads, such as mpeg or divx encoding where the encoder would 
> buffer say 250 frames sleeping and then do them in a burst. (this is the 
> maximum space between key [I] frame or intervals). The interactive credit 
> prevented those tasks that would have long but only infrequent sleeps from 
> getting the curved bonus/penalty.

In the near future, I'll be proposing another patch that goes on top of 
the single priority array (SPA) patch that has the express purpose of 
reducing the time that all tasks spend on run queues.  I think that one 
of the side effects of this mechanism is that it will alleviate the 
problem you've described above.

I also think that their is a category of task that may be automatically 
detected and that needs special attention and that is tasks that need 
very regular access to small bursts of CPU.  I suspect that the tasks 
doing the mpeg and divx encoding/decoding that you refer to above fall 
into this category.  The key to identifying these tasks would be that 
the variance (or standard deviation) of their sleeps would be close to 
zero and as they probably do much the same amount of work each CPU burst 
the same would be true of the variance of the length of the CPU bursts. 
  Currently, the scheduler relies on the interactive bonus to make sure 
that these tasks get CPU when they need it but I suspect that to make 
this happen the interactive bonus has to be larger than it might 
otherwise need to be.  So if these tasks can be identified and treated 
specially (e.g. reserve the MAX_RT_PRIO slot for them) the interactive 
bonus could be reduced and this would improve overall system throughput.

> 
> Hmm... if this is black magic I guess I'm breaking the magician's cardinal 
> rules and revealing my tricks ;-)
> 

Peter
-- 
Dr Peter Williams                                pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-31  0:04       ` Peter Williams
@ 2004-05-30 23:13         ` Con Kolivas
  0 siblings, 0 replies; 15+ messages in thread
From: Con Kolivas @ 2004-05-30 23:13 UTC (permalink / raw)
  To: Peter Williams; +Cc: Ingo Molnar, Linux Kernel Mailing List

Quoting Peter Williams <pwil3058@bigpond.net.au>:

> Con Kolivas wrote:
> > The interactive credit?
> 
> No.  I was asking about the mechanism in schedule() that, based on the 
> value of "activated", allows some tasks to count their time on the run 
> queue as sleep time.

No this is different again. When there is significant load and an interactive
task (like X) is using bursts of cpu it will not get a chance to sleep. It will
either run out of timeslice or be preempted and thus will be constantly on the
runqueue. In that time if only sleep time is counted it is progressively seen
as a cpu bound task - in fact this was a major problem before where X would
basically die under load because you'd move a window and it would be constantly
waiting for more cpu time rather than ever sleeping.

> I also think that their is a category of task that may be automatically 
> detected and that needs special attention and that is tasks that need 
> very regular access to small bursts of CPU.  I suspect that the tasks 
> doing the mpeg and divx encoding/decoding that you refer to above fall 
> into this category.  The key to identifying these tasks would be that 
> the variance (or standard deviation) of their sleeps would be close to 
> zero and as they probably do much the same amount of work each CPU burst 
> the same would be true of the variance of the length of the CPU bursts.

I've already tried that experiment and I personally failed to make the pattern
of sleep/burst correspond with interactivity. There is a huge variation in the
duration of sleep/run for all different things, and it changes with load. Of
course your modelling may be better than mine.

> Dr Peter Williams                                pwil3058@bigpond.net.au

Oooh I've got one of those too but I normally dont use it on lkml.
So just for this once...

Dr Con Kolivas

^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <214A1-6NK-7@gated-at.bofh.it>]

[parent not found: <21acm-2GN-1@gated-at.bofh.it>]

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
       [not found] ` <21acm-2GN-1@gated-at.bofh.it>
@ 2004-05-29 12:24   ` Andi Kleen
  2004-05-29 12:38     ` Con Kolivas
  2004-06-04  7:40     ` Peter Williams
  0 siblings, 2 replies; 15+ messages in thread
From: Andi Kleen @ 2004-05-29 12:24 UTC (permalink / raw)
  To: Con Kolivas; +Cc: pwil3058, linux-kernel

Con Kolivas <kernel@kolivas.org> writes:
>
> I think your aims of simplifying the scheduler are admirable but I hope you 
> don't suffer the quagmire that is manipulating the interactivity stuff. 
> Changing one value and saying it has no apparent effect is almost certainly 
> wrong; surely it was put there for a reason - or rather I put it there for a 
> reason.

But that doesn't mean that the reason cannot be reevaluated later.
If Peter can up with a simpler scheduler and nobody can break it significantly
it would be great, and i'm hope such simplifications could be merged
after testing. Certainly the current one does far too much black magic.

-Andi


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-29 12:24   ` Andi Kleen
@ 2004-05-29 12:38     ` Con Kolivas
  2004-06-04  7:40     ` Peter Williams
  1 sibling, 0 replies; 15+ messages in thread
From: Con Kolivas @ 2004-05-29 12:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: pwil3058, linux-kernel

On Sat, 29 May 2004 22:24, Andi Kleen wrote:
> Con Kolivas <kernel@kolivas.org> writes:
> > I think your aims of simplifying the scheduler are admirable but I hope
> > you don't suffer the quagmire that is manipulating the interactivity
> > stuff. Changing one value and saying it has no apparent effect is almost
> > certainly wrong; surely it was put there for a reason - or rather I put
> > it there for a reason.
>
> But that doesn't mean that the reason cannot be reevaluated later.
> If Peter can up with a simpler scheduler and nobody can break it
> significantly it would be great, and i'm hope such simplifications could be
> merged after testing. Certainly the current one does far too much black
> magic.

Once again I agree. What I'm saying is altering the magic incantation even 
with just one knob will lead to unexpected results so a complete overhaul is 
probably the correct way to tackle the unnecessary complexity. I see at least 
4 of them happening on list already by people with much more coding skill 
than I'll ever have and am happy that interactivity is one of the things high 
on people's minds for development.

Con

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array
  2004-05-29 12:24   ` Andi Kleen
  2004-05-29 12:38     ` Con Kolivas
@ 2004-06-04  7:40     ` Peter Williams
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Williams @ 2004-06-04  7:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Con Kolivas, linux-kernel, kernel.linux

Andi Kleen wrote:
> Con Kolivas <kernel@kolivas.org> writes:
> 
>>I think your aims of simplifying the scheduler are admirable but I hope you 
>>don't suffer the quagmire that is manipulating the interactivity stuff. 
>>Changing one value and saying it has no apparent effect is almost certainly 
>>wrong; surely it was put there for a reason - or rather I put it there for a 
>>reason.
> 
> 
> But that doesn't mean that the reason cannot be reevaluated later.
> If Peter can up with a simpler scheduler and nobody can break it significantly
> it would be great, and i'm hope such simplifications could be merged
> after testing. Certainly the current one does far too much black magic.

As a result of your encouragement I have implemented a simplified 
version of the interactive bonus scheme and an extension whose aim is to 
improve general system throughput on to of my SPA patch (to the 2.6.6 
kernel) an updated version of which is available at:

<http://users.bigpond.net.au/Peter-Williams/patch-2.6.6-spa-v2>

Interactive Bonus (SPA_IA_BONUS) patch is available at:

<http://users.bigpond.net.au/Peter-Williams/patch-2.6.6-spa_ia_bonus>

This patch approaches the problem from a control systems perspective and 
the statistics are calculated for each task (using extremely simple and 
efficient Kalman filters):

A_c which is the running average number of nanoseconds of CPU time that 
it spends on the CPU during a scheduling cycle (SC) which is defined to 
be the period from one activation (or wake up) to the next.

A_s which is the running average number of nanoseconds that it spends 
sleeping during the SC

A_d which is the running average number of nanoseconds that it spends on 
a run queue waiting for CPU access

These statistics are then used to make the following tests about the task:

1. is it showing interactive behaviour, and/or
2. is it showing CPU bound behaviour.

The first test consists of testing that A_s / (A_s + A_c) (where A_s is 
the average sleep time per cycle and A_c is the average CPU time per 
cycle) is greater than a threshold (currently 90%) (NB this is 
equivalent to A_c / (A_c + A_s) being less than 10%) and also that the 
average sleep time isn't greater than a threshold (currently set at 15 
minutes).  This test is applied at the end of each scheduling cycle when 
the task is woken and put back on the run queue.  If the task passes 
this test its interactive bonus is increased asymptotically towards the 
maximum bonus (or, at least, the maximum multiplied by their average A_s 
/ (A_s + A_c)).

The second test is applied at the end of a cycle if the above test fails 
and also at the end of each time slice (if the task stays active for 
more than one time slice).  This task consists of testing whether the 
ratio A_c / (A_s + A_c + A_d) (where A_d is the average delay on run 
queues waiting for CPU access per cycle) is greater a threshold 
(currently 50%) or A_s is zero or very large (currently greater than 2 
hours) and if the task passes this test it is considered to be CPU bound 
or a chronic idler and NOT interactive and its interactive bonus is 
decreased asymptotically towards zero.  The reason that A_c / (A_s + A_c 
+ A_d) instead of the equivalent A_c / (A_c + A_s) form the first test 
is so that interactive tasks are not mistakenly classified as CPU bound 
tasks when the system is very busy and their wait time becomes squeezed 
and turns into run queue delay time.  Similarly, the reason that A_c / 
(A_c + A_s + A_d) isn't used in the first test is that it could lead to 
CPU bound tasks being wrongly classed as interactive tasks when the 
system is very busy.

In the description of the first test, I mention that the interactive 
bonus of a tasks asymptotically approaches the product of the maximum 
bonus and their average A_s / (A_s + A_c).  This has the important side 
effect that the interactive bonus of a fairly busy task such as the X 
server doesn't get as big bonus as that of those low CPU usage stream 
servers such as xmms and they work quite happily under extremely heavy 
loads including heavy X server activity.

Although this may sound complex and expensive it's actually quite cheap 
as estimating the averages is very simple and the more complex bits 
(involving 64 bit division to calculate the ratios) occur relatively 
infrequently.

Throughput Bonus (SPA_TPT_BONUS) patch is available at:

<http://users.bigpond.net.au/Peter-Williams/patch-2.6.6-spa_tpt_bonus>

This bonus is also based on the statistics described above and awards 
bonus points to tasks that are spending a disproportionate amount of 
time (compared to the CPU time they are using) on run queues.  The 
maximum amount of this bonus is half that of the maximum interactive 
bonus so it should not cause tasks receiving it to interfere with them.
The bonus awarded is proportional to the equation: (A_d / (A_d + A_c * 
N)) where N is the number of tasks running on the CPU in question.  This 
latter correction is necessary to stop all tasks getting the maximum 
bonus when the system is busy.  Unlike the interactive bonus this bonus 
is not persistent and is recalculated every time the task is to be 
requeued.  When the system is not heavily loaded this bonus has the 
tendency to reduce the overall amount of queuing on the system and 
improve throughput.  When the system is heavily loaded it tends to 
spread the pain evenly among the non interactive tasks (subject, of 
course, to niceness).

Hopefully my explanations above don't fall into the "black magic" 
category and the mechanics of the scheduler are easy to understand? :-)

If not and you have any questions don't hesitate to ask
Peter
PS Patches for 2.6.7-rc2 should be available next week.
-- 
Dr Peter Williams                                pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2004-06-04  7:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-28  4:52 [RFC][PATCH][2.6.6] Replacing CPU scheduler active and expired with a single array Peter Williams
2004-05-28  9:05 ` Ingo Molnar
2004-05-28  9:24   ` Peter Williams
2004-05-28  9:29     ` Ingo Molnar
2004-05-28  9:57     ` Con Kolivas
  -- strict thread matches above, loose matches on Subject: below --
2004-05-29  1:39 Peter Williams
2004-05-29  5:27 Peter Williams
2004-05-29 11:17 ` Con Kolivas
2004-05-30  0:19   ` Peter Williams
2004-05-30 12:56     ` Con Kolivas
2004-05-31  0:04       ` Peter Williams
2004-05-30 23:13         ` Con Kolivas
     [not found] <214A1-6NK-7@gated-at.bofh.it>
     [not found] ` <21acm-2GN-1@gated-at.bofh.it>
2004-05-29 12:24   ` Andi Kleen
2004-05-29 12:38     ` Con Kolivas
2004-06-04  7:40     ` Peter Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox