public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
[parent not found: <1vuMd-5jx-5@gated-at.bofh.it>]
[parent not found: <fa.fi4j08o.17nchps@ifi.uio.no.suse.lists.linux.kernel>]
[parent not found: <fa.ftul5bl.nlk3pr@ifi.uio.no>]
[parent not found: <fa.jgj0bdi.b3u6qk@ifi.uio.no>]
[parent not found: <894006121@toto.iv>]
[parent not found: <fa.fi4j08o.17nchps@ifi.uio.no>]
[parent not found: <1tfy0-7ly-29@gated-at.bofh.it>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
@ 2004-02-26  3:30 Albert Cahalan
  2004-02-26  6:19 ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Albert Cahalan @ 2004-02-26  3:30 UTC (permalink / raw)
  To: linux-kernel mailing list; +Cc: johnl

John Lee writes:

> The usage rates for each task are estimated using Kalman
> filter techniques, the  estimates being similar to those
> obtained by taking a running average over twice the filter
> _response half life_ (see below). However, Kalman filter
> values are cheaper to compute and don't require the
> maintenance of historical usage data.

Linux dearly needs this. Please separate out this part
of the patch and send it in.

Right now, Linux does not report the recent CPU usage
of a process. The UNIX standard requires that "ps"
report this; right now ps substitutes CPU usage over
the whole lifetime of a process.

Both per-task and per-process (tid and tgid) numbers
are needed. Both percent and permill (1/1000) units
get reported, so don't convert to integer percent.



^ permalink raw reply	[flat|nested] 66+ messages in thread
[parent not found: <fa.f12rt3d.c0s9rt@ifi.uio.no>]
* [RFC][PATCH] O(1) Entitlement Based Scheduler
@ 2004-02-25 14:35 John Lee
  2004-02-25 17:09 ` Timothy Miller
                   ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: John Lee @ 2004-02-25 14:35 UTC (permalink / raw)
  To: linux-kernel

Hi everyone,

This patch is a modification of the O(1) scheduler that introduces 
O(1) entitlement based scheduling for SCHED_NORMAL tasks. This patch is aimed
at keeping the scalability and efficiency of the current scheduler, but also
provide:

- Specific allocation of CPU resources amongst tasks
- Scheduling fairness and good interactive response without the need for 
  heuristics, and
- Reduced scheduler complexity.

The fundamental concept of entitlement based sharing is that each task has an 
_entitlement_ to CPU resources that is determined by the number of _shares_ 
that it holds, and the scheduler allocates CPU to tasks so that the _rate_ at
which they receive CPU time is consistent with their entitlement. 

The usage rates for each task are estimated using Kalman filter techniques, the 
estimates being similar to those obtained by taking a running average over 
twice the filter _response half life_ (see below). However, Kalman filter
values are cheaper to compute and don't require the maintenance of historical 
usage data.

The use of CPU usage rates also makes it possible to impose _per task CPU 
usage rate caps_. This patch provides both soft and hard CPU usage rate caps per
task. The difference between a hard and soft cap is that when unused CPU cycles
are available, a hard cap is _always_ enforced regardless, whereas a soft cap 
is allowed to be exceeded.


Features of the EBS scheduler
=============================

CPU shares
----------

Each task has a number of CPU shares that determine its entitlement. Shares can
be read/set directly via the files 

/proc/<pid>/cpu_shares
/proc/<tgid>/task/<pid>/cpu_shares

or indirectly via setting the task's nice value using nice or renice. A task 
may be allocated between 1 and 420 shares with 20 shares being the default
allocation.  A nice value >= 0 is mapped to (20 - nice) shares and a value 
< 0 is mapped to (20 + nice * nice) shares. If shares are set directly via 
/proc/<pid>/cpu_shares then its nice value  will be adjusted  accordingly.


CPU usage rate caps
-------------------

A task's CPU usage rate cap imposes a soft (or hard) upper limit on the rate at
which it can use CPU resources and can be set/read via the files 

/proc/<pid>/cpu_rate_cap 
/proc/<tgid>/task/<pid>/cpu_rate_cap  

Usage rate caps are expressed as rational numbers (e.g. "1 / 2") and hard caps 
are signified by a "!" suffix.  The rational number indicates the proportion 
of a single CPU's capacity that the task may use. The value of the number must 
be in the range 0.0 to 1.0 inclusive for soft caps. For hard caps there is an 
additional restriction that a value of 0.0 is not permitted.  Tasks with a 
soft cap of 0.0 become true background tasks and only get to run when no other 
tasks are active.

When hard capped tasks exceed their cap they are removed from the run queues 
and placed in a "sinbin" for a short while until their usage rate decays to
within limits.


Scheduler Tuning Parameters
---------------------------

The characteristics of the Kalman filter used for usage rate estimates are 
determined by the _response half life_. The default value for the half life is 
5000 msecs, but this can be set to any value between 1000 and 100000 msecs 
during a make config. Also, if the SCHED_DYNAMIC_HALF_LIFE config option is set
to Y, the half life can be modified dynamically on a running system within the 
above range by writing to /proc/cpu_half_life.

Currently EBS gives all tasks a fixed default timeslice of 100msec. As with the
half life, this can be chosen at build time (between 1 and 500msec) and building
with the SCHED_DYNAMIC_TIME_SLICE option enables on-the-fly changes via
/proc/timeslice.

Performance of the EBS scheduler depends on these parameters, as well as the
load characteristics of the system. The current default settings, however, have
worked well with most loads so far.


Scheduler statistics
--------------------

Global and per task scheduling statistics are available via /proc. To reduce
the size of this post, the details are not listed here but can be seen at
<http://ebs.aurema.com>.


Implementation
==============

Those interested mainly in EBS performance can skip this section...

Effective entitlement per share
-------------------------------

Ideally, the CPU USAGE PER SHARE of tasks demanding as much CPU as they are 
entitled to should be equal. By keeping track of the HIGHEST CPU usage per 
share that has been observed and comparing it to the CPU usage per share for 
each task that runs, tasks that are receiving less usage per share than the 
one getting the most can be given a better priority, so they can "catch up".

This highest value of CPU usage per share is maintained on each runqueue as
that CPU's _effective entitlement per share_, and is used as a basis for all
priority computations on that CPU.

In a nutshell, the task that is receiving the most CPU usage for each of its 
shares serves as the yardstick via which the treatment of other tasks on that 
CPU are measured.


Task priorities
---------------

The array switch and interactivity estimator have been removed - a task's 
eligibility to run is determined purely by its priority. A task's priority is
recalculated when it is forked, wakes up, uses up its timeslice or is
preempted. 

The following ratio:

			task's usage_per_share
-----------------------------------------------------------------------
min(task's CPU rate cap per share, rq->effective_entitlement_per_share)

[where CPU rate cap per share is simply (CPU rate cap / CPU shares)]
		
is then mapped onto the SCHED_NORMAL priority range. The mapping is such
that a ratio of 1.0 is equivalent to the mean SCHED_NORMAL priority, and 
ratios less than and greater than 1.0 are mapped to priorities less and
greater than the mean respectively. This serves to boost tasks 
using less than their entitlement and penalise those using more than their 
entitlement or CPU rate cap. It also provides interactive and I/O bound tasks 
with favourable priorities since such tasks have inherently low CPU usage.

Lastly, the ->prio field in the task structure has been eliminated. The 
runqueue structure stores the priority of the currently running task, and
enqueue/dequeue_task() have been adapted to work without ->prio. The reason for
getting rid of ->prio is to facilitate the O(1) priority promotion of runnable
tasks, explained below.


Only one heuristic
------------------

All the heuristics used by the stock scheduler have been removed. The EBS
scheduler uses only one heuristic: newly forked child tasks are given the 
same usage rate as their parent, rather than a zero usage. This is done to
mollify the "ramp up" effect to some extent. "Ramp up" is the delay between
a change in a task's usage rate and the Kalman filter estimating the new rate,
which in this case could cause a parent to be swamped by its children. Total
elimination of ramp up is undesirable as it is also responsible for good
responsiveness of interactive processes.


O(1) task promotion
-------------------

When the system is busy, tasks waiting on run queues will have decaying usages,
which means that eventually tasks which have been waiting long enough will be 
entitled to a priority boost. Not giving them a boost will result in unfair
starvation, but on the other hand periodically visiting every runnable task is
an O(n) operation.

By inverting the priority and Kalman filter functions, it is possible to
determine at priority calculation time after how long a task will be entitled to
be promoted to the next best priority. These "promotion times" for each 
SCHED_NORMAL priority are then divided by the smallest of these times (call
this the promotion_interval) to obtain the "number of promotion intervals" 
for each priority. Naturally these are stored in a table indexed by priority.

enqueue_task() now places SCHED_NORMAL tasks onto an appropriate "promotion 
list" as well as the run queue. The list is determined by the enqueued task's 
current priority and the number of promotion intervals that must pass before 
it is eligible to be bumped up to the next best priority. And of course
dequeue_task() takes tasks off their promotion list. Therefore, tasks that
get to run before they are due for a promotion (which is usually the case) 
don't get one. 

Every promotion_interval jiffies, scheduler_tick() looks at only the promotion
lists that are now due for a priority bump, and anything on the lists is given
the required boost. (The highest SCHED_NORMAL and background priorities 
are ignored, as these tasks don't need promotion). Regardless of the 
number that need promoting, this is done in O(1) time. The promotion code 
exploits the fact that tasks of the same priority that are due for promotion 
at the same time, ie. are contiguous on a promotion list, ARE ALSO CONTIGUOUS 
ON THE RUN QUEUE. Therefore, promoting N tasks at priority X is simply a matter
of "lifting" these tasks out of their current place on the runqueue in one 
chunk, and appending this chunk to the (X - 1) priority list. These tasks are 
then placed onto a new promotion list according to their new (X - 1) priority. 

This simple list operation is made possible by not having to update each task's
->prio field (now that it has been removed) when moving them to their new 
position on the runqueue.


Benchmarks
==========

Benchmarking was done using contest. The following are the results of running on
a dual PIII 866MHz with 256MB RAM.

no_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        78      166.7   0       26.9    1.00
2.6.2-EBS      3        74      175.7   0       16.2    1.00

cacherun:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        75      173.3   0       24.0    0.96
2.6.2-EBS      3        71      183.1   0       12.7    0.96

process_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        93      138.7   35      54.8    1.19
2.6.2-EBS      3        91      141.8   33      53.8    1.23

ctar_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        99      141.4   1       5.1     1.27
2.6.2-EBS      3        95      145.3   1       4.2     1.28

xtar_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        93      147.3   1       9.7     1.19
2.6.2-EBS      3        89      152.8   1       6.7     1.20

io_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        99      144.4   4       14.1    1.27
2.6.2-EBS      3        91      153.8   3       10.9    1.23

read_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        193     73.6    13      7.8     2.47
2.6.2-EBS      3        136     101.5   7       6.6     1.84

list_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        83      160.2   0       4.8     1.06
2.6.2-EBS      3        80      165.0   0       2.5     1.08

mem_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        159     87.4    129     3.1     2.04
2.6.2-EBS      3        126     110.3   50      2.4     1.70

dbench_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        123     109.8   1       21.1    1.58
2.6.2-EBS      3        130     103.1   1       20.0    1.76

The big winners are read_load and mem_load, with all the others except dbench
being slightly faster than the stock kernel. Only dbench was slightly worse.


X Windows Performance
=====================

The X server isn't strictly an interactive process, but it does have a major
influence on interactive response. The fact that it services a large number
of clients means that its CPU usage rate can be quite high, and this negates 
the above mentioned favourable treatment of interactive and I/O bound 
processes.

Therefore, for best interactive feel, it is recommended that the X server run 
with a nice value of at least -15. From my own testing, doing a window wiggle 
test with a make -j16 in the background and X reniced was slightly better than 
for the stock kernel. 

When running apps such as xmms, I recommend that they should be reniced as well
when the background load is high. With the above setup and xmms reniced to -9,
there were no sound skips at all (without renicing, a few skips could be
detected).


Getting the patch
=================

The patch can be downloaded from

<http://sourceforge.net/projects/ebs-linux/>

Please note that there are 2 patches: the basic patch and the full patch. The
above description applies to the full patch. The basic patch only features
setting shares via nice, a fixed half life and timeslice, no statistics and 
soft caps only. This basic patch is for those who are mainly interested in 
looking at the core EBS changes to the stock scheduler.

The patches are against 2.6.2 (2.6.3 patches will be available shortly).

Comments, suggestions and testing are much appreciated. 

The mailing list for this project is at 
<http://lists.sourceforge.net/lists/listinfo/ebs-linux-devel>.

Cheers,

John



^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2004-03-05  3:55 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1t8wp-qF-11@gated-at.bofh.it>
     [not found] ` <1th6J-az-13@gated-at.bofh.it>
     [not found]   ` <403E2929.2080705@tmr.com>
2004-02-27  3:44     ` [RFC][PATCH] O(1) Entitlement Based Scheduler Rik van Riel
2004-02-28 21:27       ` Bill Davidsen
2004-02-28 23:55         ` Peter Williams
2004-03-04 21:08           ` Timothy Miller
     [not found] <1vuMd-5jx-5@gated-at.bofh.it>
     [not found] ` <1vuMd-5jx-7@gated-at.bofh.it>
     [not found]   ` <1vuMd-5jx-9@gated-at.bofh.it>
     [not found]     ` <1vuMd-5jx-11@gated-at.bofh.it>
     [not found]       ` <1vuMd-5jx-3@gated-at.bofh.it>
     [not found]         ` <1vvyx-6jy-13@gated-at.bofh.it>
     [not found]           ` <1vBE2-48V-21@gated-at.bofh.it>
2004-03-03 21:38             ` Bill Davidsen
     [not found] <fa.fi4j08o.17nchps@ifi.uio.no.suse.lists.linux.kernel>
     [not found] ` <fa.ctat17m.8mqa3c@ifi.uio.no.suse.lists.linux.kernel>
     [not found]   ` <yydjishqw10p.fsf@galizur.uio.no.suse.lists.linux.kernel>
     [not found]     ` <40426E1C.8010806@aurema.com.suse.lists.linux.kernel>
2004-03-03  2:48       ` Andi Kleen
2004-03-03  3:45         ` Peter Williams
2004-03-03 10:13           ` Andi Kleen
2004-03-03 23:46             ` Peter Williams
2004-03-03 15:57           ` Andi Kleen
2004-03-04  0:41             ` Peter Williams
2004-03-05  3:55               ` Andi Kleen
     [not found] <fa.ftul5bl.nlk3pr@ifi.uio.no>
     [not found] ` <fa.cvc8vnj.ahebjd@ifi.uio.no>
2004-03-01  9:18   ` Joachim B Haga
2004-03-01 10:18     ` Paul Wagland
2004-03-01 19:11       ` Mike Fedyk
     [not found] <fa.jgj0bdi.b3u6qk@ifi.uio.no>
2004-03-01  1:54 ` Andy Lutomirski
2004-03-01  2:54   ` Peter Williams
2004-03-01  3:46     ` Andy Lutomirski
2004-03-01  4:18       ` Peter Williams
2004-03-02 23:36   ` Peter Williams
     [not found] <894006121@toto.iv>
2004-03-01  0:00 ` Peter Chubb
2004-03-02  1:25   ` Peter Williams
     [not found] <fa.fi4j08o.17nchps@ifi.uio.no>
     [not found] ` <fa.ctat17m.8mqa3c@ifi.uio.no>
2004-02-29 11:58   ` Joachim B Haga
2004-02-29 20:39     ` Paul Jackson
2004-02-29 22:56     ` Peter Williams
     [not found] <1tfy0-7ly-29@gated-at.bofh.it>
     [not found] ` <1thzJ-A5-13@gated-at.bofh.it>
     [not found]   ` <1tjrN-2m5-1@gated-at.bofh.it>
     [not found]     ` <1tjLa-2Ab-9@gated-at.bofh.it>
     [not found]       ` <1tlaf-3OY-11@gated-at.bofh.it>
     [not found]         ` <1tljX-3Wf-5@gated-at.bofh.it>
     [not found]           ` <1tznd-CP-35@gated-at.bofh.it>
     [not found]             ` <1tzQe-10s-25@gated-at.bofh.it>
2004-02-26 20:14               ` Bill Davidsen
2004-02-26  3:30 Albert Cahalan
2004-02-26  6:19 ` Peter Williams
2004-02-26 17:57   ` Albert Cahalan
2004-02-26 23:24     ` Peter Williams
2004-03-01  3:47       ` Peter Williams
     [not found] <fa.f12rt3d.c0s9rt@ifi.uio.no>
     [not found] ` <fa.ishajoq.q5g90m@ifi.uio.no>
2004-02-25 23:33   ` Junio C Hamano
2004-02-26  8:15     ` Catalin BOIE
  -- strict thread matches above, loose matches on Subject: below --
2004-02-25 14:35 John Lee
2004-02-25 17:09 ` Timothy Miller
2004-02-25 22:12   ` John Lee
2004-02-26  0:31     ` Timothy Miller
2004-02-26  2:04       ` John Lee
2004-02-26  2:18       ` Peter Williams
2004-02-26  2:42         ` Mike Fedyk
2004-02-26  4:10           ` Peter Williams
2004-02-26  4:19             ` Mike Fedyk
2004-02-26 19:23               ` Shailabh Nagar
2004-02-26 19:46                 ` Mike Fedyk
2004-02-26 20:42                   ` Shailabh Nagar
2004-02-26 16:10           ` Timothy Miller
2004-02-26 19:47             ` Mike Fedyk
2004-02-26 22:51             ` Peter Williams
2004-02-27 10:06               ` Helge Hafting
2004-02-27 11:04                 ` Peter Williams
2004-02-26 16:08         ` Timothy Miller
2004-02-26 16:51           ` Rik van Riel
2004-02-26 20:15           ` Peter Williams
2004-02-27 14:46             ` Timothy Miller
2004-02-28  5:00               ` Peter Williams
2004-03-04 21:18         ` Robert White
2004-03-04 23:15           ` Peter Williams
2004-02-26  2:48       ` Nuno Silva
2004-02-26  4:25         ` Peter Williams
2004-02-26 15:57           ` Rik van Riel
2004-02-26 19:28             ` Shailabh Nagar
2004-02-26 16:12         ` Timothy Miller
2004-02-25 22:51 ` Pavel Machek
2004-02-26  3:14   ` John Lee
2004-02-25 23:45 ` Rik van Riel
2004-02-26  7:18 ` John Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox