[RFC][PATCH] O(1) Entitlement Based Scheduler

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH] O(1) Entitlement Based Scheduler
@ 2004-02-25 14:35 John Lee
  2004-02-25 17:09 ` Timothy Miller
                   ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: John Lee @ 2004-02-25 14:35 UTC (permalink / raw)
  To: linux-kernel

Hi everyone,

This patch is a modification of the O(1) scheduler that introduces 
O(1) entitlement based scheduling for SCHED_NORMAL tasks. This patch is aimed
at keeping the scalability and efficiency of the current scheduler, but also
provide:

- Specific allocation of CPU resources amongst tasks
- Scheduling fairness and good interactive response without the need for 
  heuristics, and
- Reduced scheduler complexity.

The fundamental concept of entitlement based sharing is that each task has an 
_entitlement_ to CPU resources that is determined by the number of _shares_ 
that it holds, and the scheduler allocates CPU to tasks so that the _rate_ at
which they receive CPU time is consistent with their entitlement. 

The usage rates for each task are estimated using Kalman filter techniques, the 
estimates being similar to those obtained by taking a running average over 
twice the filter _response half life_ (see below). However, Kalman filter
values are cheaper to compute and don't require the maintenance of historical 
usage data.

The use of CPU usage rates also makes it possible to impose _per task CPU 
usage rate caps_. This patch provides both soft and hard CPU usage rate caps per
task. The difference between a hard and soft cap is that when unused CPU cycles
are available, a hard cap is _always_ enforced regardless, whereas a soft cap 
is allowed to be exceeded.

Features of the EBS scheduler
=============================

CPU shares
----------

Each task has a number of CPU shares that determine its entitlement. Shares can
be read/set directly via the files 

/proc/<pid>/cpu_shares
/proc/<tgid>/task/<pid>/cpu_shares

or indirectly via setting the task's nice value using nice or renice. A task 
may be allocated between 1 and 420 shares with 20 shares being the default
allocation.  A nice value >= 0 is mapped to (20 - nice) shares and a value 
< 0 is mapped to (20 + nice * nice) shares. If shares are set directly via 
/proc/<pid>/cpu_shares then its nice value  will be adjusted  accordingly.

CPU usage rate caps
-------------------

A task's CPU usage rate cap imposes a soft (or hard) upper limit on the rate at
which it can use CPU resources and can be set/read via the files 

/proc/<pid>/cpu_rate_cap 
/proc/<tgid>/task/<pid>/cpu_rate_cap  

Usage rate caps are expressed as rational numbers (e.g. "1 / 2") and hard caps 
are signified by a "!" suffix.  The rational number indicates the proportion 
of a single CPU's capacity that the task may use. The value of the number must 
be in the range 0.0 to 1.0 inclusive for soft caps. For hard caps there is an 
additional restriction that a value of 0.0 is not permitted.  Tasks with a 
soft cap of 0.0 become true background tasks and only get to run when no other 
tasks are active.

When hard capped tasks exceed their cap they are removed from the run queues 
and placed in a "sinbin" for a short while until their usage rate decays to
within limits.

Scheduler Tuning Parameters
---------------------------

The characteristics of the Kalman filter used for usage rate estimates are 
determined by the _response half life_. The default value for the half life is 
5000 msecs, but this can be set to any value between 1000 and 100000 msecs 
during a make config. Also, if the SCHED_DYNAMIC_HALF_LIFE config option is set
to Y, the half life can be modified dynamically on a running system within the 
above range by writing to /proc/cpu_half_life.

Currently EBS gives all tasks a fixed default timeslice of 100msec. As with the
half life, this can be chosen at build time (between 1 and 500msec) and building
with the SCHED_DYNAMIC_TIME_SLICE option enables on-the-fly changes via
/proc/timeslice.

Performance of the EBS scheduler depends on these parameters, as well as the
load characteristics of the system. The current default settings, however, have
worked well with most loads so far.

Scheduler statistics
--------------------

Global and per task scheduling statistics are available via /proc. To reduce
the size of this post, the details are not listed here but can be seen at
<http://ebs.aurema.com>.

Implementation
==============

Those interested mainly in EBS performance can skip this section...

Effective entitlement per share
-------------------------------

Ideally, the CPU USAGE PER SHARE of tasks demanding as much CPU as they are 
entitled to should be equal. By keeping track of the HIGHEST CPU usage per 
share that has been observed and comparing it to the CPU usage per share for 
each task that runs, tasks that are receiving less usage per share than the 
one getting the most can be given a better priority, so they can "catch up".

This highest value of CPU usage per share is maintained on each runqueue as
that CPU's _effective entitlement per share_, and is used as a basis for all
priority computations on that CPU.

In a nutshell, the task that is receiving the most CPU usage for each of its 
shares serves as the yardstick via which the treatment of other tasks on that 
CPU are measured.

Task priorities
---------------

The array switch and interactivity estimator have been removed - a task's 
eligibility to run is determined purely by its priority. A task's priority is
recalculated when it is forked, wakes up, uses up its timeslice or is
preempted. 

The following ratio:

			task's usage_per_share
-----------------------------------------------------------------------
min(task's CPU rate cap per share, rq->effective_entitlement_per_share)

[where CPU rate cap per share is simply (CPU rate cap / CPU shares)]

is then mapped onto the SCHED_NORMAL priority range. The mapping is such
that a ratio of 1.0 is equivalent to the mean SCHED_NORMAL priority, and 
ratios less than and greater than 1.0 are mapped to priorities less and
greater than the mean respectively. This serves to boost tasks 
using less than their entitlement and penalise those using more than their 
entitlement or CPU rate cap. It also provides interactive and I/O bound tasks 
with favourable priorities since such tasks have inherently low CPU usage.

Lastly, the ->prio field in the task structure has been eliminated. The 
runqueue structure stores the priority of the currently running task, and
enqueue/dequeue_task() have been adapted to work without ->prio. The reason for
getting rid of ->prio is to facilitate the O(1) priority promotion of runnable
tasks, explained below.

Only one heuristic
------------------

All the heuristics used by the stock scheduler have been removed. The EBS
scheduler uses only one heuristic: newly forked child tasks are given the 
same usage rate as their parent, rather than a zero usage. This is done to
mollify the "ramp up" effect to some extent. "Ramp up" is the delay between
a change in a task's usage rate and the Kalman filter estimating the new rate,
which in this case could cause a parent to be swamped by its children. Total
elimination of ramp up is undesirable as it is also responsible for good
responsiveness of interactive processes.

O(1) task promotion
-------------------

When the system is busy, tasks waiting on run queues will have decaying usages,
which means that eventually tasks which have been waiting long enough will be 
entitled to a priority boost. Not giving them a boost will result in unfair
starvation, but on the other hand periodically visiting every runnable task is
an O(n) operation.

By inverting the priority and Kalman filter functions, it is possible to
determine at priority calculation time after how long a task will be entitled to
be promoted to the next best priority. These "promotion times" for each 
SCHED_NORMAL priority are then divided by the smallest of these times (call
this the promotion_interval) to obtain the "number of promotion intervals" 
for each priority. Naturally these are stored in a table indexed by priority.

enqueue_task() now places SCHED_NORMAL tasks onto an appropriate "promotion 
list" as well as the run queue. The list is determined by the enqueued task's 
current priority and the number of promotion intervals that must pass before 
it is eligible to be bumped up to the next best priority. And of course
dequeue_task() takes tasks off their promotion list. Therefore, tasks that
get to run before they are due for a promotion (which is usually the case) 
don't get one. 

Every promotion_interval jiffies, scheduler_tick() looks at only the promotion
lists that are now due for a priority bump, and anything on the lists is given
the required boost. (The highest SCHED_NORMAL and background priorities 
are ignored, as these tasks don't need promotion). Regardless of the 
number that need promoting, this is done in O(1) time. The promotion code 
exploits the fact that tasks of the same priority that are due for promotion 
at the same time, ie. are contiguous on a promotion list, ARE ALSO CONTIGUOUS 
ON THE RUN QUEUE. Therefore, promoting N tasks at priority X is simply a matter
of "lifting" these tasks out of their current place on the runqueue in one 
chunk, and appending this chunk to the (X - 1) priority list. These tasks are 
then placed onto a new promotion list according to their new (X - 1) priority. 

This simple list operation is made possible by not having to update each task's
->prio field (now that it has been removed) when moving them to their new 
position on the runqueue.

Benchmarks
==========

Benchmarking was done using contest. The following are the results of running on
a dual PIII 866MHz with 256MB RAM.

no_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        78      166.7   0       26.9    1.00
2.6.2-EBS      3        74      175.7   0       16.2    1.00

cacherun:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        75      173.3   0       24.0    0.96
2.6.2-EBS      3        71      183.1   0       12.7    0.96

process_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        93      138.7   35      54.8    1.19
2.6.2-EBS      3        91      141.8   33      53.8    1.23

ctar_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        99      141.4   1       5.1     1.27
2.6.2-EBS      3        95      145.3   1       4.2     1.28

xtar_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        93      147.3   1       9.7     1.19
2.6.2-EBS      3        89      152.8   1       6.7     1.20

io_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        99      144.4   4       14.1    1.27
2.6.2-EBS      3        91      153.8   3       10.9    1.23

read_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        193     73.6    13      7.8     2.47
2.6.2-EBS      3        136     101.5   7       6.6     1.84

list_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        83      160.2   0       4.8     1.06
2.6.2-EBS      3        80      165.0   0       2.5     1.08

mem_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        159     87.4    129     3.1     2.04
2.6.2-EBS      3        126     110.3   50      2.4     1.70

dbench_load:
Kernel    [runs]        Time    CPU%    Loads   LCPU%   Ratio
2.6.2          3        123     109.8   1       21.1    1.58
2.6.2-EBS      3        130     103.1   1       20.0    1.76

The big winners are read_load and mem_load, with all the others except dbench
being slightly faster than the stock kernel. Only dbench was slightly worse.

X Windows Performance
=====================

The X server isn't strictly an interactive process, but it does have a major
influence on interactive response. The fact that it services a large number
of clients means that its CPU usage rate can be quite high, and this negates 
the above mentioned favourable treatment of interactive and I/O bound 
processes.

Therefore, for best interactive feel, it is recommended that the X server run 
with a nice value of at least -15. From my own testing, doing a window wiggle 
test with a make -j16 in the background and X reniced was slightly better than 
for the stock kernel. 

When running apps such as xmms, I recommend that they should be reniced as well
when the background load is high. With the above setup and xmms reniced to -9,
there were no sound skips at all (without renicing, a few skips could be
detected).

Getting the patch
=================

The patch can be downloaded from

<http://sourceforge.net/projects/ebs-linux/>

Please note that there are 2 patches: the basic patch and the full patch. The
above description applies to the full patch. The basic patch only features
setting shares via nice, a fixed half life and timeslice, no statistics and 
soft caps only. This basic patch is for those who are mainly interested in 
looking at the core EBS changes to the stock scheduler.

The patches are against 2.6.2 (2.6.3 patches will be available shortly).

Comments, suggestions and testing are much appreciated. 

The mailing list for this project is at 
<http://lists.sourceforge.net/lists/listinfo/ebs-linux-devel>.

Cheers,

John

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-25 14:35 John Lee
@ 2004-02-25 17:09 ` Timothy Miller
  2004-02-25 22:12   ` John Lee
  2004-02-25 22:51 ` Pavel Machek
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Timothy Miller @ 2004-02-25 17:09 UTC (permalink / raw)
  To: John Lee; +Cc: linux-kernel

John Lee wrote:

> X Windows Performance
> =====================
> 
> The X server isn't strictly an interactive process, but it does have a major
> influence on interactive response. The fact that it services a large number
> of clients means that its CPU usage rate can be quite high, and this negates 
> the above mentioned favourable treatment of interactive and I/O bound 
> processes.
> 
> Therefore, for best interactive feel, it is recommended that the X server run 
> with a nice value of at least -15. From my own testing, doing a window wiggle 
> test with a make -j16 in the background and X reniced was slightly better than 
> for the stock kernel. 
> 
> When running apps such as xmms, I recommend that they should be reniced as well
> when the background load is high. With the above setup and xmms reniced to -9,
> there were no sound skips at all (without renicing, a few skips could be
> detected).

Well, considering that X is suid root, it's okay to require that it be 
run at nice -15, but how is the user without root access going to renice 
xmms?  Even for those who do, they're not going to want to have to 
renice xmms every time they run it.  Furthermore, it seems like a bad 
idea to keep marking more and more programs as suid root just so that 
they can boost their priority.

Not to say that your idea is bad... in fact, it may be a pipe dream to 
get "flawless" interactivity without explicitly marking which programs 
have to be boosted in priority.  Still, Nick and Con have done a 
wonderful job at getting close.

This is a tough problem.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-25 17:09 ` Timothy Miller
@ 2004-02-25 22:12   ` John Lee
  2004-02-26  0:31     ` Timothy Miller
  0 siblings, 1 reply; 66+ messages in thread
From: John Lee @ 2004-02-25 22:12 UTC (permalink / raw)
  To: Timothy Miller; +Cc: linux-kernel

On Wed, 25 Feb 2004, Timothy Miller wrote:

> Well, considering that X is suid root, it's okay to require that it be 
> run at nice -15, but how is the user without root access going to renice 
> xmms?  

Hm, I would have thought the vast majority of xmms users would be running
it on their own machines, to which they have root access. Hope I'm not 
missing something here... :-)

> Even for those who do, they're not going to want to have to 
> renice xmms every time they run it.  Furthermore, it seems like a bad 
> idea to keep marking more and more programs as suid root just so that 
> they can boost their priority.

Assuming that all/most xmms users do have root permissions, I would think
that this is a very minor inconvenience... isn't xmms something which you
tend to start up once and leave running until you log out?

I don't think xmms needs to be an suid program, it can just be given a
renice once (ie. more shares, -9 ==> 101 shares, which is 5 times
the default, just my choice) and then left alone. Furthermore, the
controls that my patch features are intended to be exercised as root,
normal users can do less (as for nice, you can give your own processes
less shares but not more, and can apply _more_ restrictive CPU caps on
your tasks).

>From my testing so far, X and xmms have been the only candidates for a
shares increase, as these two have been the most talked about :-).
And after all, one purpose of the patch is to allow users to allocate CPU
to their tasks in any way they deem fit.

> Not to say that your idea is bad... in fact, it may be a pipe dream to 
> get "flawless" interactivity without explicitly marking which programs 
> have to be boosted in priority.  Still, Nick and Con have done a 
> wonderful job at getting close.

They have indeed, there haven't been any "poor interactivity" emails for a
while now :-).

Good interactivity was just one of my goals, I was also aiming for better
CPU resource allocation and simplification of the main code paths in the
scheduler by doing away with heuristics, and therefore better throughput.

Cheers,

John

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-25 14:35 John Lee
  2004-02-25 17:09 ` Timothy Miller
@ 2004-02-25 22:51 ` Pavel Machek
  2004-02-26  3:14   ` John Lee
  2004-02-25 23:45 ` Rik van Riel
  2004-02-26  7:18 ` John Lee
  3 siblings, 1 reply; 66+ messages in thread
From: Pavel Machek @ 2004-02-25 22:51 UTC (permalink / raw)
  To: John Lee; +Cc: linux-kernel

Hi!

> CPU usage rate caps
> -------------------
> 
> A task's CPU usage rate cap imposes a soft (or hard) upper limit on the rate at
> which it can use CPU resources and can be set/read via the files 
> 
> /proc/<pid>/cpu_rate_cap 
> /proc/<tgid>/task/<pid>/cpu_rate_cap  
> 
> Usage rate caps are expressed as rational numbers (e.g. "1 / 2") and hard caps 
> are signified by a "!" suffix.  The rational number indicates the proportion 
> of a single CPU's capacity that the task may use. The value of the number must 
> be in the range 0.0 to 1.0 inclusive for soft caps. For hard caps there is an 
> additional restriction that a value of 0.0 is not permitted.  Tasks with a 
> soft cap of 0.0 become true background tasks and only get to run when no other 
> tasks are active.

Why not use something like percent, parts per milion or whatever?

> When hard capped tasks exceed their cap they are removed from the run queues 
> and placed in a "sinbin" for a short while until their usage rate decays to
> within limits.

How do you solve this one?

I want to kill your system.

I launch task A, "semaphore grabber", that does filesystem
operations. Those need semaphores. I run it as "true background".

I wait for A to grab some lock, then I run B, which is while(1);

A holds lock that can not be unlocked, and your system is dead.

This may happen randomly, even without me on your system.
								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
       [not found] ` <fa.ishajoq.q5g90m@ifi.uio.no>
@ 2004-02-25 23:33   ` Junio C Hamano
  2004-02-26  8:15     ` Catalin BOIE
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2004-02-25 23:33 UTC (permalink / raw)
  To: John Lee; +Cc: linux-kernel

>>>>> "JL" == John Lee <johnl@aurema.com> writes:

JL> On Wed, 25 Feb 2004, Timothy Miller wrote:

>> Even for those who do, they're not going to want to have to 
>> renice xmms every time they run it.  Furthermore, it seems like a bad 
>> idea to keep marking more and more programs as suid root just so that 
>> they can boost their priority.

I seem to recall reading here that xmms attempts to grab
SCHED_RR if possible.  If that is indeed the case, I suspect
that the above suggestion to run xmms as root does not let the
user exploit the strength of EBS.  Since, as I understand it,
EBS affects SCHED_NORMAL processes not SCHED_RR processes.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-25 14:35 John Lee
  2004-02-25 17:09 ` Timothy Miller
  2004-02-25 22:51 ` Pavel Machek
@ 2004-02-25 23:45 ` Rik van Riel
  2004-02-26  7:18 ` John Lee
  3 siblings, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2004-02-25 23:45 UTC (permalink / raw)
  To: John Lee; +Cc: linux-kernel

On Thu, 26 Feb 2004, John Lee wrote:

> Only one heuristic

I really like the fact that this patch gets rid of most
of the "magic tricks" that the current O(1) scheduler
needs in order to work well for most interactive users.

> O(1) task promotion

... while still being O(1)

The share based stuff should also tie in nicely with
the various resource management projects out there.

cheers,

Rik
-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-25 22:12   ` John Lee
@ 2004-02-26  0:31     ` Timothy Miller
  2004-02-26  2:04       ` John Lee
                         ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Timothy Miller @ 2004-02-26  0:31 UTC (permalink / raw)
  To: John Lee; +Cc: linux-kernel

John Lee wrote:
> 
> On Wed, 25 Feb 2004, Timothy Miller wrote:
> 
> 
>>Well, considering that X is suid root, it's okay to require that it be 
>>run at nice -15, but how is the user without root access going to renice 
>>xmms?  
> 
> 
> Hm, I would have thought the vast majority of xmms users would be running
> it on their own machines, to which they have root access. Hope I'm not 
> missing something here... :-)

It's a security concern to have to login as root unnecessarily.  It's 
bad enough we have to do that to change X11 configuration, but we 
shouldn't have to do that every time we want to start xmms.  And just 
suid root is also a security concern.

> 
> 
>>Even for those who do, they're not going to want to have to 
>>renice xmms every time they run it.  Furthermore, it seems like a bad 
>>idea to keep marking more and more programs as suid root just so that 
>>they can boost their priority.
> 
> 
> Assuming that all/most xmms users do have root permissions, I would think
> that this is a very minor inconvenience... isn't xmms something which you
> tend to start up once and leave running until you log out?

This is a bad assumption.  You should never require users to login as 
root to do basic user-oriented tasks.  Indeed, it's often nice to have 
an icon or menu option to start it without having to pull up a terminal, 
and if the program ASKS for the root password, it's annoying to have to 
type that in just to get it to start.

Under Solaris, a number of device nodes (sound, serial ports, etc.) have 
their ownership changed to that of the user who logs into the console. 
This is so that they can access those devices without logging in as root.

What about computer labs of Linux boxes where users do not own the 
computers and are therefore not allowed to login as root.  Should they 
be prohibited from running xmms properly?

For a while, the sysadmin here at work tried to deploy Windows boxes 
with restricted user priveleges, and the users were not given the admin 
password.  For the engineers, that changed, fortunately.  But consider 
what would happen if Linux boxes were deployed that way.  It would suck 
if Windows users could still listen to MP3's during heavy CPU usage but 
Linux users could not.  There are many good reasons to lock down 
workstations and not provide root access.

> 
> I don't think xmms needs to be an suid program, it can just be given a
> renice once (ie. more shares, -9 ==> 101 shares, which is 5 times
> the default, just my choice) and then left alone. Furthermore, the
> controls that my patch features are intended to be exercised as root,
> normal users can do less (as for nice, you can give your own processes
> less shares but not more, and can apply _more_ restrictive CPU caps on
> your tasks).

If someone does not own the box they're using, but they want to, say, 
contribute to xmms development, they're going to be starting and 
stopping the program quite frequently.  They're not going to have any 
way to set the nice level.

Consider what happens if some other user logs in remotely to that 
workstation and starts a large compile.

>>From my testing so far, X and xmms have been the only candidates for a
> shares increase, as these two have been the most talked about :-).
> And after all, one purpose of the patch is to allow users to allocate CPU
> to their tasks in any way they deem fit.

They are the most talked about, so you tested them.  Fine.  But we all 
know that they are not representative samples.  There are bound to be 
numerous other programs that have similar problems.

The way your scheduler works, USERS cannot "allocate CPU to their tasks 
in any way they deem fit".  Only system administrators can.

> 
>>Not to say that your idea is bad... in fact, it may be a pipe dream to 
>>get "flawless" interactivity without explicitly marking which programs 
>>have to be boosted in priority.  Still, Nick and Con have done a 
>>wonderful job at getting close.
> 
> 
> They have indeed, there haven't been any "poor interactivity" emails for a
> while now :-).
> 
> Good interactivity was just one of my goals, I was also aiming for better
> CPU resource allocation and simplification of the main code paths in the
> scheduler by doing away with heuristics, and therefore better throughput.

I read your paper, and I think you have some wonderful ideas.  Don't get 
me wrong.  I think that your ideas, coupled with an interactivity 
estimator, have an excellent chance of producing a better scheduler.

In fact, that may be the only "flaw" in your design.  It sounds like 
your scheduler does an excellent job at fairness with very low overhead. 
  The only problem with it is that it doesn't determine priority 
dynamically.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  0:31     ` Timothy Miller
@ 2004-02-26  2:04       ` John Lee
  2004-02-26  2:18       ` Peter Williams
  2004-02-26  2:48       ` Nuno Silva
  2 siblings, 0 replies; 66+ messages in thread
From: John Lee @ 2004-02-26  2:04 UTC (permalink / raw)
  To: Timothy Miller; +Cc: linux-kernel



On Wed, 25 Feb 2004, Timothy Miller wrote:

> > Hm, I would have thought the vast majority of xmms users would be running
> > it on their own machines, to which they have root access. Hope I'm not 
> > missing something here... :-)
> 
> It's a security concern to have to login as root unnecessarily.  It's 
> bad enough we have to do that to change X11 configuration, but we 
> shouldn't have to do that every time we want to start xmms.  And just 
> suid root is also a security concern.

Ah, OK. Security point taken.

> > Assuming that all/most xmms users do have root permissions, I would think
> > that this is a very minor inconvenience... isn't xmms something which you
> > tend to start up once and leave running until you log out?

<snip>

> What about computer labs of Linux boxes where users do not own the 
> computers and are therefore not allowed to login as root.  Should they 
> be prohibited from running xmms properly?
>
> If someone does not own the box they're using, but they want to, say, 
> contribute to xmms development, they're going to be starting and 
> stopping the program quite frequently.  They're not going to have any 
> way to set the nice level.
> 
> Consider what happens if some other user logs in remotely to that 
> workstation and starts a large compile.

Valid points. I guess I've been too accustomed to playing MP3s on my own
box :-(.

> >From my testing so far, X and xmms have been the only candidates for a
> 
> They are the most talked about, so you tested them.  Fine.  But we all 
> know that they are not representative samples.  There are bound to be 
> numerous other programs that have similar problems.

No others that I've noticed yet, but yes you're probably right. Which is
why I'm really looking forward to getting feedback in this area.

> The way your scheduler works, USERS cannot "allocate CPU to their tasks 
> in any way they deem fit".  Only system administrators can.

Correct, I used the wrong word. Another symptom of too much play on my own
boxes...

> I read your paper, and I think you have some wonderful ideas.  Don't get 
> me wrong.  I think that your ideas, coupled with an interactivity 
> estimator, have an excellent chance of producing a better scheduler.

Thanks :-).

And thanks for your feedback.

Cheers,

John



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  0:31     ` Timothy Miller
  2004-02-26  2:04       ` John Lee
@ 2004-02-26  2:18       ` Peter Williams
  2004-02-26  2:42         ` Mike Fedyk
                           ` (2 more replies)
  2004-02-26  2:48       ` Nuno Silva
  2 siblings, 3 replies; 66+ messages in thread
From: Peter Williams @ 2004-02-26  2:18 UTC (permalink / raw)
  To: Timothy Miller; +Cc: linux-kernel

Timothy Miller wrote:
 > <snip>
> In fact, that may be the only "flaw" in your design.  It sounds like 
> your scheduler does an excellent job at fairness with very low overhead. 
>  The only problem with it is that it doesn't determine priority 
> dynamically.

This (i.e. automatic renicing of specified programs) is a good idea but 
is not really a function that should be undertaken by the scheduler 
itself.  Two possible solutions spring to mind:

1. modify the do_execve() in fs/exec.c to renice tasks when they execute 
specified binaries
2. have a user space daemon poll running tasks periodically and renice 
them if they are running specified binaries

Both of these solutions have their advantages and disadvantages, are 
(obviously) complicated than I've made them sound and would require a 
great deal of care to be taken during their implementation.  However, I 
think that they are both doable.  My personal preference would be for 
the in kernel solution on the grounds of efficiency.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  2:18       ` Peter Williams
@ 2004-02-26  2:42         ` Mike Fedyk
  2004-02-26  4:10           ` Peter Williams
  2004-02-26 16:10           ` Timothy Miller
  2004-02-26 16:08         ` Timothy Miller
  2004-03-04 21:18         ` Robert White
  2 siblings, 2 replies; 66+ messages in thread
From: Mike Fedyk @ 2004-02-26  2:42 UTC (permalink / raw)
  To: Peter Williams; +Cc: Timothy Miller, linux-kernel

Peter Williams wrote:
> 2. have a user space daemon poll running tasks periodically and renice 
> them if they are running specified binaries
> 
> Both of these solutions have their advantages and disadvantages, are 
> (obviously) complicated than I've made them sound and would require a 
> great deal of care to be taken during their implementation.  However, I 
> think that they are both doable.  My personal preference would be for 
> the in kernel solution on the grounds of efficiency.

Better would be to have the kernel tell the daemon whenever a process in 
exec-ed, and you have simplicity in the kernel, and policy in user space.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  0:31     ` Timothy Miller
  2004-02-26  2:04       ` John Lee
  2004-02-26  2:18       ` Peter Williams
@ 2004-02-26  2:48       ` Nuno Silva
  2004-02-26  4:25         ` Peter Williams
  2004-02-26 16:12         ` Timothy Miller
  2 siblings, 2 replies; 66+ messages in thread
From: Nuno Silva @ 2004-02-26  2:48 UTC (permalink / raw)
  To: Timothy Miller; +Cc: John Lee, linux-kernel

Timothy Miller wrote:

[..]

> 
> 
> It's a security concern to have to login as root unnecessarily.  It's 
> bad enough we have to do that to change X11 configuration, but we 
> shouldn't have to do that every time we want to start xmms.  And just 
> suid root is also a security concern.
> 

Maybe I'm missing something, but xmms run OK with zero load, right? The 
problem is that, when building the kernel and the entire kde tree, each 
with make -j 16, xmms skips a few times? Well, tough luck...

And the user *can* do something about it, just nice -n 19 the builds and 
left xmms alone. (Or you can use other player... :-)

With this patch you can even say that each of the build processes can 
only hog 5% (at the most!) of the CPU (maybe the build is not a good 
example for mandatory CPU time caps, but it is usefull).

Besides, this implements a true run-only-when-noone-else-wants-to-run 
nice mode wich, combined with the absolut cpu time caps, hits some of my 
wish list for a complete scheduler :-) so I can't wait to test it :-)

A final note to John Lee: you may want to check the Class-based Kernel 
Resource Management (CKRM) at:
	http://ckrm.sourceforge.net/

Thanks,
Nuno Silva

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-25 22:51 ` Pavel Machek
@ 2004-02-26  3:14   ` John Lee
  0 siblings, 0 replies; 66+ messages in thread
From: John Lee @ 2004-02-26  3:14 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel



On Wed, 25 Feb 2004, Pavel Machek wrote:

> > Usage rate caps are expressed as rational numbers (e.g. "1 / 2") and hard caps 
> > are signified by a "!" suffix.  The rational number indicates the proportion 
> > of a single CPU's capacity that the task may use. The value of the number must 
> > be in the range 0.0 to 1.0 inclusive for soft caps. For hard caps there is an 
> > additional restriction that a value of 0.0 is not permitted.  Tasks with a 
> > soft cap of 0.0 become true background tasks and only get to run when no other 
> > tasks are active.
> 
> Why not use something like percent, parts per milion or whatever?

Fair comment. Fine granularity with percentages would require decimal
points and a function in the kernel to parse that value - maybe I could
get around to doing that. But I suppose ppm could certainly be used. We
just chose rational numbers to start with.

> > When hard capped tasks exceed their cap they are removed from the run queues 
> > and placed in a "sinbin" for a short while until their usage rate decays to
> > within limits.
> 
> How do you solve this one?

The task is removed from the runqueue and a timer is scheduled to put it
back onto the runqueue. The delay period is the required amount of time
for that task's usage to decay to below its cap.

> I want to kill your system.
> 
> I launch task A, "semaphore grabber", that does filesystem
> operations. Those need semaphores. I run it as "true background".
> 
> I wait for A to grab some lock, then I run B, which is while(1);
> 
> A holds lock that can not be unlocked, and your system is dead.
> 
> This may happen randomly, even without me on your system.

Good point. We'll have to rethink background priorities.

John



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
@ 2004-02-26  3:30 Albert Cahalan
  2004-02-26  6:19 ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Albert Cahalan @ 2004-02-26  3:30 UTC (permalink / raw)
  To: linux-kernel mailing list; +Cc: johnl

John Lee writes:

> The usage rates for each task are estimated using Kalman
> filter techniques, the  estimates being similar to those
> obtained by taking a running average over twice the filter
> _response half life_ (see below). However, Kalman filter
> values are cheaper to compute and don't require the
> maintenance of historical usage data.

Linux dearly needs this. Please separate out this part
of the patch and send it in.

Right now, Linux does not report the recent CPU usage
of a process. The UNIX standard requires that "ps"
report this; right now ps substitutes CPU usage over
the whole lifetime of a process.

Both per-task and per-process (tid and tgid) numbers
are needed. Both percent and permill (1/1000) units
get reported, so don't convert to integer percent.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  2:42         ` Mike Fedyk
@ 2004-02-26  4:10           ` Peter Williams
  2004-02-26  4:19             ` Mike Fedyk
  2004-02-26 16:10           ` Timothy Miller
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Williams @ 2004-02-26  4:10 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Timothy Miller, linux-kernel

Mike Fedyk wrote:
> Peter Williams wrote:
> 
>> 2. have a user space daemon poll running tasks periodically and renice 
>> them if they are running specified binaries
>>
>> Both of these solutions have their advantages and disadvantages, are 
>> (obviously) complicated than I've made them sound and would require a 
>> great deal of care to be taken during their implementation.  However, 
>> I think that they are both doable.  My personal preference would be 
>> for the in kernel solution on the grounds of efficiency.
> 
> 
> Better would be to have the kernel tell the daemon whenever a process in 
> exec-ed, and you have simplicity in the kernel, and policy in user space.

Yes.  That would be a good solution.  Does a mechanism that allows the 
kernel to notify specific programs about specific events like this exist?

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  4:10           ` Peter Williams
@ 2004-02-26  4:19             ` Mike Fedyk
  2004-02-26 19:23               ` Shailabh Nagar
  0 siblings, 1 reply; 66+ messages in thread
From: Mike Fedyk @ 2004-02-26  4:19 UTC (permalink / raw)
  To: Peter Williams; +Cc: Timothy Miller, linux-kernel

Peter Williams wrote:
> Mike Fedyk wrote:
> 
>> Peter Williams wrote:
>>
>>> 2. have a user space daemon poll running tasks periodically and 
>>> renice them if they are running specified binaries
>>>
>>> Both of these solutions have their advantages and disadvantages, are 
>>> (obviously) complicated than I've made them sound and would require a 
>>> great deal of care to be taken during their implementation.  However, 
>>> I think that they are both doable.  My personal preference would be 
>>> for the in kernel solution on the grounds of efficiency.
>>
>>
>>
>> Better would be to have the kernel tell the daemon whenever a process 
>> in exec-ed, and you have simplicity in the kernel, and policy in user 
>> space.
> 
> 
> Yes.  That would be a good solution.  Does a mechanism that allows the 
> kernel to notify specific programs about specific events like this exist?

I'm sure DaveM would suggest Netlink, but there are probably several 
implementations for Linux.

I'll let other more knowledgeable people fill in the list.

Mike

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  2:48       ` Nuno Silva
@ 2004-02-26  4:25         ` Peter Williams
  2004-02-26 15:57           ` Rik van Riel
  2004-02-26 16:12         ` Timothy Miller
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Williams @ 2004-02-26  4:25 UTC (permalink / raw)
  To: Nuno Silva; +Cc: Timothy Miller, John Lee, linux-kernel

Nuno Silva wrote:
> 
> 
> Timothy Miller wrote:
> 
> [..]
> 
>>
>>
>> It's a security concern to have to login as root unnecessarily.  It's 
>> bad enough we have to do that to change X11 configuration, but we 
>> shouldn't have to do that every time we want to start xmms.  And just 
>> suid root is also a security concern.
>>
> 
> Maybe I'm missing something, but xmms run OK with zero load, right? The 
> problem is that, when building the kernel and the entire kde tree, each 
> with make -j 16, xmms skips a few times? Well, tough luck...
> 
> And the user *can* do something about it, just nice -n 19 the builds and 
> left xmms alone. (Or you can use other player... :-)
> 
> With this patch you can even say that each of the build processes can 
> only hog 5% (at the most!) of the CPU (maybe the build is not a good 
> example for mandatory CPU time caps, but it is usefull).
> 
> Besides, this implements a true run-only-when-noone-else-wants-to-run 
> nice mode wich, combined with the absolut cpu time caps, hits some of my 
> wish list for a complete scheduler :-) so I can't wait to test it :-)

Another idea that we are playing with for handling programs like xmms 
(i.e. programs that require gauranteed CPU bandwidth to perform well) is 
the complement of caps namely per task CPU reservations.  The 
availability of CPU usage rate statistics for each task makes this 
possible but the question is "Is the functionality worth the extra 
overhead?".

Of course, this won't solve the "need to be root" problem as this is 
obviously the sort of control that should be reserved for root but it is 
arguably better than having to guess how many shares a task needs to 
ensure that it gets the required CPU bandwidth.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  3:30 [RFC][PATCH] O(1) Entitlement Based Scheduler Albert Cahalan
@ 2004-02-26  6:19 ` Peter Williams
  2004-02-26 17:57   ` Albert Cahalan
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Williams @ 2004-02-26  6:19 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: linux-kernel mailing list, johnl

Albert Cahalan wrote:
> John Lee writes:
> 
> 
>>The usage rates for each task are estimated using Kalman
>>filter techniques, the  estimates being similar to those
>>obtained by taking a running average over twice the filter
>>_response half life_ (see below). However, Kalman filter
>>values are cheaper to compute and don't require the
>>maintenance of historical usage data.
> 
> 
> Linux dearly needs this. Please separate out this part
> of the patch and send it in.

This information can be determined from the SleepAVG: field in the 
/proc/<pid>/status and /proc/<tgid>/task/<pid>/status files by 
subtracting the value there from 100.  Without our patch this value is a 
directly calculated estimated of the task's sleep rate which is 
available because it used by the O(1) scheduler's heuristics.  With our 
patches, it is calculated from our estimate of the task's usage because 
we dispensed with the sleep average calculations as they are no longer 
needed.  We decided to still report sleep average in the status file 
because we were reluctant to alter the contents of such files in case we 
broke user space programs.

> 
> Right now, Linux does not report the recent CPU usage
> of a process. The UNIX standard requires that "ps"
> report this; right now ps substitutes CPU usage over
> the whole lifetime of a process.
> 
> Both per-task and per-process (tid and tgid) numbers
> are needed. Both percent and permill (1/1000) units
> get reported, so don't convert to integer percent.

I think a modification to fs/proc/array.c to make this field a per 
million rather than a percent value would satisfy your needs.  It would 
be a very small change but there would be concerns about breaking 
programs that rely on it being a percentage.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-25 14:35 John Lee
                   ` (2 preceding siblings ...)
  2004-02-25 23:45 ` Rik van Riel
@ 2004-02-26  7:18 ` John Lee
  3 siblings, 0 replies; 66+ messages in thread
From: John Lee @ 2004-02-26  7:18 UTC (permalink / raw)
  To: linux-kernel

On Thu, 26 Feb 2004, John Lee wrote:

> The patch can be downloaded from
> 
> <http://sourceforge.net/projects/ebs-linux/>
> 
> Please note that there are 2 patches: the basic patch and the full patch. The
> above description applies to the full patch. The basic patch only features
> setting shares via nice, a fixed half life and timeslice, no statistics and 
> soft caps only. This basic patch is for those who are mainly interested in 
> looking at the core EBS changes to the stock scheduler.
> 
> The patches are against 2.6.2 (2.6.3 patches will be available shortly).

2.6.3 patches are now available.

John



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-25 23:33   ` Junio C Hamano
@ 2004-02-26  8:15     ` Catalin BOIE
  0 siblings, 0 replies; 66+ messages in thread
From: Catalin BOIE @ 2004-02-26  8:15 UTC (permalink / raw)
  To: John Lee; +Cc: linux-kernel

Hi!

I think Aurema tries to push this more for server stuff then for
workstations.

I think is a great project. I hope it will be included.

---
Catalin(ux) BOIE
catab@deuroconsult.ro

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  4:25         ` Peter Williams
@ 2004-02-26 15:57           ` Rik van Riel
  2004-02-26 19:28             ` Shailabh Nagar
  0 siblings, 1 reply; 66+ messages in thread
From: Rik van Riel @ 2004-02-26 15:57 UTC (permalink / raw)
  To: Peter Williams
  Cc: Nuno Silva, Timothy Miller, John Lee, linux-kernel, ckrm-tech

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1049 bytes --]

On Thu, 26 Feb 2004, Peter Williams wrote:

> Another idea that we are playing with for handling programs like xmms 
> (i.e. programs that require gauranteed CPU bandwidth to perform well) is 
> the complement of caps namely per task CPU reservations.

> Of course, this won't solve the "need to be root" problem as this is 
> obviously the sort of control that should be reserved for root

Not necessarily.  We've just fixed this dilemma in the CKRM
project, using a resource class filesystem for this kind of
stuff.

A user could have a certain percentage of the CPU guaranteed
(especially the console user) and carve out part of his/her
guarantee for multimedia applications.

Please see the attached document, which is the 6th draft of
this particular CKRM design.  If you have any improvements
for this spec, feel free to let us know ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

[-- Attachment #2: CKRM 4 spec v6 --]
[-- Type: TEXT/plain, Size: 36645 bytes --]

CKRM (Class-based Kernel Resource Management) 
http://ckrm.sf.net

v4.0 draft 6
25 Feb 2004

Rik van Riel	    riel@redhat.com
Hubertus Franke,    frankeh@watson.ibm.com
Shailabh Nagar      nagar@watson.ibm.com
Vivek Kashyap       kashyapv@us.ibm.com
Chandra Seetharaman sekharan@us.ibm.com

This is the fourth major revision of the CKRM API and framework. The
first was presented at OLS'03, the second is what's described at
http://ckrm.sf.net as of 18 Feb 2004 and the third was sent out on LKML and
ckrm-tech@lists.sf.net on 31 Jan 2004. The fourth version is
based on a merger of best of breed concepts from a filesystem-based
API proposed by Rik van Riel (with inputs from Stephen Tweedie) on the
ckrm-tech mailing list on 5 Feb 2004 and version three of this document.

The project is not committed to a particular API or architecture and
welcomes discussions/comments on the proposal below. Please send feedback
to any of the authors and cc: ckrm-tech@lists.sf.net.

The latest version of the document will be kept at
http://ckrm.sf.net

1.0 Overview
------------

Class-based Kernel Resource Management (CKRM) is a set of modifications to the
Linux kernel to enable improved systems management. The key idea in CKRM is to
control and monitor system resource usage through user-defined groups of tasks
called classes. Classes can be defined to distinguish between applications,
workloads and users in their usage of any system resource such as CPU ticks,
physical page frames, disk I/O bandwidth, accept queue connections, number of
open file handles etc. 

             Sysadmin/resource management application
			    Users       

                syscalls            /rcfs
                  |                  ^          
                  |                  |          
Userspace         V                  V                      
-----------       +------------------+        
Kernel                      |   
                          Core ---- Classification Engine (CE)
                            |
                            |
                 +----------------------------....
                 |     |     |    |         |
                 RC1   RC2   RC3  RC4      RC5 .... RC=Resource Controllers
                 CPU   MEM   I/O  AcceptQ  Open file descriptors, shmsegs etc.

The CKRM components are as shown above:

a) Core

Kernel patch that has three key roles. First, it defines the user API
consisting of system calls and a filesystem called rcfs (resource control file
system). Second, it defines the APIs between itself and all other kernel
components namely the resource controllers and the optional classification
engine. Thus the Core acts as the switchboard for users, resource controllers
and classification engines to interact. Finally, Core handles the creation and
management of classes, which are described below.

b) Resource controller (RC)

A kernel patch providing differentiated access to some resource. There can be
multiple RCs defined simultaneously. The CKRM project currently provides CPU
(ticks), Mem (physical page frames), I/O (disk bandwidth) and Inbound Network
(connections) controllers with extensions planned to manage virtual resources
like open file descriptors, shared memory segments etc.

c) Classification Engine (CE)

An optional kernel module that assists in automatic classification of tasks
into classes. A CE will provide a function that returns the class to which a
task "should" belong. The classification function is called by the Core at
significant kernel events (where a task's class might be assigned or expected
to change) and appropriate action taken.

CE's are completely optional. Tasks in the system can also be manually
associated with classes. If a CE exists, it must adhere to the Core-CE
kernel API.

CKRM will provide a rule-based classification engine (RBCE) that performs
classication by evaluating a set of rules entered by a privileged user. 

2.0 Task Classes
--------------------

A task class, commonly abbreviated to just class, is a collection of tasks with
an associated set of shares and usage statistics for each managed resource in
the system. A managed resource (such as CPU, physical memory and disk I/O) is
one whose scheduler or controller is class-aware. 

Each class is associated with a lower and upper bound of resource
usage, called guarantees and limits, for each managed resource in the
system. Guarantees and limits depend on the type of resource and are
described in detail in Section 2.5. The two are collectively called a
share where the distinction is unimportant.

Each task in the system always belongs to some task class. The main
principle behind CKRM is that a task's consumption of managed
resources is controlled (using shares) and monitored (available as a
class' statistics) primarily through the task's class.

The association of tasks to classes is dynamic. A task's classification can be
changed either manually as described Section 2.7, or automatically using a
classification engine as described in Section 4.0.

The task class is used to manage system wide resources.  An alternative type of
class is wholly contained within a task or kernel abstraction (see section
6.0). A socket's accept queue is one such resource that can be used to control
the incoming connection requests.  Unless otherwise specified, a class will
mean a task class.

2.1 Hierarchy of classes
------------------------

Classes are primarily used by sysadmins to monitor and control the
resources used by various workloads running on a system
e.g. mailserver, apache, dns etc. It can also be used to limit and track 
resource usage of users.

In both these cases, it is useful to allow the application or user to manage
its own allocation without having systemwide privileges. To enable this, CKRM
classes can be hierarchically subdivided into subclasses.

Each subclass gets its own share of the parent class' allocation. The
user or workload associated with the parent class can change subclass
shares, monitor subclass usage, effectively managing their allocation
independent of each other and the system administrator.

The depth of the class hierarchy supported by CKRM is configurable. The default
depth is expected to be three though resource controllers will/should be
written to support a reasonably larger number. The maximum depth value allowed
will equal the minimum of the depths supported by any registered
resource controller i.e. if the registered CPU controller can only support a
depth of 2, the user can only configure the depth to be 1 or 2 even if all
other registered controllers can support a depth of 3. 

A depth greater than the default may incur significant performance penalties
and/or relaxation of the time granularity at which shares get enforced. CKRM
will provide helper functions to assist resource controllers traverse the class
hierarchy for control and statistics updation.

2.2 rcfs API overview
---------------------

CKRM's primary user API is the resource control file system (rcfs). A
filesystem is a natural interface for representing and managing the hierarchy
of classes offering several advantages: 

- direct representation of class hierarchy in the filesystem's tree structure

- does not need any new system calls using standard unix
  filesystem systemcalls instead

- intuitive mapping of filesystem/file operations such as mkdir/rmdir,
read/write and chmod etc. onto creation/deletion of classes, getting/setting
class shares, controlling permissions etc.

- easy tracking and control of permissions at any level of the hierarchy with
with read, write and access rights on a user, group and other basis.

- uses standard unix filesystem system calls instead of defining any new doesn't need any new system calls and instead uses standard unix
  filesystem one.

The root of rcfs, typically mounted as /rcfs, represents the whole system.
Each class is represented by a directory containing the following

a) "magic" files : The following are proposed
     |-- target : placeholder to identify class when moving tasks to it
     |-- shares : contains guarantees and limits of this class for each 
                  managed resource
     |-- stats  : contains usage by this class of each managed resource
     |-- config : config file for changing/reading configuration parameters
                  for rcfs as well as resource controllers

b) |-- members/ : Subdirectory containing symlinks to the /proc entries for
tasks belonging to the class, but not to any of its subclasses i.e. tasks
belonging to a class' default subclass as defined below.

c) Subdirectories for each subclass: each subdirectory is similar to the
parent.

rcfs configuration parameters can be set in two ways. 

a) As a mount parameter for the rcfs filesystem e.g.

	mount -t rcfs -o maxdepth=2 rcfs /rcfs   
	mount -o remount, maxdepth=2 /rcfs

b) Writing to /rcfs/config dynamically 

   echo "rcfs:<parameter name>:<parameter value>" > /rcfs/config
   e.g.  echo "rcfs:mode:manage" > /rcfs/config

The current and available config parameters can be read from /rcfs/config in
the same format as they get written.

/rcfs/config can also be used to configure resource controllers as described later.       

2.3 Default subclass
--------------------

Each class can contain tasks that do not belong to any of its subclasses. To
regulate and monitor such tasks, CKRM core implicitly defines a default
subclass for each class.

e.g. if class_A contains tasks t1,t2 and t3 and defines subclasses class_A1 which
contains t1, then t2 & t3 belong to class_A's default subclass.

The default subclass of the root of rcfs (/rcfs) is significant because it
is present at kernel bootup, being statically defined by the Core patch,
and contains, atleast initially, /sbin/init. Having such a systemwide default
class allows CKRM to ensure that every task in the system always belongs to
some class.

Default classes are not explicitly represented by a separate subdirectory at
any level of the hierarchy. Thus /rcfs/class_A/target represents both class_A and its
default subclass (moving a task to class_A implicitly moves it to the class_A's
default subclass). Default classes do have their own share and usage statistics
which are listed separately in the class' magic files 'shares' and 'stats' respectively.
User level tools can be written to display default class data explicitly.

2.4 Class creation/deletion
---------------------------

Classes are created using mkdir/rmdir at the appropriate level of the rcfs
tree. The created directory is automatically populated with the magic files and
/members subdirectory. A class is always created empty and gets populated when
tasks get manually or automatically classified to it. Removing a class is
only allowed if it has no associated tasks or subclasses in it.

2.5 Guarantees and limits
-------------------------

A guarantee is the minimum amount of resource that tasks of a class will get if
they request it. Unused portions of the guarantee can be redistributed (work
conservation) by the corresponding resource controller with a reasonably timely
reallocation back to the class, should its demand rise later.

A limit is the maximum amount of resource that a class can use. They can be
either hard or soft, depending on the capability and semantics of the resource
e.g. open file descriptor limits are always hard whereas a limit on the number
of page frames given to a class could be configured to be either hard or soft.
Classes can never consume more resources than a hard limit, regardless of the
usage by other classes. Exceeding a soft limit is permitted if the resource is
sufficiently free. Resources granted over the soft limit can be reallocated by
the resource controller to other classes which increase their demand (while
remaining under their limit). In other words, the priority of resource
allocation amongst classes is as follows

highest: classes with  demand < guarantee
	 classes with  hard/soft limit > demand > guarantee  
lowest:  classes with  demand > soft limit

Guarantees and limits are represented by whole numbers in the
/path/to/<class>/shares magic file. The calculation of a class's share is
done as follows. Each line of the /path/to/<class>/shares file represents
one managed resource and its share values in the format

   <resource name> <my_guarantee> <my_limit> <tot_guarantee> <tot_limit>

A class's guarantee/limit  =  class' <my_*> / its parent's <tot_*>
(both my_* and tot_* values corresponding to the same resource and
 * stands for 'guarantee' or 'limit')

e.g. assuming there's only entry (say cpu) in the shares file, a class
hierarchy with the following values

/A   cpu 35 60 80 90
  /a1 cpu 20 30 20 30
  /a2 cpu 5  25 5  25

will result in the following shares

a1 guarantee =  20/80, limit = 30/90
a2 guarantee =   5/80, limit = 25/90
A's default class' guarantee = (80-20-5=55)/80, limit = (90-30-25=35)/90

To derive A's share (with respect to its peer classes), a calculation similar
to the one done for a1, a2 can be done, using A's parent's <tot_*> values.

Note that the <my_*> values in /rcfs/shares have no significance (since there
are no parent <tot_*> values against which they can be interpreted). Similarly,
the <tot_*> values of a class at the leaf of the class hierarchy (i.e. one
which has no children) does not have any significance. The default subclass of
such a leaf class is guaranteed a 100% (and has a 100% limit) as long as no
other subclasses get defined.

Setting new resource shares is done through
 echo "<resource name>:[my_guarantee|my_limit....tot_limit]:<new value>" \
  > /path/to/<class>/shares

e.g	echo "cpu:tot_limit:200" > /path/to/A/shares
	changes A's total cpu limit in the example above to 200 (from 90)
        and results in changing the shares of a1, a2 and A's default classes.

while   echo "cpu:my_guarantee:45" > /path/to/A/a1/limits
	changes a1's guarantee to 45/80 (from 20/80) and reduces the guarantee
	of A's default class to (80-45-5=30)/80.

A user can determine the shares of a class by reading the
/path/to/<class>/shares file and parsing its contents as explained
above. Default subclass shares at any level can be calculated by summing the
shares listed in each of the visible subclasses shares file and subtract
the sum from the parent's tot_* value.  Userspace tools can be written to
assist with all these calculations.

Note: the reason for choosing the scheme above is to allow absolute values to
be specified while retaining the flexibility of changing all subclass shares
without requiring an atomic update to all their values. Another option
considered and abandoned was to specify relative shares only (where the tot_*
values would not be explicitly stated/modifiable but would be calculated by
summing the my_* values of all children).

<resource name> identifies the resource controller. Changes to share values
through writes and requests for shares through reads get passed on to each 
affected resource controller by the Core. As usual, unix file permissions take
care of access control to the shares file.

Each shares file need not contain entries for all managed resources. If a
resource's share is unspecified, the class's tasks are deemed to belong to the
paren'ts default class.

2.6 Gathering usage statistics
------------------------------

Statistics on resource use can be gathered from the 'stats' file in each class
directory. To make portability and scripting easier the data is in plain text.

Each stats file will have two lines for each managed resource in some format
like

	<resource name> total 
	#1 #1_unit 
	#2 #2_unit 
	#3 #3_unit
	:
	#n #n_unit
	<resource name> local 
	#1 #1_unit 
	#2 #2_unit 
	#3 #3_unit
	:
	#n #n_unit

where 

#n : value(number) of n'th statistic exported for <resource name> by its
     controller e.g. value of avg time using the resource, value of avg. delay
     in waiting for access to the resource 
#n_unit : units of n'th statistic in plain text e.g. ticks, ms, us, pages etc.

The statistics listed under "<resource name> local" are the values of the
resource consumption by the class's default subclass alone. It is expected to
be updated frequently by the controller.

On the other hand, the statistics listed under "<resource name> total"
correspond to the sum of statistics for the default subclass and all other
subclasses. It thus represents the total consumption for the parent
class. These values only get updated lazily at a frequency decided by each
controller individually. To get the current accurate value for total usage at a class
level, userspace tools, similar to top(1), will be provided to add up
(non-atomically) the local values of the children and parent's default subclass. 

Each managed resource always has an entry in the ./stats file. Thus the file
can be used to discover the managed resources in a system.

2.7 Changing a task's class
---------------------------

A task can be be classified into a class by writing its pid into the target
class' target file as follows

       echo "<pid>" > /path/to/class/target

The pid is a positive number for a normal task ID, a negative number for a
process group, similar to the sending of signals. A zero value represents the
calling task's pid.

The unix file permissions on the magic target file in the chosen class
determine whether or not the process is allowed to change into a certain task
class.  The chmod(2) system call can be used to change the permissions on who
can join a task class.

A special 'target' file is used instead of the class' directory so that
permissions to join a task class can be configured separate from permissions to
query statistics about the class.

Manual reclassification for socket accept queue control uses a parameter
different from pid as explained in Section 5.

A task can also get classified automatically if an optional Classification
Engine is present, as described in Section 4.

2.8 Monitor Mode
----------------

CKRM will support two modes of operation: monitor and manage. Manage mode is
what has been described so far with each class having guarantees and limits.

In monitor mode, a task's class is used only to track its resource usage and
not to control its resource allocation. For allocation purposes, all tasks are
considered to be in the systemwide default class and ideally, get resources
allocated just as they would in a kernel without CKRM.

Usage statistics still get collected and reported per-class as in the manage
mode. CKRM will continue to use lazy updates of the "total" statistics at each
level of the class hierarchy, as described in Section 2.6. Despite this, a
CKRM-enabled kernel in monitor mode may still incur a small performance penalty
compared to one in which CKRM is disabled.

The mode of CKRM can be set by writing "rcfs:mode:manage" or "rcfs:mode:monitor" 
into /rcfs/config.

3.0 Resource Controllers
------------------------

Resource controllers are the kernel code that enforce the class-based control
and supply the class-based statistics. They are typically implemented as
patches to the existing controllers (or schedulers) with two primary design
objectives 

- minimize impact on users not interested in class-based control
- respect shares of class hierarchy as far as possible while keeping 
code complexity and performance overheads low.

CKRM currently provides resource controllers for the primary physical resources
such as CPU (ticks), physical memory (page frames), block I/O (per-device
bandwidth) and inbound network (socket accept queues). In future, additional
controllers for virtual resources such as open files, shared memory segments
are being considered as well.

Resource controllers need to register each managed resource separately. It is
possible for one patch/module to regulate several related resources e.g. the
controller providing class-based control over open files and shared memory
segments could be the same but needs to register each of the resources
separately. 

Typically resource controllers will have private objects for each CKRM
class. The Core patch provides data structures to associate this private
data with the class objects it creates. When classes get modified (creation,
deletion, tasks moving in and out, share changes, request for usage
statistics), the Core invokes appropriate callbacks for each of the registered
resources to do the necessary changes and return data for that resource. These
callbacks and their related functions form a Core->Resource Controller API that
is internal to the kernel. 

The CKRM Core will provide helper functions for resource controllers to
traverse the class hierarchy to update statistics lazily, calculate share
values etc. The Core->RC API will be described in more detail after the User
API and high level design is finalized. http://ckrm.sf.net/ provides some idea
of what the Core->RC API will look like.

3.1 Resource controller configuration
-------------------------------------

Resource controllers can be configured by writing resource specific config
parameters to /rcfs/config in the format

	   echo "<resource name>:<param name>:<param value>" > /rcfs/config

e.g. echo "cpu:active:1" > /rcfs/config

The name and semantics of config parameters supported are resource specific and
opaque to CKRM Core. Reading /rcfs/config will list all configurable parameters
(for rcfs as a whole and each managed resource) in the same format as they are 
written.

Configuration of resource controllers can be done at any level of the /rcfs
hierarchy by writing a resource-specific configuration parameter to
/path/to/class/config e.g.

    echo "cpu:timeslice:15" > /rcfs/class_A/config

As before, the syntax and semantics of configuration parameters are determined
by the controller implementation. Controllers are always free to ignore
unsupported parameters.

4.0 Classification Engine (CE)   
------------------------------

As described briefly in Section 1, a Classification Engine is an optional
kernel module that can automatically reclassify tasks after significant kernel
events (called reclassification events) such as exec,setuid,setgid. Fork is also
a significant event with a CE callback, though it is expected that the child
will inherit its parent's class rather than be reclassified immediately.

A CE module has to register with the kernel at which time it exports a table of
callback functions, potentially one for each reclassification event. CKRM's
Core patch hooks reclassification event points in the kernel. If a CE is
present, it is queried for the class to which the task should belong after the
event. The Core then moves the task to the appropriate class. The CE callback
is also invoked so that CE can update any state it maintains as part of its
classification logic. 

At any time, only one CE can be registered with the kernel. CE's can move
between active and inactive states after registration to allow lightweight,
temporary disabling of automatic classification.

The user interface to the CE will also be through the /rcfs filesystem.  When a
CE module registers, Core creates the /rcfs/ce directory. Core provides an
interface to the CE for the latter to create magic files under /rcfs/ce. 

During registration, the CE also provides callbacks for
create/delete/read/write operations to files under /rcfs/ce. This enables the
Core and /rcfs interfaces to handle /rcfs/ce files opaquely.

/rcfs/ce is used to dynamically configure the CE including changes to the logic
it implements for reclassifying tasks. The configuration files and operations
for a typical CE such as RBCE are listed in Section 4.3.

Note: Instead of Core creating its own set of hooks into sys_fork/sys_exec etc.
the CKRM project is considering using the security_* hooks of Linux Security
Modules (LSM), that are already in the mainline 2.6 kernel. Some issues with
LSM such as stackability of its modules and availability of all necessary hooks 
are under investigation.

4.1 Manual reclassification with an active CE
---------------------------------------------

When a CE has registered and is active, tasks get automatically reclassified.
However, a sysadmin/user can override the CE by manually moving a task to a
different class under /rcfs. 

Following such a manual reclassification, the task is deemed to be outside the
CE's control. Future reclassification events affecting the task will not result
it being reclassified according to the CE's logic. The implementation of the
per-task override could be done by either marking the task so that Core does
not call the CE's reclassifier or by creating a specially tagged rule in the CE
which returns the null class (so the CE's reclassifier is called but does not
return a valid class).

A task can be put back under CE's control by writing its pid to the 
/rcfs/ce/reclassify magic file. This causes the task to get reclassified using the CE's
logic immediately as well as at all future reclassification events.

4.2 Application Tags
--------------------

The CKRM Core adds a 'tag' field to task_struct to enable applications to
assist the CE's in reclassification. 

To set/get a tasks tag field, two new system calls are introduced by CKRM:

   int sys_tsk_settag ( int pid, int len, void *tag )
   int sys_tsk_gettag ( int pid, int len, void *tag )

As before, the pid is a positive number for a normal task ID, a negative number for a
process group and zero represents the calling task's pid.

Setting a task's tag is particularly useful for relatively trusted server
applications such as databases or webservers are running on a system and doing
work on behalf of multiple classes. By setting their tag values, such applications
can tell the CE what work they are doing and the CE can map the tag to the
appropriate class.

There is some risk in allowing applications to set their tags which could result
in their being classified to classes with more resource shares. Hence a task
needs the appropriate capability to set its own share. If the application
cannot be trusted, a trusted userlevel agent could set its tag after
performing additional verification.

In all cases, the tag value is opaque to CKRM core and resource controllers.

4.3 Rule Based Classification Engine (RBCE) 
-------------------------------------------

The CKRM project will provide a general purpose CE called the Rule-Base
Classification Engine (RBCE). RBCE evaluates an ordered list of rules provided
by the user to classify tasks. Each rules is a logical and of rule terms and a
target class. Each rule term is an attribute-value pair where the attribute can
be one of several relevant members of the task_struct, including the newly
introduced tag.

RBCE defines the following files under /rcfs/ce:

1. Magic files

	|--info - read only file detailing how to setup and use RBCE.

	|--reclassify - contains nothing. Writing a pid to it reclassifies
	   the given task according to the current set of rules. Writing "all"
	   to it reclassifies all tasks in the system. This is typically done
	   by the user/sysadmin after changing/creating rules. 

        |--state - determines whether RBCE is currently active or inactive.
	   Writing 1 (0) activates (deactivates) the CE. Reading the file
	   returns the current state.

2. Rules subdirectory: Each rule of the RBCE is represented by a file in
   /rcfs/ce/rules. The sysadmin writes lines to the file in one of the
   following formats to define or modify a rule:

	<*id> <OP> number      where   <OP>={>,<,=}
				       <*id> = {uid,euid,gid,egid}	 

	cmd = "string" // basename of the command

	pathname = "string" // full pathname of the command

	args = "string" // argv[1] - argv[argc] of command

	apptag = "string" // application tag of the task

	[+,-]depend = rule_filename_1, rule_filename_2...

                         // used to chain a rule's terms with existing rules
			 // to avoid respecifying the latter's rule terms.
			 // A rule's dependent rules are evaluated before 
			 // its rule terms get evaluated.

			 // An optional + or - can precede the depend keyword.
			 // +depend adds a dependent rule to the tail of the
			 // current chain, -depend removes an existing 
			 // dependent rule

			 // a rulefile shows depend = none if the rule has
			 // no dependencies

	order = number // order in which this rule is executed relative to 
                       // other independent rules.  
                       // rule with order 1 is checked first and so on. 
                       // As soon as a rule matches, the class of that rule 
                       // is returned to Core. So, order really matters.  
                       // If no order is specified by the user, the next 
                       // highest available order number is assigned to 
                       // the rule.

	class = "/rcfs/.../classname" // target class of this rule.
				      // /rcfs all by itself indicates the
				      // systemwide default class

        state = "number" // 1 or 0, provides the ability to deactivate a 
                         // specific rule, if needed.

	ipv4 = "string"  // ipv4 address in dotted decimal and port
                         // e.g. "127.0.0.1\80"
                         // e.g. "*\80" for CE to match any address
			 // used in socket accept queue classes (see Section 5)

        ipv6 = "string" // ipv6 address in hex and port
                        // e.g. "fe80::4567\80"
                        // e.g. "*\80" for CE to match any address 
			// used in socket accept queue classes (see Section 5)

Note: Instead of one file per rule, RBCE could use a single file to list all
rules, one per line. Given the potentially large number of terms a rule could
have, the file per rule approach may be cleaner.

5.0 Socket accept queue classes
-------------------------------

There are two types of resources that are controlled using CKRM. The first type
are system wide resources (such as CPU time) that are apportioned into
classes. Every task is assigned to a class in the system.

Another form of classification is for resources that are controlled entirely
within a process or a system object's context. For example: the socket. In a
typical setup, such as a webserver, connection requests are received from 1000s
of clients. We would like to distinguish among the requests based on the
importance of the client to the server. The client's are therefore assigned to
different classes, such as, gold/silver/bronze. The client belonging to gold
class will have its requests honoured at a higher rate (proportional to the
class's share) than one belonging to the silver or bronze class. e.g. class
gold might be the paying customers while bronze includes the 'window shoppers'.

In this section we focus on the inbound network connection control i.e. TCP
connection requests queued in the socket's accept queue.

Therefore, there are two levels of classes:

1. comprising of the listening address(es) and associated port
2. the peer's classification (done on the contents of the
   TCP SYN packet header using iptables)

Every listening socket is assigned multiple accept queues - one per
accept queue class. The connection requests are appended to these queues
depending on the class. The accept() call picks the requests from the
accept queues (classes) in accordance to the shares assigned to them.

5.1 /rcfs/network file hierarchy 
--------------------------------

The 'listening' classes are listed in /rcfs/network/socket_aq. The
'peer-classification' is controlled by "accept queue" sub-classes.

The info file specifies the default accept queue class and the number of accept
queue classes. This file is initialised by the resource controller. In the
example below the accept queues are numbered e.g. 0-7. The incoming requestes
can be MARKed 0-7 using iptable rules. The members directory lists the
ip_address/port pairs that fall under this class. There is also a symlink to
owning tasks pid.

The rest of the files such as target/shares etc. are the same for the network
classes as for any other class. The internal subclasses 0-7 do not honour moves
to target nor do they list any entries under /members.

Subclasses cannot be created below the accept_queue classes (e.g. 0-7). Nor can
one create more accept queue classes than are supported (the range is provided
in 'info' file).

/rcfs/network/sockets_aq

       |-- file info:
               - accept_q class names: 0 - 7
               - default accept_q class: 0

       |-- target
       |-- stats:    usage statistics

       /Class_A

               |-- target
               |-- shares:  1000  1000  1000  1000
               |-- stats:    usage statistics

               |-- /members
                          /<ip_address>\<port>
                                /proc/<pid>

               /1
                       |-- shares: 500 500
                       |-- stats:    usage statistics
               /2
                       |-- shares: 400 500
                       |-- stats:    usage statistics

       /Class_B
               |-- target
               |-- shares:  1000  1000  1000 1000
               |-- stats:    usage statistics

               |-- /members
                          /<ip_address>\<port>
                                /proc/<pid>

               /1
                       |-- shares: 200 500
                       |-- stats:    usage statistics
               /2
                       |-- shares: 400 500
                       |-- stats:    usage statistics
               /3
                       |-- shares: 400 500
                       |-- stats:    usage statistics

A change in the shares of a particular accept queue class (0-7) will cause the
resource controller to modify the accept queue shares in the sockets associated
with all the ipaddr\port members of the network class.

As with the process-classes the user/admin can move a listening socket to
the desired class by executing the following command:

       echo "<ip_address\port>" > /path/to/class/target

6.0 Example Control flow using CKRM
-----------------------------------

A typical usage of a CKRM-enabled system is given below: 

- Core is active as soon as the kernel is booted.
- resource controller 1 registers
- resource controller 2 registers
:
:
- No task is classified, resource controllers handle tasks in default mode
:
:
- User defines multiple task classes 
- User sets shares for each of the resouce classes defined
- User manually moves some tasks to some of the newly defined classes and
  these tasks get regulated according to the new shares.
:
:
- Classification engine registers
:
:
- User tells CE how to associate tasks to previously defined classes (in RBCE,
this is done by specifying rules and policies) 
- On significant kernel events (exec/setuid etc.), the affected task gets 
reclassified automatically according to the CE's rules 
- Tasks are allocated resources based on the shares of the
task class to which they belong
:
:
:
:
- User specifies iptable rules to MARK incoming TCP connections
- task calls listen() on a socket (thus specifying an ipaddress/port)
- CE returns the network/socket_aq class for the socket
- Core modifies the shares associated with the socket's acceptq classes
- incoming connections get MARKed by netfilter and are delivered to the 
  listening task in proportion of their acceptq class share
:
:
- User gets resource usage of different classes (task and network)
- User manually moves some tasks to different classes
:
- User moves all tasks out of a class and then deletes it
:
:

-- End --

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  2:18       ` Peter Williams
  2004-02-26  2:42         ` Mike Fedyk
@ 2004-02-26 16:08         ` Timothy Miller
  2004-02-26 16:51           ` Rik van Riel
  2004-02-26 20:15           ` Peter Williams
  2004-03-04 21:18         ` Robert White
  2 siblings, 2 replies; 66+ messages in thread
From: Timothy Miller @ 2004-02-26 16:08 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel



Peter Williams wrote:
> Timothy Miller wrote:
>  > <snip>
> 
>> In fact, that may be the only "flaw" in your design.  It sounds like 
>> your scheduler does an excellent job at fairness with very low 
>> overhead.  The only problem with it is that it doesn't determine 
>> priority dynamically.
> 
> 
> This (i.e. automatic renicing of specified programs) is a good idea but 
> is not really a function that should be undertaken by the scheduler 
> itself.  Two possible solutions spring to mind:
> 
> 1. modify the do_execve() in fs/exec.c to renice tasks when they execute 
> specified binaries

We don't want user-space programs to have control over priority.  This 
is DoS waiting to happen.

> 2. have a user space daemon poll running tasks periodically and renice 
> them if they are running specified binaries

This is much too specific.  Again, if the USER has control over this 
list, then it's potential DoS.  And if the user adds a program which 
should qualify but which is not in the list, the program will not get 
its deserved boost.

And a sysadmin is not going to want to update 200 lab computers just so 
one user can get their program to run properly.

> 
> Both of these solutions have their advantages and disadvantages, are 
> (obviously) complicated than I've made them sound and would require a 
> great deal of care to be taken during their implementation.  However, I 
> think that they are both doable.  My personal preference would be for 
> the in kernel solution on the grounds of efficiency.

They are doable, but they are not a general solution.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  2:42         ` Mike Fedyk
  2004-02-26  4:10           ` Peter Williams
@ 2004-02-26 16:10           ` Timothy Miller
  2004-02-26 19:47             ` Mike Fedyk
  2004-02-26 22:51             ` Peter Williams
  1 sibling, 2 replies; 66+ messages in thread
From: Timothy Miller @ 2004-02-26 16:10 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Peter Williams, linux-kernel

How about this:

The kernel tracks CPU usage, time slice expiration, and numerous other 
statistics, and exports them to userspace through /proc or somesuch. 
Then a user-space daemon adjusts priority.  This could work, but it 
would be sluggish in adjusting priorities.

I still like Nick and Con's solutions better.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  2:48       ` Nuno Silva
  2004-02-26  4:25         ` Peter Williams
@ 2004-02-26 16:12         ` Timothy Miller
  1 sibling, 0 replies; 66+ messages in thread
From: Timothy Miller @ 2004-02-26 16:12 UTC (permalink / raw)
  To: Nuno Silva; +Cc: John Lee, linux-kernel



Nuno Silva wrote:
> 
> 

> 
> And the user *can* do something about it, just nice -n 19 the builds and 
> left xmms alone. (Or you can use other player... :-)

You're assuming that the person using xmms is also the user doing tbe 
build.  I am not making the same assumption.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 16:08         ` Timothy Miller
@ 2004-02-26 16:51           ` Rik van Riel
  2004-02-26 20:15           ` Peter Williams
  1 sibling, 0 replies; 66+ messages in thread
From: Rik van Riel @ 2004-02-26 16:51 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Peter Williams, linux-kernel

On Thu, 26 Feb 2004, Timothy Miller wrote:

> We don't want user-space programs to have control over priority.  This
> is DoS waiting to happen.

Nope, there's a solution to this.  Please read the CKRM
design draft I just posted elsewhere in this thread ;)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  6:19 ` Peter Williams
@ 2004-02-26 17:57   ` Albert Cahalan
  2004-02-26 23:24     ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Albert Cahalan @ 2004-02-26 17:57 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel mailing list, johnl

On Thu, 2004-02-26 at 01:19, Peter Williams wrote:
> Albert Cahalan wrote:
>> John Lee writes:

>>> The usage rates for each task are estimated using Kalman
>>> filter techniques, the  estimates being similar to those
>>> obtained by taking a running average over twice the filter
>>> _response half life_ (see below). However, Kalman filter
>>> values are cheaper to compute and don't require the
>>> maintenance of historical usage data.
>>
>>
>> Linux dearly needs this. Please separate out this part
>> of the patch and send it in.
>
> This information can be determined from the SleepAVG: field in the 
> /proc/<pid>/status and /proc/<tgid>/task/<pid>/status files by 
> subtracting the value there from 100.

This doesn't seem to be the case. For example, a fork()
causes the value to be adjusted in both child and parent.

Also, perhaps the name is wrong, but I'd think SleepAVG
has more to do with the average length of a sleep. It sure
isn't documented. (time constant? type of decay?)

There's also a need for whole-process stats and cumulative
(sum of exited children) stats. %CPU can go as high as 51200%.

> Without our patch this value is a 
> directly calculated estimated of the task's sleep rate which is 
> available because it used by the O(1) scheduler's heuristics.  With our 
> patches, it is calculated from our estimate of the task's usage because 
> we dispensed with the sleep average calculations as they are no longer 
> needed.  We decided to still report sleep average in the status file 
> because we were reluctant to alter the contents of such files in case we 
> broke user space programs.

Generally this is a good move, though I don't expect anything
to be using SleepAVG at the moment.

>> Right now, Linux does not report the recent CPU usage
>> of a process. The UNIX standard requires that "ps"
>> report this; right now ps substitutes CPU usage over
>> the whole lifetime of a process.
>>
>> Both per-task and per-process (tid and tgid) numbers
>> are needed. Both percent and permill (1/1000) units
>> get reported, so don't convert to integer percent.
>
> I think a modification to fs/proc/array.c to make this field a per 
> million rather than a percent value would satisfy your needs.  It would 
> be a very small change but there would be concerns about breaking 
> programs that rely on it being a percentage.

Nothing can rely on it existing at all, so a name change would
solve the problem of apps getting confused.

BTW, permill is not per-million, it is per-thousand.
Per-million or per-billion would be fine as long as
it doesn't overflow.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  4:19             ` Mike Fedyk
@ 2004-02-26 19:23               ` Shailabh Nagar
  2004-02-26 19:46                 ` Mike Fedyk
  0 siblings, 1 reply; 66+ messages in thread
From: Shailabh Nagar @ 2004-02-26 19:23 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Peter Williams, Timothy Miller, linux-kernel

Mike Fedyk wrote:
> Peter Williams wrote:
> 
>> Mike Fedyk wrote:
>>
>>> Peter Williams wrote:
>>>
>>>> 2. have a user space daemon poll running tasks periodically and 
>>>> renice them if they are running specified binaries
>>>>
>>>> Both of these solutions have their advantages and disadvantages, are 
>>>> (obviously) complicated than I've made them sound and would require 
>>>> a great deal of care to be taken during their implementation.  
>>>> However, I think that they are both doable.  My personal preference 
>>>> would be for the in kernel solution on the grounds of efficiency.

The CKRM project (http://ckrm.sf.net) took the kernel approach too, for the 
same reason you mentioned - efficiency.

Having a userspace daemon poll and readjust priorities/entitlements is 
possible but it will not be able to react as quickly when entitlements 
change and will incur overheads for unnecessary polls when they don't.

In CKRM, the corresponding problem being solved is what should be done when 
a task moves from one "class" to another with a potentially different 
entitlement (we call them guarantees and limits). Since class changes can 
happen relatively often (compared to changes to entitlements for ebs), it 
was even more imperative to be efficient.

>>>
>>>
>>>
>>>
>>> Better would be to have the kernel tell the daemon whenever a process 
>>> in exec-ed, and you have simplicity in the kernel, and policy in user 
>>> space.

As it turns out, one can still use a fairly simple in-kernel module which 
provides a *mechanism* for effectively changing a process' entitlement 
while retaining the policy component in userland.

CKRM has a rule-based classification engine that allows simple rules to be 
defined by the user/sysadmin and evaluated by the kernel at various kernel 
events. Using this, its not only possible to catch exec's but 
setuids/setgids etc. (which are also legitimate points at which a sysadmin 
may want a task's entitlement changed). The RBCE we wrote was fairly 
efficient though we didn't do specific measurements of the overhead as the 
interfaces started changing.

>>
>>
>> Yes.  That would be a good solution.  Does a mechanism that allows the 
>> kernel to notify specific programs about specific events like this exist?
> 

I guess there are two subparts to this question:

a) Are there efficient kernel-user communication mechanisms which could be 
used to do such event notifications ? Yes, netlink is one, relayfs 
(http://www.opersys.com/relayfs)  is another.
b) Are the hooks in place at the right points ? Except for the LSM 
security* calls, no.

Which leaves you with two options - put your own hooks into do_execve, 
either custom or using some general hook interface like KHI 
(http://www-124.ibm.com/linux/projects/kernelhooks/)
OR
find a way to stack your function with LSM's calls.

CKRM is looking at both these options since we need hooks in several places 
(fork, exec, setuid, setgid....)

> I'm sure DaveM would suggest Netlink, but there are probably several 
> implementations for Linux.
> 
> I'll let other more knowledgeable people fill in the list.

Disclaimer : no claim to be knowledgeable....:-)
Just sharing some of our experiences going over similar problems in a 
slightly broader context i.e. class-based resource management.

-- Shailabh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 15:57           ` Rik van Riel
@ 2004-02-26 19:28             ` Shailabh Nagar
  0 siblings, 0 replies; 66+ messages in thread
From: Shailabh Nagar @ 2004-02-26 19:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Williams, Nuno Silva, Timothy Miller, John Lee,
	linux-kernel, ckrm-tech

Rik van Riel wrote:

> On Thu, 26 Feb 2004, Peter Williams wrote:
> 
> 
>>Another idea that we are playing with for handling programs like xmms 
>>(i.e. programs that require gauranteed CPU bandwidth to perform well) is 
>>the complement of caps namely per task CPU reservations.
> 
> 
>>Of course, this won't solve the "need to be root" problem as this is 
>>obviously the sort of control that should be reserved for root
> 
> 
> Not necessarily.  We've just fixed this dilemma in the CKRM
> project, using a resource class filesystem for this kind of
> stuff.
> 
> A user could have a certain percentage of the CPU guaranteed
> (especially the console user) and carve out part of his/her
> guarantee for multimedia applications.
> 
> Please see the attached document, which is the 6th draft of
> this particular CKRM design.  If you have any improvements
> for this spec, feel free to let us know ;)


The CKRM API has also been posted separately as an RFC on lkml
today...just in case its missed deep down in this thread !



-- Shailabh







^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 19:23               ` Shailabh Nagar
@ 2004-02-26 19:46                 ` Mike Fedyk
  2004-02-26 20:42                   ` Shailabh Nagar
  0 siblings, 1 reply; 66+ messages in thread
From: Mike Fedyk @ 2004-02-26 19:46 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: Peter Williams, Timothy Miller, linux-kernel

Shailabh Nagar wrote:
>>> Mike Fedyk wrote:
>>>> Better would be to have the kernel tell the daemon whenever a 
>>>> process in exec-ed, and you have simplicity in the kernel, and 
>>>> policy in user space.
> 
> 
> 
> As it turns out, one can still use a fairly simple in-kernel module 
> which provides a *mechanism* for effectively changing a process' 
> entitlement while retaining the policy component in userland.

How much code could be removed if CKRM triggered a userspace process to 
perform the operations required?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 16:10           ` Timothy Miller
@ 2004-02-26 19:47             ` Mike Fedyk
  2004-02-26 22:51             ` Peter Williams
  1 sibling, 0 replies; 66+ messages in thread
From: Mike Fedyk @ 2004-02-26 19:47 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Peter Williams, linux-kernel

Timothy Miller wrote:
> How about this:
> 
> The kernel tracks CPU usage, time slice expiration, and numerous other 
> statistics, and exports them to userspace through /proc or somesuch. 
> Then a user-space daemon adjusts priority.  This could work, but it 
> would be sluggish in adjusting priorities.

Userspace shouldn't have to poll, especially if there needs to be low 
latency in the interaction.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
       [not found]             ` <1tzQe-10s-25@gated-at.bofh.it>
@ 2004-02-26 20:14               ` Bill Davidsen
  0 siblings, 0 replies; 66+ messages in thread
From: Bill Davidsen @ 2004-02-26 20:14 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Linux Kernel Mailing List

Mike Fedyk wrote:
> Shailabh Nagar wrote:
> 
>>>> Mike Fedyk wrote:
>>>>
>>>>> Better would be to have the kernel tell the daemon whenever a 
>>>>> process in exec-ed, and you have simplicity in the kernel, and 
>>>>> policy in user space.
>>
>>
>>
>>
>> As it turns out, one can still use a fairly simple in-kernel module 
>> which provides a *mechanism* for effectively changing a process' 
>> entitlement while retaining the policy component in userland.
> 
> 
> How much code could be removed if CKRM triggered a userspace process to 
> perform the operations required?

One other interesting question is what would happen if the userspace 
program didn't run, died, etc. Or set some ill-behaved other user 
program to a higher priority and the other program did a DoS 
(intentional or not)?

I don't like the whole idea, but I like it even less with a user program 
requiring context switches on scheduling.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 16:08         ` Timothy Miller
  2004-02-26 16:51           ` Rik van Riel
@ 2004-02-26 20:15           ` Peter Williams
  2004-02-27 14:46             ` Timothy Miller
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Williams @ 2004-02-26 20:15 UTC (permalink / raw)
  To: Timothy Miller; +Cc: linux-kernel

Timothy Miller wrote:
> 
> 
> Peter Williams wrote:
> 
>> Timothy Miller wrote:
>>  > <snip>
>>
>>> In fact, that may be the only "flaw" in your design.  It sounds like 
>>> your scheduler does an excellent job at fairness with very low 
>>> overhead.  The only problem with it is that it doesn't determine 
>>> priority dynamically.
>>
>>
>>
>> This (i.e. automatic renicing of specified programs) is a good idea 
>> but is not really a function that should be undertaken by the 
>> scheduler itself.  Two possible solutions spring to mind:
>>
>> 1. modify the do_execve() in fs/exec.c to renice tasks when they 
>> execute specified binaries
> 
> 
> We don't want user-space programs to have control over priority. 

They already do e.g. renice is such a program.

> This 
> is DoS waiting to happen.
> 
>> 2. have a user space daemon poll running tasks periodically and renice 
>> them if they are running specified binaries
> 
> 
> This is much too specific.  Again, if the USER has control over this
> list,

It would obviously be under root control.

> then it's potential DoS.  And if the user adds a program which 
> should qualify but which is not in the list, the program will not get 
> its deserved boost.
> 
> And a sysadmin is not going to want to update 200 lab computers just so 
> one user can get their program to run properly.
> 
>>
>> Both of these solutions have their advantages and disadvantages, are 
>> (obviously) complicated than I've made them sound and would require a 
>> great deal of care to be taken during their implementation.  However, 
>> I think that they are both doable.  My personal preference would be 
>> for the in kernel solution on the grounds of efficiency.
> 
> 
> They are doable, but they are not a general solution.
> 

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 19:46                 ` Mike Fedyk
@ 2004-02-26 20:42                   ` Shailabh Nagar
  0 siblings, 0 replies; 66+ messages in thread
From: Shailabh Nagar @ 2004-02-26 20:42 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Peter Williams, Timothy Miller, linux-kernel

Mike Fedyk wrote:

> Shailabh Nagar wrote:
>
>>>> Mike Fedyk wrote:
>>>>
>>>>> Better would be to have the kernel tell the daemon whenever a 
>>>>> process in exec-ed, and you have simplicity in the kernel, and 
>>>>> policy in user space.
>>>>
>>
>>
>>
>> As it turns out, one can still use a fairly simple in-kernel module 
>> which provides a *mechanism* for effectively changing a process' 
>> entitlement while retaining the policy component in userland.
>
>
> How much code could be removed if CKRM triggered a userspace process 
> to perform the operations required?

In CKRM, the code to perform classification is an optional  kernel 
module. So size isn't really an issue in terms of impact to core kernel 
code.

Our prototype version of the classification engine, RBCE, is about 2700 
lines without any effort being put into reducing its size etc. If that 
were to be completely pared down to only provide events to userspace, it 
would come down by quite a bit (can't say exactly how much but atleast 
50% is a safe bet).

I think the more important question is performance impact - what do you 
give up in terms of efficiency and granularity of control by going to 
userspace vs what you gain in reduced kernel pathlength. Empirically, we 
found RBCE was quite efficient  but no quantitative analysis was done.

-- Shailabh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 16:10           ` Timothy Miller
  2004-02-26 19:47             ` Mike Fedyk
@ 2004-02-26 22:51             ` Peter Williams
  2004-02-27 10:06               ` Helge Hafting
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Williams @ 2004-02-26 22:51 UTC (permalink / raw)
  To: Timothy Miller; +Cc: Mike Fedyk, linux-kernel

Timothy Miller wrote:
> How about this:
> 
> The kernel tracks CPU usage, time slice expiration, and numerous other 
> statistics, and exports them to userspace through /proc or somesuch. 
> Then a user-space daemon adjusts priority.

Yes, the right statistics could allow these processes to be identified 
reasonably accurately.  The programs in question would have the 
following characteristics:

1. low CPU usage rate, and
2. a very regular pattern of use i.e. the size of each CPU bursts would 
be approximately constant as would the size of the intervals between 
each burst.

The appropriate statistic to identify the second of these would be 
variance or (equivalently but more expensively) standard deviation. 
Whether this problem is bad/important enough to warrant the extra 
overhead of gathering these statistics is a moot point.  We had to 
generate very high system loads on a single CPU system in order to cause 
one or two skips in xmms over a period of a couple of minutes.

It should be noted that these are the type of task characteristics for 
which the real time scheduler classes are designed and I think that 
someone mentioned that if run with sufficient privileges xmms tries to 
make itself SCHED_RR.

>  This could work, but it 
> would be sluggish in adjusting priorities.
> 
> I still like Nick and Con's solutions better.
> 

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 17:57   ` Albert Cahalan
@ 2004-02-26 23:24     ` Peter Williams
  2004-03-01  3:47       ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Williams @ 2004-02-26 23:24 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: linux-kernel mailing list, johnl

Albert Cahalan wrote:
> On Thu, 2004-02-26 at 01:19, Peter Williams wrote:
> 
>>Albert Cahalan wrote:
>>
>>>John Lee writes:
> 
> 
>>>>The usage rates for each task are estimated using Kalman
>>>>filter techniques, the  estimates being similar to those
>>>>obtained by taking a running average over twice the filter
>>>>_response half life_ (see below). However, Kalman filter
>>>>values are cheaper to compute and don't require the
>>>>maintenance of historical usage data.
>>>
>>>
>>>Linux dearly needs this. Please separate out this part
>>>of the patch and send it in.
>>
>>This information can be determined from the SleepAVG: field in the 
>>/proc/<pid>/status and /proc/<tgid>/task/<pid>/status files by 
>>subtracting the value there from 100.
> 
> 
> This doesn't seem to be the case. For example, a fork()
> causes the value to be adjusted in both child and parent.

This would be the case with our patch as well as we make children 
inherit their parent's usage rate to partially reduce the effect of ramp 
up in the estimation of the child's CPU usage rate.

> 
> Also, perhaps the name is wrong, but I'd think SleepAVG
> has more to do with the average length of a sleep. It sure
> isn't documented. (time constant? type of decay?)

My reading of the code caused me to interpret it as a percentage sleep 
rate i.e. a value of 50 means the task is sleeping 50% of the time.  And 
this has made me realise that without our patches using (100 - SleepAVG) 
would not really give you CPU usage rate but would instead give 
RUNNABILITY rate (i.e. the proportion of time the task is spending on 
the cpu OR on a runqueue waiting for cpu access.

It also makes me realise that the SleepAVG our patch reports is NOT 
really a sleep rate it's a sleep OR waiting on a runqueue rate.

> 
> There's also a need for whole-process stats and cumulative
> (sum of exited children) stats. %CPU can go as high as 51200%.
> 
> 
>>Without our patch this value is a 
>>directly calculated estimated of the task's sleep rate which is 
>>available because it used by the O(1) scheduler's heuristics.  With our 
>>patches, it is calculated from our estimate of the task's usage because 
>>we dispensed with the sleep average calculations as they are no longer 
>>needed.  We decided to still report sleep average in the status file 
>>because we were reluctant to alter the contents of such files in case we 
>>broke user space programs.
> 
> 
> Generally this is a good move, though I don't expect anything
> to be using SleepAVG at the moment.

OK.

> 
> 
>>>Right now, Linux does not report the recent CPU usage
>>>of a process. The UNIX standard requires that "ps"
>>>report this; right now ps substitutes CPU usage over
>>>the whole lifetime of a process.
>>>
>>>Both per-task and per-process (tid and tgid) numbers
>>>are needed. Both percent and permill (1/1000) units
>>>get reported, so don't convert to integer percent.
>>
>>I think a modification to fs/proc/array.c to make this field a per 
>>million rather than a percent value would satisfy your needs.  It would 
>>be a very small change but there would be concerns about breaking 
>>programs that rely on it being a percentage.
> 
> 
> Nothing can rely on it existing at all, so a name change would
> solve the problem of apps getting confused.
> 
> BTW, permill is not per-million, it is per-thousand.
> Per-million or per-billion would be fine as long as
> it doesn't overflow.

OK.  Since SleepAVG does not seem to be entrenched in people's 
expectations and because of the fact that the value calculated from our 
usage rates are not really valid (see above), I propose that we change 
this field's name to CPUrate and report the CPU usage rate directly in 
permill (unless there are violent objections).

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
       [not found]   ` <403E2929.2080705@tmr.com>
@ 2004-02-27  3:44     ` Rik van Riel
  2004-02-28 21:27       ` Bill Davidsen
  0 siblings, 1 reply; 66+ messages in thread
From: Rik van Riel @ 2004-02-27  3:44 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Kernel Mailing List

On Thu, 26 Feb 2004, Bill Davidsen wrote:

> I disagree.

> It would be nice to have the scheduler identify processes which
> interface to user information devices, but it must be done in a way
> which doesn't open gaping security or misuse holes.

You seem to disagree only with what you think you read,
not with what the code does.  Please read the actual
code, since it seems to do what you propose.


Rik
-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 22:51             ` Peter Williams
@ 2004-02-27 10:06               ` Helge Hafting
  2004-02-27 11:04                 ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Helge Hafting @ 2004-02-27 10:06 UTC (permalink / raw)
  To: Peter Williams; +Cc: Timothy Miller, Mike Fedyk, linux-kernel

Peter Williams wrote:
> Timothy Miller wrote:
> 
>> How about this:
>>
>> The kernel tracks CPU usage, time slice expiration, and numerous other 
>> statistics, and exports them to userspace through /proc or somesuch. 
>> Then a user-space daemon adjusts priority.
> 
> 
> Yes, the right statistics could allow these processes to be identified 
> reasonably accurately.  The programs in question would have the 
> following characteristics:
> 
> 1. low CPU usage rate, and
> 2. a very regular pattern of use i.e. the size of each CPU bursts would 
> be approximately constant as would the size of the intervals between 
> each burst.

There is no need for the regularity.  When I use a word processor, I
use it very irregularly.  Sometimes I type text, and wants each letter
typed to appear instantly.  This fits well with "low cpu usage" and
sudden short bursts.  There may be lots of long delays though while
I think about stuff to write.  So the intervals are irregular, I still
believe I should get the boosts as long as the bursts are small.
Doing something big (such as invoking latex on a big document)
is cpu-heavy, but then it is ok not to get the boost.
Current schedulers based on io-waiting gets this right already.
> 
> The appropriate statistic to identify the second of these would be 
> variance or (equivalently but more expensively) standard deviation. 
> Whether this problem is bad/important enough to warrant the extra 
> overhead of gathering these statistics is a moot point.  We had to 
> generate very high system loads on a single CPU system in order to cause 
> one or two skips in xmms over a period of a couple of minutes.

Well, perhaps you could give a slightly bigge boost to a very regular
thing like xmms.  But even that might have some snags, the load
might change a lot when doing midi in software, depending on how
many instruments are active simultaneously. There goes the
constant-size bursts.

Helge Hafting


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-27 10:06               ` Helge Hafting
@ 2004-02-27 11:04                 ` Peter Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Williams @ 2004-02-27 11:04 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Timothy Miller, Mike Fedyk, linux-kernel

Helge Hafting wrote:
> Peter Williams wrote:
> 
>> Timothy Miller wrote:
>>
>>> How about this:
>>>
>>> The kernel tracks CPU usage, time slice expiration, and numerous 
>>> other statistics, and exports them to userspace through /proc or 
>>> somesuch. Then a user-space daemon adjusts priority.
>>
>>
>>
>> Yes, the right statistics could allow these processes to be identified 
>> reasonably accurately.  The programs in question would have the 
>> following characteristics:
>>
>> 1. low CPU usage rate, and
>> 2. a very regular pattern of use i.e. the size of each CPU bursts 
>> would be approximately constant as would the size of the intervals 
>> between each burst.
> 
> 
> There is no need for the regularity.  When I use a word processor, I
> use it very irregularly.  Sometimes I type text, and wants each letter
> typed to appear instantly.  This fits well with "low cpu usage" and
> sudden short bursts.  There may be lots of long delays though while
> I think about stuff to write.  So the intervals are irregular, I still
> believe I should get the boosts as long as the bursts are small.
> Doing something big (such as invoking latex on a big document)
> is cpu-heavy, but then it is ok not to get the boost.
> Current schedulers based on io-waiting gets this right already.

I was describing programs such as xmms NOT normal interactive programs 
such as editors etc.

Normal interactive programs don't need special treatment with our EBS 
scheduler because their inherent low usage rate means that they are 
always well under their entitlement and consequently get given a high 
dynamic priority.

> 
>>
>> The appropriate statistic to identify the second of these would be 
>> variance or (equivalently but more expensively) standard deviation. 
>> Whether this problem is bad/important enough to warrant the extra 
>> overhead of gathering these statistics is a moot point.  We had to 
>> generate very high system loads on a single CPU system in order to 
>> cause one or two skips in xmms over a period of a couple of minutes.
> 
> 
> Well, perhaps you could give a slightly bigge boost to a very regular
> thing like xmms.  But even that might have some snags, the load
> might change a lot when doing midi in software, depending on how
> many instruments are active simultaneously. There goes the
> constant-size bursts.

I think the burst sizes and intervals would still be fairly constant but 
the usage rate would be higher.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 20:15           ` Peter Williams
@ 2004-02-27 14:46             ` Timothy Miller
  2004-02-28  5:00               ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Timothy Miller @ 2004-02-27 14:46 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel



Peter Williams wrote:
> Timothy Miller wrote:
> 

>>
>>
>> We don't want user-space programs to have control over priority. 
> 
> 
> They already do e.g. renice is such a program.

No one's talking about LOWERING priority here.  You can only DoS someone 
else if you can set negative nice values, and non-root can't do that.

> 
>> This is DoS waiting to happen.
>>
>>> 2. have a user space daemon poll running tasks periodically and 
>>> renice them if they are running specified binaries
>>
>>
>>
>> This is much too specific.  Again, if the USER has control over this
>> list,
> 
> 
> It would obviously be under root control.

And that means if someone wants to run a program which is not on the 
list but which requires (and deserves) higher priority, they cannot.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-27 14:46             ` Timothy Miller
@ 2004-02-28  5:00               ` Peter Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Williams @ 2004-02-28  5:00 UTC (permalink / raw)
  To: Timothy Miller; +Cc: linux-kernel

Timothy Miller wrote:
> 
> 
> Peter Williams wrote:
> 
>> Timothy Miller wrote:
>>
> 
>>>
>>>
>>> We don't want user-space programs to have control over priority. 
>>
>>
>>
>> They already do e.g. renice is such a program.
> 
> 
> No one's talking about LOWERING priority here.  You can only DoS someone 
> else if you can set negative nice values, and non-root can't do that.

Which is why root has to be in control of the mechanism.

> 
>>
>>> This is DoS waiting to happen.
>>>
>>>> 2. have a user space daemon poll running tasks periodically and 
>>>> renice them if they are running specified binaries
>>>
>>>
>>>
>>>
>>> This is much too specific.  Again, if the USER has control over this
>>> list,
>>
>>
>>
>> It would obviously be under root control.
> 
> 
> And that means if someone wants to run a program which is not on the 
> list but which requires (and deserves) higher priority, they cannot.

Any mechanism that causes a task to be treated more favourably than 
others needs to be under root's control.  It's root's prerogative to 
decide who deserves more favourable treatment.  This is even more 
important (for obvious reasons) if a reservation (or guarantee) 
mechanism is involved.

I'd like to stress at this point that xmms only needs a boost under 
extremely high loads and this issue is being blown out of proportion.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-27  3:44     ` Rik van Riel
@ 2004-02-28 21:27       ` Bill Davidsen
  2004-02-28 23:55         ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Bill Davidsen @ 2004-02-28 21:27 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Kernel Mailing List

On Thu, 26 Feb 2004, Rik van Riel wrote:

> On Thu, 26 Feb 2004, Bill Davidsen wrote:
> 
> > I disagree.
> 
> > It would be nice to have the scheduler identify processes which
> > interface to user information devices, but it must be done in a way
> > which doesn't open gaping security or misuse holes.
> 
> You seem to disagree only with what you think you read,
> not with what the code does.  Please read the actual
> code, since it seems to do what you propose.

I disagree with the paragraph preceding my comment, which you removed to
take what I said out of context. And I still disagree. I "think I read" 
that just fine, although it may not correctly summarize the implementation
of the code.

In any case, as long as the code provides the protection against letting
users change priorities to hog resources I don't disagree with that.
Experience has shown that people WILL abuse any mechanism which gives them
an unfair share of a shared system. For home systems that's less
important, obviously.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-28 21:27       ` Bill Davidsen
@ 2004-02-28 23:55         ` Peter Williams
  2004-03-04 21:08           ` Timothy Miller
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Williams @ 2004-02-28 23:55 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Rik van Riel, Kernel Mailing List

Bill Davidsen wrote:
> On Thu, 26 Feb 2004, Rik van Riel wrote:
> 
> 
>>On Thu, 26 Feb 2004, Bill Davidsen wrote:
>>
>>
>>>I disagree.
>>
>>>It would be nice to have the scheduler identify processes which
>>>interface to user information devices, but it must be done in a way
>>>which doesn't open gaping security or misuse holes.
>>
>>You seem to disagree only with what you think you read,
>>not with what the code does.  Please read the actual
>>code, since it seems to do what you propose.
> 
> 
> I disagree with the paragraph preceding my comment, which you removed to
> take what I said out of context. And I still disagree. I "think I read" 
> that just fine, although it may not correctly summarize the implementation
> of the code.
> 
> In any case, as long as the code provides the protection against letting
> users change priorities to hog resources I don't disagree with that.
> Experience has shown that people WILL abuse any mechanism which gives them
> an unfair share of a shared system. For home systems that's less
> important, obviously.
> 

The O(1) Entitlement Based Scheduler places the equivalent restrictions 
on setting task attributes (i.e. shares and caps) as are placed on using 
nice and renice.  I.e. ordinary users can only change settings on their 
own processes and only if the change is more restricting than the 
current setting.  In particular, they cannot increase a task's shares 
only decrease them, they can impose or reduce a cap but not release or 
increase it and they can change a soft cap to a hard cap but cannot 
change a hard cap to a soft cap.

Additionally, only root can change the scheduler's tuning parameters.

I hope this alleviates your concerns,
Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
       [not found] ` <fa.ctat17m.8mqa3c@ifi.uio.no>
@ 2004-02-29 11:58   ` Joachim B Haga
  2004-02-29 20:39     ` Paul Jackson
  2004-02-29 22:56     ` Peter Williams
  0 siblings, 2 replies; 66+ messages in thread
From: Joachim B Haga @ 2004-02-29 11:58 UTC (permalink / raw)
  To: Peter Williams; +Cc: Timothy Miller, linux-kernel

Peter Williams <peterw@aurema.com> writes:

>>> They already do e.g. renice is such a program.
>> No one's talking about LOWERING priority here.  You can only DoS
>> someone else if you can set negative nice values, and non-root
>> can't do that.
> 
> Which is why root has to be in control of the mechanism.

It seems to me that much of this could be solved if the user *were*
allowed to lower nice values (down to 0).

Right now the only way I can prioritize between my own processes by
starting important/timing sensitive programs normally and everything
else reniced. The problem is that the first category consists of one
or two programs while the second category is, well, "everything else".

I would *love* to be able to start the window manager and all children
at +10 and be able to adjust priorities, from 0 (important user-level)
to 10 (normal) to 20. Negative values could still be root-only.

So why shouldn't this be possible? Because a greedy user in a
multi-user system would just run everything at max prio thus defeating
the purpose? Sure, that would be annoying but it would have another
solution ie. an entitlement based scheduler or something.

(and isn't it this simple?)

--- linux-2.6.3-mm3/kernel/sys.c.orig   2004-02-29 12:58:45.000000000 +0100
+++ linux-2.6.3-mm3/kernel/sys.c        2004-02-29 12:59:20.000000000 +0100
@@ -276,7 +276,7 @@
                error = -EPERM;
                goto out;
        }
-       if (niceval < task_nice(p) && !capable(CAP_SYS_NICE)) {
+       if (niceval < 0 && !capable(CAP_SYS_NICE)) {
                error = -EACCES;
                goto out;
        }

Regards,
Joachim B Haga

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-29 11:58   ` Joachim B Haga
@ 2004-02-29 20:39     ` Paul Jackson
  2004-02-29 22:56     ` Peter Williams
  1 sibling, 0 replies; 66+ messages in thread
From: Paul Jackson @ 2004-02-29 20:39 UTC (permalink / raw)
  To: Joachim B Haga; +Cc: peterw, miller, linux-kernel

Seems like we are trying to manage something worth managing,
which is how much of a systems CPU capacity is consumed by all
of a given users tasks over periods of minutes, by micromanaging
the scheduling of individual tasks over periods of ticks.

We don't manage disk space by telling someone no file bigger
than 1 megabyte.  Rather they get an upper limit on all their
files combined.  If they want to spend most of that on one
file, that's fine.

Is there anyway to provide a mechanism that would support
administering a system as follows:

  1) Users get so much CPU usage allowed, determined by an upper
     limit on a running average of the combined CPU usage of all
     their tasks, with a half life perhaps on the order of minutes.

  2) They can nice their tasks up and down, within a decent range,
     as they will.

  3) But if they push too close to their allowed limit, all
     their tasks get reined in.  The relative priorities within
     their own tasks are not changed, but the priority of their
     tasks relative to other users is weakened.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-29 11:58   ` Joachim B Haga
  2004-02-29 20:39     ` Paul Jackson
@ 2004-02-29 22:56     ` Peter Williams
  1 sibling, 0 replies; 66+ messages in thread
From: Peter Williams @ 2004-02-29 22:56 UTC (permalink / raw)
  To: Joachim B Haga; +Cc: Timothy Miller, linux-kernel

Joachim B Haga wrote:
> Peter Williams <peterw@aurema.com> writes:
> 
> 
>>>>They already do e.g. renice is such a program.
>>>
>>>No one's talking about LOWERING priority here.  You can only DoS
>>>someone else if you can set negative nice values, and non-root
>>>can't do that.
>>
>>Which is why root has to be in control of the mechanism.
> 
> 
> It seems to me that much of this could be solved if the user *were*
> allowed to lower nice values (down to 0).
> 
> Right now the only way I can prioritize between my own processes by
> starting important/timing sensitive programs normally and everything
> else reniced. The problem is that the first category consists of one
> or two programs while the second category is, well, "everything else".
> 
> I would *love* to be able to start the window manager and all children
> at +10 and be able to adjust priorities, from 0 (important user-level)
> to 10 (normal) to 20. Negative values could still be root-only.
> 
> So why shouldn't this be possible? Because a greedy user in a
> multi-user system would just run everything at max prio thus defeating
> the purpose? Sure, that would be annoying but it would have another
> solution ie. an entitlement based scheduler or something.

More importantly it would allow ordinary users to override root's 
settings e.g. if (for whatever reason) the sysadmin decided to renice a 
task to 19 (say) this modification would allow the owner of the task to 
renice it back to zero.  This is the reason that it isn't be allowed.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
       [not found] <894006121@toto.iv>
@ 2004-03-01  0:00 ` Peter Chubb
  2004-03-02  1:25   ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Chubb @ 2004-03-01  0:00 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Joachim B Haga, peterw, miller, linux-kernel

>>>>> "Paul" == Paul Jackson <pj@sgi.com> writes:

Paul> Is there anyway to provide a mechanism that would support
Paul> administering a system as follows:

Paul>   1) Users get so much CPU usage allowed, determined by an upper
Paul> limit on a running average of the combined CPU usage of all
Paul> their tasks, with a half life perhaps on the order of minutes.

Paul>   2) They can nice their tasks up and down, within a decent
Paul> range, as they will.

Paul>   3) But if they push too close to their allowed limit, all
Paul> their tasks get reined in.  The relative priorities within their
Paul> own tasks are not changed, but the priority of their tasks
Paul> relative to other users is weakened.

This is exactly what the commercial product ARMtech does.  The EBS
that Aurema have just released as open source is (a small) part of the
commercial product.

See http://www.aurema.com

Peter C (an ex-employee of Aurema)

--
Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
The technical we do immediately,  the political takes *forever*

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
       [not found] <fa.jgj0bdi.b3u6qk@ifi.uio.no>
@ 2004-03-01  1:54 ` Andy Lutomirski
  2004-03-01  2:54   ` Peter Williams
  2004-03-02 23:36   ` Peter Williams
  0 siblings, 2 replies; 66+ messages in thread
From: Andy Lutomirski @ 2004-03-01  1:54 UTC (permalink / raw)
  To: John Lee; +Cc: linux-kernel

How hard would it be to make shares hierarchial?  For example (quoted names are 
just descriptive):

      "guaranteed" (10 shares)       "user" (5 shares)
                |                          |
       -----------------            -----------------
       |               |            |               |
  "root" (1)      "apache" (2)    "bob" (5)       "fred" (5)
       |               |            |               |
(more groups?)    (web servers)   etc.            etc.

This way one user is prevented from taking unfair CPU time by launcing too many 
processes, apache gets enough time no matter what, etc.  In this scheme, numbers 
of shares would only be comparable if they are children of the same node.  Also, 
it now becomes safe to let users _increase_ priorities of their processes -- it 
doesn't affect anyone else.

Ignoring limts, this should be just an exercise in keeping track of shares and 
eliminating the 1/420 limit in precision.  It would take some thought to figure 
out what nice should do.

Also, could interactivity problems be solved something like this:

   prio = (  (old EBS usage ratio) - 0.5  ) * i + 0.5

"i" would be a per-process interactivity factor (normally 1, but higher for 
interactive processes) which would only boost them when their CPU usage is low. 
  This makes interactive processes get their timeslices early (very high 
priority at low CPU consumption) but prevents abuse by preventing excessive CPU 
consumption.  This could even by set by the (untrusted) process itself.

I imagine that these two together would nicely solve most interactivity and 
fairness issues -- the former prevents starvation by other users and the latter 
prevents latency caused by large numbers of CPU-light tasks.

Is this sane?  And does it break the O(1) promotion algorithm?

--Andy

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-01  1:54 ` Andy Lutomirski
@ 2004-03-01  2:54   ` Peter Williams
  2004-03-01  3:46     ` Andy Lutomirski
  2004-03-02 23:36   ` Peter Williams
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Williams @ 2004-03-01  2:54 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: John Lee, linux-kernel

Andy Lutomirski wrote:
> How hard would it be to make shares hierarchial?  For example (quoted 
> names are just descriptive):
> 
>      "guaranteed" (10 shares)       "user" (5 shares)
>                |                          |
>       -----------------            -----------------
>       |               |            |               |
>  "root" (1)      "apache" (2)    "bob" (5)       "fred" (5)
>       |               |            |               |
> (more groups?)    (web servers)   etc.            etc.
> 
> 
> This way one user is prevented from taking unfair CPU time by launcing 
> too many processes, apache gets enough time no matter what, etc.  In 
> this scheme, numbers of shares would only be comparable if they are 
> children of the same node.  Also, it now becomes safe to let users 
> _increase_ priorities of their processes -- it doesn't affect anyone else.
> 
> Ignoring limts, this should be just an exercise in keeping track of 
> shares and eliminating the 1/420 limit in precision.  It would take some 
> thought to figure out what nice should do.
> 

As Peter Chubb has stated such control is possible and is available on 
Tru64, Solaris and Windows with Aurema's (<http://www.aurema.com>) 
ARMTech product.  The CKRM project also addresses this issue.

> 
> Also, could interactivity problems be solved something like this:
> 
>   prio = (  (old EBS usage ratio) - 0.5  ) * i + 0.5
> 
> "i" would be a per-process interactivity factor (normally 1, but higher 
> for interactive processes) which would only boost them when their CPU 
> usage is low.  This makes interactive processes get their timeslices 
> early (very high priority at low CPU consumption) but prevents abuse by 
> preventing excessive CPU consumption.  This could even by set by the 
> (untrusted) process itself.
> 

Interactive processes do very well under EBS without any special treatment.

Programs such as xmms aren't really interactive processes although they 
usually have a very low CPU usage rate like interactive processes.  What 
distinguishes them is their need for REGULAR access to the CPU.  It's 
unlikely that such a modification would help with the need for regularity.

Once again I'll stress that in order to cause xmms to skip we had to (on 
a single CPU machine) run a kernel build with -j 16 which causes a 
system load well in excess of 10 and is NOT a normal load.  Under normal 
loads xmms performs OK.

> 
> I imagine that these two together would nicely solve most interactivity 
> and fairness issues -- the former prevents starvation by other users and 
> the latter prevents latency caused by large numbers of CPU-light tasks.
> 
> 
> Is this sane?

Yes.  Fairness between users rather than between tasks is a sane desire 
but beyond the current scope of EBS.

> And does it break the O(1) promotion algorithm?

No, it would not break the O(1) promotion algorithm.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-01  2:54   ` Peter Williams
@ 2004-03-01  3:46     ` Andy Lutomirski
  2004-03-01  4:18       ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Andy Lutomirski @ 2004-03-01  3:46 UTC (permalink / raw)
  To: Peter Williams; +Cc: Andy Lutomirski, John Lee, linux-kernel

Peter Williams wrote:

> Andy Lutomirski wrote:
> 
>> How hard would it be to make shares hierarchial?  For example (quoted 
>> names are just descriptive):
> 
> As Peter Chubb has stated such control is possible and is available on 
> Tru64, Solaris and Windows with Aurema's (<http://www.aurema.com>) 
> ARMTech product.  The CKRM project also addresses this issue.

Cool.  I hadn't realized ARMTech did that, and I haven't fully read up on CKRM.

> 
>>
>> Also, could interactivity problems be solved something like this:
>>
>>   prio = (  (old EBS usage ratio) - 0.5  ) * i + 0.5
>>
>> "i" would be a per-process interactivity factor (normally 1, but 
>> higher for interactive processes) which would only boost them when 
>> their CPU usage is low.  This makes interactive processes get their 
>> timeslices early (very high priority at low CPU consumption) but 
>> prevents abuse by preventing excessive CPU consumption.  This could 
>> even by set by the (untrusted) process itself.
>>
> 
> Interactive processes do very well under EBS without any special treatment.
> 
> Programs such as xmms aren't really interactive processes although they 
> usually have a very low CPU usage rate like interactive processes.  What 
> distinguishes them is their need for REGULAR access to the CPU.  It's 
> unlikely that such a modification would help with the need for regularity.

I'm guessing the reason make -j16 broke it is because it (i.e. make) spawns lots 
of processes frequenly.  Since make probably uses almost no CPU under this load, 
its usage per share stays near zero.  As long as it is below that of xmms, then 
all of its children are too, at least until part-way into their first 
timeslices.  The problem is that there are a bunch of these children, and, if 
enough start at once, they all run before xmms, and the audio buffer underruns. 
  My approach would give xmms a better chance of running sooner (in fact, it 
would potentially have better priority than any non-interactive task until it 
started hogging CPU).

> 
> Once again I'll stress that in order to cause xmms to skip we had to (on 
> a single CPU machine) run a kernel build with -j 16 which causes a 
> system load well in excess of 10 and is NOT a normal load.  Under normal 
> loads xmms performs OK.
> 
>>
>> I imagine that these two together would nicely solve most 
>> interactivity and fairness issues -- the former prevents starvation by 
>> other users and the latter prevents latency caused by large numbers of 
>> CPU-light tasks.
>>
>>
>> Is this sane?
> 
> 
> Yes.  Fairness between users rather than between tasks is a sane desire 
> but beyond the current scope of EBS.

I have this strange masochistic desire to implement this.  Don't expect patches 
any time soon -- it would be my first time playing with the scheduler ;)

> Peter

--Andy

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26 23:24     ` Peter Williams
@ 2004-03-01  3:47       ` Peter Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Williams @ 2004-03-01  3:47 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: johnl, linux-kernel mailing list

Peter Williams wrote:
> Albert Cahalan wrote:
<snip>
>> Nothing can rely on it existing at all, so a name change would
>> solve the problem of apps getting confused.
>>
>> BTW, permill is not per-million, it is per-thousand.
>> Per-million or per-billion would be fine as long as
>> it doesn't overflow.
> 
> 
> OK.  Since SleepAVG does not seem to be entrenched in people's 
> expectations and because of the fact that the value calculated from our 
> usage rates are not really valid (see above), I propose that we change 
> this field's name to CPUrate and report the CPU usage rate directly in 
> permill (unless there are violent objections).

I've analysed the arithmetic and think that we can safely go to 6 
decimal places so I'll make this Per-million.  The change should appear 
in the patches to the 2.6.4 kernel when it is available.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-01  3:46     ` Andy Lutomirski
@ 2004-03-01  4:18       ` Peter Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Williams @ 2004-03-01  4:18 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Andy Lutomirski, John Lee, linux-kernel

Andy Lutomirski wrote:
> Peter Williams wrote:
> 
>> Andy Lutomirski wrote:
>>
>>> How hard would it be to make shares hierarchial?  For example (quoted 
>>> names are just descriptive):
>>
>>
>> As Peter Chubb has stated such control is possible and is available on 
>> Tru64, Solaris and Windows with Aurema's (<http://www.aurema.com>) 
>> ARMTech product.  The CKRM project also addresses this issue.
> 
> 
> Cool.  I hadn't realized ARMTech did that, and I haven't fully read up 
> on CKRM.
> 
>>
>>>
>>> Also, could interactivity problems be solved something like this:
>>>
>>>   prio = (  (old EBS usage ratio) - 0.5  ) * i + 0.5
>>>
>>> "i" would be a per-process interactivity factor (normally 1, but 
>>> higher for interactive processes) which would only boost them when 
>>> their CPU usage is low.  This makes interactive processes get their 
>>> timeslices early (very high priority at low CPU consumption) but 
>>> prevents abuse by preventing excessive CPU consumption.  This could 
>>> even by set by the (untrusted) process itself.
>>>
>>
>> Interactive processes do very well under EBS without any special 
>> treatment.
>>
>> Programs such as xmms aren't really interactive processes although 
>> they usually have a very low CPU usage rate like interactive 
>> processes.  What distinguishes them is their need for REGULAR access 
>> to the CPU.  It's unlikely that such a modification would help with 
>> the need for regularity.
> 
> 
> I'm guessing the reason make -j16 broke it is because it (i.e. make) 
> spawns lots of processes frequenly.  Since make probably uses almost no 
> CPU under this load, its usage per share stays near zero.  As long as it 
> is below that of xmms, then all of its children are too, at least until 
> part-way into their first timeslices.  The problem is that there are a 
> bunch of these children, and, if enough start at once, they all run 
> before xmms, and the audio buffer underruns.  My approach would give 
> xmms a better chance of running sooner (in fact, it would potentially 
> have better priority than any non-interactive task until it started 
> hogging CPU).

That's close to the reason.  The real culprit is the dreaded "ramp up" 
which is our tag for the fact that it takes a short while (dependent on 
half life) for the scheduler to estimate the usage rate of a new process 
(or a sudden change in the usage rate of an existing process) so they 
get treated as low usage tasks for the first part of their life.  In 
most cases this is a good thing as it causes commands run from the 
command line to get a boost and since a lot of these are very short 
lived (e.g. ls) they run very quickly and have very good response times.

In the case of a kernel build there are lots of C compiler tasks being 
launched and these run for several seconds but for the first few moments 
of their lives (during ramp up) they look like low CPU usage tasks and 
get high priority.  This effect is short lived and doesn't effect normal 
interactive programs because regularity of access isn't important to 
them and the estimated usages of the C compiler tasks quickly exceeds 
the estimated usages of the interactive tasks. It only has an effect on 
xmms because regularity of access is important to it (i.e. xmms IS 
getting sufficient CPU but isn't always getting it as regularly as it 
likes).

The X server is a different case.  Although it isn't an interactive 
program it does have an influence on interactive responsiveness when X 
windows is being used.  Unfortunately, because it is generally serving a 
number of clients, its CPU usage can be quite high and it takes a little 
longer for the estimated usage rate of the C compiler tasks to exceed 
its estimated usage.  For this reason, we recommend that sysadmins 
arrange for the X server to be run at somewhere between nice -9 and nice 
-15.

It should be obvious from the above that another way that interactive 
response can be improved is by shortening the half life.  But there is a 
trade off with lowered overall system throughput resulting if the half 
life is too short.  In the current version of EBS, root can change the 
half life on a running system and this functionality is there so that 
experiments into the effect of changing the half life on system 
performance can be conducted without the need to recompile the kernel 
and reboot the system.  The default value of 5 seconds is one that we've 
found is good for overall performance and interactive response provided 
that the X server is run with an appropriate level of niceness.

> 
>>
>> Once again I'll stress that in order to cause xmms to skip we had to 
>> (on a single CPU machine) run a kernel build with -j 16 which causes a 
>> system load well in excess of 10 and is NOT a normal load.  Under 
>> normal loads xmms performs OK.
>>
>>>
>>> I imagine that these two together would nicely solve most 
>>> interactivity and fairness issues -- the former prevents starvation 
>>> by other users and the latter prevents latency caused by large 
>>> numbers of CPU-light tasks.
>>>
>>>
>>> Is this sane?
>>
>>
>>
>> Yes.  Fairness between users rather than between tasks is a sane 
>> desire but beyond the current scope of EBS.
> 
> 
> I have this strange masochistic desire to implement this.  Don't expect 
> patches any time soon -- it would be my first time playing with the 
> scheduler ;)
> 

Good luck
Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
       [not found] ` <fa.cvc8vnj.ahebjd@ifi.uio.no>
@ 2004-03-01  9:18   ` Joachim B Haga
  2004-03-01 10:18     ` Paul Wagland
  0 siblings, 1 reply; 66+ messages in thread
From: Joachim B Haga @ 2004-03-01  9:18 UTC (permalink / raw)
  To: Peter Williams; +Cc: Joachim B Haga, Timothy Miller, linux-kernel

Peter Williams <peterw@aurema.com> writes:

>> It seems to me that much of this could be solved if the user *were*
>> allowed to lower nice values (down to 0).
[snip]
>> to 10 (normal) to 20. Negative values could still be root-only.  So
>> why shouldn't this be possible? Because a greedy user in a
>> multi-user system would just run everything at max prio thus
>> defeating the purpose? Sure, that would be annoying but it would
>> have another solution ie. an entitlement based scheduler or
>> something.

> More importantly it would allow ordinary users to override root's
> settings e.g. if (for whatever reason) the sysadmin decided to
> renice a task to 19 (say) this modification would allow the owner of
> the task to renice it back to zero.  This is the reason that it
> isn't be allowed.

"You dirty cracker! A renice +19, that'll teach you!"  :-)

Seriously though, the same is true today, it's just a bit more
cumbersome. Restart the task and you're back to 0. If the sysadmin
wants to stop that, he'll renice your shell. In which case you login
again. And so on.

My point is that this is a problem (annoying user) which has better
solutions (ranging from a polite e-mail to deluser) because renice
won't stop him.

And it's not a *security* concern, as long as the lower values are
still reserved.

I would say the benefit is very small (I mean: who has ever relied on
it?) compared to the difficulties created for users.

> Peter

Joachim

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-01  9:18   ` Joachim B Haga
@ 2004-03-01 10:18     ` Paul Wagland
  2004-03-01 19:11       ` Mike Fedyk
  0 siblings, 1 reply; 66+ messages in thread
From: Paul Wagland @ 2004-03-01 10:18 UTC (permalink / raw)
  To: Joachim B Haga; +Cc: Peter Williams, Timothy Miller, linux-kernel

On Mon, 2004-03-01 at 10:18, Joachim B Haga wrote:
> Peter Williams <peterw@aurema.com> writes:
> 
> >> It seems to me that much of this could be solved if the user *were*
> >> allowed to lower nice values (down to 0).
> [snip]
> >> to 10 (normal) to 20. Negative values could still be root-only.  So
> >> why shouldn't this be possible? Because a greedy user in a
>  
> > More importantly it would allow ordinary users to override root's
> > settings e.g. if (for whatever reason) the sysadmin decided to

> And it's not a *security* concern, as long as the lower values are
> still reserved.
> 
> I would say the benefit is very small (I mean: who has ever relied on
> it?) compared to the difficulties created for users.

Under Linux, I can't say, but certainly on my old school machine (~10
years ago) all student accounts would run at +5, all staff accounts
would run at +0. This was handled by the login process, so re-logging in
would not help you at all....

Cheers,
Paul


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-01 10:18     ` Paul Wagland
@ 2004-03-01 19:11       ` Mike Fedyk
  0 siblings, 0 replies; 66+ messages in thread
From: Mike Fedyk @ 2004-03-01 19:11 UTC (permalink / raw)
  To: Paul Wagland; +Cc: Joachim B Haga, Peter Williams, Timothy Miller, linux-kernel

Paul Wagland wrote:
> On Mon, 2004-03-01 at 10:18, Joachim B Haga wrote:
> 
>>Peter Williams <peterw@aurema.com> writes:
>>
>>
>>>>It seems to me that much of this could be solved if the user *were*
>>>>allowed to lower nice values (down to 0).
>>
>>[snip]
>>
>>>>to 10 (normal) to 20. Negative values could still be root-only.  So
>>>>why shouldn't this be possible? Because a greedy user in a
>>
>> 
>>
>>>More importantly it would allow ordinary users to override root's
>>>settings e.g. if (for whatever reason) the sysadmin decided to
> 
> 
>>And it's not a *security* concern, as long as the lower values are
>>still reserved.
>>
>>I would say the benefit is very small (I mean: who has ever relied on
>>it?) compared to the difficulties created for users.
> 
> 
> Under Linux, I can't say, but certainly on my old school machine (~10
> years ago) all student accounts would run at +5, all staff accounts
> would run at +0. This was handled by the login process, so re-logging in
> would not help you at all....

I think you can do this with pam or login under linux.  I know I did 
something like this, but since most of my users were samba users, it 
wasn't very useful at the time

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-01  0:00 ` Peter Chubb
@ 2004-03-02  1:25   ` Peter Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Williams @ 2004-03-02  1:25 UTC (permalink / raw)
  To: Peter Chubb; +Cc: Paul Jackson, Joachim B Haga, miller, linux-kernel

Peter Chubb wrote:
>>>>>>"Paul" == Paul Jackson <pj@sgi.com> writes:
> 
> 
> 
> Paul> Is there anyway to provide a mechanism that would support
> Paul> administering a system as follows:
> 
> Paul>   1) Users get so much CPU usage allowed, determined by an upper
> Paul> limit on a running average of the combined CPU usage of all
> Paul> their tasks, with a half life perhaps on the order of minutes.
> 
> Paul>   2) They can nice their tasks up and down, within a decent
> Paul> range, as they will.
> 
> Paul>   3) But if they push too close to their allowed limit, all
> Paul> their tasks get reined in.  The relative priorities within their
> Paul> own tasks are not changed, but the priority of their tasks
> Paul> relative to other users is weakened.
> 
> This is exactly what the commercial product ARMtech does.  The EBS
> that Aurema have just released as open source is (a small) part of the
> commercial product.

Not strictly speaking a part of our commercial product but based on the 
CPU scheduling technology in that product.  As you know, the CPU 
scheduler in the kernel based versions of our product relies on a 
generic "plug in scheduler" interface being present in the host kernel 
and runtime loadable kernel modules plug into that interface and take 
over scheduling.  This mechanism (while allowing our product to work on 
a number of different host operating systems) has the disadvantage that 
it adds some overhead to the scheduler (this is in addition to the extra 
overhead involved in hierarchical scheduling) and places some 
restrictions on the scheduler design as it cannot make too many 
assumptions about the underlying scheduler in the host operating system.

When the technology is used to build an embedded scheduler (as is the 
case with EBS) significant improvements in efficiency can be realised. 
So EBS is a souped up, non hierarchical version of our CPU scheduler 
designed specially for Linux.

> 
> See http://www.aurema.com
> 
> Peter C (an ex-employee of Aurema)
> 
> --
> Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
> The technical we do immediately,  the political takes *forever*

Cheers
Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-01  1:54 ` Andy Lutomirski
  2004-03-01  2:54   ` Peter Williams
@ 2004-03-02 23:36   ` Peter Williams
  1 sibling, 0 replies; 66+ messages in thread
From: Peter Williams @ 2004-03-02 23:36 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel

Andy Lutomirski wrote:
> <snip>
> Ignoring limts, this should be just an exercise in keeping track of 
> shares and eliminating the 1/420 limit in precision.  It would take some 
> thought to figure out what nice should do.
 > <snip>

I take it from this comment that you would like to see a larger range of 
shares made available?

The current range (1 to 420) was chosen to allow easy mapping between 
niceness and shares and so that minimum shares was roughly the same 
times smaller than the default as the maximum was bigger (i.e. twenty 
times).  One of the restrictions on the number of shares is the dynamic 
range of the representation of real numbers that we use for our 
calculations.  We use fixed denominator rational numbers with a 
denominator of 2 to the power of 27.  This value was chosen because the 
maximum (real number) value that we have to be able to cope with in our 
calculations is 19 and we are limited to using 32 bits because we need 
to do a divide and 64 bit division is not supported in the kernel on all 
hardwares (in particular, the IA32 kernels do not support 64 bit 
division).  The other factors that have to be taken into consideration 
are the half life and the value of HZ (which varies widely depending on 
the system).

Anyway, I will look at the numbers and see if it's possible to squeeze a 
larger range of shares in (although it may mean tighter restrictions on 
half life on some systems).

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
       [not found]     ` <40426E1C.8010806@aurema.com.suse.lists.linux.kernel>
@ 2004-03-03  2:48       ` Andi Kleen
  2004-03-03  3:45         ` Peter Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Andi Kleen @ 2004-03-03  2:48 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel, johnl

Peter Williams <peterw@aurema.com> writes:

One comment on the patches: could you remove the zillions of numerical Kconfig
options and just make them sysctls? I don't think it makes any sense 
to require a reboot to change any of that. And the user is unlikely
to have much idea yet on what he wants on them while configuring.

I really like the reduced scheduler complexity part of your patch BTW.
IMHO the 2.6 scheduler's complexity has gotten out of hand and it's great
that someone is going into the other direction with a simple basic design.

For more wide spread testing it would be useful if you could do 
a more minimal less intrusive patch with less configuration 
(e.g. only allow tuning via nice, not via other means). This would
be mainly to test your patch on more workloads without any hand tuning,
which is the most important use case.

-Andi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-03  2:48       ` Andi Kleen
@ 2004-03-03  3:45         ` Peter Williams
  2004-03-03 10:13           ` Andi Kleen
  2004-03-03 15:57           ` Andi Kleen
  0 siblings, 2 replies; 66+ messages in thread
From: Peter Williams @ 2004-03-03  3:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, johnl

Andi Kleen wrote:
> Peter Williams <peterw@aurema.com> writes:
> 
> One comment on the patches: could you remove the zillions of numerical Kconfig
> options and just make them sysctls? I don't think it makes any sense 
> to require a reboot to change any of that. And the user is unlikely
> to have much idea yet on what he wants on them while configuring.

The default initial values should be fine and the default configuration 
allows the scheduling tuning parameters (i.e. half life and time slice 
       ) to be changed on a running system via the /proc file system. 
These are mainly there so that different settings can be used with 
various benchmarks to determine what are the best settings for various 
types of loads.  If good default values that work well for a wide 
variety of load types can be found as a result of these experiments then 
these parameters may be made constants in the code.  If not they 
probably should be made settable via system calls rather than via /proc 
as you suggest.

In reality, batch type jobs tend to get better throughput with a longer 
half life but a shorter half life gives better interactive response.  So 
servers and work stations should probably have different settings.

> 
> I really like the reduced scheduler complexity part of your patch BTW.
> IMHO the 2.6 scheduler's complexity has gotten out of hand and it's great
> that someone is going into the other direction with a simple basic design.

Thanks, we felt much the same.  With a heuristic approach there always 
seems to be one more case that needs to be handled specially popping up.

> 
> For more wide spread testing it would be useful if you could do 
> a more minimal less intrusive patch with less configuration 
> (e.g. only allow tuning via nice, not via other means). This would
> be mainly to test your patch on more workloads without any hand tuning,
> which is the most important use case.

The "base" patch does this except that it also allows the setting of 
soft caps via /proc. But, as I said above, the main reason for the 
tuning parameters being exposed (in the full patch) at this time is to 
encourage testing with different values (of half life and time slice) 
and making them settable via /proc makes this possible without having to 
reboot the system.

Except for (possibly) renicing the X server there should be no need to 
fiddle with settings for individual tasks.

Peter
PS We are looking at some simple modifications to further improve 
interactive response (hopefully) without adding to complexity.
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-03  3:45         ` Peter Williams
@ 2004-03-03 10:13           ` Andi Kleen
  2004-03-03 23:46             ` Peter Williams
  2004-03-03 15:57           ` Andi Kleen
  1 sibling, 1 reply; 66+ messages in thread
From: Andi Kleen @ 2004-03-03 10:13 UTC (permalink / raw)
  To: Peter Williams; +Cc: Andi Kleen, linux-kernel, johnl

On Wed, Mar 03, 2004 at 02:45:28PM +1100, Peter Williams wrote:
> Andi Kleen wrote:
> >Peter Williams <peterw@aurema.com> writes:
> >
> >One comment on the patches: could you remove the zillions of numerical 
> >Kconfig
> >options and just make them sysctls? I don't think it makes any sense 
> >to require a reboot to change any of that. And the user is unlikely
> >to have much idea yet on what he wants on them while configuring.
> 
> The default initial values should be fine and the default configuration 
> allows the scheduling tuning parameters (i.e. half life and time slice 
>       ) to be changed on a running system via the /proc file system. 
> These are mainly there so that different settings can be used with 
> various benchmarks to determine what are the best settings for various 
> types of loads.  If good default values that work well for a wide 
> variety of load types can be found as a result of these experiments then 
> these parameters may be made constants in the code.  If not they 
> probably should be made settable via system calls rather than via /proc 
> as you suggest.

No doubt that there are different settings that make sense 
for different workloads. But there is no reason one has to recompile
to set them - it's much nicer to just run a script at boot time to set
them, instead of recompiling/rebooting. This will also make benchmarking
much easier, because you can just write a script that sets the
various parameters, runs workloads, sets other parameters, runs
workload again etc. Requiring a recompile and reboot makes this
much harder.

Also I suspect most people who are not heavily into scheduling
theory will go with the defaults at compile time, so it's important
that they already work well.

And consider it from the side of a standard distribution: users 
don't normally recompile their kernels there, so everything that
makes sense to be tunable should be settable without recompiling.

IMHO CONFIG_* should not be used for anything except for kernel binary
size tuning and possible compiler tuning.

-Andi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-03  3:45         ` Peter Williams
  2004-03-03 10:13           ` Andi Kleen
@ 2004-03-03 15:57           ` Andi Kleen
  2004-03-04  0:41             ` Peter Williams
  1 sibling, 1 reply; 66+ messages in thread
From: Andi Kleen @ 2004-03-03 15:57 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel, johnl

On Wed, 03 Mar 2004 14:45:28 +1100
Peter Williams <peterw@aurema.com> wrote:

> Andi Kleen wrote:
> > Peter Williams <peterw@aurema.com> writes:
> > 
> > One comment on the patches: could you remove the zillions of numerical Kconfig
> > options and just make them sysctls? I don't think it makes any sense 
> > to require a reboot to change any of that. And the user is unlikely
> > to have much idea yet on what he wants on them while configuring.
> 
> The default initial values should be fine and the default configuration 
> allows the scheduling tuning parameters (i.e. half life and time slice 
>        ) to be changed on a running system via the /proc file system. 

I'm running the 2.6.3-full patch on my workstation now. No tuning applied
at all. I reniced the X server to -10. When I have two kernel compiles (without any -j*) 
running there is  a visible (=not really slow, but long enough to notice something) 
delay in responses while typing something in a xterm. In sylpheed there
is the same issue.

The standard scheduler didn't show this that extreme with only two compiles.

-Andi

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
       [not found]           ` <1vBE2-48V-21@gated-at.bofh.it>
@ 2004-03-03 21:38             ` Bill Davidsen
  0 siblings, 0 replies; 66+ messages in thread
From: Bill Davidsen @ 2004-03-03 21:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing List

Andi Kleen wrote:

> No doubt that there are different settings that make sense 
> for different workloads. But there is no reason one has to recompile
> to set them - it's much nicer to just run a script at boot time to set
> them, instead of recompiling/rebooting. This will also make benchmarking
> much easier, because you can just write a script that sets the
> various parameters, runs workloads, sets other parameters, runs
> workload again etc. Requiring a recompile and reboot makes this
> much harder.

Andi, if people are trying to find an optimal tuning then in many cases 
a reboot is out. There are two reasons for this:
- a production server, can't just reboot!
- it's sometimes hard to recreate the load which is causing problems,
   and far easier to get a working config by diddling and watching.

At least those are the reasons why I would feel able to tune the 
machines which most need it.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-03 10:13           ` Andi Kleen
@ 2004-03-03 23:46             ` Peter Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Williams @ 2004-03-03 23:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, johnl

Andi Kleen wrote:
> On Wed, Mar 03, 2004 at 02:45:28PM +1100, Peter Williams wrote:
> 
>>Andi Kleen wrote:
>>
>>>Peter Williams <peterw@aurema.com> writes:
>>>
>>>One comment on the patches: could you remove the zillions of numerical 
>>>Kconfig
>>>options and just make them sysctls? I don't think it makes any sense 
>>>to require a reboot to change any of that. And the user is unlikely
>>>to have much idea yet on what he wants on them while configuring.
>>
>>The default initial values should be fine and the default configuration 
>>allows the scheduling tuning parameters (i.e. half life and time slice 
>>      ) to be changed on a running system via the /proc file system. 
>>These are mainly there so that different settings can be used with 
>>various benchmarks to determine what are the best settings for various 
>>types of loads.  If good default values that work well for a wide 
>>variety of load types can be found as a result of these experiments then 
>>these parameters may be made constants in the code.  If not they 
>>probably should be made settable via system calls rather than via /proc 
>>as you suggest.
> 
> 
> No doubt that there are different settings that make sense 
> for different workloads. But there is no reason one has to recompile
> to set them - it's much nicer to just run a script at boot time to set
> them, instead of recompiling/rebooting. This will also make benchmarking
> much easier, because you can just write a script that sets the
> various parameters, runs workloads, sets other parameters, runs
> workload again etc. Requiring a recompile and reboot makes this
> much harder.

As I said (with the full patch) these values can be set via /proc on a 
running system so there's no need for a recompile and reboot.

> 
> Also I suspect most people who are not heavily into scheduling
> theory will go with the defaults at compile time, so it's important
> that they already work well.
> 
> And consider it from the side of a standard distribution: users 
> don't normally recompile their kernels there, so everything that
> makes sense to be tunable should be settable without recompiling.
> 
> IMHO CONFIG_* should not be used for anything except for kernel binary
> size tuning and possible compiler tuning.

Yes, once this patch has stabilised we will probably remove some (or 
all) of the configuration variables.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-03 15:57           ` Andi Kleen
@ 2004-03-04  0:41             ` Peter Williams
  2004-03-05  3:55               ` Andi Kleen
  0 siblings, 1 reply; 66+ messages in thread
From: Peter Williams @ 2004-03-04  0:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: johnl, linux-kernel

Andi Kleen wrote:
> On Wed, 03 Mar 2004 14:45:28 +1100
> Peter Williams <peterw@aurema.com> wrote:
> 
> 
>>Andi Kleen wrote:
>>
>>>Peter Williams <peterw@aurema.com> writes:
>>>
>>>One comment on the patches: could you remove the zillions of numerical Kconfig
>>>options and just make them sysctls? I don't think it makes any sense 
>>>to require a reboot to change any of that. And the user is unlikely
>>>to have much idea yet on what he wants on them while configuring.
>>
>>The default initial values should be fine and the default configuration 
>>allows the scheduling tuning parameters (i.e. half life and time slice 
>>       ) to be changed on a running system via the /proc file system. 
> 
> 
> I'm running the 2.6.3-full patch on my workstation now. No tuning applied
> at all. I reniced the X server to -10. When I have two kernel compiles (without any -j*) 
> running there is  a visible (=not really slow, but long enough to notice something) 
> delay in responses while typing something in a xterm. In sylpheed there
> is the same issue.
> 
> The standard scheduler didn't show this that extreme with only two compiles.
> 

Thanks for the feedback.  We're looking at some minor modifications to 
try and improve this issue.

BTW Could you try it with the X server reniced to -15?

Thanks
Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-28 23:55         ` Peter Williams
@ 2004-03-04 21:08           ` Timothy Miller
  0 siblings, 0 replies; 66+ messages in thread
From: Timothy Miller @ 2004-03-04 21:08 UTC (permalink / raw)
  To: Peter Williams; +Cc: Bill Davidsen, Rik van Riel, Kernel Mailing List

Peter Williams wrote:

> 
> The O(1) Entitlement Based Scheduler places the equivalent restrictions 
> on setting task attributes (i.e. shares and caps) as are placed on using 
> nice and renice.  I.e. ordinary users can only change settings on their 
> own processes and only if the change is more restricting than the 
> current setting.  In particular, they cannot increase a task's shares 
> only decrease them, they can impose or reduce a cap but not release or 
> increase it and they can change a soft cap to a hard cap but cannot 
> change a hard cap to a soft cap.
> 
> Additionally, only root can change the scheduler's tuning parameters.
> 
> I hope this alleviates your concerns,

I, for one, never had any such concerns.  My concern was about the 
unpriveledged user begin unable to run certain applications under load 
without prior approval.

Two philosophical points:

1) Perhaps we are trying too hard to please everyone.  As Linus said, 
perfect is the enemy of good.  A good scheduler won't work perfectly for 
everyone's application, but it will work very well for the most 
important ones.  Perhaps people writing schedulers should compete based 
on overall throughput and latency, rather than on how well it runs xmms 
(and other such apps).

2) Perhaps certain apps like xmms are 'broken' can be rewritten to 
behave better with the new scheduler.  For instance, more buffering, 
separating the mp3 decoding thread from the thread that feeds 
/dev/audio, more efficient decoder, a decoder that voluntarily sleeps 
when it's 'done enough', so that it doesn't get knocked down to a lower 
priority, a decoder that 'cheats' on audio quality just to maintain low 
CPU usage when it finds itself being preempted, etc.

It bears mentioning that many applications work well with 2.4 because 
they evolved to work well with the 2.4 scheduler.  The 2.6 scheduler is 
different.  We shouldn't constrain 2.6 for the sake of old apps.  Those 
old apps should be rewritten to adapt to the new environment.  "Working 
well under 2.6" doesn't require any more adaptation than with 2.4, but 
it does require _different_ adaptation.

This isn't to speak negatively of Con and Nick and others who have 
attempted to improve upon the 2.6 scheduler.  If they can make old apps 
work well without impacting the potential that new apps can get out of 
2.6, then more power to them!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-02-26  2:18       ` Peter Williams
  2004-02-26  2:42         ` Mike Fedyk
  2004-02-26 16:08         ` Timothy Miller
@ 2004-03-04 21:18         ` Robert White
  2004-03-04 23:15           ` Peter Williams
  2 siblings, 1 reply; 66+ messages in thread
From: Robert White @ 2004-03-04 21:18 UTC (permalink / raw)
  To: 'Peter Williams', 'Timothy Miller'; +Cc: linux-kernel

At a previous employer (so code not available) I used a simple expedient to
solve this very problem.  I had a custom program "shim.c" that tweaked
priorities and environment variables.  Basically a fistful of lines that
would take argv[0], look for the file named ".shim_"+basename(argv[0]) {in a
well-defined location} to load some simple environment and path and priority
overrides, apply these changes and then setuid itself back to the real user
and exec() the real program with the received args.

It had some few degenerate cases (shmming out from under a setuid program
was the primary one) but it worked out rather well and had little-to-no
meaningful overhead.

You set up a /usr/local/bin (or equivalent) directory, link shim into that
directory named as the various programs that need to be boosted (e.g. xmms
etc) and put that directory earlier on the path than the real executable.
{If this directory only contains shims, it is useful to code shim.c to
remove that directory from the PATH.)

This technique lets the administrator have fine-grained control of a
reasonable list of priority promotions and permissions overrides without
having to move anything into kernel space or running status daemons.

====

I would think that while fork() should keep the heuristics of its parent,
exec() would probably need to do some normalizing.

====

Has this scheduler been tried for applications like CD burning?

Rob.

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org
[mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Peter Williams
Sent: Wednesday, February 25, 2004 6:18 PM
To: Timothy Miller
Cc: linux-kernel@vger.kernel.org
Subject: Re: [RFC][PATCH] O(1) Entitlement Based Scheduler

Timothy Miller wrote:
 > <snip>
> In fact, that may be the only "flaw" in your design.  It sounds like 
> your scheduler does an excellent job at fairness with very low overhead. 
>  The only problem with it is that it doesn't determine priority 
> dynamically.

This (i.e. automatic renicing of specified programs) is a good idea but 
is not really a function that should be undertaken by the scheduler 
itself.  Two possible solutions spring to mind:

1. modify the do_execve() in fs/exec.c to renice tasks when they execute 
specified binaries
2. have a user space daemon poll running tasks periodically and renice 
them if they are running specified binaries

Both of these solutions have their advantages and disadvantages, are 
(obviously) complicated than I've made them sound and would require a 
great deal of care to be taken during their implementation.  However, I 
think that they are both doable.  My personal preference would be for 
the in kernel solution on the grounds of efficiency.

Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-04 21:18         ` Robert White
@ 2004-03-04 23:15           ` Peter Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Peter Williams @ 2004-03-04 23:15 UTC (permalink / raw)
  To: Robert White; +Cc: 'Timothy Miller', linux-kernel

Robert White wrote:
> <snip>
> 
> Has this scheduler been tried for applications like CD burning?
> 

No.  But it will be now that you've brought it to our attention.

Thanks
Peter
-- 
Dr Peter Williams, Chief Scientist                peterw@aurema.com
Aurema Pty Limited                                Tel:+61 2 9698 2322
PO Box 305, Strawberry Hills NSW 2012, Australia  Fax:+61 2 9699 9174
79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
  2004-03-04  0:41             ` Peter Williams
@ 2004-03-05  3:55               ` Andi Kleen
  0 siblings, 0 replies; 66+ messages in thread
From: Andi Kleen @ 2004-03-05  3:55 UTC (permalink / raw)
  To: Peter Williams; +Cc: johnl, linux-kernel

On Thu, 04 Mar 2004 11:41:00 +1100
Peter Williams <peterw@aurema.com> wrote:


> BTW Could you try it with the X server reniced to -15?

It seems to be a bit better with -15 (or maybe I'm imagining that), but still 
noticeable.

-Andi

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2004-03-05  3:55 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-26  3:30 [RFC][PATCH] O(1) Entitlement Based Scheduler Albert Cahalan
2004-02-26  6:19 ` Peter Williams
2004-02-26 17:57   ` Albert Cahalan
2004-02-26 23:24     ` Peter Williams
2004-03-01  3:47       ` Peter Williams
     [not found] <1vuMd-5jx-5@gated-at.bofh.it>
     [not found] ` <1vuMd-5jx-7@gated-at.bofh.it>
     [not found]   ` <1vuMd-5jx-9@gated-at.bofh.it>
     [not found]     ` <1vuMd-5jx-11@gated-at.bofh.it>
     [not found]       ` <1vuMd-5jx-3@gated-at.bofh.it>
     [not found]         ` <1vvyx-6jy-13@gated-at.bofh.it>
     [not found]           ` <1vBE2-48V-21@gated-at.bofh.it>
2004-03-03 21:38             ` Bill Davidsen
     [not found] <fa.fi4j08o.17nchps@ifi.uio.no.suse.lists.linux.kernel>
     [not found] ` <fa.ctat17m.8mqa3c@ifi.uio.no.suse.lists.linux.kernel>
     [not found]   ` <yydjishqw10p.fsf@galizur.uio.no.suse.lists.linux.kernel>
     [not found]     ` <40426E1C.8010806@aurema.com.suse.lists.linux.kernel>
2004-03-03  2:48       ` Andi Kleen
2004-03-03  3:45         ` Peter Williams
2004-03-03 10:13           ` Andi Kleen
2004-03-03 23:46             ` Peter Williams
2004-03-03 15:57           ` Andi Kleen
2004-03-04  0:41             ` Peter Williams
2004-03-05  3:55               ` Andi Kleen
     [not found] <fa.ftul5bl.nlk3pr@ifi.uio.no>
     [not found] ` <fa.cvc8vnj.ahebjd@ifi.uio.no>
2004-03-01  9:18   ` Joachim B Haga
2004-03-01 10:18     ` Paul Wagland
2004-03-01 19:11       ` Mike Fedyk
     [not found] <fa.jgj0bdi.b3u6qk@ifi.uio.no>
2004-03-01  1:54 ` Andy Lutomirski
2004-03-01  2:54   ` Peter Williams
2004-03-01  3:46     ` Andy Lutomirski
2004-03-01  4:18       ` Peter Williams
2004-03-02 23:36   ` Peter Williams
     [not found] <894006121@toto.iv>
2004-03-01  0:00 ` Peter Chubb
2004-03-02  1:25   ` Peter Williams
     [not found] <fa.fi4j08o.17nchps@ifi.uio.no>
     [not found] ` <fa.ctat17m.8mqa3c@ifi.uio.no>
2004-02-29 11:58   ` Joachim B Haga
2004-02-29 20:39     ` Paul Jackson
2004-02-29 22:56     ` Peter Williams
     [not found] <1t8wp-qF-11@gated-at.bofh.it>
     [not found] ` <1th6J-az-13@gated-at.bofh.it>
     [not found]   ` <403E2929.2080705@tmr.com>
2004-02-27  3:44     ` Rik van Riel
2004-02-28 21:27       ` Bill Davidsen
2004-02-28 23:55         ` Peter Williams
2004-03-04 21:08           ` Timothy Miller
     [not found] <1tfy0-7ly-29@gated-at.bofh.it>
     [not found] ` <1thzJ-A5-13@gated-at.bofh.it>
     [not found]   ` <1tjrN-2m5-1@gated-at.bofh.it>
     [not found]     ` <1tjLa-2Ab-9@gated-at.bofh.it>
     [not found]       ` <1tlaf-3OY-11@gated-at.bofh.it>
     [not found]         ` <1tljX-3Wf-5@gated-at.bofh.it>
     [not found]           ` <1tznd-CP-35@gated-at.bofh.it>
     [not found]             ` <1tzQe-10s-25@gated-at.bofh.it>
2004-02-26 20:14               ` Bill Davidsen
     [not found] <fa.f12rt3d.c0s9rt@ifi.uio.no>
     [not found] ` <fa.ishajoq.q5g90m@ifi.uio.no>
2004-02-25 23:33   ` Junio C Hamano
2004-02-26  8:15     ` Catalin BOIE
  -- strict thread matches above, loose matches on Subject: below --
2004-02-25 14:35 John Lee
2004-02-25 17:09 ` Timothy Miller
2004-02-25 22:12   ` John Lee
2004-02-26  0:31     ` Timothy Miller
2004-02-26  2:04       ` John Lee
2004-02-26  2:18       ` Peter Williams
2004-02-26  2:42         ` Mike Fedyk
2004-02-26  4:10           ` Peter Williams
2004-02-26  4:19             ` Mike Fedyk
2004-02-26 19:23               ` Shailabh Nagar
2004-02-26 19:46                 ` Mike Fedyk
2004-02-26 20:42                   ` Shailabh Nagar
2004-02-26 16:10           ` Timothy Miller
2004-02-26 19:47             ` Mike Fedyk
2004-02-26 22:51             ` Peter Williams
2004-02-27 10:06               ` Helge Hafting
2004-02-27 11:04                 ` Peter Williams
2004-02-26 16:08         ` Timothy Miller
2004-02-26 16:51           ` Rik van Riel
2004-02-26 20:15           ` Peter Williams
2004-02-27 14:46             ` Timothy Miller
2004-02-28  5:00               ` Peter Williams
2004-03-04 21:18         ` Robert White
2004-03-04 23:15           ` Peter Williams
2004-02-26  2:48       ` Nuno Silva
2004-02-26  4:25         ` Peter Williams
2004-02-26 15:57           ` Rik van Riel
2004-02-26 19:28             ` Shailabh Nagar
2004-02-26 16:12         ` Timothy Miller
2004-02-25 22:51 ` Pavel Machek
2004-02-26  3:14   ` John Lee
2004-02-25 23:45 ` Rik van Riel
2004-02-26  7:18 ` John Lee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox