* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler
@ 2004-02-26 3:30 Albert Cahalan
2004-02-26 6:19 ` Peter Williams
0 siblings, 1 reply; 66+ messages in thread
From: Albert Cahalan @ 2004-02-26 3:30 UTC (permalink / raw)
To: linux-kernel mailing list; +Cc: johnl
John Lee writes:
> The usage rates for each task are estimated using Kalman
> filter techniques, the estimates being similar to those
> obtained by taking a running average over twice the filter
> _response half life_ (see below). However, Kalman filter
> values are cheaper to compute and don't require the
> maintenance of historical usage data.
Linux dearly needs this. Please separate out this part
of the patch and send it in.
Right now, Linux does not report the recent CPU usage
of a process. The UNIX standard requires that "ps"
report this; right now ps substitutes CPU usage over
the whole lifetime of a process.
Both per-task and per-process (tid and tgid) numbers
are needed. Both percent and permill (1/1000) units
get reported, so don't convert to integer percent.
^ permalink raw reply [flat|nested] 66+ messages in thread* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 3:30 [RFC][PATCH] O(1) Entitlement Based Scheduler Albert Cahalan @ 2004-02-26 6:19 ` Peter Williams 2004-02-26 17:57 ` Albert Cahalan 0 siblings, 1 reply; 66+ messages in thread From: Peter Williams @ 2004-02-26 6:19 UTC (permalink / raw) To: Albert Cahalan; +Cc: linux-kernel mailing list, johnl Albert Cahalan wrote: > John Lee writes: > > >>The usage rates for each task are estimated using Kalman >>filter techniques, the estimates being similar to those >>obtained by taking a running average over twice the filter >>_response half life_ (see below). However, Kalman filter >>values are cheaper to compute and don't require the >>maintenance of historical usage data. > > > Linux dearly needs this. Please separate out this part > of the patch and send it in. This information can be determined from the SleepAVG: field in the /proc/<pid>/status and /proc/<tgid>/task/<pid>/status files by subtracting the value there from 100. Without our patch this value is a directly calculated estimated of the task's sleep rate which is available because it used by the O(1) scheduler's heuristics. With our patches, it is calculated from our estimate of the task's usage because we dispensed with the sleep average calculations as they are no longer needed. We decided to still report sleep average in the status file because we were reluctant to alter the contents of such files in case we broke user space programs. > > Right now, Linux does not report the recent CPU usage > of a process. The UNIX standard requires that "ps" > report this; right now ps substitutes CPU usage over > the whole lifetime of a process. > > Both per-task and per-process (tid and tgid) numbers > are needed. Both percent and permill (1/1000) units > get reported, so don't convert to integer percent. I think a modification to fs/proc/array.c to make this field a per million rather than a percent value would satisfy your needs. It would be a very small change but there would be concerns about breaking programs that rely on it being a percentage. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 6:19 ` Peter Williams @ 2004-02-26 17:57 ` Albert Cahalan 2004-02-26 23:24 ` Peter Williams 0 siblings, 1 reply; 66+ messages in thread From: Albert Cahalan @ 2004-02-26 17:57 UTC (permalink / raw) To: Peter Williams; +Cc: linux-kernel mailing list, johnl On Thu, 2004-02-26 at 01:19, Peter Williams wrote: > Albert Cahalan wrote: >> John Lee writes: >>> The usage rates for each task are estimated using Kalman >>> filter techniques, the estimates being similar to those >>> obtained by taking a running average over twice the filter >>> _response half life_ (see below). However, Kalman filter >>> values are cheaper to compute and don't require the >>> maintenance of historical usage data. >> >> >> Linux dearly needs this. Please separate out this part >> of the patch and send it in. > > This information can be determined from the SleepAVG: field in the > /proc/<pid>/status and /proc/<tgid>/task/<pid>/status files by > subtracting the value there from 100. This doesn't seem to be the case. For example, a fork() causes the value to be adjusted in both child and parent. Also, perhaps the name is wrong, but I'd think SleepAVG has more to do with the average length of a sleep. It sure isn't documented. (time constant? type of decay?) There's also a need for whole-process stats and cumulative (sum of exited children) stats. %CPU can go as high as 51200%. > Without our patch this value is a > directly calculated estimated of the task's sleep rate which is > available because it used by the O(1) scheduler's heuristics. With our > patches, it is calculated from our estimate of the task's usage because > we dispensed with the sleep average calculations as they are no longer > needed. We decided to still report sleep average in the status file > because we were reluctant to alter the contents of such files in case we > broke user space programs. Generally this is a good move, though I don't expect anything to be using SleepAVG at the moment. >> Right now, Linux does not report the recent CPU usage >> of a process. The UNIX standard requires that "ps" >> report this; right now ps substitutes CPU usage over >> the whole lifetime of a process. >> >> Both per-task and per-process (tid and tgid) numbers >> are needed. Both percent and permill (1/1000) units >> get reported, so don't convert to integer percent. > > I think a modification to fs/proc/array.c to make this field a per > million rather than a percent value would satisfy your needs. It would > be a very small change but there would be concerns about breaking > programs that rely on it being a percentage. Nothing can rely on it existing at all, so a name change would solve the problem of apps getting confused. BTW, permill is not per-million, it is per-thousand. Per-million or per-billion would be fine as long as it doesn't overflow. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 17:57 ` Albert Cahalan @ 2004-02-26 23:24 ` Peter Williams 2004-03-01 3:47 ` Peter Williams 0 siblings, 1 reply; 66+ messages in thread From: Peter Williams @ 2004-02-26 23:24 UTC (permalink / raw) To: Albert Cahalan; +Cc: linux-kernel mailing list, johnl Albert Cahalan wrote: > On Thu, 2004-02-26 at 01:19, Peter Williams wrote: > >>Albert Cahalan wrote: >> >>>John Lee writes: > > >>>>The usage rates for each task are estimated using Kalman >>>>filter techniques, the estimates being similar to those >>>>obtained by taking a running average over twice the filter >>>>_response half life_ (see below). However, Kalman filter >>>>values are cheaper to compute and don't require the >>>>maintenance of historical usage data. >>> >>> >>>Linux dearly needs this. Please separate out this part >>>of the patch and send it in. >> >>This information can be determined from the SleepAVG: field in the >>/proc/<pid>/status and /proc/<tgid>/task/<pid>/status files by >>subtracting the value there from 100. > > > This doesn't seem to be the case. For example, a fork() > causes the value to be adjusted in both child and parent. This would be the case with our patch as well as we make children inherit their parent's usage rate to partially reduce the effect of ramp up in the estimation of the child's CPU usage rate. > > Also, perhaps the name is wrong, but I'd think SleepAVG > has more to do with the average length of a sleep. It sure > isn't documented. (time constant? type of decay?) My reading of the code caused me to interpret it as a percentage sleep rate i.e. a value of 50 means the task is sleeping 50% of the time. And this has made me realise that without our patches using (100 - SleepAVG) would not really give you CPU usage rate but would instead give RUNNABILITY rate (i.e. the proportion of time the task is spending on the cpu OR on a runqueue waiting for cpu access. It also makes me realise that the SleepAVG our patch reports is NOT really a sleep rate it's a sleep OR waiting on a runqueue rate. > > There's also a need for whole-process stats and cumulative > (sum of exited children) stats. %CPU can go as high as 51200%. > > >>Without our patch this value is a >>directly calculated estimated of the task's sleep rate which is >>available because it used by the O(1) scheduler's heuristics. With our >>patches, it is calculated from our estimate of the task's usage because >>we dispensed with the sleep average calculations as they are no longer >>needed. We decided to still report sleep average in the status file >>because we were reluctant to alter the contents of such files in case we >>broke user space programs. > > > Generally this is a good move, though I don't expect anything > to be using SleepAVG at the moment. OK. > > >>>Right now, Linux does not report the recent CPU usage >>>of a process. The UNIX standard requires that "ps" >>>report this; right now ps substitutes CPU usage over >>>the whole lifetime of a process. >>> >>>Both per-task and per-process (tid and tgid) numbers >>>are needed. Both percent and permill (1/1000) units >>>get reported, so don't convert to integer percent. >> >>I think a modification to fs/proc/array.c to make this field a per >>million rather than a percent value would satisfy your needs. It would >>be a very small change but there would be concerns about breaking >>programs that rely on it being a percentage. > > > Nothing can rely on it existing at all, so a name change would > solve the problem of apps getting confused. > > BTW, permill is not per-million, it is per-thousand. > Per-million or per-billion would be fine as long as > it doesn't overflow. OK. Since SleepAVG does not seem to be entrenched in people's expectations and because of the fact that the value calculated from our usage rates are not really valid (see above), I propose that we change this field's name to CPUrate and report the CPU usage rate directly in permill (unless there are violent objections). Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 23:24 ` Peter Williams @ 2004-03-01 3:47 ` Peter Williams 0 siblings, 0 replies; 66+ messages in thread From: Peter Williams @ 2004-03-01 3:47 UTC (permalink / raw) To: Albert Cahalan; +Cc: johnl, linux-kernel mailing list Peter Williams wrote: > Albert Cahalan wrote: <snip> >> Nothing can rely on it existing at all, so a name change would >> solve the problem of apps getting confused. >> >> BTW, permill is not per-million, it is per-thousand. >> Per-million or per-billion would be fine as long as >> it doesn't overflow. > > > OK. Since SleepAVG does not seem to be entrenched in people's > expectations and because of the fact that the value calculated from our > usage rates are not really valid (see above), I propose that we change > this field's name to CPUrate and report the CPU usage rate directly in > permill (unless there are violent objections). I've analysed the arithmetic and think that we can safely go to 6 decimal places so I'll make this Per-million. The change should appear in the patches to the 2.6.4 kernel when it is available. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <1vuMd-5jx-5@gated-at.bofh.it>]
[parent not found: <1vuMd-5jx-7@gated-at.bofh.it>]
[parent not found: <1vuMd-5jx-9@gated-at.bofh.it>]
[parent not found: <1vuMd-5jx-11@gated-at.bofh.it>]
[parent not found: <1vuMd-5jx-3@gated-at.bofh.it>]
[parent not found: <1vvyx-6jy-13@gated-at.bofh.it>]
[parent not found: <1vBE2-48V-21@gated-at.bofh.it>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler [not found] ` <1vBE2-48V-21@gated-at.bofh.it> @ 2004-03-03 21:38 ` Bill Davidsen 0 siblings, 0 replies; 66+ messages in thread From: Bill Davidsen @ 2004-03-03 21:38 UTC (permalink / raw) To: Andi Kleen; +Cc: Linux Kernel Mailing List Andi Kleen wrote: > No doubt that there are different settings that make sense > for different workloads. But there is no reason one has to recompile > to set them - it's much nicer to just run a script at boot time to set > them, instead of recompiling/rebooting. This will also make benchmarking > much easier, because you can just write a script that sets the > various parameters, runs workloads, sets other parameters, runs > workload again etc. Requiring a recompile and reboot makes this > much harder. Andi, if people are trying to find an optimal tuning then in many cases a reboot is out. There are two reasons for this: - a production server, can't just reboot! - it's sometimes hard to recreate the load which is causing problems, and far easier to get a working config by diddling and watching. At least those are the reasons why I would feel able to tune the machines which most need it. ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <fa.fi4j08o.17nchps@ifi.uio.no.suse.lists.linux.kernel>]
[parent not found: <fa.ctat17m.8mqa3c@ifi.uio.no.suse.lists.linux.kernel>]
[parent not found: <yydjishqw10p.fsf@galizur.uio.no.suse.lists.linux.kernel>]
[parent not found: <40426E1C.8010806@aurema.com.suse.lists.linux.kernel>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler [not found] ` <40426E1C.8010806@aurema.com.suse.lists.linux.kernel> @ 2004-03-03 2:48 ` Andi Kleen 2004-03-03 3:45 ` Peter Williams 0 siblings, 1 reply; 66+ messages in thread From: Andi Kleen @ 2004-03-03 2:48 UTC (permalink / raw) To: Peter Williams; +Cc: linux-kernel, johnl Peter Williams <peterw@aurema.com> writes: One comment on the patches: could you remove the zillions of numerical Kconfig options and just make them sysctls? I don't think it makes any sense to require a reboot to change any of that. And the user is unlikely to have much idea yet on what he wants on them while configuring. I really like the reduced scheduler complexity part of your patch BTW. IMHO the 2.6 scheduler's complexity has gotten out of hand and it's great that someone is going into the other direction with a simple basic design. For more wide spread testing it would be useful if you could do a more minimal less intrusive patch with less configuration (e.g. only allow tuning via nice, not via other means). This would be mainly to test your patch on more workloads without any hand tuning, which is the most important use case. -Andi ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-03 2:48 ` Andi Kleen @ 2004-03-03 3:45 ` Peter Williams 2004-03-03 10:13 ` Andi Kleen 2004-03-03 15:57 ` Andi Kleen 0 siblings, 2 replies; 66+ messages in thread From: Peter Williams @ 2004-03-03 3:45 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, johnl Andi Kleen wrote: > Peter Williams <peterw@aurema.com> writes: > > One comment on the patches: could you remove the zillions of numerical Kconfig > options and just make them sysctls? I don't think it makes any sense > to require a reboot to change any of that. And the user is unlikely > to have much idea yet on what he wants on them while configuring. The default initial values should be fine and the default configuration allows the scheduling tuning parameters (i.e. half life and time slice ) to be changed on a running system via the /proc file system. These are mainly there so that different settings can be used with various benchmarks to determine what are the best settings for various types of loads. If good default values that work well for a wide variety of load types can be found as a result of these experiments then these parameters may be made constants in the code. If not they probably should be made settable via system calls rather than via /proc as you suggest. In reality, batch type jobs tend to get better throughput with a longer half life but a shorter half life gives better interactive response. So servers and work stations should probably have different settings. > > I really like the reduced scheduler complexity part of your patch BTW. > IMHO the 2.6 scheduler's complexity has gotten out of hand and it's great > that someone is going into the other direction with a simple basic design. Thanks, we felt much the same. With a heuristic approach there always seems to be one more case that needs to be handled specially popping up. > > For more wide spread testing it would be useful if you could do > a more minimal less intrusive patch with less configuration > (e.g. only allow tuning via nice, not via other means). This would > be mainly to test your patch on more workloads without any hand tuning, > which is the most important use case. The "base" patch does this except that it also allows the setting of soft caps via /proc. But, as I said above, the main reason for the tuning parameters being exposed (in the full patch) at this time is to encourage testing with different values (of half life and time slice) and making them settable via /proc makes this possible without having to reboot the system. Except for (possibly) renicing the X server there should be no need to fiddle with settings for individual tasks. Peter PS We are looking at some simple modifications to further improve interactive response (hopefully) without adding to complexity. -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-03 3:45 ` Peter Williams @ 2004-03-03 10:13 ` Andi Kleen 2004-03-03 23:46 ` Peter Williams 2004-03-03 15:57 ` Andi Kleen 1 sibling, 1 reply; 66+ messages in thread From: Andi Kleen @ 2004-03-03 10:13 UTC (permalink / raw) To: Peter Williams; +Cc: Andi Kleen, linux-kernel, johnl On Wed, Mar 03, 2004 at 02:45:28PM +1100, Peter Williams wrote: > Andi Kleen wrote: > >Peter Williams <peterw@aurema.com> writes: > > > >One comment on the patches: could you remove the zillions of numerical > >Kconfig > >options and just make them sysctls? I don't think it makes any sense > >to require a reboot to change any of that. And the user is unlikely > >to have much idea yet on what he wants on them while configuring. > > The default initial values should be fine and the default configuration > allows the scheduling tuning parameters (i.e. half life and time slice > ) to be changed on a running system via the /proc file system. > These are mainly there so that different settings can be used with > various benchmarks to determine what are the best settings for various > types of loads. If good default values that work well for a wide > variety of load types can be found as a result of these experiments then > these parameters may be made constants in the code. If not they > probably should be made settable via system calls rather than via /proc > as you suggest. No doubt that there are different settings that make sense for different workloads. But there is no reason one has to recompile to set them - it's much nicer to just run a script at boot time to set them, instead of recompiling/rebooting. This will also make benchmarking much easier, because you can just write a script that sets the various parameters, runs workloads, sets other parameters, runs workload again etc. Requiring a recompile and reboot makes this much harder. Also I suspect most people who are not heavily into scheduling theory will go with the defaults at compile time, so it's important that they already work well. And consider it from the side of a standard distribution: users don't normally recompile their kernels there, so everything that makes sense to be tunable should be settable without recompiling. IMHO CONFIG_* should not be used for anything except for kernel binary size tuning and possible compiler tuning. -Andi ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-03 10:13 ` Andi Kleen @ 2004-03-03 23:46 ` Peter Williams 0 siblings, 0 replies; 66+ messages in thread From: Peter Williams @ 2004-03-03 23:46 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel, johnl Andi Kleen wrote: > On Wed, Mar 03, 2004 at 02:45:28PM +1100, Peter Williams wrote: > >>Andi Kleen wrote: >> >>>Peter Williams <peterw@aurema.com> writes: >>> >>>One comment on the patches: could you remove the zillions of numerical >>>Kconfig >>>options and just make them sysctls? I don't think it makes any sense >>>to require a reboot to change any of that. And the user is unlikely >>>to have much idea yet on what he wants on them while configuring. >> >>The default initial values should be fine and the default configuration >>allows the scheduling tuning parameters (i.e. half life and time slice >> ) to be changed on a running system via the /proc file system. >>These are mainly there so that different settings can be used with >>various benchmarks to determine what are the best settings for various >>types of loads. If good default values that work well for a wide >>variety of load types can be found as a result of these experiments then >>these parameters may be made constants in the code. If not they >>probably should be made settable via system calls rather than via /proc >>as you suggest. > > > No doubt that there are different settings that make sense > for different workloads. But there is no reason one has to recompile > to set them - it's much nicer to just run a script at boot time to set > them, instead of recompiling/rebooting. This will also make benchmarking > much easier, because you can just write a script that sets the > various parameters, runs workloads, sets other parameters, runs > workload again etc. Requiring a recompile and reboot makes this > much harder. As I said (with the full patch) these values can be set via /proc on a running system so there's no need for a recompile and reboot. > > Also I suspect most people who are not heavily into scheduling > theory will go with the defaults at compile time, so it's important > that they already work well. > > And consider it from the side of a standard distribution: users > don't normally recompile their kernels there, so everything that > makes sense to be tunable should be settable without recompiling. > > IMHO CONFIG_* should not be used for anything except for kernel binary > size tuning and possible compiler tuning. Yes, once this patch has stabilised we will probably remove some (or all) of the configuration variables. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-03 3:45 ` Peter Williams 2004-03-03 10:13 ` Andi Kleen @ 2004-03-03 15:57 ` Andi Kleen 2004-03-04 0:41 ` Peter Williams 1 sibling, 1 reply; 66+ messages in thread From: Andi Kleen @ 2004-03-03 15:57 UTC (permalink / raw) To: Peter Williams; +Cc: linux-kernel, johnl On Wed, 03 Mar 2004 14:45:28 +1100 Peter Williams <peterw@aurema.com> wrote: > Andi Kleen wrote: > > Peter Williams <peterw@aurema.com> writes: > > > > One comment on the patches: could you remove the zillions of numerical Kconfig > > options and just make them sysctls? I don't think it makes any sense > > to require a reboot to change any of that. And the user is unlikely > > to have much idea yet on what he wants on them while configuring. > > The default initial values should be fine and the default configuration > allows the scheduling tuning parameters (i.e. half life and time slice > ) to be changed on a running system via the /proc file system. I'm running the 2.6.3-full patch on my workstation now. No tuning applied at all. I reniced the X server to -10. When I have two kernel compiles (without any -j*) running there is a visible (=not really slow, but long enough to notice something) delay in responses while typing something in a xterm. In sylpheed there is the same issue. The standard scheduler didn't show this that extreme with only two compiles. -Andi ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-03 15:57 ` Andi Kleen @ 2004-03-04 0:41 ` Peter Williams 2004-03-05 3:55 ` Andi Kleen 0 siblings, 1 reply; 66+ messages in thread From: Peter Williams @ 2004-03-04 0:41 UTC (permalink / raw) To: Andi Kleen; +Cc: johnl, linux-kernel Andi Kleen wrote: > On Wed, 03 Mar 2004 14:45:28 +1100 > Peter Williams <peterw@aurema.com> wrote: > > >>Andi Kleen wrote: >> >>>Peter Williams <peterw@aurema.com> writes: >>> >>>One comment on the patches: could you remove the zillions of numerical Kconfig >>>options and just make them sysctls? I don't think it makes any sense >>>to require a reboot to change any of that. And the user is unlikely >>>to have much idea yet on what he wants on them while configuring. >> >>The default initial values should be fine and the default configuration >>allows the scheduling tuning parameters (i.e. half life and time slice >> ) to be changed on a running system via the /proc file system. > > > I'm running the 2.6.3-full patch on my workstation now. No tuning applied > at all. I reniced the X server to -10. When I have two kernel compiles (without any -j*) > running there is a visible (=not really slow, but long enough to notice something) > delay in responses while typing something in a xterm. In sylpheed there > is the same issue. > > The standard scheduler didn't show this that extreme with only two compiles. > Thanks for the feedback. We're looking at some minor modifications to try and improve this issue. BTW Could you try it with the X server reniced to -15? Thanks Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-04 0:41 ` Peter Williams @ 2004-03-05 3:55 ` Andi Kleen 0 siblings, 0 replies; 66+ messages in thread From: Andi Kleen @ 2004-03-05 3:55 UTC (permalink / raw) To: Peter Williams; +Cc: johnl, linux-kernel On Thu, 04 Mar 2004 11:41:00 +1100 Peter Williams <peterw@aurema.com> wrote: > BTW Could you try it with the X server reniced to -15? It seems to be a bit better with -15 (or maybe I'm imagining that), but still noticeable. -Andi ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <fa.ftul5bl.nlk3pr@ifi.uio.no>]
[parent not found: <fa.cvc8vnj.ahebjd@ifi.uio.no>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler [not found] ` <fa.cvc8vnj.ahebjd@ifi.uio.no> @ 2004-03-01 9:18 ` Joachim B Haga 2004-03-01 10:18 ` Paul Wagland 0 siblings, 1 reply; 66+ messages in thread From: Joachim B Haga @ 2004-03-01 9:18 UTC (permalink / raw) To: Peter Williams; +Cc: Joachim B Haga, Timothy Miller, linux-kernel Peter Williams <peterw@aurema.com> writes: >> It seems to me that much of this could be solved if the user *were* >> allowed to lower nice values (down to 0). [snip] >> to 10 (normal) to 20. Negative values could still be root-only. So >> why shouldn't this be possible? Because a greedy user in a >> multi-user system would just run everything at max prio thus >> defeating the purpose? Sure, that would be annoying but it would >> have another solution ie. an entitlement based scheduler or >> something. > More importantly it would allow ordinary users to override root's > settings e.g. if (for whatever reason) the sysadmin decided to > renice a task to 19 (say) this modification would allow the owner of > the task to renice it back to zero. This is the reason that it > isn't be allowed. "You dirty cracker! A renice +19, that'll teach you!" :-) Seriously though, the same is true today, it's just a bit more cumbersome. Restart the task and you're back to 0. If the sysadmin wants to stop that, he'll renice your shell. In which case you login again. And so on. My point is that this is a problem (annoying user) which has better solutions (ranging from a polite e-mail to deluser) because renice won't stop him. And it's not a *security* concern, as long as the lower values are still reserved. I would say the benefit is very small (I mean: who has ever relied on it?) compared to the difficulties created for users. > Peter Joachim ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-01 9:18 ` Joachim B Haga @ 2004-03-01 10:18 ` Paul Wagland 2004-03-01 19:11 ` Mike Fedyk 0 siblings, 1 reply; 66+ messages in thread From: Paul Wagland @ 2004-03-01 10:18 UTC (permalink / raw) To: Joachim B Haga; +Cc: Peter Williams, Timothy Miller, linux-kernel On Mon, 2004-03-01 at 10:18, Joachim B Haga wrote: > Peter Williams <peterw@aurema.com> writes: > > >> It seems to me that much of this could be solved if the user *were* > >> allowed to lower nice values (down to 0). > [snip] > >> to 10 (normal) to 20. Negative values could still be root-only. So > >> why shouldn't this be possible? Because a greedy user in a > > > More importantly it would allow ordinary users to override root's > > settings e.g. if (for whatever reason) the sysadmin decided to > And it's not a *security* concern, as long as the lower values are > still reserved. > > I would say the benefit is very small (I mean: who has ever relied on > it?) compared to the difficulties created for users. Under Linux, I can't say, but certainly on my old school machine (~10 years ago) all student accounts would run at +5, all staff accounts would run at +0. This was handled by the login process, so re-logging in would not help you at all.... Cheers, Paul ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-01 10:18 ` Paul Wagland @ 2004-03-01 19:11 ` Mike Fedyk 0 siblings, 0 replies; 66+ messages in thread From: Mike Fedyk @ 2004-03-01 19:11 UTC (permalink / raw) To: Paul Wagland; +Cc: Joachim B Haga, Peter Williams, Timothy Miller, linux-kernel Paul Wagland wrote: > On Mon, 2004-03-01 at 10:18, Joachim B Haga wrote: > >>Peter Williams <peterw@aurema.com> writes: >> >> >>>>It seems to me that much of this could be solved if the user *were* >>>>allowed to lower nice values (down to 0). >> >>[snip] >> >>>>to 10 (normal) to 20. Negative values could still be root-only. So >>>>why shouldn't this be possible? Because a greedy user in a >> >> >> >>>More importantly it would allow ordinary users to override root's >>>settings e.g. if (for whatever reason) the sysadmin decided to > > >>And it's not a *security* concern, as long as the lower values are >>still reserved. >> >>I would say the benefit is very small (I mean: who has ever relied on >>it?) compared to the difficulties created for users. > > > Under Linux, I can't say, but certainly on my old school machine (~10 > years ago) all student accounts would run at +5, all staff accounts > would run at +0. This was handled by the login process, so re-logging in > would not help you at all.... I think you can do this with pam or login under linux. I know I did something like this, but since most of my users were samba users, it wasn't very useful at the time ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <fa.jgj0bdi.b3u6qk@ifi.uio.no>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler [not found] <fa.jgj0bdi.b3u6qk@ifi.uio.no> @ 2004-03-01 1:54 ` Andy Lutomirski 2004-03-01 2:54 ` Peter Williams 2004-03-02 23:36 ` Peter Williams 0 siblings, 2 replies; 66+ messages in thread From: Andy Lutomirski @ 2004-03-01 1:54 UTC (permalink / raw) To: John Lee; +Cc: linux-kernel How hard would it be to make shares hierarchial? For example (quoted names are just descriptive): "guaranteed" (10 shares) "user" (5 shares) | | ----------------- ----------------- | | | | "root" (1) "apache" (2) "bob" (5) "fred" (5) | | | | (more groups?) (web servers) etc. etc. This way one user is prevented from taking unfair CPU time by launcing too many processes, apache gets enough time no matter what, etc. In this scheme, numbers of shares would only be comparable if they are children of the same node. Also, it now becomes safe to let users _increase_ priorities of their processes -- it doesn't affect anyone else. Ignoring limts, this should be just an exercise in keeping track of shares and eliminating the 1/420 limit in precision. It would take some thought to figure out what nice should do. Also, could interactivity problems be solved something like this: prio = ( (old EBS usage ratio) - 0.5 ) * i + 0.5 "i" would be a per-process interactivity factor (normally 1, but higher for interactive processes) which would only boost them when their CPU usage is low. This makes interactive processes get their timeslices early (very high priority at low CPU consumption) but prevents abuse by preventing excessive CPU consumption. This could even by set by the (untrusted) process itself. I imagine that these two together would nicely solve most interactivity and fairness issues -- the former prevents starvation by other users and the latter prevents latency caused by large numbers of CPU-light tasks. Is this sane? And does it break the O(1) promotion algorithm? --Andy ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-01 1:54 ` Andy Lutomirski @ 2004-03-01 2:54 ` Peter Williams 2004-03-01 3:46 ` Andy Lutomirski 2004-03-02 23:36 ` Peter Williams 1 sibling, 1 reply; 66+ messages in thread From: Peter Williams @ 2004-03-01 2:54 UTC (permalink / raw) To: Andy Lutomirski; +Cc: John Lee, linux-kernel Andy Lutomirski wrote: > How hard would it be to make shares hierarchial? For example (quoted > names are just descriptive): > > "guaranteed" (10 shares) "user" (5 shares) > | | > ----------------- ----------------- > | | | | > "root" (1) "apache" (2) "bob" (5) "fred" (5) > | | | | > (more groups?) (web servers) etc. etc. > > > This way one user is prevented from taking unfair CPU time by launcing > too many processes, apache gets enough time no matter what, etc. In > this scheme, numbers of shares would only be comparable if they are > children of the same node. Also, it now becomes safe to let users > _increase_ priorities of their processes -- it doesn't affect anyone else. > > Ignoring limts, this should be just an exercise in keeping track of > shares and eliminating the 1/420 limit in precision. It would take some > thought to figure out what nice should do. > As Peter Chubb has stated such control is possible and is available on Tru64, Solaris and Windows with Aurema's (<http://www.aurema.com>) ARMTech product. The CKRM project also addresses this issue. > > Also, could interactivity problems be solved something like this: > > prio = ( (old EBS usage ratio) - 0.5 ) * i + 0.5 > > "i" would be a per-process interactivity factor (normally 1, but higher > for interactive processes) which would only boost them when their CPU > usage is low. This makes interactive processes get their timeslices > early (very high priority at low CPU consumption) but prevents abuse by > preventing excessive CPU consumption. This could even by set by the > (untrusted) process itself. > Interactive processes do very well under EBS without any special treatment. Programs such as xmms aren't really interactive processes although they usually have a very low CPU usage rate like interactive processes. What distinguishes them is their need for REGULAR access to the CPU. It's unlikely that such a modification would help with the need for regularity. Once again I'll stress that in order to cause xmms to skip we had to (on a single CPU machine) run a kernel build with -j 16 which causes a system load well in excess of 10 and is NOT a normal load. Under normal loads xmms performs OK. > > I imagine that these two together would nicely solve most interactivity > and fairness issues -- the former prevents starvation by other users and > the latter prevents latency caused by large numbers of CPU-light tasks. > > > Is this sane? Yes. Fairness between users rather than between tasks is a sane desire but beyond the current scope of EBS. > And does it break the O(1) promotion algorithm? No, it would not break the O(1) promotion algorithm. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-01 2:54 ` Peter Williams @ 2004-03-01 3:46 ` Andy Lutomirski 2004-03-01 4:18 ` Peter Williams 0 siblings, 1 reply; 66+ messages in thread From: Andy Lutomirski @ 2004-03-01 3:46 UTC (permalink / raw) To: Peter Williams; +Cc: Andy Lutomirski, John Lee, linux-kernel Peter Williams wrote: > Andy Lutomirski wrote: > >> How hard would it be to make shares hierarchial? For example (quoted >> names are just descriptive): > > As Peter Chubb has stated such control is possible and is available on > Tru64, Solaris and Windows with Aurema's (<http://www.aurema.com>) > ARMTech product. The CKRM project also addresses this issue. Cool. I hadn't realized ARMTech did that, and I haven't fully read up on CKRM. > >> >> Also, could interactivity problems be solved something like this: >> >> prio = ( (old EBS usage ratio) - 0.5 ) * i + 0.5 >> >> "i" would be a per-process interactivity factor (normally 1, but >> higher for interactive processes) which would only boost them when >> their CPU usage is low. This makes interactive processes get their >> timeslices early (very high priority at low CPU consumption) but >> prevents abuse by preventing excessive CPU consumption. This could >> even by set by the (untrusted) process itself. >> > > Interactive processes do very well under EBS without any special treatment. > > Programs such as xmms aren't really interactive processes although they > usually have a very low CPU usage rate like interactive processes. What > distinguishes them is their need for REGULAR access to the CPU. It's > unlikely that such a modification would help with the need for regularity. I'm guessing the reason make -j16 broke it is because it (i.e. make) spawns lots of processes frequenly. Since make probably uses almost no CPU under this load, its usage per share stays near zero. As long as it is below that of xmms, then all of its children are too, at least until part-way into their first timeslices. The problem is that there are a bunch of these children, and, if enough start at once, they all run before xmms, and the audio buffer underruns. My approach would give xmms a better chance of running sooner (in fact, it would potentially have better priority than any non-interactive task until it started hogging CPU). > > Once again I'll stress that in order to cause xmms to skip we had to (on > a single CPU machine) run a kernel build with -j 16 which causes a > system load well in excess of 10 and is NOT a normal load. Under normal > loads xmms performs OK. > >> >> I imagine that these two together would nicely solve most >> interactivity and fairness issues -- the former prevents starvation by >> other users and the latter prevents latency caused by large numbers of >> CPU-light tasks. >> >> >> Is this sane? > > > Yes. Fairness between users rather than between tasks is a sane desire > but beyond the current scope of EBS. I have this strange masochistic desire to implement this. Don't expect patches any time soon -- it would be my first time playing with the scheduler ;) > Peter --Andy ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-01 3:46 ` Andy Lutomirski @ 2004-03-01 4:18 ` Peter Williams 0 siblings, 0 replies; 66+ messages in thread From: Peter Williams @ 2004-03-01 4:18 UTC (permalink / raw) To: Andy Lutomirski; +Cc: Andy Lutomirski, John Lee, linux-kernel Andy Lutomirski wrote: > Peter Williams wrote: > >> Andy Lutomirski wrote: >> >>> How hard would it be to make shares hierarchial? For example (quoted >>> names are just descriptive): >> >> >> As Peter Chubb has stated such control is possible and is available on >> Tru64, Solaris and Windows with Aurema's (<http://www.aurema.com>) >> ARMTech product. The CKRM project also addresses this issue. > > > Cool. I hadn't realized ARMTech did that, and I haven't fully read up > on CKRM. > >> >>> >>> Also, could interactivity problems be solved something like this: >>> >>> prio = ( (old EBS usage ratio) - 0.5 ) * i + 0.5 >>> >>> "i" would be a per-process interactivity factor (normally 1, but >>> higher for interactive processes) which would only boost them when >>> their CPU usage is low. This makes interactive processes get their >>> timeslices early (very high priority at low CPU consumption) but >>> prevents abuse by preventing excessive CPU consumption. This could >>> even by set by the (untrusted) process itself. >>> >> >> Interactive processes do very well under EBS without any special >> treatment. >> >> Programs such as xmms aren't really interactive processes although >> they usually have a very low CPU usage rate like interactive >> processes. What distinguishes them is their need for REGULAR access >> to the CPU. It's unlikely that such a modification would help with >> the need for regularity. > > > I'm guessing the reason make -j16 broke it is because it (i.e. make) > spawns lots of processes frequenly. Since make probably uses almost no > CPU under this load, its usage per share stays near zero. As long as it > is below that of xmms, then all of its children are too, at least until > part-way into their first timeslices. The problem is that there are a > bunch of these children, and, if enough start at once, they all run > before xmms, and the audio buffer underruns. My approach would give > xmms a better chance of running sooner (in fact, it would potentially > have better priority than any non-interactive task until it started > hogging CPU). That's close to the reason. The real culprit is the dreaded "ramp up" which is our tag for the fact that it takes a short while (dependent on half life) for the scheduler to estimate the usage rate of a new process (or a sudden change in the usage rate of an existing process) so they get treated as low usage tasks for the first part of their life. In most cases this is a good thing as it causes commands run from the command line to get a boost and since a lot of these are very short lived (e.g. ls) they run very quickly and have very good response times. In the case of a kernel build there are lots of C compiler tasks being launched and these run for several seconds but for the first few moments of their lives (during ramp up) they look like low CPU usage tasks and get high priority. This effect is short lived and doesn't effect normal interactive programs because regularity of access isn't important to them and the estimated usages of the C compiler tasks quickly exceeds the estimated usages of the interactive tasks. It only has an effect on xmms because regularity of access is important to it (i.e. xmms IS getting sufficient CPU but isn't always getting it as regularly as it likes). The X server is a different case. Although it isn't an interactive program it does have an influence on interactive responsiveness when X windows is being used. Unfortunately, because it is generally serving a number of clients, its CPU usage can be quite high and it takes a little longer for the estimated usage rate of the C compiler tasks to exceed its estimated usage. For this reason, we recommend that sysadmins arrange for the X server to be run at somewhere between nice -9 and nice -15. It should be obvious from the above that another way that interactive response can be improved is by shortening the half life. But there is a trade off with lowered overall system throughput resulting if the half life is too short. In the current version of EBS, root can change the half life on a running system and this functionality is there so that experiments into the effect of changing the half life on system performance can be conducted without the need to recompile the kernel and reboot the system. The default value of 5 seconds is one that we've found is good for overall performance and interactive response provided that the X server is run with an appropriate level of niceness. > >> >> Once again I'll stress that in order to cause xmms to skip we had to >> (on a single CPU machine) run a kernel build with -j 16 which causes a >> system load well in excess of 10 and is NOT a normal load. Under >> normal loads xmms performs OK. >> >>> >>> I imagine that these two together would nicely solve most >>> interactivity and fairness issues -- the former prevents starvation >>> by other users and the latter prevents latency caused by large >>> numbers of CPU-light tasks. >>> >>> >>> Is this sane? >> >> >> >> Yes. Fairness between users rather than between tasks is a sane >> desire but beyond the current scope of EBS. > > > I have this strange masochistic desire to implement this. Don't expect > patches any time soon -- it would be my first time playing with the > scheduler ;) > Good luck Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-01 1:54 ` Andy Lutomirski 2004-03-01 2:54 ` Peter Williams @ 2004-03-02 23:36 ` Peter Williams 1 sibling, 0 replies; 66+ messages in thread From: Peter Williams @ 2004-03-02 23:36 UTC (permalink / raw) To: Andy Lutomirski; +Cc: linux-kernel Andy Lutomirski wrote: > <snip> > Ignoring limts, this should be just an exercise in keeping track of > shares and eliminating the 1/420 limit in precision. It would take some > thought to figure out what nice should do. > <snip> I take it from this comment that you would like to see a larger range of shares made available? The current range (1 to 420) was chosen to allow easy mapping between niceness and shares and so that minimum shares was roughly the same times smaller than the default as the maximum was bigger (i.e. twenty times). One of the restrictions on the number of shares is the dynamic range of the representation of real numbers that we use for our calculations. We use fixed denominator rational numbers with a denominator of 2 to the power of 27. This value was chosen because the maximum (real number) value that we have to be able to cope with in our calculations is 19 and we are limited to using 32 bits because we need to do a divide and 64 bit division is not supported in the kernel on all hardwares (in particular, the IA32 kernels do not support 64 bit division). The other factors that have to be taken into consideration are the half life and the value of HZ (which varies widely depending on the system). Anyway, I will look at the numbers and see if it's possible to squeeze a larger range of shares in (although it may mean tighter restrictions on half life on some systems). Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <894006121@toto.iv>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler [not found] <894006121@toto.iv> @ 2004-03-01 0:00 ` Peter Chubb 2004-03-02 1:25 ` Peter Williams 0 siblings, 1 reply; 66+ messages in thread From: Peter Chubb @ 2004-03-01 0:00 UTC (permalink / raw) To: Paul Jackson; +Cc: Joachim B Haga, peterw, miller, linux-kernel >>>>> "Paul" == Paul Jackson <pj@sgi.com> writes: Paul> Is there anyway to provide a mechanism that would support Paul> administering a system as follows: Paul> 1) Users get so much CPU usage allowed, determined by an upper Paul> limit on a running average of the combined CPU usage of all Paul> their tasks, with a half life perhaps on the order of minutes. Paul> 2) They can nice their tasks up and down, within a decent Paul> range, as they will. Paul> 3) But if they push too close to their allowed limit, all Paul> their tasks get reined in. The relative priorities within their Paul> own tasks are not changed, but the priority of their tasks Paul> relative to other users is weakened. This is exactly what the commercial product ARMtech does. The EBS that Aurema have just released as open source is (a small) part of the commercial product. See http://www.aurema.com Peter C (an ex-employee of Aurema) -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au The technical we do immediately, the political takes *forever* ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-01 0:00 ` Peter Chubb @ 2004-03-02 1:25 ` Peter Williams 0 siblings, 0 replies; 66+ messages in thread From: Peter Williams @ 2004-03-02 1:25 UTC (permalink / raw) To: Peter Chubb; +Cc: Paul Jackson, Joachim B Haga, miller, linux-kernel Peter Chubb wrote: >>>>>>"Paul" == Paul Jackson <pj@sgi.com> writes: > > > > Paul> Is there anyway to provide a mechanism that would support > Paul> administering a system as follows: > > Paul> 1) Users get so much CPU usage allowed, determined by an upper > Paul> limit on a running average of the combined CPU usage of all > Paul> their tasks, with a half life perhaps on the order of minutes. > > Paul> 2) They can nice their tasks up and down, within a decent > Paul> range, as they will. > > Paul> 3) But if they push too close to their allowed limit, all > Paul> their tasks get reined in. The relative priorities within their > Paul> own tasks are not changed, but the priority of their tasks > Paul> relative to other users is weakened. > > This is exactly what the commercial product ARMtech does. The EBS > that Aurema have just released as open source is (a small) part of the > commercial product. Not strictly speaking a part of our commercial product but based on the CPU scheduling technology in that product. As you know, the CPU scheduler in the kernel based versions of our product relies on a generic "plug in scheduler" interface being present in the host kernel and runtime loadable kernel modules plug into that interface and take over scheduling. This mechanism (while allowing our product to work on a number of different host operating systems) has the disadvantage that it adds some overhead to the scheduler (this is in addition to the extra overhead involved in hierarchical scheduling) and places some restrictions on the scheduler design as it cannot make too many assumptions about the underlying scheduler in the host operating system. When the technology is used to build an embedded scheduler (as is the case with EBS) significant improvements in efficiency can be realised. So EBS is a souped up, non hierarchical version of our CPU scheduler designed specially for Linux. > > See http://www.aurema.com > > Peter C (an ex-employee of Aurema) > > -- > Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au > The technical we do immediately, the political takes *forever* Cheers Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <fa.fi4j08o.17nchps@ifi.uio.no>]
[parent not found: <fa.ctat17m.8mqa3c@ifi.uio.no>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler [not found] ` <fa.ctat17m.8mqa3c@ifi.uio.no> @ 2004-02-29 11:58 ` Joachim B Haga 2004-02-29 20:39 ` Paul Jackson 2004-02-29 22:56 ` Peter Williams 0 siblings, 2 replies; 66+ messages in thread From: Joachim B Haga @ 2004-02-29 11:58 UTC (permalink / raw) To: Peter Williams; +Cc: Timothy Miller, linux-kernel Peter Williams <peterw@aurema.com> writes: >>> They already do e.g. renice is such a program. >> No one's talking about LOWERING priority here. You can only DoS >> someone else if you can set negative nice values, and non-root >> can't do that. > > Which is why root has to be in control of the mechanism. It seems to me that much of this could be solved if the user *were* allowed to lower nice values (down to 0). Right now the only way I can prioritize between my own processes by starting important/timing sensitive programs normally and everything else reniced. The problem is that the first category consists of one or two programs while the second category is, well, "everything else". I would *love* to be able to start the window manager and all children at +10 and be able to adjust priorities, from 0 (important user-level) to 10 (normal) to 20. Negative values could still be root-only. So why shouldn't this be possible? Because a greedy user in a multi-user system would just run everything at max prio thus defeating the purpose? Sure, that would be annoying but it would have another solution ie. an entitlement based scheduler or something. (and isn't it this simple?) --- linux-2.6.3-mm3/kernel/sys.c.orig 2004-02-29 12:58:45.000000000 +0100 +++ linux-2.6.3-mm3/kernel/sys.c 2004-02-29 12:59:20.000000000 +0100 @@ -276,7 +276,7 @@ error = -EPERM; goto out; } - if (niceval < task_nice(p) && !capable(CAP_SYS_NICE)) { + if (niceval < 0 && !capable(CAP_SYS_NICE)) { error = -EACCES; goto out; } Regards, Joachim B Haga ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-29 11:58 ` Joachim B Haga @ 2004-02-29 20:39 ` Paul Jackson 2004-02-29 22:56 ` Peter Williams 1 sibling, 0 replies; 66+ messages in thread From: Paul Jackson @ 2004-02-29 20:39 UTC (permalink / raw) To: Joachim B Haga; +Cc: peterw, miller, linux-kernel Seems like we are trying to manage something worth managing, which is how much of a systems CPU capacity is consumed by all of a given users tasks over periods of minutes, by micromanaging the scheduling of individual tasks over periods of ticks. We don't manage disk space by telling someone no file bigger than 1 megabyte. Rather they get an upper limit on all their files combined. If they want to spend most of that on one file, that's fine. Is there anyway to provide a mechanism that would support administering a system as follows: 1) Users get so much CPU usage allowed, determined by an upper limit on a running average of the combined CPU usage of all their tasks, with a half life perhaps on the order of minutes. 2) They can nice their tasks up and down, within a decent range, as they will. 3) But if they push too close to their allowed limit, all their tasks get reined in. The relative priorities within their own tasks are not changed, but the priority of their tasks relative to other users is weakened. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373 ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-29 11:58 ` Joachim B Haga 2004-02-29 20:39 ` Paul Jackson @ 2004-02-29 22:56 ` Peter Williams 1 sibling, 0 replies; 66+ messages in thread From: Peter Williams @ 2004-02-29 22:56 UTC (permalink / raw) To: Joachim B Haga; +Cc: Timothy Miller, linux-kernel Joachim B Haga wrote: > Peter Williams <peterw@aurema.com> writes: > > >>>>They already do e.g. renice is such a program. >>> >>>No one's talking about LOWERING priority here. You can only DoS >>>someone else if you can set negative nice values, and non-root >>>can't do that. >> >>Which is why root has to be in control of the mechanism. > > > It seems to me that much of this could be solved if the user *were* > allowed to lower nice values (down to 0). > > Right now the only way I can prioritize between my own processes by > starting important/timing sensitive programs normally and everything > else reniced. The problem is that the first category consists of one > or two programs while the second category is, well, "everything else". > > I would *love* to be able to start the window manager and all children > at +10 and be able to adjust priorities, from 0 (important user-level) > to 10 (normal) to 20. Negative values could still be root-only. > > So why shouldn't this be possible? Because a greedy user in a > multi-user system would just run everything at max prio thus defeating > the purpose? Sure, that would be annoying but it would have another > solution ie. an entitlement based scheduler or something. More importantly it would allow ordinary users to override root's settings e.g. if (for whatever reason) the sysadmin decided to renice a task to 19 (say) this modification would allow the owner of the task to renice it back to zero. This is the reason that it isn't be allowed. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <1t8wp-qF-11@gated-at.bofh.it>]
[parent not found: <1th6J-az-13@gated-at.bofh.it>]
[parent not found: <403E2929.2080705@tmr.com>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler [not found] ` <403E2929.2080705@tmr.com> @ 2004-02-27 3:44 ` Rik van Riel 2004-02-28 21:27 ` Bill Davidsen 0 siblings, 1 reply; 66+ messages in thread From: Rik van Riel @ 2004-02-27 3:44 UTC (permalink / raw) To: Bill Davidsen; +Cc: Kernel Mailing List On Thu, 26 Feb 2004, Bill Davidsen wrote: > I disagree. > It would be nice to have the scheduler identify processes which > interface to user information devices, but it must be done in a way > which doesn't open gaping security or misuse holes. You seem to disagree only with what you think you read, not with what the code does. Please read the actual code, since it seems to do what you propose. Rik -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-27 3:44 ` Rik van Riel @ 2004-02-28 21:27 ` Bill Davidsen 2004-02-28 23:55 ` Peter Williams 0 siblings, 1 reply; 66+ messages in thread From: Bill Davidsen @ 2004-02-28 21:27 UTC (permalink / raw) To: Rik van Riel; +Cc: Kernel Mailing List On Thu, 26 Feb 2004, Rik van Riel wrote: > On Thu, 26 Feb 2004, Bill Davidsen wrote: > > > I disagree. > > > It would be nice to have the scheduler identify processes which > > interface to user information devices, but it must be done in a way > > which doesn't open gaping security or misuse holes. > > You seem to disagree only with what you think you read, > not with what the code does. Please read the actual > code, since it seems to do what you propose. I disagree with the paragraph preceding my comment, which you removed to take what I said out of context. And I still disagree. I "think I read" that just fine, although it may not correctly summarize the implementation of the code. In any case, as long as the code provides the protection against letting users change priorities to hog resources I don't disagree with that. Experience has shown that people WILL abuse any mechanism which gives them an unfair share of a shared system. For home systems that's less important, obviously. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-28 21:27 ` Bill Davidsen @ 2004-02-28 23:55 ` Peter Williams 2004-03-04 21:08 ` Timothy Miller 0 siblings, 1 reply; 66+ messages in thread From: Peter Williams @ 2004-02-28 23:55 UTC (permalink / raw) To: Bill Davidsen; +Cc: Rik van Riel, Kernel Mailing List Bill Davidsen wrote: > On Thu, 26 Feb 2004, Rik van Riel wrote: > > >>On Thu, 26 Feb 2004, Bill Davidsen wrote: >> >> >>>I disagree. >> >>>It would be nice to have the scheduler identify processes which >>>interface to user information devices, but it must be done in a way >>>which doesn't open gaping security or misuse holes. >> >>You seem to disagree only with what you think you read, >>not with what the code does. Please read the actual >>code, since it seems to do what you propose. > > > I disagree with the paragraph preceding my comment, which you removed to > take what I said out of context. And I still disagree. I "think I read" > that just fine, although it may not correctly summarize the implementation > of the code. > > In any case, as long as the code provides the protection against letting > users change priorities to hog resources I don't disagree with that. > Experience has shown that people WILL abuse any mechanism which gives them > an unfair share of a shared system. For home systems that's less > important, obviously. > The O(1) Entitlement Based Scheduler places the equivalent restrictions on setting task attributes (i.e. shares and caps) as are placed on using nice and renice. I.e. ordinary users can only change settings on their own processes and only if the change is more restricting than the current setting. In particular, they cannot increase a task's shares only decrease them, they can impose or reduce a cap but not release or increase it and they can change a soft cap to a hard cap but cannot change a hard cap to a soft cap. Additionally, only root can change the scheduler's tuning parameters. I hope this alleviates your concerns, Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-28 23:55 ` Peter Williams @ 2004-03-04 21:08 ` Timothy Miller 0 siblings, 0 replies; 66+ messages in thread From: Timothy Miller @ 2004-03-04 21:08 UTC (permalink / raw) To: Peter Williams; +Cc: Bill Davidsen, Rik van Riel, Kernel Mailing List Peter Williams wrote: > > The O(1) Entitlement Based Scheduler places the equivalent restrictions > on setting task attributes (i.e. shares and caps) as are placed on using > nice and renice. I.e. ordinary users can only change settings on their > own processes and only if the change is more restricting than the > current setting. In particular, they cannot increase a task's shares > only decrease them, they can impose or reduce a cap but not release or > increase it and they can change a soft cap to a hard cap but cannot > change a hard cap to a soft cap. > > Additionally, only root can change the scheduler's tuning parameters. > > I hope this alleviates your concerns, I, for one, never had any such concerns. My concern was about the unpriveledged user begin unable to run certain applications under load without prior approval. Two philosophical points: 1) Perhaps we are trying too hard to please everyone. As Linus said, perfect is the enemy of good. A good scheduler won't work perfectly for everyone's application, but it will work very well for the most important ones. Perhaps people writing schedulers should compete based on overall throughput and latency, rather than on how well it runs xmms (and other such apps). 2) Perhaps certain apps like xmms are 'broken' can be rewritten to behave better with the new scheduler. For instance, more buffering, separating the mp3 decoding thread from the thread that feeds /dev/audio, more efficient decoder, a decoder that voluntarily sleeps when it's 'done enough', so that it doesn't get knocked down to a lower priority, a decoder that 'cheats' on audio quality just to maintain low CPU usage when it finds itself being preempted, etc. It bears mentioning that many applications work well with 2.4 because they evolved to work well with the 2.4 scheduler. The 2.6 scheduler is different. We shouldn't constrain 2.6 for the sake of old apps. Those old apps should be rewritten to adapt to the new environment. "Working well under 2.6" doesn't require any more adaptation than with 2.4, but it does require _different_ adaptation. This isn't to speak negatively of Con and Nick and others who have attempted to improve upon the 2.6 scheduler. If they can make old apps work well without impacting the potential that new apps can get out of 2.6, then more power to them! ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <1tfy0-7ly-29@gated-at.bofh.it>]
[parent not found: <1thzJ-A5-13@gated-at.bofh.it>]
[parent not found: <1tjrN-2m5-1@gated-at.bofh.it>]
[parent not found: <1tjLa-2Ab-9@gated-at.bofh.it>]
[parent not found: <1tlaf-3OY-11@gated-at.bofh.it>]
[parent not found: <1tljX-3Wf-5@gated-at.bofh.it>]
[parent not found: <1tznd-CP-35@gated-at.bofh.it>]
[parent not found: <1tzQe-10s-25@gated-at.bofh.it>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler [not found] ` <1tzQe-10s-25@gated-at.bofh.it> @ 2004-02-26 20:14 ` Bill Davidsen 0 siblings, 0 replies; 66+ messages in thread From: Bill Davidsen @ 2004-02-26 20:14 UTC (permalink / raw) To: Mike Fedyk; +Cc: Linux Kernel Mailing List Mike Fedyk wrote: > Shailabh Nagar wrote: > >>>> Mike Fedyk wrote: >>>> >>>>> Better would be to have the kernel tell the daemon whenever a >>>>> process in exec-ed, and you have simplicity in the kernel, and >>>>> policy in user space. >> >> >> >> >> As it turns out, one can still use a fairly simple in-kernel module >> which provides a *mechanism* for effectively changing a process' >> entitlement while retaining the policy component in userland. > > > How much code could be removed if CKRM triggered a userspace process to > perform the operations required? One other interesting question is what would happen if the userspace program didn't run, died, etc. Or set some ill-behaved other user program to a higher priority and the other program did a DoS (intentional or not)? I don't like the whole idea, but I like it even less with a user program requiring context switches on scheduling. ^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <fa.f12rt3d.c0s9rt@ifi.uio.no>]
[parent not found: <fa.ishajoq.q5g90m@ifi.uio.no>]
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler [not found] ` <fa.ishajoq.q5g90m@ifi.uio.no> @ 2004-02-25 23:33 ` Junio C Hamano 2004-02-26 8:15 ` Catalin BOIE 0 siblings, 1 reply; 66+ messages in thread From: Junio C Hamano @ 2004-02-25 23:33 UTC (permalink / raw) To: John Lee; +Cc: linux-kernel >>>>> "JL" == John Lee <johnl@aurema.com> writes: JL> On Wed, 25 Feb 2004, Timothy Miller wrote: >> Even for those who do, they're not going to want to have to >> renice xmms every time they run it. Furthermore, it seems like a bad >> idea to keep marking more and more programs as suid root just so that >> they can boost their priority. I seem to recall reading here that xmms attempts to grab SCHED_RR if possible. If that is indeed the case, I suspect that the above suggestion to run xmms as root does not let the user exploit the strength of EBS. Since, as I understand it, EBS affects SCHED_NORMAL processes not SCHED_RR processes. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-25 23:33 ` Junio C Hamano @ 2004-02-26 8:15 ` Catalin BOIE 0 siblings, 0 replies; 66+ messages in thread From: Catalin BOIE @ 2004-02-26 8:15 UTC (permalink / raw) To: John Lee; +Cc: linux-kernel Hi! I think Aurema tries to push this more for server stuff then for workstations. I think is a great project. I hope it will be included. --- Catalin(ux) BOIE catab@deuroconsult.ro ^ permalink raw reply [flat|nested] 66+ messages in thread
* [RFC][PATCH] O(1) Entitlement Based Scheduler
@ 2004-02-25 14:35 John Lee
2004-02-25 17:09 ` Timothy Miller
` (3 more replies)
0 siblings, 4 replies; 66+ messages in thread
From: John Lee @ 2004-02-25 14:35 UTC (permalink / raw)
To: linux-kernel
Hi everyone,
This patch is a modification of the O(1) scheduler that introduces
O(1) entitlement based scheduling for SCHED_NORMAL tasks. This patch is aimed
at keeping the scalability and efficiency of the current scheduler, but also
provide:
- Specific allocation of CPU resources amongst tasks
- Scheduling fairness and good interactive response without the need for
heuristics, and
- Reduced scheduler complexity.
The fundamental concept of entitlement based sharing is that each task has an
_entitlement_ to CPU resources that is determined by the number of _shares_
that it holds, and the scheduler allocates CPU to tasks so that the _rate_ at
which they receive CPU time is consistent with their entitlement.
The usage rates for each task are estimated using Kalman filter techniques, the
estimates being similar to those obtained by taking a running average over
twice the filter _response half life_ (see below). However, Kalman filter
values are cheaper to compute and don't require the maintenance of historical
usage data.
The use of CPU usage rates also makes it possible to impose _per task CPU
usage rate caps_. This patch provides both soft and hard CPU usage rate caps per
task. The difference between a hard and soft cap is that when unused CPU cycles
are available, a hard cap is _always_ enforced regardless, whereas a soft cap
is allowed to be exceeded.
Features of the EBS scheduler
=============================
CPU shares
----------
Each task has a number of CPU shares that determine its entitlement. Shares can
be read/set directly via the files
/proc/<pid>/cpu_shares
/proc/<tgid>/task/<pid>/cpu_shares
or indirectly via setting the task's nice value using nice or renice. A task
may be allocated between 1 and 420 shares with 20 shares being the default
allocation. A nice value >= 0 is mapped to (20 - nice) shares and a value
< 0 is mapped to (20 + nice * nice) shares. If shares are set directly via
/proc/<pid>/cpu_shares then its nice value will be adjusted accordingly.
CPU usage rate caps
-------------------
A task's CPU usage rate cap imposes a soft (or hard) upper limit on the rate at
which it can use CPU resources and can be set/read via the files
/proc/<pid>/cpu_rate_cap
/proc/<tgid>/task/<pid>/cpu_rate_cap
Usage rate caps are expressed as rational numbers (e.g. "1 / 2") and hard caps
are signified by a "!" suffix. The rational number indicates the proportion
of a single CPU's capacity that the task may use. The value of the number must
be in the range 0.0 to 1.0 inclusive for soft caps. For hard caps there is an
additional restriction that a value of 0.0 is not permitted. Tasks with a
soft cap of 0.0 become true background tasks and only get to run when no other
tasks are active.
When hard capped tasks exceed their cap they are removed from the run queues
and placed in a "sinbin" for a short while until their usage rate decays to
within limits.
Scheduler Tuning Parameters
---------------------------
The characteristics of the Kalman filter used for usage rate estimates are
determined by the _response half life_. The default value for the half life is
5000 msecs, but this can be set to any value between 1000 and 100000 msecs
during a make config. Also, if the SCHED_DYNAMIC_HALF_LIFE config option is set
to Y, the half life can be modified dynamically on a running system within the
above range by writing to /proc/cpu_half_life.
Currently EBS gives all tasks a fixed default timeslice of 100msec. As with the
half life, this can be chosen at build time (between 1 and 500msec) and building
with the SCHED_DYNAMIC_TIME_SLICE option enables on-the-fly changes via
/proc/timeslice.
Performance of the EBS scheduler depends on these parameters, as well as the
load characteristics of the system. The current default settings, however, have
worked well with most loads so far.
Scheduler statistics
--------------------
Global and per task scheduling statistics are available via /proc. To reduce
the size of this post, the details are not listed here but can be seen at
<http://ebs.aurema.com>.
Implementation
==============
Those interested mainly in EBS performance can skip this section...
Effective entitlement per share
-------------------------------
Ideally, the CPU USAGE PER SHARE of tasks demanding as much CPU as they are
entitled to should be equal. By keeping track of the HIGHEST CPU usage per
share that has been observed and comparing it to the CPU usage per share for
each task that runs, tasks that are receiving less usage per share than the
one getting the most can be given a better priority, so they can "catch up".
This highest value of CPU usage per share is maintained on each runqueue as
that CPU's _effective entitlement per share_, and is used as a basis for all
priority computations on that CPU.
In a nutshell, the task that is receiving the most CPU usage for each of its
shares serves as the yardstick via which the treatment of other tasks on that
CPU are measured.
Task priorities
---------------
The array switch and interactivity estimator have been removed - a task's
eligibility to run is determined purely by its priority. A task's priority is
recalculated when it is forked, wakes up, uses up its timeslice or is
preempted.
The following ratio:
task's usage_per_share
-----------------------------------------------------------------------
min(task's CPU rate cap per share, rq->effective_entitlement_per_share)
[where CPU rate cap per share is simply (CPU rate cap / CPU shares)]
is then mapped onto the SCHED_NORMAL priority range. The mapping is such
that a ratio of 1.0 is equivalent to the mean SCHED_NORMAL priority, and
ratios less than and greater than 1.0 are mapped to priorities less and
greater than the mean respectively. This serves to boost tasks
using less than their entitlement and penalise those using more than their
entitlement or CPU rate cap. It also provides interactive and I/O bound tasks
with favourable priorities since such tasks have inherently low CPU usage.
Lastly, the ->prio field in the task structure has been eliminated. The
runqueue structure stores the priority of the currently running task, and
enqueue/dequeue_task() have been adapted to work without ->prio. The reason for
getting rid of ->prio is to facilitate the O(1) priority promotion of runnable
tasks, explained below.
Only one heuristic
------------------
All the heuristics used by the stock scheduler have been removed. The EBS
scheduler uses only one heuristic: newly forked child tasks are given the
same usage rate as their parent, rather than a zero usage. This is done to
mollify the "ramp up" effect to some extent. "Ramp up" is the delay between
a change in a task's usage rate and the Kalman filter estimating the new rate,
which in this case could cause a parent to be swamped by its children. Total
elimination of ramp up is undesirable as it is also responsible for good
responsiveness of interactive processes.
O(1) task promotion
-------------------
When the system is busy, tasks waiting on run queues will have decaying usages,
which means that eventually tasks which have been waiting long enough will be
entitled to a priority boost. Not giving them a boost will result in unfair
starvation, but on the other hand periodically visiting every runnable task is
an O(n) operation.
By inverting the priority and Kalman filter functions, it is possible to
determine at priority calculation time after how long a task will be entitled to
be promoted to the next best priority. These "promotion times" for each
SCHED_NORMAL priority are then divided by the smallest of these times (call
this the promotion_interval) to obtain the "number of promotion intervals"
for each priority. Naturally these are stored in a table indexed by priority.
enqueue_task() now places SCHED_NORMAL tasks onto an appropriate "promotion
list" as well as the run queue. The list is determined by the enqueued task's
current priority and the number of promotion intervals that must pass before
it is eligible to be bumped up to the next best priority. And of course
dequeue_task() takes tasks off their promotion list. Therefore, tasks that
get to run before they are due for a promotion (which is usually the case)
don't get one.
Every promotion_interval jiffies, scheduler_tick() looks at only the promotion
lists that are now due for a priority bump, and anything on the lists is given
the required boost. (The highest SCHED_NORMAL and background priorities
are ignored, as these tasks don't need promotion). Regardless of the
number that need promoting, this is done in O(1) time. The promotion code
exploits the fact that tasks of the same priority that are due for promotion
at the same time, ie. are contiguous on a promotion list, ARE ALSO CONTIGUOUS
ON THE RUN QUEUE. Therefore, promoting N tasks at priority X is simply a matter
of "lifting" these tasks out of their current place on the runqueue in one
chunk, and appending this chunk to the (X - 1) priority list. These tasks are
then placed onto a new promotion list according to their new (X - 1) priority.
This simple list operation is made possible by not having to update each task's
->prio field (now that it has been removed) when moving them to their new
position on the runqueue.
Benchmarks
==========
Benchmarking was done using contest. The following are the results of running on
a dual PIII 866MHz with 256MB RAM.
no_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 78 166.7 0 26.9 1.00
2.6.2-EBS 3 74 175.7 0 16.2 1.00
cacherun:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 75 173.3 0 24.0 0.96
2.6.2-EBS 3 71 183.1 0 12.7 0.96
process_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 93 138.7 35 54.8 1.19
2.6.2-EBS 3 91 141.8 33 53.8 1.23
ctar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 99 141.4 1 5.1 1.27
2.6.2-EBS 3 95 145.3 1 4.2 1.28
xtar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 93 147.3 1 9.7 1.19
2.6.2-EBS 3 89 152.8 1 6.7 1.20
io_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 99 144.4 4 14.1 1.27
2.6.2-EBS 3 91 153.8 3 10.9 1.23
read_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 193 73.6 13 7.8 2.47
2.6.2-EBS 3 136 101.5 7 6.6 1.84
list_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 83 160.2 0 4.8 1.06
2.6.2-EBS 3 80 165.0 0 2.5 1.08
mem_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 159 87.4 129 3.1 2.04
2.6.2-EBS 3 126 110.3 50 2.4 1.70
dbench_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.6.2 3 123 109.8 1 21.1 1.58
2.6.2-EBS 3 130 103.1 1 20.0 1.76
The big winners are read_load and mem_load, with all the others except dbench
being slightly faster than the stock kernel. Only dbench was slightly worse.
X Windows Performance
=====================
The X server isn't strictly an interactive process, but it does have a major
influence on interactive response. The fact that it services a large number
of clients means that its CPU usage rate can be quite high, and this negates
the above mentioned favourable treatment of interactive and I/O bound
processes.
Therefore, for best interactive feel, it is recommended that the X server run
with a nice value of at least -15. From my own testing, doing a window wiggle
test with a make -j16 in the background and X reniced was slightly better than
for the stock kernel.
When running apps such as xmms, I recommend that they should be reniced as well
when the background load is high. With the above setup and xmms reniced to -9,
there were no sound skips at all (without renicing, a few skips could be
detected).
Getting the patch
=================
The patch can be downloaded from
<http://sourceforge.net/projects/ebs-linux/>
Please note that there are 2 patches: the basic patch and the full patch. The
above description applies to the full patch. The basic patch only features
setting shares via nice, a fixed half life and timeslice, no statistics and
soft caps only. This basic patch is for those who are mainly interested in
looking at the core EBS changes to the stock scheduler.
The patches are against 2.6.2 (2.6.3 patches will be available shortly).
Comments, suggestions and testing are much appreciated.
The mailing list for this project is at
<http://lists.sourceforge.net/lists/listinfo/ebs-linux-devel>.
Cheers,
John
^ permalink raw reply [flat|nested] 66+ messages in thread* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-25 14:35 John Lee @ 2004-02-25 17:09 ` Timothy Miller 2004-02-25 22:12 ` John Lee 2004-02-25 22:51 ` Pavel Machek ` (2 subsequent siblings) 3 siblings, 1 reply; 66+ messages in thread From: Timothy Miller @ 2004-02-25 17:09 UTC (permalink / raw) To: John Lee; +Cc: linux-kernel John Lee wrote: > X Windows Performance > ===================== > > The X server isn't strictly an interactive process, but it does have a major > influence on interactive response. The fact that it services a large number > of clients means that its CPU usage rate can be quite high, and this negates > the above mentioned favourable treatment of interactive and I/O bound > processes. > > Therefore, for best interactive feel, it is recommended that the X server run > with a nice value of at least -15. From my own testing, doing a window wiggle > test with a make -j16 in the background and X reniced was slightly better than > for the stock kernel. > > When running apps such as xmms, I recommend that they should be reniced as well > when the background load is high. With the above setup and xmms reniced to -9, > there were no sound skips at all (without renicing, a few skips could be > detected). Well, considering that X is suid root, it's okay to require that it be run at nice -15, but how is the user without root access going to renice xmms? Even for those who do, they're not going to want to have to renice xmms every time they run it. Furthermore, it seems like a bad idea to keep marking more and more programs as suid root just so that they can boost their priority. Not to say that your idea is bad... in fact, it may be a pipe dream to get "flawless" interactivity without explicitly marking which programs have to be boosted in priority. Still, Nick and Con have done a wonderful job at getting close. This is a tough problem. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-25 17:09 ` Timothy Miller @ 2004-02-25 22:12 ` John Lee 2004-02-26 0:31 ` Timothy Miller 0 siblings, 1 reply; 66+ messages in thread From: John Lee @ 2004-02-25 22:12 UTC (permalink / raw) To: Timothy Miller; +Cc: linux-kernel On Wed, 25 Feb 2004, Timothy Miller wrote: > Well, considering that X is suid root, it's okay to require that it be > run at nice -15, but how is the user without root access going to renice > xmms? Hm, I would have thought the vast majority of xmms users would be running it on their own machines, to which they have root access. Hope I'm not missing something here... :-) > Even for those who do, they're not going to want to have to > renice xmms every time they run it. Furthermore, it seems like a bad > idea to keep marking more and more programs as suid root just so that > they can boost their priority. Assuming that all/most xmms users do have root permissions, I would think that this is a very minor inconvenience... isn't xmms something which you tend to start up once and leave running until you log out? I don't think xmms needs to be an suid program, it can just be given a renice once (ie. more shares, -9 ==> 101 shares, which is 5 times the default, just my choice) and then left alone. Furthermore, the controls that my patch features are intended to be exercised as root, normal users can do less (as for nice, you can give your own processes less shares but not more, and can apply _more_ restrictive CPU caps on your tasks). >From my testing so far, X and xmms have been the only candidates for a shares increase, as these two have been the most talked about :-). And after all, one purpose of the patch is to allow users to allocate CPU to their tasks in any way they deem fit. > Not to say that your idea is bad... in fact, it may be a pipe dream to > get "flawless" interactivity without explicitly marking which programs > have to be boosted in priority. Still, Nick and Con have done a > wonderful job at getting close. They have indeed, there haven't been any "poor interactivity" emails for a while now :-). Good interactivity was just one of my goals, I was also aiming for better CPU resource allocation and simplification of the main code paths in the scheduler by doing away with heuristics, and therefore better throughput. Cheers, John ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-25 22:12 ` John Lee @ 2004-02-26 0:31 ` Timothy Miller 2004-02-26 2:04 ` John Lee ` (2 more replies) 0 siblings, 3 replies; 66+ messages in thread From: Timothy Miller @ 2004-02-26 0:31 UTC (permalink / raw) To: John Lee; +Cc: linux-kernel John Lee wrote: > > On Wed, 25 Feb 2004, Timothy Miller wrote: > > >>Well, considering that X is suid root, it's okay to require that it be >>run at nice -15, but how is the user without root access going to renice >>xmms? > > > Hm, I would have thought the vast majority of xmms users would be running > it on their own machines, to which they have root access. Hope I'm not > missing something here... :-) It's a security concern to have to login as root unnecessarily. It's bad enough we have to do that to change X11 configuration, but we shouldn't have to do that every time we want to start xmms. And just suid root is also a security concern. > > >>Even for those who do, they're not going to want to have to >>renice xmms every time they run it. Furthermore, it seems like a bad >>idea to keep marking more and more programs as suid root just so that >>they can boost their priority. > > > Assuming that all/most xmms users do have root permissions, I would think > that this is a very minor inconvenience... isn't xmms something which you > tend to start up once and leave running until you log out? This is a bad assumption. You should never require users to login as root to do basic user-oriented tasks. Indeed, it's often nice to have an icon or menu option to start it without having to pull up a terminal, and if the program ASKS for the root password, it's annoying to have to type that in just to get it to start. Under Solaris, a number of device nodes (sound, serial ports, etc.) have their ownership changed to that of the user who logs into the console. This is so that they can access those devices without logging in as root. What about computer labs of Linux boxes where users do not own the computers and are therefore not allowed to login as root. Should they be prohibited from running xmms properly? For a while, the sysadmin here at work tried to deploy Windows boxes with restricted user priveleges, and the users were not given the admin password. For the engineers, that changed, fortunately. But consider what would happen if Linux boxes were deployed that way. It would suck if Windows users could still listen to MP3's during heavy CPU usage but Linux users could not. There are many good reasons to lock down workstations and not provide root access. > > I don't think xmms needs to be an suid program, it can just be given a > renice once (ie. more shares, -9 ==> 101 shares, which is 5 times > the default, just my choice) and then left alone. Furthermore, the > controls that my patch features are intended to be exercised as root, > normal users can do less (as for nice, you can give your own processes > less shares but not more, and can apply _more_ restrictive CPU caps on > your tasks). If someone does not own the box they're using, but they want to, say, contribute to xmms development, they're going to be starting and stopping the program quite frequently. They're not going to have any way to set the nice level. Consider what happens if some other user logs in remotely to that workstation and starts a large compile. >>From my testing so far, X and xmms have been the only candidates for a > shares increase, as these two have been the most talked about :-). > And after all, one purpose of the patch is to allow users to allocate CPU > to their tasks in any way they deem fit. They are the most talked about, so you tested them. Fine. But we all know that they are not representative samples. There are bound to be numerous other programs that have similar problems. The way your scheduler works, USERS cannot "allocate CPU to their tasks in any way they deem fit". Only system administrators can. > >>Not to say that your idea is bad... in fact, it may be a pipe dream to >>get "flawless" interactivity without explicitly marking which programs >>have to be boosted in priority. Still, Nick and Con have done a >>wonderful job at getting close. > > > They have indeed, there haven't been any "poor interactivity" emails for a > while now :-). > > Good interactivity was just one of my goals, I was also aiming for better > CPU resource allocation and simplification of the main code paths in the > scheduler by doing away with heuristics, and therefore better throughput. I read your paper, and I think you have some wonderful ideas. Don't get me wrong. I think that your ideas, coupled with an interactivity estimator, have an excellent chance of producing a better scheduler. In fact, that may be the only "flaw" in your design. It sounds like your scheduler does an excellent job at fairness with very low overhead. The only problem with it is that it doesn't determine priority dynamically. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 0:31 ` Timothy Miller @ 2004-02-26 2:04 ` John Lee 2004-02-26 2:18 ` Peter Williams 2004-02-26 2:48 ` Nuno Silva 2 siblings, 0 replies; 66+ messages in thread From: John Lee @ 2004-02-26 2:04 UTC (permalink / raw) To: Timothy Miller; +Cc: linux-kernel On Wed, 25 Feb 2004, Timothy Miller wrote: > > Hm, I would have thought the vast majority of xmms users would be running > > it on their own machines, to which they have root access. Hope I'm not > > missing something here... :-) > > It's a security concern to have to login as root unnecessarily. It's > bad enough we have to do that to change X11 configuration, but we > shouldn't have to do that every time we want to start xmms. And just > suid root is also a security concern. Ah, OK. Security point taken. > > Assuming that all/most xmms users do have root permissions, I would think > > that this is a very minor inconvenience... isn't xmms something which you > > tend to start up once and leave running until you log out? <snip> > What about computer labs of Linux boxes where users do not own the > computers and are therefore not allowed to login as root. Should they > be prohibited from running xmms properly? > > If someone does not own the box they're using, but they want to, say, > contribute to xmms development, they're going to be starting and > stopping the program quite frequently. They're not going to have any > way to set the nice level. > > Consider what happens if some other user logs in remotely to that > workstation and starts a large compile. Valid points. I guess I've been too accustomed to playing MP3s on my own box :-(. > >From my testing so far, X and xmms have been the only candidates for a > > They are the most talked about, so you tested them. Fine. But we all > know that they are not representative samples. There are bound to be > numerous other programs that have similar problems. No others that I've noticed yet, but yes you're probably right. Which is why I'm really looking forward to getting feedback in this area. > The way your scheduler works, USERS cannot "allocate CPU to their tasks > in any way they deem fit". Only system administrators can. Correct, I used the wrong word. Another symptom of too much play on my own boxes... > I read your paper, and I think you have some wonderful ideas. Don't get > me wrong. I think that your ideas, coupled with an interactivity > estimator, have an excellent chance of producing a better scheduler. Thanks :-). And thanks for your feedback. Cheers, John ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 0:31 ` Timothy Miller 2004-02-26 2:04 ` John Lee @ 2004-02-26 2:18 ` Peter Williams 2004-02-26 2:42 ` Mike Fedyk ` (2 more replies) 2004-02-26 2:48 ` Nuno Silva 2 siblings, 3 replies; 66+ messages in thread From: Peter Williams @ 2004-02-26 2:18 UTC (permalink / raw) To: Timothy Miller; +Cc: linux-kernel Timothy Miller wrote: > <snip> > In fact, that may be the only "flaw" in your design. It sounds like > your scheduler does an excellent job at fairness with very low overhead. > The only problem with it is that it doesn't determine priority > dynamically. This (i.e. automatic renicing of specified programs) is a good idea but is not really a function that should be undertaken by the scheduler itself. Two possible solutions spring to mind: 1. modify the do_execve() in fs/exec.c to renice tasks when they execute specified binaries 2. have a user space daemon poll running tasks periodically and renice them if they are running specified binaries Both of these solutions have their advantages and disadvantages, are (obviously) complicated than I've made them sound and would require a great deal of care to be taken during their implementation. However, I think that they are both doable. My personal preference would be for the in kernel solution on the grounds of efficiency. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 2:18 ` Peter Williams @ 2004-02-26 2:42 ` Mike Fedyk 2004-02-26 4:10 ` Peter Williams 2004-02-26 16:10 ` Timothy Miller 2004-02-26 16:08 ` Timothy Miller 2004-03-04 21:18 ` Robert White 2 siblings, 2 replies; 66+ messages in thread From: Mike Fedyk @ 2004-02-26 2:42 UTC (permalink / raw) To: Peter Williams; +Cc: Timothy Miller, linux-kernel Peter Williams wrote: > 2. have a user space daemon poll running tasks periodically and renice > them if they are running specified binaries > > Both of these solutions have their advantages and disadvantages, are > (obviously) complicated than I've made them sound and would require a > great deal of care to be taken during their implementation. However, I > think that they are both doable. My personal preference would be for > the in kernel solution on the grounds of efficiency. Better would be to have the kernel tell the daemon whenever a process in exec-ed, and you have simplicity in the kernel, and policy in user space. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 2:42 ` Mike Fedyk @ 2004-02-26 4:10 ` Peter Williams 2004-02-26 4:19 ` Mike Fedyk 2004-02-26 16:10 ` Timothy Miller 1 sibling, 1 reply; 66+ messages in thread From: Peter Williams @ 2004-02-26 4:10 UTC (permalink / raw) To: Mike Fedyk; +Cc: Timothy Miller, linux-kernel Mike Fedyk wrote: > Peter Williams wrote: > >> 2. have a user space daemon poll running tasks periodically and renice >> them if they are running specified binaries >> >> Both of these solutions have their advantages and disadvantages, are >> (obviously) complicated than I've made them sound and would require a >> great deal of care to be taken during their implementation. However, >> I think that they are both doable. My personal preference would be >> for the in kernel solution on the grounds of efficiency. > > > Better would be to have the kernel tell the daemon whenever a process in > exec-ed, and you have simplicity in the kernel, and policy in user space. Yes. That would be a good solution. Does a mechanism that allows the kernel to notify specific programs about specific events like this exist? Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 4:10 ` Peter Williams @ 2004-02-26 4:19 ` Mike Fedyk 2004-02-26 19:23 ` Shailabh Nagar 0 siblings, 1 reply; 66+ messages in thread From: Mike Fedyk @ 2004-02-26 4:19 UTC (permalink / raw) To: Peter Williams; +Cc: Timothy Miller, linux-kernel Peter Williams wrote: > Mike Fedyk wrote: > >> Peter Williams wrote: >> >>> 2. have a user space daemon poll running tasks periodically and >>> renice them if they are running specified binaries >>> >>> Both of these solutions have their advantages and disadvantages, are >>> (obviously) complicated than I've made them sound and would require a >>> great deal of care to be taken during their implementation. However, >>> I think that they are both doable. My personal preference would be >>> for the in kernel solution on the grounds of efficiency. >> >> >> >> Better would be to have the kernel tell the daemon whenever a process >> in exec-ed, and you have simplicity in the kernel, and policy in user >> space. > > > Yes. That would be a good solution. Does a mechanism that allows the > kernel to notify specific programs about specific events like this exist? I'm sure DaveM would suggest Netlink, but there are probably several implementations for Linux. I'll let other more knowledgeable people fill in the list. Mike ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 4:19 ` Mike Fedyk @ 2004-02-26 19:23 ` Shailabh Nagar 2004-02-26 19:46 ` Mike Fedyk 0 siblings, 1 reply; 66+ messages in thread From: Shailabh Nagar @ 2004-02-26 19:23 UTC (permalink / raw) To: Mike Fedyk; +Cc: Peter Williams, Timothy Miller, linux-kernel Mike Fedyk wrote: > Peter Williams wrote: > >> Mike Fedyk wrote: >> >>> Peter Williams wrote: >>> >>>> 2. have a user space daemon poll running tasks periodically and >>>> renice them if they are running specified binaries >>>> >>>> Both of these solutions have their advantages and disadvantages, are >>>> (obviously) complicated than I've made them sound and would require >>>> a great deal of care to be taken during their implementation. >>>> However, I think that they are both doable. My personal preference >>>> would be for the in kernel solution on the grounds of efficiency. The CKRM project (http://ckrm.sf.net) took the kernel approach too, for the same reason you mentioned - efficiency. Having a userspace daemon poll and readjust priorities/entitlements is possible but it will not be able to react as quickly when entitlements change and will incur overheads for unnecessary polls when they don't. In CKRM, the corresponding problem being solved is what should be done when a task moves from one "class" to another with a potentially different entitlement (we call them guarantees and limits). Since class changes can happen relatively often (compared to changes to entitlements for ebs), it was even more imperative to be efficient. >>> >>> >>> >>> >>> Better would be to have the kernel tell the daemon whenever a process >>> in exec-ed, and you have simplicity in the kernel, and policy in user >>> space. As it turns out, one can still use a fairly simple in-kernel module which provides a *mechanism* for effectively changing a process' entitlement while retaining the policy component in userland. CKRM has a rule-based classification engine that allows simple rules to be defined by the user/sysadmin and evaluated by the kernel at various kernel events. Using this, its not only possible to catch exec's but setuids/setgids etc. (which are also legitimate points at which a sysadmin may want a task's entitlement changed). The RBCE we wrote was fairly efficient though we didn't do specific measurements of the overhead as the interfaces started changing. >> >> >> Yes. That would be a good solution. Does a mechanism that allows the >> kernel to notify specific programs about specific events like this exist? > I guess there are two subparts to this question: a) Are there efficient kernel-user communication mechanisms which could be used to do such event notifications ? Yes, netlink is one, relayfs (http://www.opersys.com/relayfs) is another. b) Are the hooks in place at the right points ? Except for the LSM security* calls, no. Which leaves you with two options - put your own hooks into do_execve, either custom or using some general hook interface like KHI (http://www-124.ibm.com/linux/projects/kernelhooks/) OR find a way to stack your function with LSM's calls. CKRM is looking at both these options since we need hooks in several places (fork, exec, setuid, setgid....) > I'm sure DaveM would suggest Netlink, but there are probably several > implementations for Linux. > > I'll let other more knowledgeable people fill in the list. Disclaimer : no claim to be knowledgeable....:-) Just sharing some of our experiences going over similar problems in a slightly broader context i.e. class-based resource management. -- Shailabh ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 19:23 ` Shailabh Nagar @ 2004-02-26 19:46 ` Mike Fedyk 2004-02-26 20:42 ` Shailabh Nagar 0 siblings, 1 reply; 66+ messages in thread From: Mike Fedyk @ 2004-02-26 19:46 UTC (permalink / raw) To: Shailabh Nagar; +Cc: Peter Williams, Timothy Miller, linux-kernel Shailabh Nagar wrote: >>> Mike Fedyk wrote: >>>> Better would be to have the kernel tell the daemon whenever a >>>> process in exec-ed, and you have simplicity in the kernel, and >>>> policy in user space. > > > > As it turns out, one can still use a fairly simple in-kernel module > which provides a *mechanism* for effectively changing a process' > entitlement while retaining the policy component in userland. How much code could be removed if CKRM triggered a userspace process to perform the operations required? ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 19:46 ` Mike Fedyk @ 2004-02-26 20:42 ` Shailabh Nagar 0 siblings, 0 replies; 66+ messages in thread From: Shailabh Nagar @ 2004-02-26 20:42 UTC (permalink / raw) To: Mike Fedyk; +Cc: Peter Williams, Timothy Miller, linux-kernel Mike Fedyk wrote: > Shailabh Nagar wrote: > >>>> Mike Fedyk wrote: >>>> >>>>> Better would be to have the kernel tell the daemon whenever a >>>>> process in exec-ed, and you have simplicity in the kernel, and >>>>> policy in user space. >>>> >> >> >> >> As it turns out, one can still use a fairly simple in-kernel module >> which provides a *mechanism* for effectively changing a process' >> entitlement while retaining the policy component in userland. > > > How much code could be removed if CKRM triggered a userspace process > to perform the operations required? In CKRM, the code to perform classification is an optional kernel module. So size isn't really an issue in terms of impact to core kernel code. Our prototype version of the classification engine, RBCE, is about 2700 lines without any effort being put into reducing its size etc. If that were to be completely pared down to only provide events to userspace, it would come down by quite a bit (can't say exactly how much but atleast 50% is a safe bet). I think the more important question is performance impact - what do you give up in terms of efficiency and granularity of control by going to userspace vs what you gain in reduced kernel pathlength. Empirically, we found RBCE was quite efficient but no quantitative analysis was done. -- Shailabh ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 2:42 ` Mike Fedyk 2004-02-26 4:10 ` Peter Williams @ 2004-02-26 16:10 ` Timothy Miller 2004-02-26 19:47 ` Mike Fedyk 2004-02-26 22:51 ` Peter Williams 1 sibling, 2 replies; 66+ messages in thread From: Timothy Miller @ 2004-02-26 16:10 UTC (permalink / raw) To: Mike Fedyk; +Cc: Peter Williams, linux-kernel How about this: The kernel tracks CPU usage, time slice expiration, and numerous other statistics, and exports them to userspace through /proc or somesuch. Then a user-space daemon adjusts priority. This could work, but it would be sluggish in adjusting priorities. I still like Nick and Con's solutions better. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 16:10 ` Timothy Miller @ 2004-02-26 19:47 ` Mike Fedyk 2004-02-26 22:51 ` Peter Williams 1 sibling, 0 replies; 66+ messages in thread From: Mike Fedyk @ 2004-02-26 19:47 UTC (permalink / raw) To: Timothy Miller; +Cc: Peter Williams, linux-kernel Timothy Miller wrote: > How about this: > > The kernel tracks CPU usage, time slice expiration, and numerous other > statistics, and exports them to userspace through /proc or somesuch. > Then a user-space daemon adjusts priority. This could work, but it > would be sluggish in adjusting priorities. Userspace shouldn't have to poll, especially if there needs to be low latency in the interaction. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 16:10 ` Timothy Miller 2004-02-26 19:47 ` Mike Fedyk @ 2004-02-26 22:51 ` Peter Williams 2004-02-27 10:06 ` Helge Hafting 1 sibling, 1 reply; 66+ messages in thread From: Peter Williams @ 2004-02-26 22:51 UTC (permalink / raw) To: Timothy Miller; +Cc: Mike Fedyk, linux-kernel Timothy Miller wrote: > How about this: > > The kernel tracks CPU usage, time slice expiration, and numerous other > statistics, and exports them to userspace through /proc or somesuch. > Then a user-space daemon adjusts priority. Yes, the right statistics could allow these processes to be identified reasonably accurately. The programs in question would have the following characteristics: 1. low CPU usage rate, and 2. a very regular pattern of use i.e. the size of each CPU bursts would be approximately constant as would the size of the intervals between each burst. The appropriate statistic to identify the second of these would be variance or (equivalently but more expensively) standard deviation. Whether this problem is bad/important enough to warrant the extra overhead of gathering these statistics is a moot point. We had to generate very high system loads on a single CPU system in order to cause one or two skips in xmms over a period of a couple of minutes. It should be noted that these are the type of task characteristics for which the real time scheduler classes are designed and I think that someone mentioned that if run with sufficient privileges xmms tries to make itself SCHED_RR. > This could work, but it > would be sluggish in adjusting priorities. > > I still like Nick and Con's solutions better. > Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 22:51 ` Peter Williams @ 2004-02-27 10:06 ` Helge Hafting 2004-02-27 11:04 ` Peter Williams 0 siblings, 1 reply; 66+ messages in thread From: Helge Hafting @ 2004-02-27 10:06 UTC (permalink / raw) To: Peter Williams; +Cc: Timothy Miller, Mike Fedyk, linux-kernel Peter Williams wrote: > Timothy Miller wrote: > >> How about this: >> >> The kernel tracks CPU usage, time slice expiration, and numerous other >> statistics, and exports them to userspace through /proc or somesuch. >> Then a user-space daemon adjusts priority. > > > Yes, the right statistics could allow these processes to be identified > reasonably accurately. The programs in question would have the > following characteristics: > > 1. low CPU usage rate, and > 2. a very regular pattern of use i.e. the size of each CPU bursts would > be approximately constant as would the size of the intervals between > each burst. There is no need for the regularity. When I use a word processor, I use it very irregularly. Sometimes I type text, and wants each letter typed to appear instantly. This fits well with "low cpu usage" and sudden short bursts. There may be lots of long delays though while I think about stuff to write. So the intervals are irregular, I still believe I should get the boosts as long as the bursts are small. Doing something big (such as invoking latex on a big document) is cpu-heavy, but then it is ok not to get the boost. Current schedulers based on io-waiting gets this right already. > > The appropriate statistic to identify the second of these would be > variance or (equivalently but more expensively) standard deviation. > Whether this problem is bad/important enough to warrant the extra > overhead of gathering these statistics is a moot point. We had to > generate very high system loads on a single CPU system in order to cause > one or two skips in xmms over a period of a couple of minutes. Well, perhaps you could give a slightly bigge boost to a very regular thing like xmms. But even that might have some snags, the load might change a lot when doing midi in software, depending on how many instruments are active simultaneously. There goes the constant-size bursts. Helge Hafting ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-27 10:06 ` Helge Hafting @ 2004-02-27 11:04 ` Peter Williams 0 siblings, 0 replies; 66+ messages in thread From: Peter Williams @ 2004-02-27 11:04 UTC (permalink / raw) To: Helge Hafting; +Cc: Timothy Miller, Mike Fedyk, linux-kernel Helge Hafting wrote: > Peter Williams wrote: > >> Timothy Miller wrote: >> >>> How about this: >>> >>> The kernel tracks CPU usage, time slice expiration, and numerous >>> other statistics, and exports them to userspace through /proc or >>> somesuch. Then a user-space daemon adjusts priority. >> >> >> >> Yes, the right statistics could allow these processes to be identified >> reasonably accurately. The programs in question would have the >> following characteristics: >> >> 1. low CPU usage rate, and >> 2. a very regular pattern of use i.e. the size of each CPU bursts >> would be approximately constant as would the size of the intervals >> between each burst. > > > There is no need for the regularity. When I use a word processor, I > use it very irregularly. Sometimes I type text, and wants each letter > typed to appear instantly. This fits well with "low cpu usage" and > sudden short bursts. There may be lots of long delays though while > I think about stuff to write. So the intervals are irregular, I still > believe I should get the boosts as long as the bursts are small. > Doing something big (such as invoking latex on a big document) > is cpu-heavy, but then it is ok not to get the boost. > Current schedulers based on io-waiting gets this right already. I was describing programs such as xmms NOT normal interactive programs such as editors etc. Normal interactive programs don't need special treatment with our EBS scheduler because their inherent low usage rate means that they are always well under their entitlement and consequently get given a high dynamic priority. > >> >> The appropriate statistic to identify the second of these would be >> variance or (equivalently but more expensively) standard deviation. >> Whether this problem is bad/important enough to warrant the extra >> overhead of gathering these statistics is a moot point. We had to >> generate very high system loads on a single CPU system in order to >> cause one or two skips in xmms over a period of a couple of minutes. > > > Well, perhaps you could give a slightly bigge boost to a very regular > thing like xmms. But even that might have some snags, the load > might change a lot when doing midi in software, depending on how > many instruments are active simultaneously. There goes the > constant-size bursts. I think the burst sizes and intervals would still be fairly constant but the usage rate would be higher. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 2:18 ` Peter Williams 2004-02-26 2:42 ` Mike Fedyk @ 2004-02-26 16:08 ` Timothy Miller 2004-02-26 16:51 ` Rik van Riel 2004-02-26 20:15 ` Peter Williams 2004-03-04 21:18 ` Robert White 2 siblings, 2 replies; 66+ messages in thread From: Timothy Miller @ 2004-02-26 16:08 UTC (permalink / raw) To: Peter Williams; +Cc: linux-kernel Peter Williams wrote: > Timothy Miller wrote: > > <snip> > >> In fact, that may be the only "flaw" in your design. It sounds like >> your scheduler does an excellent job at fairness with very low >> overhead. The only problem with it is that it doesn't determine >> priority dynamically. > > > This (i.e. automatic renicing of specified programs) is a good idea but > is not really a function that should be undertaken by the scheduler > itself. Two possible solutions spring to mind: > > 1. modify the do_execve() in fs/exec.c to renice tasks when they execute > specified binaries We don't want user-space programs to have control over priority. This is DoS waiting to happen. > 2. have a user space daemon poll running tasks periodically and renice > them if they are running specified binaries This is much too specific. Again, if the USER has control over this list, then it's potential DoS. And if the user adds a program which should qualify but which is not in the list, the program will not get its deserved boost. And a sysadmin is not going to want to update 200 lab computers just so one user can get their program to run properly. > > Both of these solutions have their advantages and disadvantages, are > (obviously) complicated than I've made them sound and would require a > great deal of care to be taken during their implementation. However, I > think that they are both doable. My personal preference would be for > the in kernel solution on the grounds of efficiency. They are doable, but they are not a general solution. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 16:08 ` Timothy Miller @ 2004-02-26 16:51 ` Rik van Riel 2004-02-26 20:15 ` Peter Williams 1 sibling, 0 replies; 66+ messages in thread From: Rik van Riel @ 2004-02-26 16:51 UTC (permalink / raw) To: Timothy Miller; +Cc: Peter Williams, linux-kernel On Thu, 26 Feb 2004, Timothy Miller wrote: > We don't want user-space programs to have control over priority. This > is DoS waiting to happen. Nope, there's a solution to this. Please read the CKRM design draft I just posted elsewhere in this thread ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 16:08 ` Timothy Miller 2004-02-26 16:51 ` Rik van Riel @ 2004-02-26 20:15 ` Peter Williams 2004-02-27 14:46 ` Timothy Miller 1 sibling, 1 reply; 66+ messages in thread From: Peter Williams @ 2004-02-26 20:15 UTC (permalink / raw) To: Timothy Miller; +Cc: linux-kernel Timothy Miller wrote: > > > Peter Williams wrote: > >> Timothy Miller wrote: >> > <snip> >> >>> In fact, that may be the only "flaw" in your design. It sounds like >>> your scheduler does an excellent job at fairness with very low >>> overhead. The only problem with it is that it doesn't determine >>> priority dynamically. >> >> >> >> This (i.e. automatic renicing of specified programs) is a good idea >> but is not really a function that should be undertaken by the >> scheduler itself. Two possible solutions spring to mind: >> >> 1. modify the do_execve() in fs/exec.c to renice tasks when they >> execute specified binaries > > > We don't want user-space programs to have control over priority. They already do e.g. renice is such a program. > This > is DoS waiting to happen. > >> 2. have a user space daemon poll running tasks periodically and renice >> them if they are running specified binaries > > > This is much too specific. Again, if the USER has control over this > list, It would obviously be under root control. > then it's potential DoS. And if the user adds a program which > should qualify but which is not in the list, the program will not get > its deserved boost. > > And a sysadmin is not going to want to update 200 lab computers just so > one user can get their program to run properly. > >> >> Both of these solutions have their advantages and disadvantages, are >> (obviously) complicated than I've made them sound and would require a >> great deal of care to be taken during their implementation. However, >> I think that they are both doable. My personal preference would be >> for the in kernel solution on the grounds of efficiency. > > > They are doable, but they are not a general solution. > Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 20:15 ` Peter Williams @ 2004-02-27 14:46 ` Timothy Miller 2004-02-28 5:00 ` Peter Williams 0 siblings, 1 reply; 66+ messages in thread From: Timothy Miller @ 2004-02-27 14:46 UTC (permalink / raw) To: Peter Williams; +Cc: linux-kernel Peter Williams wrote: > Timothy Miller wrote: > >> >> >> We don't want user-space programs to have control over priority. > > > They already do e.g. renice is such a program. No one's talking about LOWERING priority here. You can only DoS someone else if you can set negative nice values, and non-root can't do that. > >> This is DoS waiting to happen. >> >>> 2. have a user space daemon poll running tasks periodically and >>> renice them if they are running specified binaries >> >> >> >> This is much too specific. Again, if the USER has control over this >> list, > > > It would obviously be under root control. And that means if someone wants to run a program which is not on the list but which requires (and deserves) higher priority, they cannot. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-27 14:46 ` Timothy Miller @ 2004-02-28 5:00 ` Peter Williams 0 siblings, 0 replies; 66+ messages in thread From: Peter Williams @ 2004-02-28 5:00 UTC (permalink / raw) To: Timothy Miller; +Cc: linux-kernel Timothy Miller wrote: > > > Peter Williams wrote: > >> Timothy Miller wrote: >> > >>> >>> >>> We don't want user-space programs to have control over priority. >> >> >> >> They already do e.g. renice is such a program. > > > No one's talking about LOWERING priority here. You can only DoS someone > else if you can set negative nice values, and non-root can't do that. Which is why root has to be in control of the mechanism. > >> >>> This is DoS waiting to happen. >>> >>>> 2. have a user space daemon poll running tasks periodically and >>>> renice them if they are running specified binaries >>> >>> >>> >>> >>> This is much too specific. Again, if the USER has control over this >>> list, >> >> >> >> It would obviously be under root control. > > > And that means if someone wants to run a program which is not on the > list but which requires (and deserves) higher priority, they cannot. Any mechanism that causes a task to be treated more favourably than others needs to be under root's control. It's root's prerogative to decide who deserves more favourable treatment. This is even more important (for obvious reasons) if a reservation (or guarantee) mechanism is involved. I'd like to stress at this point that xmms only needs a boost under extremely high loads and this issue is being blown out of proportion. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* RE: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 2:18 ` Peter Williams 2004-02-26 2:42 ` Mike Fedyk 2004-02-26 16:08 ` Timothy Miller @ 2004-03-04 21:18 ` Robert White 2004-03-04 23:15 ` Peter Williams 2 siblings, 1 reply; 66+ messages in thread From: Robert White @ 2004-03-04 21:18 UTC (permalink / raw) To: 'Peter Williams', 'Timothy Miller'; +Cc: linux-kernel At a previous employer (so code not available) I used a simple expedient to solve this very problem. I had a custom program "shim.c" that tweaked priorities and environment variables. Basically a fistful of lines that would take argv[0], look for the file named ".shim_"+basename(argv[0]) {in a well-defined location} to load some simple environment and path and priority overrides, apply these changes and then setuid itself back to the real user and exec() the real program with the received args. It had some few degenerate cases (shmming out from under a setuid program was the primary one) but it worked out rather well and had little-to-no meaningful overhead. You set up a /usr/local/bin (or equivalent) directory, link shim into that directory named as the various programs that need to be boosted (e.g. xmms etc) and put that directory earlier on the path than the real executable. {If this directory only contains shims, it is useful to code shim.c to remove that directory from the PATH.) This technique lets the administrator have fine-grained control of a reasonable list of priority promotions and permissions overrides without having to move anything into kernel space or running status daemons. ==== I would think that while fork() should keep the heuristics of its parent, exec() would probably need to do some normalizing. ==== Has this scheduler been tried for applications like CD burning? Rob. -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Peter Williams Sent: Wednesday, February 25, 2004 6:18 PM To: Timothy Miller Cc: linux-kernel@vger.kernel.org Subject: Re: [RFC][PATCH] O(1) Entitlement Based Scheduler Timothy Miller wrote: > <snip> > In fact, that may be the only "flaw" in your design. It sounds like > your scheduler does an excellent job at fairness with very low overhead. > The only problem with it is that it doesn't determine priority > dynamically. This (i.e. automatic renicing of specified programs) is a good idea but is not really a function that should be undertaken by the scheduler itself. Two possible solutions spring to mind: 1. modify the do_execve() in fs/exec.c to renice tasks when they execute specified binaries 2. have a user space daemon poll running tasks periodically and renice them if they are running specified binaries Both of these solutions have their advantages and disadvantages, are (obviously) complicated than I've made them sound and would require a great deal of care to be taken during their implementation. However, I think that they are both doable. My personal preference would be for the in kernel solution on the grounds of efficiency. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-03-04 21:18 ` Robert White @ 2004-03-04 23:15 ` Peter Williams 0 siblings, 0 replies; 66+ messages in thread From: Peter Williams @ 2004-03-04 23:15 UTC (permalink / raw) To: Robert White; +Cc: 'Timothy Miller', linux-kernel Robert White wrote: > <snip> > > Has this scheduler been tried for applications like CD burning? > No. But it will be now that you've brought it to our attention. Thanks Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 0:31 ` Timothy Miller 2004-02-26 2:04 ` John Lee 2004-02-26 2:18 ` Peter Williams @ 2004-02-26 2:48 ` Nuno Silva 2004-02-26 4:25 ` Peter Williams 2004-02-26 16:12 ` Timothy Miller 2 siblings, 2 replies; 66+ messages in thread From: Nuno Silva @ 2004-02-26 2:48 UTC (permalink / raw) To: Timothy Miller; +Cc: John Lee, linux-kernel Timothy Miller wrote: [..] > > > It's a security concern to have to login as root unnecessarily. It's > bad enough we have to do that to change X11 configuration, but we > shouldn't have to do that every time we want to start xmms. And just > suid root is also a security concern. > Maybe I'm missing something, but xmms run OK with zero load, right? The problem is that, when building the kernel and the entire kde tree, each with make -j 16, xmms skips a few times? Well, tough luck... And the user *can* do something about it, just nice -n 19 the builds and left xmms alone. (Or you can use other player... :-) With this patch you can even say that each of the build processes can only hog 5% (at the most!) of the CPU (maybe the build is not a good example for mandatory CPU time caps, but it is usefull). Besides, this implements a true run-only-when-noone-else-wants-to-run nice mode wich, combined with the absolut cpu time caps, hits some of my wish list for a complete scheduler :-) so I can't wait to test it :-) A final note to John Lee: you may want to check the Class-based Kernel Resource Management (CKRM) at: http://ckrm.sourceforge.net/ Thanks, Nuno Silva ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 2:48 ` Nuno Silva @ 2004-02-26 4:25 ` Peter Williams 2004-02-26 15:57 ` Rik van Riel 2004-02-26 16:12 ` Timothy Miller 1 sibling, 1 reply; 66+ messages in thread From: Peter Williams @ 2004-02-26 4:25 UTC (permalink / raw) To: Nuno Silva; +Cc: Timothy Miller, John Lee, linux-kernel Nuno Silva wrote: > > > Timothy Miller wrote: > > [..] > >> >> >> It's a security concern to have to login as root unnecessarily. It's >> bad enough we have to do that to change X11 configuration, but we >> shouldn't have to do that every time we want to start xmms. And just >> suid root is also a security concern. >> > > Maybe I'm missing something, but xmms run OK with zero load, right? The > problem is that, when building the kernel and the entire kde tree, each > with make -j 16, xmms skips a few times? Well, tough luck... > > And the user *can* do something about it, just nice -n 19 the builds and > left xmms alone. (Or you can use other player... :-) > > With this patch you can even say that each of the build processes can > only hog 5% (at the most!) of the CPU (maybe the build is not a good > example for mandatory CPU time caps, but it is usefull). > > Besides, this implements a true run-only-when-noone-else-wants-to-run > nice mode wich, combined with the absolut cpu time caps, hits some of my > wish list for a complete scheduler :-) so I can't wait to test it :-) Another idea that we are playing with for handling programs like xmms (i.e. programs that require gauranteed CPU bandwidth to perform well) is the complement of caps namely per task CPU reservations. The availability of CPU usage rate statistics for each task makes this possible but the question is "Is the functionality worth the extra overhead?". Of course, this won't solve the "need to be root" problem as this is obviously the sort of control that should be reserved for root but it is arguably better than having to guess how many shares a task needs to ensure that it gets the required CPU bandwidth. Peter -- Dr Peter Williams, Chief Scientist peterw@aurema.com Aurema Pty Limited Tel:+61 2 9698 2322 PO Box 305, Strawberry Hills NSW 2012, Australia Fax:+61 2 9699 9174 79 Myrtle Street, Chippendale NSW 2008, Australia http://www.aurema.com ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 4:25 ` Peter Williams @ 2004-02-26 15:57 ` Rik van Riel 2004-02-26 19:28 ` Shailabh Nagar 0 siblings, 1 reply; 66+ messages in thread From: Rik van Riel @ 2004-02-26 15:57 UTC (permalink / raw) To: Peter Williams Cc: Nuno Silva, Timothy Miller, John Lee, linux-kernel, ckrm-tech [-- Attachment #1: Type: TEXT/PLAIN, Size: 1049 bytes --] On Thu, 26 Feb 2004, Peter Williams wrote: > Another idea that we are playing with for handling programs like xmms > (i.e. programs that require gauranteed CPU bandwidth to perform well) is > the complement of caps namely per task CPU reservations. > Of course, this won't solve the "need to be root" problem as this is > obviously the sort of control that should be reserved for root Not necessarily. We've just fixed this dilemma in the CKRM project, using a resource class filesystem for this kind of stuff. A user could have a certain percentage of the CPU guaranteed (especially the console user) and carve out part of his/her guarantee for multimedia applications. Please see the attached document, which is the 6th draft of this particular CKRM design. If you have any improvements for this spec, feel free to let us know ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan [-- Attachment #2: CKRM 4 spec v6 --] [-- Type: TEXT/plain, Size: 36645 bytes --] CKRM (Class-based Kernel Resource Management) http://ckrm.sf.net v4.0 draft 6 25 Feb 2004 Rik van Riel riel@redhat.com Hubertus Franke, frankeh@watson.ibm.com Shailabh Nagar nagar@watson.ibm.com Vivek Kashyap kashyapv@us.ibm.com Chandra Seetharaman sekharan@us.ibm.com This is the fourth major revision of the CKRM API and framework. The first was presented at OLS'03, the second is what's described at http://ckrm.sf.net as of 18 Feb 2004 and the third was sent out on LKML and ckrm-tech@lists.sf.net on 31 Jan 2004. The fourth version is based on a merger of best of breed concepts from a filesystem-based API proposed by Rik van Riel (with inputs from Stephen Tweedie) on the ckrm-tech mailing list on 5 Feb 2004 and version three of this document. The project is not committed to a particular API or architecture and welcomes discussions/comments on the proposal below. Please send feedback to any of the authors and cc: ckrm-tech@lists.sf.net. The latest version of the document will be kept at http://ckrm.sf.net 1.0 Overview ------------ Class-based Kernel Resource Management (CKRM) is a set of modifications to the Linux kernel to enable improved systems management. The key idea in CKRM is to control and monitor system resource usage through user-defined groups of tasks called classes. Classes can be defined to distinguish between applications, workloads and users in their usage of any system resource such as CPU ticks, physical page frames, disk I/O bandwidth, accept queue connections, number of open file handles etc. Sysadmin/resource management application Users syscalls /rcfs | ^ | | Userspace V V ----------- +------------------+ Kernel | Core ---- Classification Engine (CE) | | +----------------------------.... | | | | | RC1 RC2 RC3 RC4 RC5 .... RC=Resource Controllers CPU MEM I/O AcceptQ Open file descriptors, shmsegs etc. The CKRM components are as shown above: a) Core Kernel patch that has three key roles. First, it defines the user API consisting of system calls and a filesystem called rcfs (resource control file system). Second, it defines the APIs between itself and all other kernel components namely the resource controllers and the optional classification engine. Thus the Core acts as the switchboard for users, resource controllers and classification engines to interact. Finally, Core handles the creation and management of classes, which are described below. b) Resource controller (RC) A kernel patch providing differentiated access to some resource. There can be multiple RCs defined simultaneously. The CKRM project currently provides CPU (ticks), Mem (physical page frames), I/O (disk bandwidth) and Inbound Network (connections) controllers with extensions planned to manage virtual resources like open file descriptors, shared memory segments etc. c) Classification Engine (CE) An optional kernel module that assists in automatic classification of tasks into classes. A CE will provide a function that returns the class to which a task "should" belong. The classification function is called by the Core at significant kernel events (where a task's class might be assigned or expected to change) and appropriate action taken. CE's are completely optional. Tasks in the system can also be manually associated with classes. If a CE exists, it must adhere to the Core-CE kernel API. CKRM will provide a rule-based classification engine (RBCE) that performs classication by evaluating a set of rules entered by a privileged user. 2.0 Task Classes -------------------- A task class, commonly abbreviated to just class, is a collection of tasks with an associated set of shares and usage statistics for each managed resource in the system. A managed resource (such as CPU, physical memory and disk I/O) is one whose scheduler or controller is class-aware. Each class is associated with a lower and upper bound of resource usage, called guarantees and limits, for each managed resource in the system. Guarantees and limits depend on the type of resource and are described in detail in Section 2.5. The two are collectively called a share where the distinction is unimportant. Each task in the system always belongs to some task class. The main principle behind CKRM is that a task's consumption of managed resources is controlled (using shares) and monitored (available as a class' statistics) primarily through the task's class. The association of tasks to classes is dynamic. A task's classification can be changed either manually as described Section 2.7, or automatically using a classification engine as described in Section 4.0. The task class is used to manage system wide resources. An alternative type of class is wholly contained within a task or kernel abstraction (see section 6.0). A socket's accept queue is one such resource that can be used to control the incoming connection requests. Unless otherwise specified, a class will mean a task class. 2.1 Hierarchy of classes ------------------------ Classes are primarily used by sysadmins to monitor and control the resources used by various workloads running on a system e.g. mailserver, apache, dns etc. It can also be used to limit and track resource usage of users. In both these cases, it is useful to allow the application or user to manage its own allocation without having systemwide privileges. To enable this, CKRM classes can be hierarchically subdivided into subclasses. Each subclass gets its own share of the parent class' allocation. The user or workload associated with the parent class can change subclass shares, monitor subclass usage, effectively managing their allocation independent of each other and the system administrator. The depth of the class hierarchy supported by CKRM is configurable. The default depth is expected to be three though resource controllers will/should be written to support a reasonably larger number. The maximum depth value allowed will equal the minimum of the depths supported by any registered resource controller i.e. if the registered CPU controller can only support a depth of 2, the user can only configure the depth to be 1 or 2 even if all other registered controllers can support a depth of 3. A depth greater than the default may incur significant performance penalties and/or relaxation of the time granularity at which shares get enforced. CKRM will provide helper functions to assist resource controllers traverse the class hierarchy for control and statistics updation. 2.2 rcfs API overview --------------------- CKRM's primary user API is the resource control file system (rcfs). A filesystem is a natural interface for representing and managing the hierarchy of classes offering several advantages: - direct representation of class hierarchy in the filesystem's tree structure - does not need any new system calls using standard unix filesystem systemcalls instead - intuitive mapping of filesystem/file operations such as mkdir/rmdir, read/write and chmod etc. onto creation/deletion of classes, getting/setting class shares, controlling permissions etc. - easy tracking and control of permissions at any level of the hierarchy with with read, write and access rights on a user, group and other basis. - uses standard unix filesystem system calls instead of defining any new doesn't need any new system calls and instead uses standard unix filesystem one. The root of rcfs, typically mounted as /rcfs, represents the whole system. Each class is represented by a directory containing the following a) "magic" files : The following are proposed |-- target : placeholder to identify class when moving tasks to it |-- shares : contains guarantees and limits of this class for each managed resource |-- stats : contains usage by this class of each managed resource |-- config : config file for changing/reading configuration parameters for rcfs as well as resource controllers b) |-- members/ : Subdirectory containing symlinks to the /proc entries for tasks belonging to the class, but not to any of its subclasses i.e. tasks belonging to a class' default subclass as defined below. c) Subdirectories for each subclass: each subdirectory is similar to the parent. rcfs configuration parameters can be set in two ways. a) As a mount parameter for the rcfs filesystem e.g. mount -t rcfs -o maxdepth=2 rcfs /rcfs mount -o remount, maxdepth=2 /rcfs b) Writing to /rcfs/config dynamically echo "rcfs:<parameter name>:<parameter value>" > /rcfs/config e.g. echo "rcfs:mode:manage" > /rcfs/config The current and available config parameters can be read from /rcfs/config in the same format as they get written. /rcfs/config can also be used to configure resource controllers as described later. 2.3 Default subclass -------------------- Each class can contain tasks that do not belong to any of its subclasses. To regulate and monitor such tasks, CKRM core implicitly defines a default subclass for each class. e.g. if class_A contains tasks t1,t2 and t3 and defines subclasses class_A1 which contains t1, then t2 & t3 belong to class_A's default subclass. The default subclass of the root of rcfs (/rcfs) is significant because it is present at kernel bootup, being statically defined by the Core patch, and contains, atleast initially, /sbin/init. Having such a systemwide default class allows CKRM to ensure that every task in the system always belongs to some class. Default classes are not explicitly represented by a separate subdirectory at any level of the hierarchy. Thus /rcfs/class_A/target represents both class_A and its default subclass (moving a task to class_A implicitly moves it to the class_A's default subclass). Default classes do have their own share and usage statistics which are listed separately in the class' magic files 'shares' and 'stats' respectively. User level tools can be written to display default class data explicitly. 2.4 Class creation/deletion --------------------------- Classes are created using mkdir/rmdir at the appropriate level of the rcfs tree. The created directory is automatically populated with the magic files and /members subdirectory. A class is always created empty and gets populated when tasks get manually or automatically classified to it. Removing a class is only allowed if it has no associated tasks or subclasses in it. 2.5 Guarantees and limits ------------------------- A guarantee is the minimum amount of resource that tasks of a class will get if they request it. Unused portions of the guarantee can be redistributed (work conservation) by the corresponding resource controller with a reasonably timely reallocation back to the class, should its demand rise later. A limit is the maximum amount of resource that a class can use. They can be either hard or soft, depending on the capability and semantics of the resource e.g. open file descriptor limits are always hard whereas a limit on the number of page frames given to a class could be configured to be either hard or soft. Classes can never consume more resources than a hard limit, regardless of the usage by other classes. Exceeding a soft limit is permitted if the resource is sufficiently free. Resources granted over the soft limit can be reallocated by the resource controller to other classes which increase their demand (while remaining under their limit). In other words, the priority of resource allocation amongst classes is as follows highest: classes with demand < guarantee classes with hard/soft limit > demand > guarantee lowest: classes with demand > soft limit Guarantees and limits are represented by whole numbers in the /path/to/<class>/shares magic file. The calculation of a class's share is done as follows. Each line of the /path/to/<class>/shares file represents one managed resource and its share values in the format <resource name> <my_guarantee> <my_limit> <tot_guarantee> <tot_limit> A class's guarantee/limit = class' <my_*> / its parent's <tot_*> (both my_* and tot_* values corresponding to the same resource and * stands for 'guarantee' or 'limit') e.g. assuming there's only entry (say cpu) in the shares file, a class hierarchy with the following values /A cpu 35 60 80 90 /a1 cpu 20 30 20 30 /a2 cpu 5 25 5 25 will result in the following shares a1 guarantee = 20/80, limit = 30/90 a2 guarantee = 5/80, limit = 25/90 A's default class' guarantee = (80-20-5=55)/80, limit = (90-30-25=35)/90 To derive A's share (with respect to its peer classes), a calculation similar to the one done for a1, a2 can be done, using A's parent's <tot_*> values. Note that the <my_*> values in /rcfs/shares have no significance (since there are no parent <tot_*> values against which they can be interpreted). Similarly, the <tot_*> values of a class at the leaf of the class hierarchy (i.e. one which has no children) does not have any significance. The default subclass of such a leaf class is guaranteed a 100% (and has a 100% limit) as long as no other subclasses get defined. Setting new resource shares is done through echo "<resource name>:[my_guarantee|my_limit....tot_limit]:<new value>" \ > /path/to/<class>/shares e.g echo "cpu:tot_limit:200" > /path/to/A/shares changes A's total cpu limit in the example above to 200 (from 90) and results in changing the shares of a1, a2 and A's default classes. while echo "cpu:my_guarantee:45" > /path/to/A/a1/limits changes a1's guarantee to 45/80 (from 20/80) and reduces the guarantee of A's default class to (80-45-5=30)/80. A user can determine the shares of a class by reading the /path/to/<class>/shares file and parsing its contents as explained above. Default subclass shares at any level can be calculated by summing the shares listed in each of the visible subclasses shares file and subtract the sum from the parent's tot_* value. Userspace tools can be written to assist with all these calculations. Note: the reason for choosing the scheme above is to allow absolute values to be specified while retaining the flexibility of changing all subclass shares without requiring an atomic update to all their values. Another option considered and abandoned was to specify relative shares only (where the tot_* values would not be explicitly stated/modifiable but would be calculated by summing the my_* values of all children). <resource name> identifies the resource controller. Changes to share values through writes and requests for shares through reads get passed on to each affected resource controller by the Core. As usual, unix file permissions take care of access control to the shares file. Each shares file need not contain entries for all managed resources. If a resource's share is unspecified, the class's tasks are deemed to belong to the paren'ts default class. 2.6 Gathering usage statistics ------------------------------ Statistics on resource use can be gathered from the 'stats' file in each class directory. To make portability and scripting easier the data is in plain text. Each stats file will have two lines for each managed resource in some format like <resource name> total #1 #1_unit #2 #2_unit #3 #3_unit : #n #n_unit <resource name> local #1 #1_unit #2 #2_unit #3 #3_unit : #n #n_unit where #n : value(number) of n'th statistic exported for <resource name> by its controller e.g. value of avg time using the resource, value of avg. delay in waiting for access to the resource #n_unit : units of n'th statistic in plain text e.g. ticks, ms, us, pages etc. The statistics listed under "<resource name> local" are the values of the resource consumption by the class's default subclass alone. It is expected to be updated frequently by the controller. On the other hand, the statistics listed under "<resource name> total" correspond to the sum of statistics for the default subclass and all other subclasses. It thus represents the total consumption for the parent class. These values only get updated lazily at a frequency decided by each controller individually. To get the current accurate value for total usage at a class level, userspace tools, similar to top(1), will be provided to add up (non-atomically) the local values of the children and parent's default subclass. Each managed resource always has an entry in the ./stats file. Thus the file can be used to discover the managed resources in a system. 2.7 Changing a task's class --------------------------- A task can be be classified into a class by writing its pid into the target class' target file as follows echo "<pid>" > /path/to/class/target The pid is a positive number for a normal task ID, a negative number for a process group, similar to the sending of signals. A zero value represents the calling task's pid. The unix file permissions on the magic target file in the chosen class determine whether or not the process is allowed to change into a certain task class. The chmod(2) system call can be used to change the permissions on who can join a task class. A special 'target' file is used instead of the class' directory so that permissions to join a task class can be configured separate from permissions to query statistics about the class. Manual reclassification for socket accept queue control uses a parameter different from pid as explained in Section 5. A task can also get classified automatically if an optional Classification Engine is present, as described in Section 4. 2.8 Monitor Mode ---------------- CKRM will support two modes of operation: monitor and manage. Manage mode is what has been described so far with each class having guarantees and limits. In monitor mode, a task's class is used only to track its resource usage and not to control its resource allocation. For allocation purposes, all tasks are considered to be in the systemwide default class and ideally, get resources allocated just as they would in a kernel without CKRM. Usage statistics still get collected and reported per-class as in the manage mode. CKRM will continue to use lazy updates of the "total" statistics at each level of the class hierarchy, as described in Section 2.6. Despite this, a CKRM-enabled kernel in monitor mode may still incur a small performance penalty compared to one in which CKRM is disabled. The mode of CKRM can be set by writing "rcfs:mode:manage" or "rcfs:mode:monitor" into /rcfs/config. 3.0 Resource Controllers ------------------------ Resource controllers are the kernel code that enforce the class-based control and supply the class-based statistics. They are typically implemented as patches to the existing controllers (or schedulers) with two primary design objectives - minimize impact on users not interested in class-based control - respect shares of class hierarchy as far as possible while keeping code complexity and performance overheads low. CKRM currently provides resource controllers for the primary physical resources such as CPU (ticks), physical memory (page frames), block I/O (per-device bandwidth) and inbound network (socket accept queues). In future, additional controllers for virtual resources such as open files, shared memory segments are being considered as well. Resource controllers need to register each managed resource separately. It is possible for one patch/module to regulate several related resources e.g. the controller providing class-based control over open files and shared memory segments could be the same but needs to register each of the resources separately. Typically resource controllers will have private objects for each CKRM class. The Core patch provides data structures to associate this private data with the class objects it creates. When classes get modified (creation, deletion, tasks moving in and out, share changes, request for usage statistics), the Core invokes appropriate callbacks for each of the registered resources to do the necessary changes and return data for that resource. These callbacks and their related functions form a Core->Resource Controller API that is internal to the kernel. The CKRM Core will provide helper functions for resource controllers to traverse the class hierarchy to update statistics lazily, calculate share values etc. The Core->RC API will be described in more detail after the User API and high level design is finalized. http://ckrm.sf.net/ provides some idea of what the Core->RC API will look like. 3.1 Resource controller configuration ------------------------------------- Resource controllers can be configured by writing resource specific config parameters to /rcfs/config in the format echo "<resource name>:<param name>:<param value>" > /rcfs/config e.g. echo "cpu:active:1" > /rcfs/config The name and semantics of config parameters supported are resource specific and opaque to CKRM Core. Reading /rcfs/config will list all configurable parameters (for rcfs as a whole and each managed resource) in the same format as they are written. Configuration of resource controllers can be done at any level of the /rcfs hierarchy by writing a resource-specific configuration parameter to /path/to/class/config e.g. echo "cpu:timeslice:15" > /rcfs/class_A/config As before, the syntax and semantics of configuration parameters are determined by the controller implementation. Controllers are always free to ignore unsupported parameters. 4.0 Classification Engine (CE) ------------------------------ As described briefly in Section 1, a Classification Engine is an optional kernel module that can automatically reclassify tasks after significant kernel events (called reclassification events) such as exec,setuid,setgid. Fork is also a significant event with a CE callback, though it is expected that the child will inherit its parent's class rather than be reclassified immediately. A CE module has to register with the kernel at which time it exports a table of callback functions, potentially one for each reclassification event. CKRM's Core patch hooks reclassification event points in the kernel. If a CE is present, it is queried for the class to which the task should belong after the event. The Core then moves the task to the appropriate class. The CE callback is also invoked so that CE can update any state it maintains as part of its classification logic. At any time, only one CE can be registered with the kernel. CE's can move between active and inactive states after registration to allow lightweight, temporary disabling of automatic classification. The user interface to the CE will also be through the /rcfs filesystem. When a CE module registers, Core creates the /rcfs/ce directory. Core provides an interface to the CE for the latter to create magic files under /rcfs/ce. During registration, the CE also provides callbacks for create/delete/read/write operations to files under /rcfs/ce. This enables the Core and /rcfs interfaces to handle /rcfs/ce files opaquely. /rcfs/ce is used to dynamically configure the CE including changes to the logic it implements for reclassifying tasks. The configuration files and operations for a typical CE such as RBCE are listed in Section 4.3. Note: Instead of Core creating its own set of hooks into sys_fork/sys_exec etc. the CKRM project is considering using the security_* hooks of Linux Security Modules (LSM), that are already in the mainline 2.6 kernel. Some issues with LSM such as stackability of its modules and availability of all necessary hooks are under investigation. 4.1 Manual reclassification with an active CE --------------------------------------------- When a CE has registered and is active, tasks get automatically reclassified. However, a sysadmin/user can override the CE by manually moving a task to a different class under /rcfs. Following such a manual reclassification, the task is deemed to be outside the CE's control. Future reclassification events affecting the task will not result it being reclassified according to the CE's logic. The implementation of the per-task override could be done by either marking the task so that Core does not call the CE's reclassifier or by creating a specially tagged rule in the CE which returns the null class (so the CE's reclassifier is called but does not return a valid class). A task can be put back under CE's control by writing its pid to the /rcfs/ce/reclassify magic file. This causes the task to get reclassified using the CE's logic immediately as well as at all future reclassification events. 4.2 Application Tags -------------------- The CKRM Core adds a 'tag' field to task_struct to enable applications to assist the CE's in reclassification. To set/get a tasks tag field, two new system calls are introduced by CKRM: int sys_tsk_settag ( int pid, int len, void *tag ) int sys_tsk_gettag ( int pid, int len, void *tag ) As before, the pid is a positive number for a normal task ID, a negative number for a process group and zero represents the calling task's pid. Setting a task's tag is particularly useful for relatively trusted server applications such as databases or webservers are running on a system and doing work on behalf of multiple classes. By setting their tag values, such applications can tell the CE what work they are doing and the CE can map the tag to the appropriate class. There is some risk in allowing applications to set their tags which could result in their being classified to classes with more resource shares. Hence a task needs the appropriate capability to set its own share. If the application cannot be trusted, a trusted userlevel agent could set its tag after performing additional verification. In all cases, the tag value is opaque to CKRM core and resource controllers. 4.3 Rule Based Classification Engine (RBCE) ------------------------------------------- The CKRM project will provide a general purpose CE called the Rule-Base Classification Engine (RBCE). RBCE evaluates an ordered list of rules provided by the user to classify tasks. Each rules is a logical and of rule terms and a target class. Each rule term is an attribute-value pair where the attribute can be one of several relevant members of the task_struct, including the newly introduced tag. RBCE defines the following files under /rcfs/ce: 1. Magic files |--info - read only file detailing how to setup and use RBCE. |--reclassify - contains nothing. Writing a pid to it reclassifies the given task according to the current set of rules. Writing "all" to it reclassifies all tasks in the system. This is typically done by the user/sysadmin after changing/creating rules. |--state - determines whether RBCE is currently active or inactive. Writing 1 (0) activates (deactivates) the CE. Reading the file returns the current state. 2. Rules subdirectory: Each rule of the RBCE is represented by a file in /rcfs/ce/rules. The sysadmin writes lines to the file in one of the following formats to define or modify a rule: <*id> <OP> number where <OP>={>,<,=} <*id> = {uid,euid,gid,egid} cmd = "string" // basename of the command pathname = "string" // full pathname of the command args = "string" // argv[1] - argv[argc] of command apptag = "string" // application tag of the task [+,-]depend = rule_filename_1, rule_filename_2... // used to chain a rule's terms with existing rules // to avoid respecifying the latter's rule terms. // A rule's dependent rules are evaluated before // its rule terms get evaluated. // An optional + or - can precede the depend keyword. // +depend adds a dependent rule to the tail of the // current chain, -depend removes an existing // dependent rule // a rulefile shows depend = none if the rule has // no dependencies order = number // order in which this rule is executed relative to // other independent rules. // rule with order 1 is checked first and so on. // As soon as a rule matches, the class of that rule // is returned to Core. So, order really matters. // If no order is specified by the user, the next // highest available order number is assigned to // the rule. class = "/rcfs/.../classname" // target class of this rule. // /rcfs all by itself indicates the // systemwide default class state = "number" // 1 or 0, provides the ability to deactivate a // specific rule, if needed. ipv4 = "string" // ipv4 address in dotted decimal and port // e.g. "127.0.0.1\80" // e.g. "*\80" for CE to match any address // used in socket accept queue classes (see Section 5) ipv6 = "string" // ipv6 address in hex and port // e.g. "fe80::4567\80" // e.g. "*\80" for CE to match any address // used in socket accept queue classes (see Section 5) Note: Instead of one file per rule, RBCE could use a single file to list all rules, one per line. Given the potentially large number of terms a rule could have, the file per rule approach may be cleaner. 5.0 Socket accept queue classes ------------------------------- There are two types of resources that are controlled using CKRM. The first type are system wide resources (such as CPU time) that are apportioned into classes. Every task is assigned to a class in the system. Another form of classification is for resources that are controlled entirely within a process or a system object's context. For example: the socket. In a typical setup, such as a webserver, connection requests are received from 1000s of clients. We would like to distinguish among the requests based on the importance of the client to the server. The client's are therefore assigned to different classes, such as, gold/silver/bronze. The client belonging to gold class will have its requests honoured at a higher rate (proportional to the class's share) than one belonging to the silver or bronze class. e.g. class gold might be the paying customers while bronze includes the 'window shoppers'. In this section we focus on the inbound network connection control i.e. TCP connection requests queued in the socket's accept queue. Therefore, there are two levels of classes: 1. comprising of the listening address(es) and associated port 2. the peer's classification (done on the contents of the TCP SYN packet header using iptables) Every listening socket is assigned multiple accept queues - one per accept queue class. The connection requests are appended to these queues depending on the class. The accept() call picks the requests from the accept queues (classes) in accordance to the shares assigned to them. 5.1 /rcfs/network file hierarchy -------------------------------- The 'listening' classes are listed in /rcfs/network/socket_aq. The 'peer-classification' is controlled by "accept queue" sub-classes. The info file specifies the default accept queue class and the number of accept queue classes. This file is initialised by the resource controller. In the example below the accept queues are numbered e.g. 0-7. The incoming requestes can be MARKed 0-7 using iptable rules. The members directory lists the ip_address/port pairs that fall under this class. There is also a symlink to owning tasks pid. The rest of the files such as target/shares etc. are the same for the network classes as for any other class. The internal subclasses 0-7 do not honour moves to target nor do they list any entries under /members. Subclasses cannot be created below the accept_queue classes (e.g. 0-7). Nor can one create more accept queue classes than are supported (the range is provided in 'info' file). /rcfs/network/sockets_aq |-- file info: - accept_q class names: 0 - 7 - default accept_q class: 0 |-- target |-- stats: usage statistics /Class_A |-- target |-- shares: 1000 1000 1000 1000 |-- stats: usage statistics |-- /members /<ip_address>\<port> /proc/<pid> /1 |-- shares: 500 500 |-- stats: usage statistics /2 |-- shares: 400 500 |-- stats: usage statistics /Class_B |-- target |-- shares: 1000 1000 1000 1000 |-- stats: usage statistics |-- /members /<ip_address>\<port> /proc/<pid> /1 |-- shares: 200 500 |-- stats: usage statistics /2 |-- shares: 400 500 |-- stats: usage statistics /3 |-- shares: 400 500 |-- stats: usage statistics A change in the shares of a particular accept queue class (0-7) will cause the resource controller to modify the accept queue shares in the sockets associated with all the ipaddr\port members of the network class. As with the process-classes the user/admin can move a listening socket to the desired class by executing the following command: echo "<ip_address\port>" > /path/to/class/target 6.0 Example Control flow using CKRM ----------------------------------- A typical usage of a CKRM-enabled system is given below: - Core is active as soon as the kernel is booted. - resource controller 1 registers - resource controller 2 registers : : - No task is classified, resource controllers handle tasks in default mode : : - User defines multiple task classes - User sets shares for each of the resouce classes defined - User manually moves some tasks to some of the newly defined classes and these tasks get regulated according to the new shares. : : - Classification engine registers : : - User tells CE how to associate tasks to previously defined classes (in RBCE, this is done by specifying rules and policies) - On significant kernel events (exec/setuid etc.), the affected task gets reclassified automatically according to the CE's rules - Tasks are allocated resources based on the shares of the task class to which they belong : : : : - User specifies iptable rules to MARK incoming TCP connections - task calls listen() on a socket (thus specifying an ipaddress/port) - CE returns the network/socket_aq class for the socket - Core modifies the shares associated with the socket's acceptq classes - incoming connections get MARKed by netfilter and are delivered to the listening task in proportion of their acceptq class share : : - User gets resource usage of different classes (task and network) - User manually moves some tasks to different classes : - User moves all tasks out of a class and then deletes it : : -- End -- ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 15:57 ` Rik van Riel @ 2004-02-26 19:28 ` Shailabh Nagar 0 siblings, 0 replies; 66+ messages in thread From: Shailabh Nagar @ 2004-02-26 19:28 UTC (permalink / raw) To: Rik van Riel Cc: Peter Williams, Nuno Silva, Timothy Miller, John Lee, linux-kernel, ckrm-tech Rik van Riel wrote: > On Thu, 26 Feb 2004, Peter Williams wrote: > > >>Another idea that we are playing with for handling programs like xmms >>(i.e. programs that require gauranteed CPU bandwidth to perform well) is >>the complement of caps namely per task CPU reservations. > > >>Of course, this won't solve the "need to be root" problem as this is >>obviously the sort of control that should be reserved for root > > > Not necessarily. We've just fixed this dilemma in the CKRM > project, using a resource class filesystem for this kind of > stuff. > > A user could have a certain percentage of the CPU guaranteed > (especially the console user) and carve out part of his/her > guarantee for multimedia applications. > > Please see the attached document, which is the 6th draft of > this particular CKRM design. If you have any improvements > for this spec, feel free to let us know ;) The CKRM API has also been posted separately as an RFC on lkml today...just in case its missed deep down in this thread ! -- Shailabh ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-26 2:48 ` Nuno Silva 2004-02-26 4:25 ` Peter Williams @ 2004-02-26 16:12 ` Timothy Miller 1 sibling, 0 replies; 66+ messages in thread From: Timothy Miller @ 2004-02-26 16:12 UTC (permalink / raw) To: Nuno Silva; +Cc: John Lee, linux-kernel Nuno Silva wrote: > > > > And the user *can* do something about it, just nice -n 19 the builds and > left xmms alone. (Or you can use other player... :-) You're assuming that the person using xmms is also the user doing tbe build. I am not making the same assumption. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-25 14:35 John Lee 2004-02-25 17:09 ` Timothy Miller @ 2004-02-25 22:51 ` Pavel Machek 2004-02-26 3:14 ` John Lee 2004-02-25 23:45 ` Rik van Riel 2004-02-26 7:18 ` John Lee 3 siblings, 1 reply; 66+ messages in thread From: Pavel Machek @ 2004-02-25 22:51 UTC (permalink / raw) To: John Lee; +Cc: linux-kernel Hi! > CPU usage rate caps > ------------------- > > A task's CPU usage rate cap imposes a soft (or hard) upper limit on the rate at > which it can use CPU resources and can be set/read via the files > > /proc/<pid>/cpu_rate_cap > /proc/<tgid>/task/<pid>/cpu_rate_cap > > Usage rate caps are expressed as rational numbers (e.g. "1 / 2") and hard caps > are signified by a "!" suffix. The rational number indicates the proportion > of a single CPU's capacity that the task may use. The value of the number must > be in the range 0.0 to 1.0 inclusive for soft caps. For hard caps there is an > additional restriction that a value of 0.0 is not permitted. Tasks with a > soft cap of 0.0 become true background tasks and only get to run when no other > tasks are active. Why not use something like percent, parts per milion or whatever? > When hard capped tasks exceed their cap they are removed from the run queues > and placed in a "sinbin" for a short while until their usage rate decays to > within limits. How do you solve this one? I want to kill your system. I launch task A, "semaphore grabber", that does filesystem operations. Those need semaphores. I run it as "true background". I wait for A to grab some lock, then I run B, which is while(1); A holds lock that can not be unlocked, and your system is dead. This may happen randomly, even without me on your system. Pavel -- When do you have a heart between your knees? [Johanka's followup: and *two* hearts?] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-25 22:51 ` Pavel Machek @ 2004-02-26 3:14 ` John Lee 0 siblings, 0 replies; 66+ messages in thread From: John Lee @ 2004-02-26 3:14 UTC (permalink / raw) To: Pavel Machek; +Cc: linux-kernel On Wed, 25 Feb 2004, Pavel Machek wrote: > > Usage rate caps are expressed as rational numbers (e.g. "1 / 2") and hard caps > > are signified by a "!" suffix. The rational number indicates the proportion > > of a single CPU's capacity that the task may use. The value of the number must > > be in the range 0.0 to 1.0 inclusive for soft caps. For hard caps there is an > > additional restriction that a value of 0.0 is not permitted. Tasks with a > > soft cap of 0.0 become true background tasks and only get to run when no other > > tasks are active. > > Why not use something like percent, parts per milion or whatever? Fair comment. Fine granularity with percentages would require decimal points and a function in the kernel to parse that value - maybe I could get around to doing that. But I suppose ppm could certainly be used. We just chose rational numbers to start with. > > When hard capped tasks exceed their cap they are removed from the run queues > > and placed in a "sinbin" for a short while until their usage rate decays to > > within limits. > > How do you solve this one? The task is removed from the runqueue and a timer is scheduled to put it back onto the runqueue. The delay period is the required amount of time for that task's usage to decay to below its cap. > I want to kill your system. > > I launch task A, "semaphore grabber", that does filesystem > operations. Those need semaphores. I run it as "true background". > > I wait for A to grab some lock, then I run B, which is while(1); > > A holds lock that can not be unlocked, and your system is dead. > > This may happen randomly, even without me on your system. Good point. We'll have to rethink background priorities. John ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-25 14:35 John Lee 2004-02-25 17:09 ` Timothy Miller 2004-02-25 22:51 ` Pavel Machek @ 2004-02-25 23:45 ` Rik van Riel 2004-02-26 7:18 ` John Lee 3 siblings, 0 replies; 66+ messages in thread From: Rik van Riel @ 2004-02-25 23:45 UTC (permalink / raw) To: John Lee; +Cc: linux-kernel On Thu, 26 Feb 2004, John Lee wrote: > Only one heuristic I really like the fact that this patch gets rid of most of the "magic tricks" that the current O(1) scheduler needs in order to work well for most interactive users. > O(1) task promotion ... while still being O(1) The share based stuff should also tie in nicely with the various resource management projects out there. cheers, Rik -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [RFC][PATCH] O(1) Entitlement Based Scheduler 2004-02-25 14:35 John Lee ` (2 preceding siblings ...) 2004-02-25 23:45 ` Rik van Riel @ 2004-02-26 7:18 ` John Lee 3 siblings, 0 replies; 66+ messages in thread From: John Lee @ 2004-02-26 7:18 UTC (permalink / raw) To: linux-kernel On Thu, 26 Feb 2004, John Lee wrote: > The patch can be downloaded from > > <http://sourceforge.net/projects/ebs-linux/> > > Please note that there are 2 patches: the basic patch and the full patch. The > above description applies to the full patch. The basic patch only features > setting shares via nice, a fixed half life and timeslice, no statistics and > soft caps only. This basic patch is for those who are mainly interested in > looking at the core EBS changes to the stock scheduler. > > The patches are against 2.6.2 (2.6.3 patches will be available shortly). 2.6.3 patches are now available. John ^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2004-03-05 3:55 UTC | newest]
Thread overview: 66+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-26 3:30 [RFC][PATCH] O(1) Entitlement Based Scheduler Albert Cahalan
2004-02-26 6:19 ` Peter Williams
2004-02-26 17:57 ` Albert Cahalan
2004-02-26 23:24 ` Peter Williams
2004-03-01 3:47 ` Peter Williams
[not found] <1vuMd-5jx-5@gated-at.bofh.it>
[not found] ` <1vuMd-5jx-7@gated-at.bofh.it>
[not found] ` <1vuMd-5jx-9@gated-at.bofh.it>
[not found] ` <1vuMd-5jx-11@gated-at.bofh.it>
[not found] ` <1vuMd-5jx-3@gated-at.bofh.it>
[not found] ` <1vvyx-6jy-13@gated-at.bofh.it>
[not found] ` <1vBE2-48V-21@gated-at.bofh.it>
2004-03-03 21:38 ` Bill Davidsen
[not found] <fa.fi4j08o.17nchps@ifi.uio.no.suse.lists.linux.kernel>
[not found] ` <fa.ctat17m.8mqa3c@ifi.uio.no.suse.lists.linux.kernel>
[not found] ` <yydjishqw10p.fsf@galizur.uio.no.suse.lists.linux.kernel>
[not found] ` <40426E1C.8010806@aurema.com.suse.lists.linux.kernel>
2004-03-03 2:48 ` Andi Kleen
2004-03-03 3:45 ` Peter Williams
2004-03-03 10:13 ` Andi Kleen
2004-03-03 23:46 ` Peter Williams
2004-03-03 15:57 ` Andi Kleen
2004-03-04 0:41 ` Peter Williams
2004-03-05 3:55 ` Andi Kleen
[not found] <fa.ftul5bl.nlk3pr@ifi.uio.no>
[not found] ` <fa.cvc8vnj.ahebjd@ifi.uio.no>
2004-03-01 9:18 ` Joachim B Haga
2004-03-01 10:18 ` Paul Wagland
2004-03-01 19:11 ` Mike Fedyk
[not found] <fa.jgj0bdi.b3u6qk@ifi.uio.no>
2004-03-01 1:54 ` Andy Lutomirski
2004-03-01 2:54 ` Peter Williams
2004-03-01 3:46 ` Andy Lutomirski
2004-03-01 4:18 ` Peter Williams
2004-03-02 23:36 ` Peter Williams
[not found] <894006121@toto.iv>
2004-03-01 0:00 ` Peter Chubb
2004-03-02 1:25 ` Peter Williams
[not found] <fa.fi4j08o.17nchps@ifi.uio.no>
[not found] ` <fa.ctat17m.8mqa3c@ifi.uio.no>
2004-02-29 11:58 ` Joachim B Haga
2004-02-29 20:39 ` Paul Jackson
2004-02-29 22:56 ` Peter Williams
[not found] <1t8wp-qF-11@gated-at.bofh.it>
[not found] ` <1th6J-az-13@gated-at.bofh.it>
[not found] ` <403E2929.2080705@tmr.com>
2004-02-27 3:44 ` Rik van Riel
2004-02-28 21:27 ` Bill Davidsen
2004-02-28 23:55 ` Peter Williams
2004-03-04 21:08 ` Timothy Miller
[not found] <1tfy0-7ly-29@gated-at.bofh.it>
[not found] ` <1thzJ-A5-13@gated-at.bofh.it>
[not found] ` <1tjrN-2m5-1@gated-at.bofh.it>
[not found] ` <1tjLa-2Ab-9@gated-at.bofh.it>
[not found] ` <1tlaf-3OY-11@gated-at.bofh.it>
[not found] ` <1tljX-3Wf-5@gated-at.bofh.it>
[not found] ` <1tznd-CP-35@gated-at.bofh.it>
[not found] ` <1tzQe-10s-25@gated-at.bofh.it>
2004-02-26 20:14 ` Bill Davidsen
[not found] <fa.f12rt3d.c0s9rt@ifi.uio.no>
[not found] ` <fa.ishajoq.q5g90m@ifi.uio.no>
2004-02-25 23:33 ` Junio C Hamano
2004-02-26 8:15 ` Catalin BOIE
-- strict thread matches above, loose matches on Subject: below --
2004-02-25 14:35 John Lee
2004-02-25 17:09 ` Timothy Miller
2004-02-25 22:12 ` John Lee
2004-02-26 0:31 ` Timothy Miller
2004-02-26 2:04 ` John Lee
2004-02-26 2:18 ` Peter Williams
2004-02-26 2:42 ` Mike Fedyk
2004-02-26 4:10 ` Peter Williams
2004-02-26 4:19 ` Mike Fedyk
2004-02-26 19:23 ` Shailabh Nagar
2004-02-26 19:46 ` Mike Fedyk
2004-02-26 20:42 ` Shailabh Nagar
2004-02-26 16:10 ` Timothy Miller
2004-02-26 19:47 ` Mike Fedyk
2004-02-26 22:51 ` Peter Williams
2004-02-27 10:06 ` Helge Hafting
2004-02-27 11:04 ` Peter Williams
2004-02-26 16:08 ` Timothy Miller
2004-02-26 16:51 ` Rik van Riel
2004-02-26 20:15 ` Peter Williams
2004-02-27 14:46 ` Timothy Miller
2004-02-28 5:00 ` Peter Williams
2004-03-04 21:18 ` Robert White
2004-03-04 23:15 ` Peter Williams
2004-02-26 2:48 ` Nuno Silva
2004-02-26 4:25 ` Peter Williams
2004-02-26 15:57 ` Rik van Riel
2004-02-26 19:28 ` Shailabh Nagar
2004-02-26 16:12 ` Timothy Miller
2004-02-25 22:51 ` Pavel Machek
2004-02-26 3:14 ` John Lee
2004-02-25 23:45 ` Rik van Riel
2004-02-26 7:18 ` John Lee
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox