* Re: [RFC 3/5] sched: Add CPU rate hard caps
@ 2006-06-01 21:03 Al Boldi
2006-06-02 1:33 ` Peter Williams
0 siblings, 1 reply; 27+ messages in thread
From: Al Boldi @ 2006-06-01 21:03 UTC (permalink / raw)
To: linux-kernel
Chandra Seetharaman wrote:
> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> > Kirill Korotaev wrote:
> > >> Do you have any documented requirements for container resource
> > >> management?
> > >> Is there a minimum list of features and nice to have features for
> > >> containers
> > >> as far as resource management is concerned?
> > >
> > > Sure! You can check OpenVZ project (http://openvz.org) for example of
> > > required resource management. BTW, I must agree with other people here
> > > who noticed that per-process resource management is really useless and
> > > hard to use :(
>
> I totally agree.
>
> > I'll take a look at the references. I agree with you that it will be
> > useful to have resource management for a group of tasks.
For Resource Management to be useful it must depend on Resource Control.
Resource Control depends on per-process accounting. Per-process accounting,
when abstracted sufficiently, may enable higher level routines, preferrably
in userland, to extend functionality at will. All efforts should really go
into the successful abstraction of per-process accounting.
Thanks!
--
Al
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-06-01 21:03 [RFC 3/5] sched: Add CPU rate hard caps Al Boldi @ 2006-06-02 1:33 ` Peter Williams 2006-06-02 11:23 ` Matt Helsley 0 siblings, 1 reply; 27+ messages in thread From: Peter Williams @ 2006-06-02 1:33 UTC (permalink / raw) To: Al Boldi; +Cc: linux-kernel Al Boldi wrote: > Chandra Seetharaman wrote: >> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote: >>> Kirill Korotaev wrote: >>>>> Do you have any documented requirements for container resource >>>>> management? >>>>> Is there a minimum list of features and nice to have features for >>>>> containers >>>>> as far as resource management is concerned? >>>> Sure! You can check OpenVZ project (http://openvz.org) for example of >>>> required resource management. BTW, I must agree with other people here >>>> who noticed that per-process resource management is really useless and >>>> hard to use :( >> I totally agree. >> >>> I'll take a look at the references. I agree with you that it will be >>> useful to have resource management for a group of tasks. > > For Resource Management to be useful it must depend on Resource Control. > Resource Control depends on per-process accounting. Per-process accounting, > when abstracted sufficiently, may enable higher level routines, preferrably > in userland, to extend functionality at will. All efforts should really go > into the successful abstraction of per-process accounting. I couldn't agree more. All that's needed in the kernel is low level per task control and statistics gathering. The rest can be done in user space. Peter PS I'm a big fan of the various efforts to improve the quality of the performance statistics that are exported from the kernel and my only wish is that they get together to create one comprehensive solution. -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-06-02 1:33 ` Peter Williams @ 2006-06-02 11:23 ` Matt Helsley 2006-06-02 13:16 ` Peter Williams 2006-06-06 10:47 ` Srivatsa Vaddagiri 0 siblings, 2 replies; 27+ messages in thread From: Matt Helsley @ 2006-06-02 11:23 UTC (permalink / raw) To: Peter Williams Cc: LKML, Andrew Morton, dev, Srivatsa, ckrm-tech, balbir, Balbir Singh, Mike Galbraith, Peter Williams, Con Kolivas, Sam Vilain, Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman, Chandra S. Seetharaman On Fri, 2006-06-02 at 11:33 +1000, Peter Williams wrote: > Al Boldi wrote: > > Chandra Seetharaman wrote: > >> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote: > >>> Kirill Korotaev wrote: > >>>>> Do you have any documented requirements for container resource > >>>>> management? > >>>>> Is there a minimum list of features and nice to have features for > >>>>> containers > >>>>> as far as resource management is concerned? > >>>> Sure! You can check OpenVZ project (http://openvz.org) for example of > >>>> required resource management. BTW, I must agree with other people here > >>>> who noticed that per-process resource management is really useless and > >>>> hard to use :( > >> I totally agree. > >> > >>> I'll take a look at the references. I agree with you that it will be > >>> useful to have resource management for a group of tasks. > > > > For Resource Management to be useful it must depend on Resource Control. > > Resource Control depends on per-process accounting. Per-process accounting, > > when abstracted sufficiently, may enable higher level routines, preferrably > > in userland, to extend functionality at will. All efforts should really go > > into the successful abstraction of per-process accounting. > > I couldn't agree more. All that's needed in the kernel is low level per > task control and statistics gathering. The rest can be done in user space. <snip> I'm assuming by "The rest can be done in user space" you mean that tasks can be grouped, accounting information updated (% CPU), and various knobs (nice) can be turned to keep task resource (CPU) usage under control. If I seem to be describing your suggestion then I don't think it will work. Below you'll find the reasons I've come to this conclusion. Am I oversimplifying or misunderstanding something critical? Groups are needed to prevent processes from consuming unlimited resources using clone/fork. However, since our accounting sources and control knobs are per-task we must adjust per-task knobs within a group every time accounting indicates a change in resource usage. Let us suppose we have a UP system with 3 tasks -- group X: X1, X2; and Z. By adjusting nice values of X1 and X2 Z is responsible for ensuring that group X does not exceed its limit of 50% CPU. Further suppose that X1 and X2 are each using 25% of the CPU. In order to prevent X1 + X2 from exceeding 50% each must be limited to 25% by an appropriate nice value. [Note the hand wave: I'm assuming nice can be mapped to a predictable percentage of CPU on a UP system.] When accounting data indicates X2 has dropped to 15% of the CPU, Z may raise X1's limit (to 35% at most) and it must lower X2's limit (down to as little as 15%). Z must raise X1's limit by some amount (delta) otherwise X1 could never increase its CPU usage. Z must decrease X2 to 25 - delta, otherwise the sum could exceed 50%. [Aside: In fact, if we have N tasks in group X then it seems Z ought to adjust N nice values by a total of delta. How delta gets distributed limits the rate at which CPU usage may increase and would ideally depend on future changes in usage.] There are two problems as I see it: 1) If X1 grows to use 35% then X2's usage can't grow back from 15% until X1 relents. This is seems unpleasantly like cooperative scheduling within group X because if we take this to its limit X2 gets 0% and X1 gets 50% -- effectively starving X2. What little I know about nice suggests this wouldn't really happen. However I think may highlight one case where fiddling with nice can't effectively control CPU usage. 2) Suppose we add group Y with tasks Y1-YM, Y's CPU usage is limited to 49%, each task of Y uses its limit of (M/49)% CPU, and the remaining 1% is left for Z (i.e. the single CPU is being used heavily). Z must use this 1% to read accounting information and adjust nice values as described above. If X1 spawns X3 we're likely in trouble -- Z might not get to run for a while but X3 has inheritted X1's nice value. If we return to our initial assumption that X1 and X2 are each using their limit of 25% then X3 will get limited to 25% too. The sum of Xi can now exceed 50% until Z is scheduled next. This only gets worse if there is an imbalance between X1 and X2 as described earlier. In that case group X could use 100% CPU until Z is scheduled! It also probably gets worse as load increases and the number of scheduling opportunities for Z decrease. I don't see how task Z could solve the second problem. As with UP, in SMP I think it depends on when Z (or one Z fixed to each CPU) is scheduled. I think these are simple scenarios that demonstrate the problem with splitting resource management into accounting and control with userspace in between. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-06-02 11:23 ` Matt Helsley @ 2006-06-02 13:16 ` Peter Williams 2006-06-06 10:47 ` Srivatsa Vaddagiri 1 sibling, 0 replies; 27+ messages in thread From: Peter Williams @ 2006-06-02 13:16 UTC (permalink / raw) To: Matt Helsley Cc: LKML, Andrew Morton, dev, Srivatsa, ckrm-tech, balbir, Balbir Singh, Mike Galbraith, Con Kolivas, Sam Vilain, Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman, Chandra S. Seetharaman Matt Helsley wrote: > On Fri, 2006-06-02 at 11:33 +1000, Peter Williams wrote: >> Al Boldi wrote: >>> Chandra Seetharaman wrote: >>>> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote: >>>>> Kirill Korotaev wrote: >>>>>>> Do you have any documented requirements for container resource >>>>>>> management? >>>>>>> Is there a minimum list of features and nice to have features for >>>>>>> containers >>>>>>> as far as resource management is concerned? >>>>>> Sure! You can check OpenVZ project (http://openvz.org) for example of >>>>>> required resource management. BTW, I must agree with other people here >>>>>> who noticed that per-process resource management is really useless and >>>>>> hard to use :( >>>> I totally agree. >>>> >>>>> I'll take a look at the references. I agree with you that it will be >>>>> useful to have resource management for a group of tasks. >>> For Resource Management to be useful it must depend on Resource Control. >>> Resource Control depends on per-process accounting. Per-process accounting, >>> when abstracted sufficiently, may enable higher level routines, preferrably >>> in userland, to extend functionality at will. All efforts should really go >>> into the successful abstraction of per-process accounting. >> I couldn't agree more. All that's needed in the kernel is low level per >> task control and statistics gathering. The rest can be done in user space. > > <snip> > > I'm assuming by "The rest can be done in user space" you mean that > tasks can be grouped, accounting information updated (% CPU), and > various knobs (nice) can be turned to keep task resource (CPU) usage > under control. > > If I seem to be describing your suggestion then I don't think it will > work. Below you'll find the reasons I've come to this conclusion. Am I > oversimplifying or misunderstanding something critical? > > Groups are needed to prevent processes from consuming unlimited > resources using clone/fork. However, since our accounting sources and > control knobs are per-task we must adjust per-task knobs within a group > every time accounting indicates a change in resource usage. > > Let us suppose we have a UP system with 3 tasks -- group X: X1, X2; and > Z. By adjusting nice values of X1 and X2 Z is responsible for ensuring > that group X does not exceed its limit of 50% CPU. Further suppose that > X1 and X2 are each using 25% of the CPU. In order to prevent X1 + X2 > from exceeding 50% each must be limited to 25% by an appropriate nice > value. [Note the hand wave: I'm assuming nice can be mapped to a > predictable percentage of CPU on a UP system.] > > When accounting data indicates X2 has dropped to 15% of the CPU, Z may > raise X1's limit (to 35% at most) and it must lower X2's limit (down to > as little as 15%). Z must raise X1's limit by some amount (delta) > otherwise X1 could never increase its CPU usage. Z must decrease X2 to > 25 - delta, otherwise the sum could exceed 50%. [Aside: In fact, if we > have N tasks in group X then it seems Z ought to adjust N nice values by > a total of delta. How delta gets distributed limits the rate at which > CPU usage may increase and would ideally depend on future changes in > usage.] > > There are two problems as I see it: > > 1) If X1 grows to use 35% then X2's usage can't grow back from 15% until > X1 relents. This is seems unpleasantly like cooperative scheduling > within group X because if we take this to its limit X2 gets 0% and X1 > gets 50% -- effectively starving X2. What little I know about nice > suggests this wouldn't really happen. However I think may highlight one > case where fiddling with nice can't effectively control CPU usage. > > 2) Suppose we add group Y with tasks Y1-YM, Y's CPU usage is limited to > 49%, each task of Y uses its limit of (M/49)% CPU, and the remaining 1% > is left for Z (i.e. the single CPU is being used heavily). Z must use > this 1% to read accounting information and adjust nice values as > described above. If X1 spawns X3 we're likely in trouble -- Z might not > get to run for a while but X3 has inheritted X1's nice value. If we > return to our initial assumption that X1 and X2 are each using their > limit of 25% then X3 will get limited to 25% too. The sum of Xi can now > exceed 50% until Z is scheduled next. This only gets worse if there is > an imbalance between X1 and X2 as described earlier. In that case group > X could use 100% CPU until Z is scheduled! It also probably gets worse > as load increases and the number of scheduling opportunities for Z > decrease. > > > I don't see how task Z could solve the second problem. As with UP, in > SMP I think it depends on when Z (or one Z fixed to each CPU) is > scheduled. > > I think these are simple scenarios that demonstrate the problem with > splitting resource management into accounting and control with userspace > in between. You're trying to do it all with nice. I said it could be done with nice plus the CPU capping functionality my patch provides. Plus the stats of course. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-06-02 11:23 ` Matt Helsley 2006-06-02 13:16 ` Peter Williams @ 2006-06-06 10:47 ` Srivatsa Vaddagiri 1 sibling, 0 replies; 27+ messages in thread From: Srivatsa Vaddagiri @ 2006-06-06 10:47 UTC (permalink / raw) To: Matt Helsley Cc: Peter Williams, LKML, Andrew Morton, dev, ckrm-tech, balbir, Balbir Singh, Mike Galbraith, Con Kolivas, Sam Vilain, Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman, Chandra S. Seetharaman On Fri, Jun 02, 2006 at 04:23:04AM -0700, Matt Helsley wrote: > There are two problems as I see it: > > 1) If X1 grows to use 35% then X2's usage can't grow back from 15% until > X1 relents. This is seems unpleasantly like cooperative scheduling > within group X because if we take this to its limit X2 gets 0% and X1 > gets 50% -- effectively starving X2. What little I know about nice > suggests this wouldn't really happen. However I think may highlight one > case where fiddling with nice can't effectively control CPU usage. I would expect task Z to adjust the limits of X1, X2 again when it notices that X2 is "hungry". Until Z gets around to do that, what situation you describe will be true. If Z is configured to run quite frequently (every 5 seconds?) to monitor/adjust limits, then this starvation (of X2) may be avoided for longer periods? > 2) Suppose we add group Y with tasks Y1-YM, Y's CPU usage is limited to > 49%, each task of Y uses its limit of (M/49)% CPU, and the remaining 1% > is left for Z (i.e. the single CPU is being used heavily). Z must use > this 1% to read accounting information and adjust nice values as > described above. If X1 spawns X3 we're likely in trouble -- Z might not > get to run for a while but X3 has inheritted X1's nice value. If we > return to our initial assumption that X1 and X2 are each using their > limit of 25% then X3 will get limited to 25% too. The sum of Xi can now > exceed 50% until Z is scheduled next. This only gets worse if there is > an imbalance between X1 and X2 as described earlier. In that case group > X could use 100% CPU until Z is scheduled! It also probably gets worse > as load increases and the number of scheduling opportunities for Z > decrease. > > > I don't see how task Z could solve the second problem. As with UP, in > SMP I think it depends on when Z (or one Z fixed to each CPU) is > scheduled. Wouldn't it help if Z is made to run with nice -20 (or with RT prio maybe), so that when Z wants to run (every 5 or 10 seconds) it is run immediately? This is assuming that Z can do its job of adjusting limits for all tasks "quickly" (maybe 100-200 ms?). > > I think these are simple scenarios that demonstrate the problem with > splitting resource management into accounting and control with userspace > in between. > > Cheers, > -Matt Helsley -- Regards, vatsa ^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC 0/5] sched: Add CPU rate caps @ 2006-05-26 4:20 Peter Williams 2006-05-26 4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams 0 siblings, 1 reply; 27+ messages in thread From: Peter Williams @ 2006-05-26 4:20 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman These patches implement CPU usage rate limits for tasks. Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU) it is a total usage limit and therefore (to my mind) not very useful. These patches provide an alternative whereby the (recent) average CPU usage rate of a task can be limited to a (per task) specified proportion of a single CPU's capacity. The limits are specified in parts per thousand and come in two varieties -- hard and soft. The difference between the two is that the system tries to enforce hard caps regardless of the other demand for CPU resources but allows soft caps to be exceeded if there are spare CPU resources available. By default, tasks will have both caps set to 1000 (i.e. no limit) but newly forked tasks will inherit any caps that have been imposed on their parent from the parent. The mimimim soft cap allowed is 0 (which effectively puts the task in the background) and the minimim hard cap allowed is 1. Care has been taken to minimize the overhead inflicted on tasks that have no caps and my tests using kernbench indicate that it is hidden in the noise. Note: The first patch in this series fixes some problems with priority inheritance that are present in 2.6.17-rc4-mm3 but will be fixed in the next -mm kernel. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* [RFC 3/5] sched: Add CPU rate hard caps 2006-05-26 4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams @ 2006-05-26 4:20 ` Peter Williams 2006-05-26 6:58 ` Kari Hurtta ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Peter Williams @ 2006-05-26 4:20 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman This patch implements hard CPU rate caps per task as a proportion of a single CPU's capacity expressed in parts per thousand. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> include/linux/sched.h | 8 ++ kernel/Kconfig.caps | 14 +++- kernel/sched.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 168 insertions(+), 8 deletions(-) Index: MM-2.6.17-rc4-mm3/include/linux/sched.h =================================================================== --- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h 2006-05-26 10:46:35.000000000 +1000 +++ MM-2.6.17-rc4-mm3/include/linux/sched.h 2006-05-26 11:00:07.000000000 +1000 @@ -796,6 +796,10 @@ struct task_struct { #ifdef CONFIG_CPU_RATE_CAPS unsigned long long avg_cpu_per_cycle, avg_cycle_length; unsigned int cpu_rate_cap; +#ifdef CONFIG_CPU_RATE_HARD_CAPS + unsigned int cpu_rate_hard_cap; + struct timer_list sinbin_timer; +#endif #endif enum sleep_type sleep_type; @@ -994,6 +998,10 @@ struct task_struct { #ifdef CONFIG_CPU_RATE_CAPS unsigned int get_cpu_rate_cap(const struct task_struct *); int set_cpu_rate_cap(struct task_struct *, unsigned int); +#ifdef CONFIG_CPU_RATE_HARD_CAPS +unsigned int get_cpu_rate_hard_cap(const struct task_struct *); +int set_cpu_rate_hard_cap(struct task_struct *, unsigned int); +#endif #endif static inline pid_t process_group(struct task_struct *tsk) Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps =================================================================== --- MM-2.6.17-rc4-mm3.orig/kernel/Kconfig.caps 2006-05-26 10:45:26.000000000 +1000 +++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps 2006-05-26 11:00:07.000000000 +1000 @@ -3,11 +3,21 @@ # config CPU_RATE_CAPS - bool "Support (soft) CPU rate caps" + bool "Support CPU rate caps" default n ---help--- - Say y here if you wish to be able to put a (soft) upper limit on + Say y here if you wish to be able to put a soft upper limit on the rate of CPU usage by individual tasks. A task which has been allocated a soft CPU rate cap will be limited to that rate of CPU usage unless there is spare CPU resources available after the needs of uncapped tasks are met. + +config CPU_RATE_HARD_CAPS + bool "Support CPU rate hard caps" + depends on CPU_RATE_CAPS + default n + ---help--- + Say y here if you wish to be able to put a hard upper limit on + the rate of CPU usage by individual tasks. A task which has been + allocated a hard CPU rate cap will be limited to that rate of CPU + usage regardless of whether there is spare CPU resources available. Index: MM-2.6.17-rc4-mm3/kernel/sched.c =================================================================== --- MM-2.6.17-rc4-mm3.orig/kernel/sched.c 2006-05-26 11:00:02.000000000 +1000 +++ MM-2.6.17-rc4-mm3/kernel/sched.c 2006-05-26 13:50:11.000000000 +1000 @@ -201,21 +201,33 @@ static inline unsigned int task_timeslic #ifdef CONFIG_CPU_RATE_CAPS #define CAP_STATS_OFFSET 8 +#ifdef CONFIG_CPU_RATE_HARD_CAPS +static void sinbin_release_fn(unsigned long arg); +#define min_cpu_rate_cap(p) min((p)->cpu_rate_cap, (p)->cpu_rate_hard_cap) +#else +#define min_cpu_rate_cap(p) (p)->cpu_rate_cap +#endif #define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP) /* this assumes that p is not a real time task */ #define task_is_background(p) unlikely((p)->cpu_rate_cap == 0) #define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO) -#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000) +#define cap_load_weight(p) ((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / 1000) static void init_cpu_rate_caps(task_t *p) { p->cpu_rate_cap = 1000; p->flags &= ~PF_HAS_CAP; +#ifdef CONFIG_CPU_RATE_HARD_CAPS + p->cpu_rate_hard_cap = 1000; + init_timer(&p->sinbin_timer); + p->sinbin_timer.function = sinbin_release_fn; + p->sinbin_timer.data = (unsigned long) p; +#endif } static inline void set_cap_flag(task_t *p) { - if (p->cpu_rate_cap < 1000 && !has_rt_policy(p)) + if (min_cpu_rate_cap(p) < 1000 && !has_rt_policy(p)) p->flags |= PF_HAS_CAP; else p->flags &= ~PF_HAS_CAP; @@ -223,7 +235,7 @@ static inline void set_cap_flag(task_t * static inline int task_exceeding_cap(const task_t *p) { - return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap); + return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * min_cpu_rate_cap(p)); } #ifdef CONFIG_SCHED_SMT @@ -257,7 +269,7 @@ static int task_exceeding_cap_now(const delta = (now > p->timestamp) ? (now - p->timestamp) : 0; lhs = (p->avg_cpu_per_cycle + delta) * 1000; - rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap; + rhs = (p->avg_cycle_length + delta) * min_cpu_rate_cap(p); return lhs > rhs; } @@ -266,6 +278,10 @@ static inline void init_cap_stats(task_t { p->avg_cpu_per_cycle = 0; p->avg_cycle_length = 0; +#ifdef CONFIG_CPU_RATE_HARD_CAPS + init_timer(&p->sinbin_timer); + p->sinbin_timer.data = (unsigned long) p; +#endif } static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now) @@ -1213,6 +1229,64 @@ static void deactivate_task(struct task_ p->array = NULL; } +#ifdef CONFIG_CPU_RATE_HARD_CAPS +#define task_has_hard_cap(p) unlikely((p)->cpu_rate_hard_cap < 1000) + +/* + * Release a task from the sinbin + */ +static void sinbin_release_fn(unsigned long arg) +{ + unsigned long flags; + struct task_struct *p = (struct task_struct*)arg; + struct runqueue *rq = task_rq_lock(p, &flags); + + p->prio = effective_prio(p); + + __activate_task(p, rq); + + task_rq_unlock(rq, &flags); +} + +static unsigned long reqd_sinbin_ticks(const task_t *p) +{ + unsigned long long res; + + res = p->avg_cpu_per_cycle * 1000; + + if (res > p->avg_cycle_length * p->cpu_rate_hard_cap) { + (void)do_div(res, p->cpu_rate_hard_cap); + res -= p->avg_cpu_per_cycle; + /* + * IF it was available we'd also subtract + * the average sleep per cycle here + */ + res >>= CAP_STATS_OFFSET; + (void)do_div(res, (1000000000 / HZ)); + + return res ? : 1; + } + + return 0; +} + +static void sinbin_task(task_t *p, unsigned long durn) +{ + if (durn == 0) + return; + deactivate_task(p, task_rq(p)); + p->sinbin_timer.expires = jiffies + durn; + add_timer(&p->sinbin_timer); +} +#else +#define task_has_hard_cap(p) 0 +#define reqd_sinbin_ticks(p) 0 + +static inline void sinbin_task(task_t *p, unsigned long durn) +{ +} +#endif + /* * resched_task - mark a task 'to be rescheduled now'. * @@ -3508,9 +3582,16 @@ need_resched_nonpreemptible: } } - /* do this now so that stats are correct for SMT code */ - if (task_has_cap(prev)) + if (task_has_cap(prev)) { inc_cap_stats_both(prev, now); + if (task_has_hard_cap(prev) && !prev->state && + !rt_task(prev) && !signal_pending(prev)) { + unsigned long sinbin_ticks = reqd_sinbin_ticks(prev); + + if (sinbin_ticks) + sinbin_task(prev, sinbin_ticks); + } + } cpu = smp_processor_id(); if (unlikely(!rq->nr_running)) { @@ -4539,6 +4620,67 @@ out: } EXPORT_SYMBOL(set_cpu_rate_cap); + +#ifdef CONFIG_CPU_RATE_HARD_CAPS +unsigned int get_cpu_rate_hard_cap(const struct task_struct *p) +{ + return p->cpu_rate_hard_cap; +} + +EXPORT_SYMBOL(get_cpu_rate_hard_cap); + +/* + * Require: 1 <= new_cap <= 1000 + */ +int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap) +{ + int is_allowed; + unsigned long flags; + struct runqueue *rq; + int delta; + + if (new_cap > 1000 && new_cap > 0) + return -EINVAL; + is_allowed = capable(CAP_SYS_NICE); + /* + * We have to be careful, if called from /proc code, + * the task might be in the middle of scheduling on another CPU. + */ + rq = task_rq_lock(p, &flags); + delta = new_cap - p->cpu_rate_hard_cap; + if (!is_allowed) { + /* + * Ordinary users can set/change caps on their own tasks + * provided that the new setting is MORE constraining + */ + if (((current->euid != p->uid) && (current->uid != p->uid)) || (delta > 0)) { + task_rq_unlock(rq, &flags); + return -EPERM; + } + } + /* + * The RT tasks don't have caps, but we still allow the caps to be + * set - but as expected it wont have any effect on scheduling until + * the task becomes SCHED_NORMAL/SCHED_BATCH: + */ + p->cpu_rate_hard_cap = new_cap; + + if (has_rt_policy(p)) + goto out; + + if (p->array) + dec_raw_weighted_load(rq, p); + set_load_weight(p); + if (p->array) + inc_raw_weighted_load(rq, p); +out: + task_rq_unlock(rq, &flags); + + return 0; +} + +EXPORT_SYMBOL(set_cpu_rate_hard_cap); +#endif #endif long sched_setaffinity(pid_t pid, cpumask_t new_mask) -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-26 4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams @ 2006-05-26 6:58 ` Kari Hurtta 2006-05-27 1:00 ` Peter Williams 2006-05-26 11:00 ` Con Kolivas 2006-05-27 6:48 ` Balbir Singh 2 siblings, 1 reply; 27+ messages in thread From: Kari Hurtta @ 2006-05-26 6:58 UTC (permalink / raw) To: linux-kernel Peter Williams <pwil3058@bigpond.net.au> writes in gmane.linux.kernel: > This patch implements hard CPU rate caps per task as a proportion of a > single CPU's capacity expressed in parts per thousand. > + * Require: 1 <= new_cap <= 1000 > + */ > +int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap) > +{ > + int is_allowed; > + unsigned long flags; > + struct runqueue *rq; > + int delta; > + > + if (new_cap > 1000 && new_cap > 0) > + return -EINVAL; That condition looks wrong. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-26 6:58 ` Kari Hurtta @ 2006-05-27 1:00 ` Peter Williams 0 siblings, 0 replies; 27+ messages in thread From: Peter Williams @ 2006-05-27 1:00 UTC (permalink / raw) To: Kari Hurtta; +Cc: linux-kernel Kari Hurtta wrote: > Peter Williams <pwil3058@bigpond.net.au> writes in gmane.linux.kernel: > >> This patch implements hard CPU rate caps per task as a proportion of a >> single CPU's capacity expressed in parts per thousand. > >> + * Require: 1 <= new_cap <= 1000 >> + */ >> +int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap) >> +{ >> + int is_allowed; >> + unsigned long flags; >> + struct runqueue *rq; >> + int delta; >> + >> + if (new_cap > 1000 && new_cap > 0) >> + return -EINVAL; > > That condition looks wrong. It certainly does. Thanks Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-26 4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams 2006-05-26 6:58 ` Kari Hurtta @ 2006-05-26 11:00 ` Con Kolivas 2006-05-26 13:59 ` Peter Williams 2006-05-27 6:48 ` Balbir Singh 2 siblings, 1 reply; 27+ messages in thread From: Con Kolivas @ 2006-05-26 11:00 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman On Friday 26 May 2006 14:20, Peter Williams wrote: > This patch implements hard CPU rate caps per task as a proportion of a > single CPU's capacity expressed in parts per thousand. A hard cap of 1/1000 could lead to interesting starvation scenarios where a mutex or semaphore was held by a task that hardly ever got cpu. Same goes to a lesser extent to a 0 soft cap. Here is how I handle idleprio tasks in current -ck: http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch tags tasks that are holding a mutex http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch is the idleprio policy for staircase. What it does is runs idleprio tasks as normal tasks when they hold a mutex or are waking up after calling down() (ie holding a semaphore). These two in combination have shown resistance to any priority inversion problems in widespread testing. An attempt was made to track semaphores held via a down_interruptible() but unfortunately the lack of strict rules about who could release the semaphore meant accounting was impossible of this scenario. In practice, though there were no test cases that showed it to be an issue, and the recent conversion en-masse of semaphores to mutexes in the kernel means it has pretty much covered most possibilities. -- -ck ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-26 11:00 ` Con Kolivas @ 2006-05-26 13:59 ` Peter Williams 2006-05-26 14:12 ` Con Kolivas 2006-05-26 14:23 ` Mike Galbraith 0 siblings, 2 replies; 27+ messages in thread From: Peter Williams @ 2006-05-26 13:59 UTC (permalink / raw) To: Con Kolivas Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman Con Kolivas wrote: > On Friday 26 May 2006 14:20, Peter Williams wrote: >> This patch implements hard CPU rate caps per task as a proportion of a >> single CPU's capacity expressed in parts per thousand. > > A hard cap of 1/1000 could lead to interesting starvation scenarios where a > mutex or semaphore was held by a task that hardly ever got cpu. Same goes to > a lesser extent to a 0 soft cap. > > Here is how I handle idleprio tasks in current -ck: > > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch > tags tasks that are holding a mutex > > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch > is the idleprio policy for staircase. > > What it does is runs idleprio tasks as normal tasks when they hold a mutex or > are waking up after calling down() (ie holding a semaphore). I wasn't aware that you could detect those conditions. They could be very useful. > These two in > combination have shown resistance to any priority inversion problems in > widespread testing. An attempt was made to track semaphores held via a > down_interruptible() but unfortunately the lack of strict rules about who > could release the semaphore meant accounting was impossible of this scenario. > In practice, though there were no test cases that showed it to be an issue, > and the recent conversion en-masse of semaphores to mutexes in the kernel > means it has pretty much covered most possibilities. > Thanks, Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-26 13:59 ` Peter Williams @ 2006-05-26 14:12 ` Con Kolivas 2006-05-26 14:23 ` Mike Galbraith 1 sibling, 0 replies; 27+ messages in thread From: Con Kolivas @ 2006-05-26 14:12 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman On Friday 26 May 2006 23:59, Peter Williams wrote: > Con Kolivas wrote: > > On Friday 26 May 2006 14:20, Peter Williams wrote: > >> This patch implements hard CPU rate caps per task as a proportion of a > >> single CPU's capacity expressed in parts per thousand. > > > > A hard cap of 1/1000 could lead to interesting starvation scenarios where > > a mutex or semaphore was held by a task that hardly ever got cpu. Same > > goes to a lesser extent to a 0 soft cap. > > > > Here is how I handle idleprio tasks in current -ck: > > > > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/ > >patches/track_mutexes-1.patch tags tasks that are holding a mutex > > > > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/ > >patches/sched-idleprio-1.7.patch is the idleprio policy for staircase. > > > > What it does is runs idleprio tasks as normal tasks when they hold a > > mutex or are waking up after calling down() (ie holding a semaphore). > > I wasn't aware that you could detect those conditions. They could be > very useful. Ingo's mutex infrastructure made it possible to accurately track mutexes held, and basically anything entering uninterruptible sleep has called down(). Mainline, as you know, already flags the latter for interactivity purposes. -- -ck ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-26 13:59 ` Peter Williams 2006-05-26 14:12 ` Con Kolivas @ 2006-05-26 14:23 ` Mike Galbraith 2006-05-27 0:16 ` Peter Williams 1 sibling, 1 reply; 27+ messages in thread From: Mike Galbraith @ 2006-05-26 14:23 UTC (permalink / raw) To: Peter Williams Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote: > Con Kolivas wrote: > > On Friday 26 May 2006 14:20, Peter Williams wrote: > >> This patch implements hard CPU rate caps per task as a proportion of a > >> single CPU's capacity expressed in parts per thousand. > > > > A hard cap of 1/1000 could lead to interesting starvation scenarios where a > > mutex or semaphore was held by a task that hardly ever got cpu. Same goes to > > a lesser extent to a 0 soft cap. > > > > Here is how I handle idleprio tasks in current -ck: > > > > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch > > tags tasks that are holding a mutex > > > > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch > > is the idleprio policy for staircase. > > > > What it does is runs idleprio tasks as normal tasks when they hold a mutex or > > are waking up after calling down() (ie holding a semaphore). > > I wasn't aware that you could detect those conditions. They could be > very useful. Isn't this exactly what the PI code is there to handle? Is something more than PI needed? -Mike ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-26 14:23 ` Mike Galbraith @ 2006-05-27 0:16 ` Peter Williams 2006-05-27 9:28 ` Mike Galbraith 0 siblings, 1 reply; 27+ messages in thread From: Peter Williams @ 2006-05-27 0:16 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman Mike Galbraith wrote: > On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote: >> Con Kolivas wrote: >>> On Friday 26 May 2006 14:20, Peter Williams wrote: >>>> This patch implements hard CPU rate caps per task as a proportion of a >>>> single CPU's capacity expressed in parts per thousand. >>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a >>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to >>> a lesser extent to a 0 soft cap. >>> >>> Here is how I handle idleprio tasks in current -ck: >>> >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch >>> tags tasks that are holding a mutex >>> >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch >>> is the idleprio policy for staircase. >>> >>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or >>> are waking up after calling down() (ie holding a semaphore). >> I wasn't aware that you could detect those conditions. They could be >> very useful. > > Isn't this exactly what the PI code is there to handle? Is something > more than PI needed? > AFAIK (but I may be wrong) PI is only used by RT tasks and would need to be extended. It could be argued that extending PI so that it can be used by non RT tasks is a worthwhile endeavour in its own right. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-27 0:16 ` Peter Williams @ 2006-05-27 9:28 ` Mike Galbraith 2006-05-28 2:09 ` Peter Williams 0 siblings, 1 reply; 27+ messages in thread From: Mike Galbraith @ 2006-05-27 9:28 UTC (permalink / raw) To: Peter Williams Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman On Sat, 2006-05-27 at 10:16 +1000, Peter Williams wrote: > Mike Galbraith wrote: > > On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote: > >> Con Kolivas wrote: > >>> On Friday 26 May 2006 14:20, Peter Williams wrote: > >>>> This patch implements hard CPU rate caps per task as a proportion of a > >>>> single CPU's capacity expressed in parts per thousand. > >>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a > >>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to > >>> a lesser extent to a 0 soft cap. > >>> > >>> Here is how I handle idleprio tasks in current -ck: > >>> > >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch > >>> tags tasks that are holding a mutex > >>> > >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch > >>> is the idleprio policy for staircase. > >>> > >>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or > >>> are waking up after calling down() (ie holding a semaphore). > >> I wasn't aware that you could detect those conditions. They could be > >> very useful. > > > > Isn't this exactly what the PI code is there to handle? Is something > > more than PI needed? > > > > AFAIK (but I may be wrong) PI is only used by RT tasks and would need to > be extended. It could be argued that extending PI so that it can be > used by non RT tasks is a worthwhile endeavour in its own right. Hm. Looking around a bit, it appears to me that we're one itty bitty redefine away from PI being global. No idea if/when that will happen though. -Mike ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-27 9:28 ` Mike Galbraith @ 2006-05-28 2:09 ` Peter Williams 0 siblings, 0 replies; 27+ messages in thread From: Peter Williams @ 2006-05-28 2:09 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman Mike Galbraith wrote: > On Sat, 2006-05-27 at 10:16 +1000, Peter Williams wrote: >> Mike Galbraith wrote: >>> On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote: >>>> Con Kolivas wrote: >>>>> On Friday 26 May 2006 14:20, Peter Williams wrote: >>>>>> This patch implements hard CPU rate caps per task as a proportion of a >>>>>> single CPU's capacity expressed in parts per thousand. >>>>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a >>>>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to >>>>> a lesser extent to a 0 soft cap. >>>>> >>>>> Here is how I handle idleprio tasks in current -ck: >>>>> >>>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch >>>>> tags tasks that are holding a mutex >>>>> >>>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch >>>>> is the idleprio policy for staircase. >>>>> >>>>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or >>>>> are waking up after calling down() (ie holding a semaphore). >>>> I wasn't aware that you could detect those conditions. They could be >>>> very useful. >>> Isn't this exactly what the PI code is there to handle? Is something >>> more than PI needed? >>> >> AFAIK (but I may be wrong) PI is only used by RT tasks and would need to >> be extended. It could be argued that extending PI so that it can be >> used by non RT tasks is a worthwhile endeavour in its own right. > > Hm. Looking around a bit, it appears to me that we're one itty bitty > redefine away from PI being global. No idea if/when that will happen > though. It needs slightly more than that. It's currently relying on the way tasks with prio less than MAX_RT_PRIO are treated to prevent the priority of tasks who are inheriting a priority from having that priority reset to their normal priority at various places in sched.c. So something would need to be done in that regard but it shouldn't be too difficult. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-26 4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams 2006-05-26 6:58 ` Kari Hurtta 2006-05-26 11:00 ` Con Kolivas @ 2006-05-27 6:48 ` Balbir Singh 2006-05-27 8:44 ` Peter Williams 2 siblings, 1 reply; 27+ messages in thread From: Balbir Singh @ 2006-05-27 6:48 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman On 5/26/06, Peter Williams <pwil3058@bigpond.net.au> wrote: > This patch implements hard CPU rate caps per task as a proportion of a > single CPU's capacity expressed in parts per thousand. > > Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> > include/linux/sched.h | 8 ++ > kernel/Kconfig.caps | 14 +++- > kernel/sched.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++-- > 3 files changed, 168 insertions(+), 8 deletions(-) > > Index: MM-2.6.17-rc4-mm3/include/linux/sched.h > =================================================================== > --- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h 2006-05-26 10:46:35.000000000 +1000 > +++ MM-2.6.17-rc4-mm3/include/linux/sched.h 2006-05-26 11:00:07.000000000 +1000 > @@ -796,6 +796,10 @@ struct task_struct { > #ifdef CONFIG_CPU_RATE_CAPS > unsigned long long avg_cpu_per_cycle, avg_cycle_length; > unsigned int cpu_rate_cap; > +#ifdef CONFIG_CPU_RATE_HARD_CAPS > + unsigned int cpu_rate_hard_cap; > + struct timer_list sinbin_timer; Using a timer for releasing tasks from their sinbin sounds like a bit of an overhead. Given that there could be 10s of thousands of tasks. Is it possible to use the scheduler_tick() function take a look at all deactivated tasks (as efficiently as possible) and activate them when its time to activate them or just fold the functionality by defining a time quantum after which everyone is worken up. This time quantum could be the same as the time over which limits are honoured. > +#endif > #endif > enum sleep_type sleep_type; > > @@ -994,6 +998,10 @@ struct task_struct { > #ifdef CONFIG_CPU_RATE_CAPS > unsigned int get_cpu_rate_cap(const struct task_struct *); > int set_cpu_rate_cap(struct task_struct *, unsigned int); > +#ifdef CONFIG_CPU_RATE_HARD_CAPS > +unsigned int get_cpu_rate_hard_cap(const struct task_struct *); > +int set_cpu_rate_hard_cap(struct task_struct *, unsigned int); > +#endif > #endif > > static inline pid_t process_group(struct task_struct *tsk) > Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps > =================================================================== > --- MM-2.6.17-rc4-mm3.orig/kernel/Kconfig.caps 2006-05-26 10:45:26.000000000 +1000 > +++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps 2006-05-26 11:00:07.000000000 +1000 > @@ -3,11 +3,21 @@ > # > > config CPU_RATE_CAPS > - bool "Support (soft) CPU rate caps" > + bool "Support CPU rate caps" > default n > ---help--- > - Say y here if you wish to be able to put a (soft) upper limit on > + Say y here if you wish to be able to put a soft upper limit on > the rate of CPU usage by individual tasks. A task which has been > allocated a soft CPU rate cap will be limited to that rate of CPU > usage unless there is spare CPU resources available after the needs > of uncapped tasks are met. > + > +config CPU_RATE_HARD_CAPS > + bool "Support CPU rate hard caps" > + depends on CPU_RATE_CAPS > + default n > + ---help--- > + Say y here if you wish to be able to put a hard upper limit on > + the rate of CPU usage by individual tasks. A task which has been > + allocated a hard CPU rate cap will be limited to that rate of CPU > + usage regardless of whether there is spare CPU resources available. > Index: MM-2.6.17-rc4-mm3/kernel/sched.c > =================================================================== > --- MM-2.6.17-rc4-mm3.orig/kernel/sched.c 2006-05-26 11:00:02.000000000 +1000 > +++ MM-2.6.17-rc4-mm3/kernel/sched.c 2006-05-26 13:50:11.000000000 +1000 > @@ -201,21 +201,33 @@ static inline unsigned int task_timeslic > > #ifdef CONFIG_CPU_RATE_CAPS > #define CAP_STATS_OFFSET 8 > +#ifdef CONFIG_CPU_RATE_HARD_CAPS > +static void sinbin_release_fn(unsigned long arg); > +#define min_cpu_rate_cap(p) min((p)->cpu_rate_cap, (p)->cpu_rate_hard_cap) > +#else > +#define min_cpu_rate_cap(p) (p)->cpu_rate_cap > +#endif > #define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP) > /* this assumes that p is not a real time task */ > #define task_is_background(p) unlikely((p)->cpu_rate_cap == 0) > #define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO) > -#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000) > +#define cap_load_weight(p) ((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / 1000) > > static void init_cpu_rate_caps(task_t *p) > { > p->cpu_rate_cap = 1000; > p->flags &= ~PF_HAS_CAP; > +#ifdef CONFIG_CPU_RATE_HARD_CAPS > + p->cpu_rate_hard_cap = 1000; > + init_timer(&p->sinbin_timer); > + p->sinbin_timer.function = sinbin_release_fn; > + p->sinbin_timer.data = (unsigned long) p; > +#endif > } > > static inline void set_cap_flag(task_t *p) > { > - if (p->cpu_rate_cap < 1000 && !has_rt_policy(p)) > + if (min_cpu_rate_cap(p) < 1000 && !has_rt_policy(p)) > p->flags |= PF_HAS_CAP; > else > p->flags &= ~PF_HAS_CAP; > @@ -223,7 +235,7 @@ static inline void set_cap_flag(task_t * > > static inline int task_exceeding_cap(const task_t *p) > { > - return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap); > + return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * min_cpu_rate_cap(p)); > } > > #ifdef CONFIG_SCHED_SMT > @@ -257,7 +269,7 @@ static int task_exceeding_cap_now(const > > delta = (now > p->timestamp) ? (now - p->timestamp) : 0; > lhs = (p->avg_cpu_per_cycle + delta) * 1000; > - rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap; > + rhs = (p->avg_cycle_length + delta) * min_cpu_rate_cap(p); > > return lhs > rhs; > } > @@ -266,6 +278,10 @@ static inline void init_cap_stats(task_t > { > p->avg_cpu_per_cycle = 0; > p->avg_cycle_length = 0; > +#ifdef CONFIG_CPU_RATE_HARD_CAPS > + init_timer(&p->sinbin_timer); > + p->sinbin_timer.data = (unsigned long) p; > +#endif > } > > static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now) > @@ -1213,6 +1229,64 @@ static void deactivate_task(struct task_ > p->array = NULL; > } > > +#ifdef CONFIG_CPU_RATE_HARD_CAPS > +#define task_has_hard_cap(p) unlikely((p)->cpu_rate_hard_cap < 1000) > + > +/* > + * Release a task from the sinbin > + */ > +static void sinbin_release_fn(unsigned long arg) > +{ > + unsigned long flags; > + struct task_struct *p = (struct task_struct*)arg; > + struct runqueue *rq = task_rq_lock(p, &flags); > + > + p->prio = effective_prio(p); > + > + __activate_task(p, rq); > + > + task_rq_unlock(rq, &flags); > +} > + > +static unsigned long reqd_sinbin_ticks(const task_t *p) > +{ > + unsigned long long res; > + > + res = p->avg_cpu_per_cycle * 1000; > + > + if (res > p->avg_cycle_length * p->cpu_rate_hard_cap) { > + (void)do_div(res, p->cpu_rate_hard_cap); > + res -= p->avg_cpu_per_cycle; > + /* > + * IF it was available we'd also subtract > + * the average sleep per cycle here > + */ > + res >>= CAP_STATS_OFFSET; > + (void)do_div(res, (1000000000 / HZ)); Please use NSEC_PER_SEC if that is what 10^9 stands for in the above calculation. > + > + return res ? : 1; > + } > + > + return 0; > +} > + > +static void sinbin_task(task_t *p, unsigned long durn) > +{ > + if (durn == 0) > + return; > + deactivate_task(p, task_rq(p)); > + p->sinbin_timer.expires = jiffies + durn; > + add_timer(&p->sinbin_timer); > +} > +#else > +#define task_has_hard_cap(p) 0 > +#define reqd_sinbin_ticks(p) 0 > + > +static inline void sinbin_task(task_t *p, unsigned long durn) > +{ > +} > +#endif > + > /* > * resched_task - mark a task 'to be rescheduled now'. > * > @@ -3508,9 +3582,16 @@ need_resched_nonpreemptible: > } > } > > - /* do this now so that stats are correct for SMT code */ > - if (task_has_cap(prev)) > + if (task_has_cap(prev)) { > inc_cap_stats_both(prev, now); > + if (task_has_hard_cap(prev) && !prev->state && > + !rt_task(prev) && !signal_pending(prev)) { > + unsigned long sinbin_ticks = reqd_sinbin_ticks(prev); > + > + if (sinbin_ticks) > + sinbin_task(prev, sinbin_ticks); > + } > + } > > cpu = smp_processor_id(); > if (unlikely(!rq->nr_running)) { > @@ -4539,6 +4620,67 @@ out: > } > <snip> Balbir Linux Technology Center IBM Software Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-27 6:48 ` Balbir Singh @ 2006-05-27 8:44 ` Peter Williams 2006-05-31 13:10 ` Kirill Korotaev 0 siblings, 1 reply; 27+ messages in thread From: Peter Williams @ 2006-05-27 8:44 UTC (permalink / raw) To: Balbir Singh Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman Balbir Singh wrote: > > Using a timer for releasing tasks from their sinbin sounds like a bit > of an overhead. Given that there could be 10s of thousands of tasks. The more runnable tasks there are the less likely it is that any of them is exceeding its hard cap due to normal competition for the CPUs. So I think that it's unlikely that there will ever be a very large number of tasks in the sinbin at the same time. > Is it possible to use the scheduler_tick() function take a look at all > deactivated tasks (as efficiently as possible) and activate them when > its time to activate them or just fold the functionality by defining a > time quantum after which everyone is worken up. This time quantum > could be the same as the time over which limits are honoured. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-27 8:44 ` Peter Williams @ 2006-05-31 13:10 ` Kirill Korotaev 2006-05-31 15:59 ` Balbir Singh 2006-05-31 23:28 ` Peter Williams 0 siblings, 2 replies; 27+ messages in thread From: Kirill Korotaev @ 2006-05-31 13:10 UTC (permalink / raw) To: Peter Williams Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman >> Using a timer for releasing tasks from their sinbin sounds like a bit >> of an overhead. Given that there could be 10s of thousands of tasks. > > > The more runnable tasks there are the less likely it is that any of them > is exceeding its hard cap due to normal competition for the CPUs. So I > think that it's unlikely that there will ever be a very large number of > tasks in the sinbin at the same time. for containers this can be untrue... :( actually even for 1000 tasks (I suppose this is the maximum in your case) it can slowdown significantly as well. >> Is it possible to use the scheduler_tick() function take a look at all >> deactivated tasks (as efficiently as possible) and activate them when >> its time to activate them or just fold the functionality by defining a >> time quantum after which everyone is worken up. This time quantum >> could be the same as the time over which limits are honoured. agree with it. Kirill ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-31 13:10 ` Kirill Korotaev @ 2006-05-31 15:59 ` Balbir Singh 2006-05-31 18:09 ` Mike Galbraith ` (2 more replies) 2006-05-31 23:28 ` Peter Williams 1 sibling, 3 replies; 27+ messages in thread From: Balbir Singh @ 2006-05-31 15:59 UTC (permalink / raw) To: Kirill Korotaev Cc: Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman Kirill Korotaev wrote: >>> Using a timer for releasing tasks from their sinbin sounds like a bit >>> of an overhead. Given that there could be 10s of thousands of tasks. >> >> >> >> The more runnable tasks there are the less likely it is that any of >> them is exceeding its hard cap due to normal competition for the >> CPUs. So I think that it's unlikely that there will ever be a very >> large number of tasks in the sinbin at the same time. > > for containers this can be untrue... :( actually even for 1000 tasks (I > suppose this is the maximum in your case) it can slowdown significantly > as well. Do you have any documented requirements for container resource management? Is there a minimum list of features and nice to have features for containers as far as resource management is concerned? > >>> Is it possible to use the scheduler_tick() function take a look at all >>> deactivated tasks (as efficiently as possible) and activate them when >>> its time to activate them or just fold the functionality by defining a >>> time quantum after which everyone is worken up. This time quantum >>> could be the same as the time over which limits are honoured. > > agree with it. Thinking a bit more along these lines, it would probably break O(1). But I guess a good algorithm can amortize the cost. > > Kirill > -- Balbir Singh, Linux Technology Center, IBM Software Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-31 15:59 ` Balbir Singh @ 2006-05-31 18:09 ` Mike Galbraith 2006-06-01 7:41 ` Kirill Korotaev 2006-06-01 23:43 ` Peter Williams 2 siblings, 0 replies; 27+ messages in thread From: Mike Galbraith @ 2006-05-31 18:09 UTC (permalink / raw) To: balbir Cc: Kirill Korotaev, Peter Williams, Balbir Singh, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman On Wed, 2006-05-31 at 21:29 +0530, Balbir Singh wrote: > Do you have any documented requirements for container resource management? (?? where would that come from?) Containers, I can imagine ~working (albeit I don't see why num_tasks dilution problem shouldn't apply to num_containers... it's the same thing, stale info) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-31 15:59 ` Balbir Singh 2006-05-31 18:09 ` Mike Galbraith @ 2006-06-01 7:41 ` Kirill Korotaev 2006-06-01 8:34 ` Balbir Singh 2006-06-01 23:43 ` Peter Williams 2 siblings, 1 reply; 27+ messages in thread From: Kirill Korotaev @ 2006-06-01 7:41 UTC (permalink / raw) To: balbir Cc: Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman, Sam Vilain, Andrew Morton, Eric W. Biederman >>> The more runnable tasks there are the less likely it is that any of >>> them is exceeding its hard cap due to normal competition for the >>> CPUs. So I think that it's unlikely that there will ever be a very >>> large number of tasks in the sinbin at the same time. >> >> >> for containers this can be untrue... :( actually even for 1000 tasks >> (I suppose this is the maximum in your case) it can slowdown >> significantly as well. > > > Do you have any documented requirements for container resource management? > Is there a minimum list of features and nice to have features for > containers > as far as resource management is concerned? Sure! You can check OpenVZ project (http://openvz.org) for example of required resource management. BTW, I must agree with other people here who noticed that per-process resource management is really useless and hard to use :( Briefly about required resource management: 1) CPU: - fairness (i.e. prioritization of containers). For this we use SFQ like fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses tocken bucket algorithm. I can provide more details on this if you are interested. - cpu limits (soft, hard). OpenVZ provides only hard cpu limits. For this we account the time in cycles. And after some credit is used do delay of container execution. We use cycles as our experiments show that statistical algorithms work poorly on some patterns :( - cpu guarantees. I'm not sure any of solutions provide this yet. 2) disk: - overall disk quota for container - per-user/group quotas inside container in OpenVZ we wrote a 2level disk quota which works on disk subtrees. vserver imho uses 1 partition per container approach. - disk I/O bandwidth: we started to use CFQv2, but it is quite poor in this regard. First, it doesn't prioritizes writes and async disk operations :( And even for sync reads we found some problems we work on now... 3) memory and other resources. - memory - files - signals and so on and so on. For example, in OpenVZ we have user resource beancounters (original author is Alan Cox), which account the following set of parameters: kernel memory (vmas, page tables, different structures etc.), dcache pinned size, different user pages (locked, physical, private, shared), number of files, sockets, ptys, signals, network buffers, netfilter rules etc. 4. network bandwidth traffic shaping is already ok here. >>>> Is it possible to use the scheduler_tick() function take a look at all >>>> deactivated tasks (as efficiently as possible) and activate them when >>>> its time to activate them or just fold the functionality by defining a >>>> time quantum after which everyone is worken up. This time quantum >>>> could be the same as the time over which limits are honoured. >> >> >> agree with it. > > > Thinking a bit more along these lines, it would probably break O(1). But > I guess a good > algorithm can amortize the cost. this is the price to pay. but it happens quite rarelly as was noticed already... Kirill ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-06-01 7:41 ` Kirill Korotaev @ 2006-06-01 8:34 ` Balbir Singh 2006-06-01 23:47 ` Sam Vilain 0 siblings, 1 reply; 27+ messages in thread From: Balbir Singh @ 2006-06-01 8:34 UTC (permalink / raw) To: Kirill Korotaev Cc: Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman, Sam Vilain, Andrew Morton, Eric W. Biederman, Srivatsa, ckrm-tech Hi, Kirill, Kirill Korotaev wrote: >> Do you have any documented requirements for container resource >> management? >> Is there a minimum list of features and nice to have features for >> containers >> as far as resource management is concerned? > > Sure! You can check OpenVZ project (http://openvz.org) for example of > required resource management. BTW, I must agree with other people here > who noticed that per-process resource management is really useless and > hard to use :( I'll take a look at the references. I agree with you that it will be useful to have resource management for a group of tasks. > > Briefly about required resource management: > 1) CPU: > - fairness (i.e. prioritization of containers). For this we use SFQ like > fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses > tocken bucket algorithm. I can provide more details on this if you are > interested. Yes, any information or pointers to them will be very useful. > - cpu limits (soft, hard). OpenVZ provides only hard cpu limits. For > this we account the time in cycles. And after some credit is used do > delay of container execution. We use cycles as our experiments show that > statistical algorithms work poorly on some patterns :( > - cpu guarantees. I'm not sure any of solutions provide this yet. ckrm has a solution to provide cpu guarantees. I think as far as CPU resource management is concerned (limits or guarantees), there are common problems to be solved, for example 1. Tracking when a limit or a gaurantee is not met 2. Taking a decision to cap the group 3. Selecting the next task to execute (keeping O(1) in mind) For the existing resource controller in OpenVZ I would be interested in the information on the kinds of patterns it does not perform well on and the patterns it performs well on. > > 2) disk: > - overall disk quota for container > - per-user/group quotas inside container > > in OpenVZ we wrote a 2level disk quota which works on disk subtrees. > vserver imho uses 1 partition per container approach. > > - disk I/O bandwidth: > we started to use CFQv2, but it is quite poor in this regard. First, it > doesn't prioritizes writes and async disk operations :( And even for > sync reads we found some problems we work on now... > > 3) memory and other resources. > - memory > - files > - signals and so on and so on. > For example, in OpenVZ we have user resource beancounters (original > author is Alan Cox), which account the following set of parameters: > kernel memory (vmas, page tables, different structures etc.), dcache > pinned size, different user pages (locked, physical, private, shared), > number of files, sockets, ptys, signals, network buffers, netfilter > rules etc. > > 4. network bandwidth > traffic shaping is already ok here. Traffic shaping is just for outgoing traffic right? How about incoming traffic (through the accept call) > These are a great set of requirements. Thanks for putting them together. >> Thinking a bit more along these lines, it would probably break O(1). >> But I guess a good >> algorithm can amortize the cost. > > this is the price to pay. but it happens quite rarelly as was noticed > already... > Yes, agreed. > Kirill > -- Balbir Singh, Linux Technology Center, IBM Software Labs PS: I am also cc'ing ckrm-tech and srivatsa ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-06-01 8:34 ` Balbir Singh @ 2006-06-01 23:47 ` Sam Vilain 0 siblings, 0 replies; 27+ messages in thread From: Sam Vilain @ 2006-06-01 23:47 UTC (permalink / raw) To: balbir Cc: Kirill Korotaev, Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman, Andrew Morton, Eric W. Biederman, Srivatsa, ckrm-tech Balbir Singh wrote: >>1) CPU: >>- fairness (i.e. prioritization of containers). For this we use SFQ like >>fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses >>tocken bucket algorithm. I can provide more details on this if you are >>interested. >> >> >Yes, any information or pointers to them will be very useful. > > A general description of the token bucket scheduler is on the Vserver wiki at http://linux-vserver.org/Linux-VServer-Paper-06 I also just described it on a nearby thread - http://lkml.org/lkml/2006/5/28/122 Sam. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-31 15:59 ` Balbir Singh 2006-05-31 18:09 ` Mike Galbraith 2006-06-01 7:41 ` Kirill Korotaev @ 2006-06-01 23:43 ` Peter Williams 2 siblings, 0 replies; 27+ messages in thread From: Peter Williams @ 2006-06-01 23:43 UTC (permalink / raw) To: balbir Cc: Kirill Korotaev, Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman Balbir Singh wrote: > Kirill Korotaev wrote: >>>> Using a timer for releasing tasks from their sinbin sounds like a bit >>>> of an overhead. Given that there could be 10s of thousands of tasks. >>> >>> >>> >>> The more runnable tasks there are the less likely it is that any of >>> them is exceeding its hard cap due to normal competition for the >>> CPUs. So I think that it's unlikely that there will ever be a very >>> large number of tasks in the sinbin at the same time. >> >> for containers this can be untrue... :( actually even for 1000 tasks >> (I suppose this is the maximum in your case) it can slowdown >> significantly as well. > > Do you have any documented requirements for container resource management? > Is there a minimum list of features and nice to have features for > containers > as far as resource management is concerned? > > >> >>>> Is it possible to use the scheduler_tick() function take a look at all >>>> deactivated tasks (as efficiently as possible) and activate them when >>>> its time to activate them or just fold the functionality by defining a >>>> time quantum after which everyone is worken up. This time quantum >>>> could be the same as the time over which limits are honoured. >> >> agree with it. > > Thinking a bit more along these lines, it would probably break O(1). But > I guess a good > algorithm can amortize the cost. It's also unlikely to be less overhead than using timers. In fact, my gut feeling is that you'd actually be doing something very similar to how timers work only cruder. I.e. reinventing the wheel. -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-31 13:10 ` Kirill Korotaev 2006-05-31 15:59 ` Balbir Singh @ 2006-05-31 23:28 ` Peter Williams 2006-06-01 7:44 ` Kirill Korotaev 1 sibling, 1 reply; 27+ messages in thread From: Peter Williams @ 2006-05-31 23:28 UTC (permalink / raw) To: Kirill Korotaev Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman Kirill Korotaev wrote: >>> Using a timer for releasing tasks from their sinbin sounds like a bit >>> of an overhead. Given that there could be 10s of thousands of tasks. >> >> >> The more runnable tasks there are the less likely it is that any of >> them is exceeding its hard cap due to normal competition for the >> CPUs. So I think that it's unlikely that there will ever be a very >> large number of tasks in the sinbin at the same time. > for containers this can be untrue... Why will this be untrue for containers? > :( actually even for 1000 tasks (I > suppose this is the maximum in your case) it can slowdown significantly > as well. > >>> Is it possible to use the scheduler_tick() function take a look at all >>> deactivated tasks (as efficiently as possible) and activate them when >>> its time to activate them or just fold the functionality by defining a >>> time quantum after which everyone is worken up. This time quantum >>> could be the same as the time over which limits are honoured. > agree with it. If there are a lot of RUNNABLE (i.e. on a run queue) tasks then normal competition will mean that their CPU usage rates are small and therefore unlikely to be greater than their cap. The sinbin is only used for tasks that are EXCEEDING their cap. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-05-31 23:28 ` Peter Williams @ 2006-06-01 7:44 ` Kirill Korotaev 2006-06-01 23:21 ` Peter Williams 0 siblings, 1 reply; 27+ messages in thread From: Kirill Korotaev @ 2006-06-01 7:44 UTC (permalink / raw) To: Peter Williams Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman >>>> Using a timer for releasing tasks from their sinbin sounds like a bit >>>> of an overhead. Given that there could be 10s of thousands of tasks. >>> >>> >>> >>> The more runnable tasks there are the less likely it is that any of >>> them is exceeding its hard cap due to normal competition for the >>> CPUs. So I think that it's unlikely that there will ever be a very >>> large number of tasks in the sinbin at the same time. >> >> for containers this can be untrue... > > > Why will this be untrue for containers? if one container having 100 running tasks inside exceeded it's credit, it should be delayed. i.e. 100 tasks should be placed in sinbin if I understand your algo correctly. the second container having 100 tasks as well will do the same. Kirill ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC 3/5] sched: Add CPU rate hard caps 2006-06-01 7:44 ` Kirill Korotaev @ 2006-06-01 23:21 ` Peter Williams 0 siblings, 0 replies; 27+ messages in thread From: Peter Williams @ 2006-06-01 23:21 UTC (permalink / raw) To: Kirill Korotaev Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman Kirill Korotaev wrote: >>>>> Using a timer for releasing tasks from their sinbin sounds like a bit >>>>> of an overhead. Given that there could be 10s of thousands of tasks. >>>> >>>> >>>> >>>> The more runnable tasks there are the less likely it is that any of >>>> them is exceeding its hard cap due to normal competition for the >>>> CPUs. So I think that it's unlikely that there will ever be a very >>>> large number of tasks in the sinbin at the same time. >>> >>> for containers this can be untrue... >> >> >> Why will this be untrue for containers? > if one container having 100 running tasks inside exceeded it's credit, > it should be delayed. i.e. 100 tasks should be placed in sinbin if I > understand your algo correctly. the second container having 100 tasks as > well will do the same. 1. Caps are set on a per task basis not on a group basis. 2. Sinbinning is the last resort and only used for hard caps. The soft capping mechanism is also applied to hard capped tasks and natural competition also tends to reduce usage rates. In general, sinbinning will only kick in on lightly loaded systems where there is no competition for CPU resources. Further, there is a natural ceiling of 999 per CPU on the number tasks that will ever be in the sinbin at the same time. To achieve this maximum some very unusual circumstances have to prevail: 1. these 999 tasks must be the only runnable tasks on the system, 2. they all must have a cap of 1/1000, and 3. the distribution of CPU among them must be perfectly fair so that they all have the expected average usage rate of 1/999. If you add one more task to this mix the average usage would be 1/1000 and if they all had that none would be exceeding their cap and there would be no sinbinning at all. Of course, in reality, half would be slightly above the average and half slightly below and about 500 would be sinbinned. But this reality check also applies to the 999 and somewhat less than 999 would actually be sinbinned. As the number of runnable tasks increases beyond 1000 then the number that have a usage rate greater than their cap will decrease and quickly reach zero. So the conclusion is that the maximum number of sinbinned tasks per CPU is given by: min(1000 / min_cpu_rate_cap - 1, nr_running) As you can see, if a minimum cap cpu of 1 causes problems we can just increase that minimum. And once again I ask what's so special about containers that changes this? Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2006-06-06 10:47 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-01 21:03 [RFC 3/5] sched: Add CPU rate hard caps Al Boldi 2006-06-02 1:33 ` Peter Williams 2006-06-02 11:23 ` Matt Helsley 2006-06-02 13:16 ` Peter Williams 2006-06-06 10:47 ` Srivatsa Vaddagiri -- strict thread matches above, loose matches on Subject: below -- 2006-05-26 4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams 2006-05-26 4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams 2006-05-26 6:58 ` Kari Hurtta 2006-05-27 1:00 ` Peter Williams 2006-05-26 11:00 ` Con Kolivas 2006-05-26 13:59 ` Peter Williams 2006-05-26 14:12 ` Con Kolivas 2006-05-26 14:23 ` Mike Galbraith 2006-05-27 0:16 ` Peter Williams 2006-05-27 9:28 ` Mike Galbraith 2006-05-28 2:09 ` Peter Williams 2006-05-27 6:48 ` Balbir Singh 2006-05-27 8:44 ` Peter Williams 2006-05-31 13:10 ` Kirill Korotaev 2006-05-31 15:59 ` Balbir Singh 2006-05-31 18:09 ` Mike Galbraith 2006-06-01 7:41 ` Kirill Korotaev 2006-06-01 8:34 ` Balbir Singh 2006-06-01 23:47 ` Sam Vilain 2006-06-01 23:43 ` Peter Williams 2006-05-31 23:28 ` Peter Williams 2006-06-01 7:44 ` Kirill Korotaev 2006-06-01 23:21 ` Peter Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox