Re: [RFC 3/5] sched: Add CPU rate hard caps

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC 3/5] sched: Add CPU rate hard caps
@ 2006-06-01 21:03 Al Boldi
  2006-06-02  1:33 ` Peter Williams
  0 siblings, 1 reply; 27+ messages in thread
From: Al Boldi @ 2006-06-01 21:03 UTC (permalink / raw)
  To: linux-kernel

Chandra Seetharaman wrote:
> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> > Kirill Korotaev wrote:
> > >> Do you have any documented requirements for container resource
> > >> management?
> > >> Is there a minimum list of features and nice to have features for
> > >> containers
> > >> as far as resource management is concerned?
> > >
> > > Sure! You can check OpenVZ project (http://openvz.org) for example of
> > > required resource management. BTW, I must agree with other people here
> > > who noticed that per-process resource management is really useless and
> > > hard to use :(
>
> I totally agree.
>
> > I'll take a look at the references. I agree with you that it will be
> > useful to have resource management for a group of tasks.

For Resource Management to be useful it must depend on Resource Control.  
Resource Control depends on per-process accounting.  Per-process accounting, 
when abstracted sufficiently, may enable higher level routines, preferrably 
in userland, to extend functionality at will.  All efforts should really go 
into the successful abstraction of per-process accounting.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01 21:03 [RFC 3/5] sched: Add CPU rate hard caps Al Boldi
@ 2006-06-02  1:33 ` Peter Williams
  2006-06-02 11:23   ` Matt Helsley
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Williams @ 2006-06-02  1:33 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

Al Boldi wrote:
> Chandra Seetharaman wrote:
>> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
>>> Kirill Korotaev wrote:
>>>>> Do you have any documented requirements for container resource
>>>>> management?
>>>>> Is there a minimum list of features and nice to have features for
>>>>> containers
>>>>> as far as resource management is concerned?
>>>> Sure! You can check OpenVZ project (http://openvz.org) for example of
>>>> required resource management. BTW, I must agree with other people here
>>>> who noticed that per-process resource management is really useless and
>>>> hard to use :(
>> I totally agree.
>>
>>> I'll take a look at the references. I agree with you that it will be
>>> useful to have resource management for a group of tasks.
> 
> For Resource Management to be useful it must depend on Resource Control.  
> Resource Control depends on per-process accounting.  Per-process accounting, 
> when abstracted sufficiently, may enable higher level routines, preferrably 
> in userland, to extend functionality at will.  All efforts should really go 
> into the successful abstraction of per-process accounting.

I couldn't agree more.  All that's needed in the kernel is low level per 
task control and statistics gathering.  The rest can be done in user space.

Peter
PS I'm a big fan of the various efforts to improve the quality of the 
performance statistics that are exported from the kernel and my only 
wish is that they get together to create one comprehensive solution.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02  1:33 ` Peter Williams
@ 2006-06-02 11:23   ` Matt Helsley
  2006-06-02 13:16     ` Peter Williams
  2006-06-06 10:47     ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 27+ messages in thread
From: Matt Helsley @ 2006-06-02 11:23 UTC (permalink / raw)
  To: Peter Williams
  Cc: LKML, Andrew Morton, dev, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Peter Williams, Con Kolivas,
	Sam Vilain, Kingsley Cheung, Eric W. Biederman, Ingo Molnar,
	Rene Herman, Chandra S. Seetharaman

On Fri, 2006-06-02 at 11:33 +1000, Peter Williams wrote:
> Al Boldi wrote:
> > Chandra Seetharaman wrote:
> >> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
> >>> Kirill Korotaev wrote:
> >>>>> Do you have any documented requirements for container resource
> >>>>> management?
> >>>>> Is there a minimum list of features and nice to have features for
> >>>>> containers
> >>>>> as far as resource management is concerned?
> >>>> Sure! You can check OpenVZ project (http://openvz.org) for example of
> >>>> required resource management. BTW, I must agree with other people here
> >>>> who noticed that per-process resource management is really useless and
> >>>> hard to use :(
> >> I totally agree.
> >>
> >>> I'll take a look at the references. I agree with you that it will be
> >>> useful to have resource management for a group of tasks.
> > 
> > For Resource Management to be useful it must depend on Resource Control.  
> > Resource Control depends on per-process accounting.  Per-process accounting, 
> > when abstracted sufficiently, may enable higher level routines, preferrably 
> > in userland, to extend functionality at will.  All efforts should really go 
> > into the successful abstraction of per-process accounting.
> 
> I couldn't agree more.  All that's needed in the kernel is low level per 
> task control and statistics gathering.  The rest can be done in user space.

<snip>

	I'm assuming by "The rest can be done in user space" you mean that
tasks can be grouped, accounting information updated (% CPU), and
various knobs (nice) can be turned to keep task resource (CPU) usage
under control.

If I seem to be describing your suggestion then I don't think it will
work. Below you'll find the reasons I've come to this conclusion. Am I
oversimplifying or misunderstanding something critical?

	Groups are needed to prevent processes from consuming unlimited
resources using clone/fork. However, since our accounting sources and
control knobs are per-task we must adjust per-task knobs within a group
every time accounting indicates a change in resource usage.

	Let us suppose we have a UP system with 3 tasks -- group X: X1, X2; and
Z. By adjusting nice values of X1 and X2 Z is responsible for ensuring
that group X does not exceed its limit of 50% CPU. Further suppose that
X1 and X2 are each using 25% of the CPU. In order to prevent X1 + X2
from exceeding 50% each must be limited to 25% by an appropriate nice
value. [Note the hand wave: I'm assuming nice can be mapped to a
predictable percentage of CPU on a UP system.]

	When accounting data indicates X2 has dropped to 15% of the CPU, Z may
raise X1's limit (to 35% at most) and it must lower X2's limit (down to
as little as 15%). Z must raise X1's limit by some amount (delta)
otherwise X1 could never increase its CPU usage. Z must decrease X2 to
25 - delta, otherwise the sum could exceed 50%. [Aside: In fact, if we
have N tasks in group X then it seems Z ought to adjust N nice values by
a total of delta. How delta gets distributed limits the rate at which
CPU usage may increase and would ideally depend on future changes in
usage.]

There are two problems as I see it:

1) If X1 grows to use 35% then X2's usage can't grow back from 15% until
X1 relents. This is seems unpleasantly like cooperative scheduling
within group X because if we take this to its limit X2 gets 0% and X1
gets 50% -- effectively starving X2. What little I know about nice
suggests this wouldn't really happen. However I think may highlight one
case where fiddling with nice can't effectively control CPU usage.

2) Suppose we add group Y with tasks Y1-YM, Y's CPU usage is limited to
49%, each task of Y uses its limit of (M/49)% CPU, and the remaining 1%
is left for Z (i.e. the single CPU is being used heavily). Z must use
this 1% to read accounting information and adjust nice values as
described above. If X1 spawns X3 we're likely in trouble -- Z might not
get to run for a while but X3 has inheritted X1's nice value. If we
return to our initial assumption that X1 and X2 are each using their
limit of 25% then X3 will get limited to 25% too. The sum of Xi can now
exceed 50% until Z is scheduled next. This only gets worse if there is
an imbalance between X1 and X2 as described earlier. In that case group
X could use 100% CPU until Z is scheduled! It also probably gets worse
as load increases and the number of scheduling opportunities for Z
decrease.


	I don't see how task Z could solve the second problem. As with UP, in
SMP I think it depends on when Z (or one Z fixed to each CPU) is
scheduled.

	I think these are simple scenarios that demonstrate the problem with
splitting resource management into accounting and control with userspace
in between.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02 11:23   ` Matt Helsley
@ 2006-06-02 13:16     ` Peter Williams
  2006-06-06 10:47     ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Williams @ 2006-06-02 13:16 UTC (permalink / raw)
  To: Matt Helsley
  Cc: LKML, Andrew Morton, dev, Srivatsa, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Con Kolivas, Sam Vilain,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman,
	Chandra S. Seetharaman

Matt Helsley wrote:
> On Fri, 2006-06-02 at 11:33 +1000, Peter Williams wrote:
>> Al Boldi wrote:
>>> Chandra Seetharaman wrote:
>>>> On Thu, 2006-06-01 at 14:04 +0530, Balbir Singh wrote:
>>>>> Kirill Korotaev wrote:
>>>>>>> Do you have any documented requirements for container resource
>>>>>>> management?
>>>>>>> Is there a minimum list of features and nice to have features for
>>>>>>> containers
>>>>>>> as far as resource management is concerned?
>>>>>> Sure! You can check OpenVZ project (http://openvz.org) for example of
>>>>>> required resource management. BTW, I must agree with other people here
>>>>>> who noticed that per-process resource management is really useless and
>>>>>> hard to use :(
>>>> I totally agree.
>>>>
>>>>> I'll take a look at the references. I agree with you that it will be
>>>>> useful to have resource management for a group of tasks.
>>> For Resource Management to be useful it must depend on Resource Control.  
>>> Resource Control depends on per-process accounting.  Per-process accounting, 
>>> when abstracted sufficiently, may enable higher level routines, preferrably 
>>> in userland, to extend functionality at will.  All efforts should really go 
>>> into the successful abstraction of per-process accounting.
>> I couldn't agree more.  All that's needed in the kernel is low level per 
>> task control and statistics gathering.  The rest can be done in user space.
> 
> <snip>
> 
> 	I'm assuming by "The rest can be done in user space" you mean that
> tasks can be grouped, accounting information updated (% CPU), and
> various knobs (nice) can be turned to keep task resource (CPU) usage
> under control.
> 
> If I seem to be describing your suggestion then I don't think it will
> work. Below you'll find the reasons I've come to this conclusion. Am I
> oversimplifying or misunderstanding something critical?
> 
> 	Groups are needed to prevent processes from consuming unlimited
> resources using clone/fork. However, since our accounting sources and
> control knobs are per-task we must adjust per-task knobs within a group
> every time accounting indicates a change in resource usage.
> 
> 	Let us suppose we have a UP system with 3 tasks -- group X: X1, X2; and
> Z. By adjusting nice values of X1 and X2 Z is responsible for ensuring
> that group X does not exceed its limit of 50% CPU. Further suppose that
> X1 and X2 are each using 25% of the CPU. In order to prevent X1 + X2
> from exceeding 50% each must be limited to 25% by an appropriate nice
> value. [Note the hand wave: I'm assuming nice can be mapped to a
> predictable percentage of CPU on a UP system.]
> 
> 	When accounting data indicates X2 has dropped to 15% of the CPU, Z may
> raise X1's limit (to 35% at most) and it must lower X2's limit (down to
> as little as 15%). Z must raise X1's limit by some amount (delta)
> otherwise X1 could never increase its CPU usage. Z must decrease X2 to
> 25 - delta, otherwise the sum could exceed 50%. [Aside: In fact, if we
> have N tasks in group X then it seems Z ought to adjust N nice values by
> a total of delta. How delta gets distributed limits the rate at which
> CPU usage may increase and would ideally depend on future changes in
> usage.]
> 
> There are two problems as I see it:
> 
> 1) If X1 grows to use 35% then X2's usage can't grow back from 15% until
> X1 relents. This is seems unpleasantly like cooperative scheduling
> within group X because if we take this to its limit X2 gets 0% and X1
> gets 50% -- effectively starving X2. What little I know about nice
> suggests this wouldn't really happen. However I think may highlight one
> case where fiddling with nice can't effectively control CPU usage.
> 
> 2) Suppose we add group Y with tasks Y1-YM, Y's CPU usage is limited to
> 49%, each task of Y uses its limit of (M/49)% CPU, and the remaining 1%
> is left for Z (i.e. the single CPU is being used heavily). Z must use
> this 1% to read accounting information and adjust nice values as
> described above. If X1 spawns X3 we're likely in trouble -- Z might not
> get to run for a while but X3 has inheritted X1's nice value. If we
> return to our initial assumption that X1 and X2 are each using their
> limit of 25% then X3 will get limited to 25% too. The sum of Xi can now
> exceed 50% until Z is scheduled next. This only gets worse if there is
> an imbalance between X1 and X2 as described earlier. In that case group
> X could use 100% CPU until Z is scheduled! It also probably gets worse
> as load increases and the number of scheduling opportunities for Z
> decrease.
> 
> 
> 	I don't see how task Z could solve the second problem. As with UP, in
> SMP I think it depends on when Z (or one Z fixed to each CPU) is
> scheduled.
> 
> 	I think these are simple scenarios that demonstrate the problem with
> splitting resource management into accounting and control with userspace
> in between.

You're trying to do it all with nice.  I said it could be done with nice 
plus the CPU capping functionality my patch provides.  Plus the stats of 
course.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-02 11:23   ` Matt Helsley
  2006-06-02 13:16     ` Peter Williams
@ 2006-06-06 10:47     ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 27+ messages in thread
From: Srivatsa Vaddagiri @ 2006-06-06 10:47 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Peter Williams, LKML, Andrew Morton, dev, ckrm-tech, balbir,
	Balbir Singh, Mike Galbraith, Con Kolivas, Sam Vilain,
	Kingsley Cheung, Eric W. Biederman, Ingo Molnar, Rene Herman,
	Chandra S. Seetharaman

On Fri, Jun 02, 2006 at 04:23:04AM -0700, Matt Helsley wrote:
> There are two problems as I see it:
> 
> 1) If X1 grows to use 35% then X2's usage can't grow back from 15% until
> X1 relents. This is seems unpleasantly like cooperative scheduling
> within group X because if we take this to its limit X2 gets 0% and X1
> gets 50% -- effectively starving X2. What little I know about nice
> suggests this wouldn't really happen. However I think may highlight one
> case where fiddling with nice can't effectively control CPU usage.

I would expect task Z to adjust the limits of X1, X2 again when it notices 
that X2 is "hungry". Until Z gets around to do that, what situation you
describe will be true. If Z is configured to run quite frequently (every
5 seconds?) to monitor/adjust limits, then this starvation (of X2) may be
avoided for longer periods?

> 2) Suppose we add group Y with tasks Y1-YM, Y's CPU usage is limited to
> 49%, each task of Y uses its limit of (M/49)% CPU, and the remaining 1%
> is left for Z (i.e. the single CPU is being used heavily). Z must use
> this 1% to read accounting information and adjust nice values as
> described above. If X1 spawns X3 we're likely in trouble -- Z might not
> get to run for a while but X3 has inheritted X1's nice value. If we
> return to our initial assumption that X1 and X2 are each using their
> limit of 25% then X3 will get limited to 25% too. The sum of Xi can now
> exceed 50% until Z is scheduled next. This only gets worse if there is
> an imbalance between X1 and X2 as described earlier. In that case group
> X could use 100% CPU until Z is scheduled! It also probably gets worse
> as load increases and the number of scheduling opportunities for Z
> decrease.
> 
> 
> 	I don't see how task Z could solve the second problem. As with UP, in
> SMP I think it depends on when Z (or one Z fixed to each CPU) is
> scheduled.

Wouldn't it help if Z is made to run with nice -20 (or with RT prio maybe),
so that when Z wants to run (every 5 or 10 seconds) it is run
immediately? This is assuming that Z can do its job of adjusting limits
for all tasks "quickly" (maybe 100-200 ms?).

> 
> 	I think these are simple scenarios that demonstrate the problem with
> splitting resource management into accounting and control with userspace
> in between.
> 
> Cheers,
> 	-Matt Helsley

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 0/5] sched: Add CPU rate caps
@ 2006-05-26  4:20 Peter Williams
  2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Williams @ 2006-05-26  4:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

These patches implement CPU usage rate limits for tasks.

Although the rlimit mechanism already has a CPU usage limit (RLIMIT_CPU)
it is a total usage limit and therefore (to my mind) not very useful.
These patches provide an alternative whereby the (recent) average CPU
usage rate of a task can be limited to a (per task) specified proportion
of a single CPU's capacity.  The limits are specified in parts per
thousand and come in two varieties -- hard and soft.  The difference
between the two is that the system tries to enforce hard caps regardless
of the other demand for CPU resources but allows soft caps to be
exceeded if there are spare CPU resources available.  By default, tasks
will have both caps set to 1000 (i.e. no limit) but newly forked tasks
will inherit any caps that have been imposed on their parent from the
parent.  The mimimim soft cap allowed is 0 (which effectively puts the
task in the background) and the minimim hard cap allowed is 1.

Care has been taken to minimize the overhead inflicted on tasks that
have no caps and my tests using kernbench indicate that it is hidden in
the noise.

Note:

The first patch in this series fixes some problems with priority
inheritance that are present in 2.6.17-rc4-mm3 but will be fixed in
the next -mm kernel.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
@ 2006-05-26  4:20 ` Peter Williams
  2006-05-26  6:58   ` Kari Hurtta
                     ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Peter Williams @ 2006-05-26  4:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Peter Williams, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

This patch implements hard CPU rate caps per task as a proportion of a
single CPU's capacity expressed in parts per thousand.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
 include/linux/sched.h |    8 ++
 kernel/Kconfig.caps   |   14 +++-
 kernel/sched.c        |  154 ++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 168 insertions(+), 8 deletions(-)

Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
===================================================================
--- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h	2006-05-26 10:46:35.000000000 +1000
+++ MM-2.6.17-rc4-mm3/include/linux/sched.h	2006-05-26 11:00:07.000000000 +1000
@@ -796,6 +796,10 @@ struct task_struct {
 #ifdef CONFIG_CPU_RATE_CAPS
 	unsigned long long avg_cpu_per_cycle, avg_cycle_length;
 	unsigned int cpu_rate_cap;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	unsigned int cpu_rate_hard_cap;
+	struct timer_list sinbin_timer;
+#endif
 #endif
 	enum sleep_type sleep_type;
 
@@ -994,6 +998,10 @@ struct task_struct {
 #ifdef CONFIG_CPU_RATE_CAPS
 unsigned int get_cpu_rate_cap(const struct task_struct *);
 int set_cpu_rate_cap(struct task_struct *, unsigned int);
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+unsigned int get_cpu_rate_hard_cap(const struct task_struct *);
+int set_cpu_rate_hard_cap(struct task_struct *, unsigned int);
+#endif
 #endif
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/Kconfig.caps	2006-05-26 10:45:26.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps	2006-05-26 11:00:07.000000000 +1000
@@ -3,11 +3,21 @@
 #
 
 config CPU_RATE_CAPS
-	bool "Support (soft) CPU rate caps"
+	bool "Support CPU rate caps"
 	default n
 	---help---
-	  Say y here if you wish to be able to put a (soft) upper limit on
+	  Say y here if you wish to be able to put a soft upper limit on
 	  the rate of CPU usage by individual tasks.  A task which has been
 	  allocated a soft CPU rate cap will be limited to that rate of CPU
 	  usage unless there is spare CPU resources available after the needs
 	  of uncapped tasks are met.
+
+config CPU_RATE_HARD_CAPS
+	bool "Support CPU rate hard caps"
+	depends on CPU_RATE_CAPS
+	default n
+	---help---
+	  Say y here if you wish to be able to put a hard upper limit on
+	  the rate of CPU usage by individual tasks.  A task which has been
+	  allocated a hard CPU rate cap will be limited to that rate of CPU
+	  usage regardless of whether there is spare CPU resources available.
Index: MM-2.6.17-rc4-mm3/kernel/sched.c
===================================================================
--- MM-2.6.17-rc4-mm3.orig/kernel/sched.c	2006-05-26 11:00:02.000000000 +1000
+++ MM-2.6.17-rc4-mm3/kernel/sched.c	2006-05-26 13:50:11.000000000 +1000
@@ -201,21 +201,33 @@ static inline unsigned int task_timeslic
 
 #ifdef CONFIG_CPU_RATE_CAPS
 #define CAP_STATS_OFFSET 8
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+static void sinbin_release_fn(unsigned long arg);
+#define min_cpu_rate_cap(p) min((p)->cpu_rate_cap, (p)->cpu_rate_hard_cap)
+#else
+#define min_cpu_rate_cap(p) (p)->cpu_rate_cap
+#endif
 #define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
 /* this assumes that p is not a real time task */
 #define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
 #define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
-#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)
+#define cap_load_weight(p) ((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / 1000)
 
 static void init_cpu_rate_caps(task_t *p)
 {
 	p->cpu_rate_cap = 1000;
 	p->flags &= ~PF_HAS_CAP;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	p->cpu_rate_hard_cap = 1000;
+	init_timer(&p->sinbin_timer);
+	p->sinbin_timer.function = sinbin_release_fn;
+	p->sinbin_timer.data = (unsigned long) p;
+#endif
 }
 
 static inline void set_cap_flag(task_t *p)
 {
-	if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
+	if (min_cpu_rate_cap(p) < 1000 && !has_rt_policy(p))
 		p->flags |= PF_HAS_CAP;
 	else
 		p->flags &= ~PF_HAS_CAP;
@@ -223,7 +235,7 @@ static inline void set_cap_flag(task_t *
 
 static inline int task_exceeding_cap(const task_t *p)
 {
-	return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
+	return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * min_cpu_rate_cap(p));
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -257,7 +269,7 @@ static int task_exceeding_cap_now(const 
 
 	delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
 	lhs = (p->avg_cpu_per_cycle + delta) * 1000;
-	rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
+	rhs = (p->avg_cycle_length + delta) * min_cpu_rate_cap(p);
 
 	return lhs > rhs;
 }
@@ -266,6 +278,10 @@ static inline void init_cap_stats(task_t
 {
 	p->avg_cpu_per_cycle = 0;
 	p->avg_cycle_length = 0;
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+	init_timer(&p->sinbin_timer);
+	p->sinbin_timer.data = (unsigned long) p;
+#endif
 }
 
 static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
@@ -1213,6 +1229,64 @@ static void deactivate_task(struct task_
 	p->array = NULL;
 }
 
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+#define task_has_hard_cap(p) unlikely((p)->cpu_rate_hard_cap < 1000)
+
+/*
+ * Release a task from the sinbin
+ */
+static void sinbin_release_fn(unsigned long arg)
+{
+	unsigned long flags;
+	struct task_struct *p = (struct task_struct*)arg;
+	struct runqueue *rq = task_rq_lock(p, &flags);
+
+	p->prio = effective_prio(p);
+
+	__activate_task(p, rq);
+
+	task_rq_unlock(rq, &flags);
+}
+
+static unsigned long reqd_sinbin_ticks(const task_t *p)
+{
+	unsigned long long res;
+
+	res = p->avg_cpu_per_cycle * 1000;
+
+	if (res > p->avg_cycle_length * p->cpu_rate_hard_cap) {
+		(void)do_div(res, p->cpu_rate_hard_cap);
+		res -= p->avg_cpu_per_cycle;
+		/*
+		 * IF it was available we'd also subtract
+		 * the average sleep per cycle here
+		 */
+		res >>= CAP_STATS_OFFSET;
+		(void)do_div(res, (1000000000 / HZ));
+
+		return res ? : 1;
+	}
+
+	return 0;
+}
+
+static void sinbin_task(task_t *p, unsigned long durn)
+{
+	if (durn == 0)
+		return;
+	deactivate_task(p, task_rq(p));
+	p->sinbin_timer.expires = jiffies + durn;
+	add_timer(&p->sinbin_timer);
+}
+#else
+#define task_has_hard_cap(p) 0
+#define reqd_sinbin_ticks(p) 0
+
+static inline void sinbin_task(task_t *p, unsigned long durn)
+{
+}
+#endif
+
 /*
  * resched_task - mark a task 'to be rescheduled now'.
  *
@@ -3508,9 +3582,16 @@ need_resched_nonpreemptible:
 		}
 	}
 
-	/* do this now so that stats are correct for SMT code */
-	if (task_has_cap(prev))
+	if (task_has_cap(prev)) {
 		inc_cap_stats_both(prev, now);
+		if (task_has_hard_cap(prev) && !prev->state &&
+		    !rt_task(prev) && !signal_pending(prev)) {
+			unsigned long sinbin_ticks = reqd_sinbin_ticks(prev);
+
+			if (sinbin_ticks)
+				sinbin_task(prev, sinbin_ticks);
+		}
+	}
 
 	cpu = smp_processor_id();
 	if (unlikely(!rq->nr_running)) {
@@ -4539,6 +4620,67 @@ out:
 }
 
 EXPORT_SYMBOL(set_cpu_rate_cap);
+
+#ifdef CONFIG_CPU_RATE_HARD_CAPS
+unsigned int get_cpu_rate_hard_cap(const struct task_struct *p)
+{
+	return p->cpu_rate_hard_cap;
+}
+
+EXPORT_SYMBOL(get_cpu_rate_hard_cap);
+
+/*
+ * Require: 1 <= new_cap <= 1000
+ */
+int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
+{
+	int is_allowed;
+	unsigned long flags;
+	struct runqueue *rq;
+	int delta;
+
+	if (new_cap > 1000 && new_cap > 0)
+		return -EINVAL;
+	is_allowed = capable(CAP_SYS_NICE);
+	/*
+	 * We have to be careful, if called from /proc code,
+	 * the task might be in the middle of scheduling on another CPU.
+	 */
+	rq = task_rq_lock(p, &flags);
+	delta = new_cap - p->cpu_rate_hard_cap;
+	if (!is_allowed) {
+		/*
+		 * Ordinary users can set/change caps on their own tasks
+		 * provided that the new setting is MORE constraining
+		 */
+		if (((current->euid != p->uid) && (current->uid != p->uid)) || (delta > 0)) {
+			task_rq_unlock(rq, &flags);
+			return -EPERM;
+		}
+	}
+	/*
+	 * The RT tasks don't have caps, but we still allow the caps to be
+	 * set - but as expected it wont have any effect on scheduling until
+	 * the task becomes SCHED_NORMAL/SCHED_BATCH:
+	 */
+	p->cpu_rate_hard_cap = new_cap;
+
+	if (has_rt_policy(p))
+		goto out;
+
+	if (p->array)
+		dec_raw_weighted_load(rq, p);
+	set_load_weight(p);
+	if (p->array)
+		inc_raw_weighted_load(rq, p);
+out:
+	task_rq_unlock(rq, &flags);
+
+	return 0;
+}
+
+EXPORT_SYMBOL(set_cpu_rate_hard_cap);
+#endif
 #endif
 
 long sched_setaffinity(pid_t pid, cpumask_t new_mask)

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
@ 2006-05-26  6:58   ` Kari Hurtta
  2006-05-27  1:00     ` Peter Williams
  2006-05-26 11:00   ` Con Kolivas
  2006-05-27  6:48   ` Balbir Singh
  2 siblings, 1 reply; 27+ messages in thread
From: Kari Hurtta @ 2006-05-26  6:58 UTC (permalink / raw)
  To: linux-kernel

Peter Williams <pwil3058@bigpond.net.au> writes in gmane.linux.kernel:

> This patch implements hard CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.

> + * Require: 1 <= new_cap <= 1000
> + */
> +int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
> +{
> +	int is_allowed;
> +	unsigned long flags;
> +	struct runqueue *rq;
> +	int delta;
> +
> +	if (new_cap > 1000 && new_cap > 0)
> +		return -EINVAL;

That condition looks wrong.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  6:58   ` Kari Hurtta
@ 2006-05-27  1:00     ` Peter Williams
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Williams @ 2006-05-27  1:00 UTC (permalink / raw)
  To: Kari Hurtta; +Cc: linux-kernel

Kari Hurtta wrote:
> Peter Williams <pwil3058@bigpond.net.au> writes in gmane.linux.kernel:
> 
>> This patch implements hard CPU rate caps per task as a proportion of a
>> single CPU's capacity expressed in parts per thousand.
> 
>> + * Require: 1 <= new_cap <= 1000
>> + */
>> +int set_cpu_rate_hard_cap(struct task_struct *p, unsigned int new_cap)
>> +{
>> +	int is_allowed;
>> +	unsigned long flags;
>> +	struct runqueue *rq;
>> +	int delta;
>> +
>> +	if (new_cap > 1000 && new_cap > 0)
>> +		return -EINVAL;
> 
> That condition looks wrong.

It certainly does.

Thanks
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
  2006-05-26  6:58   ` Kari Hurtta
@ 2006-05-26 11:00   ` Con Kolivas
  2006-05-26 13:59     ` Peter Williams
  2006-05-27  6:48   ` Balbir Singh
  2 siblings, 1 reply; 27+ messages in thread
From: Con Kolivas @ 2006-05-26 11:00 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar,
	Rene Herman

On Friday 26 May 2006 14:20, Peter Williams wrote:
> This patch implements hard CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.

A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
a lesser extent to a 0 soft cap. 

Here is how I handle idleprio tasks in current -ck:

http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
tags tasks that are holding a mutex

http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
is the idleprio policy for staircase.

What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
are waking up after calling down() (ie holding a semaphore). These two in 
combination have shown resistance to any priority inversion problems in 
widespread testing. An attempt was made to track semaphores held via a 
down_interruptible() but unfortunately the lack of strict rules about who 
could release the semaphore meant accounting was impossible of this scenario. 
In practice, though there were no test cases that showed it to be an issue, 
and the recent conversion en-masse of semaphores to mutexes in the kernel 
means it has pretty much covered most possibilities.

-- 
-ck

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26 11:00   ` Con Kolivas
@ 2006-05-26 13:59     ` Peter Williams
  2006-05-26 14:12       ` Con Kolivas
  2006-05-26 14:23       ` Mike Galbraith
  0 siblings, 2 replies; 27+ messages in thread
From: Peter Williams @ 2006-05-26 13:59 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar,
	Rene Herman

Con Kolivas wrote:
> On Friday 26 May 2006 14:20, Peter Williams wrote:
>> This patch implements hard CPU rate caps per task as a proportion of a
>> single CPU's capacity expressed in parts per thousand.
> 
> A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
> a lesser extent to a 0 soft cap. 
> 
> Here is how I handle idleprio tasks in current -ck:
> 
> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
> tags tasks that are holding a mutex
> 
> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
> is the idleprio policy for staircase.
> 
> What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
> are waking up after calling down() (ie holding a semaphore).

I wasn't aware that you could detect those conditions.  They could be 
very useful.

> These two in 
> combination have shown resistance to any priority inversion problems in 
> widespread testing. An attempt was made to track semaphores held via a 
> down_interruptible() but unfortunately the lack of strict rules about who 
> could release the semaphore meant accounting was impossible of this scenario. 
> In practice, though there were no test cases that showed it to be an issue, 
> and the recent conversion en-masse of semaphores to mutexes in the kernel 
> means it has pretty much covered most possibilities.
> 

Thanks,
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26 13:59     ` Peter Williams
@ 2006-05-26 14:12       ` Con Kolivas
  2006-05-26 14:23       ` Mike Galbraith
  1 sibling, 0 replies; 27+ messages in thread
From: Con Kolivas @ 2006-05-26 14:12 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Linux Kernel, Kingsley Cheung, Ingo Molnar,
	Rene Herman

On Friday 26 May 2006 23:59, Peter Williams wrote:
> Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> >> This patch implements hard CPU rate caps per task as a proportion of a
> >> single CPU's capacity expressed in parts per thousand.
> >
> > A hard cap of 1/1000 could lead to interesting starvation scenarios where
> > a mutex or semaphore was held by a task that hardly ever got cpu. Same
> > goes to a lesser extent to a 0 soft cap.
> >
> > Here is how I handle idleprio tasks in current -ck:
> >
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/
> >patches/track_mutexes-1.patch tags tasks that are holding a mutex
> >
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/
> >patches/sched-idleprio-1.7.patch is the idleprio policy for staircase.
> >
> > What it does is runs idleprio tasks as normal tasks when they hold a
> > mutex or are waking up after calling down() (ie holding a semaphore).
>
> I wasn't aware that you could detect those conditions.  They could be
> very useful.

Ingo's mutex infrastructure made it possible to accurately track mutexes held, 
and basically anything entering uninterruptible sleep has called down(). 
Mainline, as you know, already flags the latter for interactivity purposes.

-- 
-ck

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26 13:59     ` Peter Williams
  2006-05-26 14:12       ` Con Kolivas
@ 2006-05-26 14:23       ` Mike Galbraith
  2006-05-27  0:16         ` Peter Williams
  1 sibling, 1 reply; 27+ messages in thread
From: Mike Galbraith @ 2006-05-26 14:23 UTC (permalink / raw)
  To: Peter Williams
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar,
	Rene Herman

On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
> Con Kolivas wrote:
> > On Friday 26 May 2006 14:20, Peter Williams wrote:
> >> This patch implements hard CPU rate caps per task as a proportion of a
> >> single CPU's capacity expressed in parts per thousand.
> > 
> > A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
> > mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
> > a lesser extent to a 0 soft cap. 
> > 
> > Here is how I handle idleprio tasks in current -ck:
> > 
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
> > tags tasks that are holding a mutex
> > 
> > http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
> > is the idleprio policy for staircase.
> > 
> > What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
> > are waking up after calling down() (ie holding a semaphore).
> 
> I wasn't aware that you could detect those conditions.  They could be 
> very useful.

Isn't this exactly what the PI code is there to handle?  Is something
more than PI needed?

	-Mike


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26 14:23       ` Mike Galbraith
@ 2006-05-27  0:16         ` Peter Williams
  2006-05-27  9:28           ` Mike Galbraith
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Williams @ 2006-05-27  0:16 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar,
	Rene Herman

Mike Galbraith wrote:
> On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
>> Con Kolivas wrote:
>>> On Friday 26 May 2006 14:20, Peter Williams wrote:
>>>> This patch implements hard CPU rate caps per task as a proportion of a
>>>> single CPU's capacity expressed in parts per thousand.
>>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
>>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
>>> a lesser extent to a 0 soft cap. 
>>>
>>> Here is how I handle idleprio tasks in current -ck:
>>>
>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
>>> tags tasks that are holding a mutex
>>>
>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
>>> is the idleprio policy for staircase.
>>>
>>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
>>> are waking up after calling down() (ie holding a semaphore).
>> I wasn't aware that you could detect those conditions.  They could be 
>> very useful.
> 
> Isn't this exactly what the PI code is there to handle?  Is something
> more than PI needed?
> 

AFAIK (but I may be wrong) PI is only used by RT tasks and would need to 
be extended.  It could be argued that extending PI so that it can be 
used by non RT tasks is a worthwhile endeavour in its own right.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-27  0:16         ` Peter Williams
@ 2006-05-27  9:28           ` Mike Galbraith
  2006-05-28  2:09             ` Peter Williams
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Galbraith @ 2006-05-27  9:28 UTC (permalink / raw)
  To: Peter Williams
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar,
	Rene Herman

On Sat, 2006-05-27 at 10:16 +1000, Peter Williams wrote:
> Mike Galbraith wrote:
> > On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
> >> Con Kolivas wrote:
> >>> On Friday 26 May 2006 14:20, Peter Williams wrote:
> >>>> This patch implements hard CPU rate caps per task as a proportion of a
> >>>> single CPU's capacity expressed in parts per thousand.
> >>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
> >>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
> >>> a lesser extent to a 0 soft cap. 
> >>>
> >>> Here is how I handle idleprio tasks in current -ck:
> >>>
> >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
> >>> tags tasks that are holding a mutex
> >>>
> >>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
> >>> is the idleprio policy for staircase.
> >>>
> >>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
> >>> are waking up after calling down() (ie holding a semaphore).
> >> I wasn't aware that you could detect those conditions.  They could be 
> >> very useful.
> > 
> > Isn't this exactly what the PI code is there to handle?  Is something
> > more than PI needed?
> > 
> 
> AFAIK (but I may be wrong) PI is only used by RT tasks and would need to 
> be extended.  It could be argued that extending PI so that it can be 
> used by non RT tasks is a worthwhile endeavour in its own right.

Hm.  Looking around a bit, it appears to me that we're one itty bitty
redefine away from PI being global.  No idea if/when that will happen
though.

	-Mike


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-27  9:28           ` Mike Galbraith
@ 2006-05-28  2:09             ` Peter Williams
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Williams @ 2006-05-28  2:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar,
	Rene Herman

Mike Galbraith wrote:
> On Sat, 2006-05-27 at 10:16 +1000, Peter Williams wrote:
>> Mike Galbraith wrote:
>>> On Fri, 2006-05-26 at 23:59 +1000, Peter Williams wrote:
>>>> Con Kolivas wrote:
>>>>> On Friday 26 May 2006 14:20, Peter Williams wrote:
>>>>>> This patch implements hard CPU rate caps per task as a proportion of a
>>>>>> single CPU's capacity expressed in parts per thousand.
>>>>> A hard cap of 1/1000 could lead to interesting starvation scenarios where a 
>>>>> mutex or semaphore was held by a task that hardly ever got cpu. Same goes to 
>>>>> a lesser extent to a 0 soft cap. 
>>>>>
>>>>> Here is how I handle idleprio tasks in current -ck:
>>>>>
>>>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/track_mutexes-1.patch
>>>>> tags tasks that are holding a mutex
>>>>>
>>>>> http://ck.kolivas.org/patches/2.6/pre-releases/2.6.17-rc5/2.6.17-rc5-ck1/patches/sched-idleprio-1.7.patch
>>>>> is the idleprio policy for staircase.
>>>>>
>>>>> What it does is runs idleprio tasks as normal tasks when they hold a mutex or 
>>>>> are waking up after calling down() (ie holding a semaphore).
>>>> I wasn't aware that you could detect those conditions.  They could be 
>>>> very useful.
>>> Isn't this exactly what the PI code is there to handle?  Is something
>>> more than PI needed?
>>>
>> AFAIK (but I may be wrong) PI is only used by RT tasks and would need to 
>> be extended.  It could be argued that extending PI so that it can be 
>> used by non RT tasks is a worthwhile endeavour in its own right.
> 
> Hm.  Looking around a bit, it appears to me that we're one itty bitty
> redefine away from PI being global.  No idea if/when that will happen
> though.

It needs slightly more than that.  It's currently relying on the way 
tasks with prio less than MAX_RT_PRIO are treated to prevent the 
priority of tasks who are inheriting a priority from having that 
priority reset to their normal priority at various places in sched.c. 
So something would need to be done in that regard but it shouldn't be 
too difficult.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
  2006-05-26  6:58   ` Kari Hurtta
  2006-05-26 11:00   ` Con Kolivas
@ 2006-05-27  6:48   ` Balbir Singh
  2006-05-27  8:44     ` Peter Williams
  2 siblings, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2006-05-27  6:48 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

On 5/26/06, Peter Williams <pwil3058@bigpond.net.au> wrote:
> This patch implements hard CPU rate caps per task as a proportion of a
> single CPU's capacity expressed in parts per thousand.
>
> Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
>  include/linux/sched.h |    8 ++
>  kernel/Kconfig.caps   |   14 +++-
>  kernel/sched.c        |  154 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 168 insertions(+), 8 deletions(-)
>
> Index: MM-2.6.17-rc4-mm3/include/linux/sched.h
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/include/linux/sched.h        2006-05-26 10:46:35.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/include/linux/sched.h     2006-05-26 11:00:07.000000000 +1000
> @@ -796,6 +796,10 @@ struct task_struct {
>  #ifdef CONFIG_CPU_RATE_CAPS
>         unsigned long long avg_cpu_per_cycle, avg_cycle_length;
>         unsigned int cpu_rate_cap;
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +       unsigned int cpu_rate_hard_cap;
> +       struct timer_list sinbin_timer;

Using a timer for releasing tasks from their sinbin sounds like a  bit
of an overhead. Given that there could be 10s of thousands of tasks.
Is it possible to use the scheduler_tick() function take a look at all
deactivated tasks (as efficiently as possible) and activate them when
its time to activate them or just fold the functionality by defining a
time quantum after which everyone is worken up. This time quantum
could be the same as the time over which limits are honoured.

> +#endif
>  #endif
>         enum sleep_type sleep_type;
>
> @@ -994,6 +998,10 @@ struct task_struct {
>  #ifdef CONFIG_CPU_RATE_CAPS
>  unsigned int get_cpu_rate_cap(const struct task_struct *);
>  int set_cpu_rate_cap(struct task_struct *, unsigned int);
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +unsigned int get_cpu_rate_hard_cap(const struct task_struct *);
> +int set_cpu_rate_hard_cap(struct task_struct *, unsigned int);
> +#endif
>  #endif
>
>  static inline pid_t process_group(struct task_struct *tsk)
> Index: MM-2.6.17-rc4-mm3/kernel/Kconfig.caps
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/kernel/Kconfig.caps  2006-05-26 10:45:26.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/kernel/Kconfig.caps       2006-05-26 11:00:07.000000000 +1000
> @@ -3,11 +3,21 @@
>  #
>
>  config CPU_RATE_CAPS
> -       bool "Support (soft) CPU rate caps"
> +       bool "Support CPU rate caps"
>         default n
>         ---help---
> -         Say y here if you wish to be able to put a (soft) upper limit on
> +         Say y here if you wish to be able to put a soft upper limit on
>           the rate of CPU usage by individual tasks.  A task which has been
>           allocated a soft CPU rate cap will be limited to that rate of CPU
>           usage unless there is spare CPU resources available after the needs
>           of uncapped tasks are met.
> +
> +config CPU_RATE_HARD_CAPS
> +       bool "Support CPU rate hard caps"
> +       depends on CPU_RATE_CAPS
> +       default n
> +       ---help---
> +         Say y here if you wish to be able to put a hard upper limit on
> +         the rate of CPU usage by individual tasks.  A task which has been
> +         allocated a hard CPU rate cap will be limited to that rate of CPU
> +         usage regardless of whether there is spare CPU resources available.
> Index: MM-2.6.17-rc4-mm3/kernel/sched.c
> ===================================================================
> --- MM-2.6.17-rc4-mm3.orig/kernel/sched.c       2006-05-26 11:00:02.000000000 +1000
> +++ MM-2.6.17-rc4-mm3/kernel/sched.c    2006-05-26 13:50:11.000000000 +1000
> @@ -201,21 +201,33 @@ static inline unsigned int task_timeslic
>
>  #ifdef CONFIG_CPU_RATE_CAPS
>  #define CAP_STATS_OFFSET 8
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +static void sinbin_release_fn(unsigned long arg);
> +#define min_cpu_rate_cap(p) min((p)->cpu_rate_cap, (p)->cpu_rate_hard_cap)
> +#else
> +#define min_cpu_rate_cap(p) (p)->cpu_rate_cap
> +#endif
>  #define task_has_cap(p) unlikely((p)->flags & PF_HAS_CAP)
>  /* this assumes that p is not a real time task */
>  #define task_is_background(p) unlikely((p)->cpu_rate_cap == 0)
>  #define task_being_capped(p) unlikely((p)->prio >= CAPPED_PRIO)
> -#define cap_load_weight(p) (((p)->cpu_rate_cap * SCHED_LOAD_SCALE) / 1000)
> +#define cap_load_weight(p) ((min_cpu_rate_cap(p) * SCHED_LOAD_SCALE) / 1000)
>
>  static void init_cpu_rate_caps(task_t *p)
>  {
>         p->cpu_rate_cap = 1000;
>         p->flags &= ~PF_HAS_CAP;
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +       p->cpu_rate_hard_cap = 1000;
> +       init_timer(&p->sinbin_timer);
> +       p->sinbin_timer.function = sinbin_release_fn;
> +       p->sinbin_timer.data = (unsigned long) p;
> +#endif
>  }
>
>  static inline void set_cap_flag(task_t *p)
>  {
> -       if (p->cpu_rate_cap < 1000 && !has_rt_policy(p))
> +       if (min_cpu_rate_cap(p) < 1000 && !has_rt_policy(p))
>                 p->flags |= PF_HAS_CAP;
>         else
>                 p->flags &= ~PF_HAS_CAP;
> @@ -223,7 +235,7 @@ static inline void set_cap_flag(task_t *
>
>  static inline int task_exceeding_cap(const task_t *p)
>  {
> -       return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * p->cpu_rate_cap);
> +       return (p->avg_cpu_per_cycle * 1000) > (p->avg_cycle_length * min_cpu_rate_cap(p));
>  }
>
>  #ifdef CONFIG_SCHED_SMT
> @@ -257,7 +269,7 @@ static int task_exceeding_cap_now(const
>
>         delta = (now > p->timestamp) ? (now - p->timestamp) : 0;
>         lhs = (p->avg_cpu_per_cycle + delta) * 1000;
> -       rhs = (p->avg_cycle_length + delta) * p->cpu_rate_cap;
> +       rhs = (p->avg_cycle_length + delta) * min_cpu_rate_cap(p);
>
>         return lhs > rhs;
>  }
> @@ -266,6 +278,10 @@ static inline void init_cap_stats(task_t
>  {
>         p->avg_cpu_per_cycle = 0;
>         p->avg_cycle_length = 0;
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +       init_timer(&p->sinbin_timer);
> +       p->sinbin_timer.data = (unsigned long) p;
> +#endif
>  }
>
>  static inline void inc_cap_stats_cycle(task_t *p, unsigned long long now)
> @@ -1213,6 +1229,64 @@ static void deactivate_task(struct task_
>         p->array = NULL;
>  }
>
> +#ifdef CONFIG_CPU_RATE_HARD_CAPS
> +#define task_has_hard_cap(p) unlikely((p)->cpu_rate_hard_cap < 1000)
> +
> +/*
> + * Release a task from the sinbin
> + */
> +static void sinbin_release_fn(unsigned long arg)
> +{
> +       unsigned long flags;
> +       struct task_struct *p = (struct task_struct*)arg;
> +       struct runqueue *rq = task_rq_lock(p, &flags);
> +
> +       p->prio = effective_prio(p);
> +
> +       __activate_task(p, rq);
> +
> +       task_rq_unlock(rq, &flags);
> +}
> +
> +static unsigned long reqd_sinbin_ticks(const task_t *p)
> +{
> +       unsigned long long res;
> +
> +       res = p->avg_cpu_per_cycle * 1000;
> +
> +       if (res > p->avg_cycle_length * p->cpu_rate_hard_cap) {
> +               (void)do_div(res, p->cpu_rate_hard_cap);
> +               res -= p->avg_cpu_per_cycle;
> +               /*
> +                * IF it was available we'd also subtract
> +                * the average sleep per cycle here
> +                */
> +               res >>= CAP_STATS_OFFSET;
> +               (void)do_div(res, (1000000000 / HZ));

Please use NSEC_PER_SEC if that is what 10^9 stands for in the above
calculation.

> +
> +               return res ? : 1;
> +       }
> +
> +       return 0;
> +}
> +
> +static void sinbin_task(task_t *p, unsigned long durn)
> +{
> +       if (durn == 0)
> +               return;
> +       deactivate_task(p, task_rq(p));
> +       p->sinbin_timer.expires = jiffies + durn;
> +       add_timer(&p->sinbin_timer);
> +}
> +#else
> +#define task_has_hard_cap(p) 0
> +#define reqd_sinbin_ticks(p) 0
> +
> +static inline void sinbin_task(task_t *p, unsigned long durn)
> +{
> +}
> +#endif
> +
>  /*
>   * resched_task - mark a task 'to be rescheduled now'.
>   *
> @@ -3508,9 +3582,16 @@ need_resched_nonpreemptible:
>                 }
>         }
>
> -       /* do this now so that stats are correct for SMT code */
> -       if (task_has_cap(prev))
> +       if (task_has_cap(prev)) {
>                 inc_cap_stats_both(prev, now);
> +               if (task_has_hard_cap(prev) && !prev->state &&
> +                   !rt_task(prev) && !signal_pending(prev)) {
> +                       unsigned long sinbin_ticks = reqd_sinbin_ticks(prev);
> +
> +                       if (sinbin_ticks)
> +                               sinbin_task(prev, sinbin_ticks);
> +               }
> +       }
>
>         cpu = smp_processor_id();
>         if (unlikely(!rq->nr_running)) {
> @@ -4539,6 +4620,67 @@ out:
>  }
>
<snip>

Balbir
Linux Technology Center
IBM Software Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-27  6:48   ` Balbir Singh
@ 2006-05-27  8:44     ` Peter Williams
  2006-05-31 13:10       ` Kirill Korotaev
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Williams @ 2006-05-27  8:44 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Mike Galbraith, Con Kolivas, Linux Kernel, Kingsley Cheung,
	Ingo Molnar, Rene Herman

Balbir Singh wrote:
> 
> Using a timer for releasing tasks from their sinbin sounds like a  bit
> of an overhead. Given that there could be 10s of thousands of tasks.

The more runnable tasks there are the less likely it is that any of them 
is exceeding its hard cap due to normal competition for the CPUs.  So I 
think that it's unlikely that there will ever be a very large number of 
tasks in the sinbin at the same time.

> Is it possible to use the scheduler_tick() function take a look at all
> deactivated tasks (as efficiently as possible) and activate them when
> its time to activate them or just fold the functionality by defining a
> time quantum after which everyone is worken up. This time quantum
> could be the same as the time over which limits are honoured.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-27  8:44     ` Peter Williams
@ 2006-05-31 13:10       ` Kirill Korotaev
  2006-05-31 15:59         ` Balbir Singh
  2006-05-31 23:28         ` Peter Williams
  0 siblings, 2 replies; 27+ messages in thread
From: Kirill Korotaev @ 2006-05-31 13:10 UTC (permalink / raw)
  To: Peter Williams
  Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>> of an overhead. Given that there could be 10s of thousands of tasks.
> 
> 
> The more runnable tasks there are the less likely it is that any of them 
> is exceeding its hard cap due to normal competition for the CPUs.  So I 
> think that it's unlikely that there will ever be a very large number of 
> tasks in the sinbin at the same time.
for containers this can be untrue... :( actually even for 1000 tasks (I 
suppose this is the maximum in your case) it can slowdown significantly 
as well.

>> Is it possible to use the scheduler_tick() function take a look at all
>> deactivated tasks (as efficiently as possible) and activate them when
>> its time to activate them or just fold the functionality by defining a
>> time quantum after which everyone is worken up. This time quantum
>> could be the same as the time over which limits are honoured.
agree with it.

Kirill


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 13:10       ` Kirill Korotaev
@ 2006-05-31 15:59         ` Balbir Singh
  2006-05-31 18:09           ` Mike Galbraith
                             ` (2 more replies)
  2006-05-31 23:28         ` Peter Williams
  1 sibling, 3 replies; 27+ messages in thread
From: Balbir Singh @ 2006-05-31 15:59 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Kirill Korotaev wrote:
>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>
>>
>>
>> The more runnable tasks there are the less likely it is that any of 
>> them is exceeding its hard cap due to normal competition for the 
>> CPUs.  So I think that it's unlikely that there will ever be a very 
>> large number of tasks in the sinbin at the same time.
> 
> for containers this can be untrue... :( actually even for 1000 tasks (I 
> suppose this is the maximum in your case) it can slowdown significantly 
> as well.

Do you have any documented requirements for container resource management?
Is there a minimum list of features and nice to have features for containers
as far as resource management is concerned?


> 
>>> Is it possible to use the scheduler_tick() function take a look at all
>>> deactivated tasks (as efficiently as possible) and activate them when
>>> its time to activate them or just fold the functionality by defining a
>>> time quantum after which everyone is worken up. This time quantum
>>> could be the same as the time over which limits are honoured.
> 
> agree with it.

Thinking a bit more along these lines, it would probably break O(1). But I guess a good
algorithm can amortize the cost.

> 
> Kirill
> 
-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 15:59         ` Balbir Singh
@ 2006-05-31 18:09           ` Mike Galbraith
  2006-06-01  7:41           ` Kirill Korotaev
  2006-06-01 23:43           ` Peter Williams
  2 siblings, 0 replies; 27+ messages in thread
From: Mike Galbraith @ 2006-05-31 18:09 UTC (permalink / raw)
  To: balbir
  Cc: Kirill Korotaev, Peter Williams, Balbir Singh, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

On Wed, 2006-05-31 at 21:29 +0530, Balbir Singh wrote:

> Do you have any documented requirements for container resource management?

(??  where would that come from?)

Containers, I can imagine ~working (albeit I don't see why num_tasks
dilution problem shouldn't apply to num_containers... it's the same
thing, stale info)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 15:59         ` Balbir Singh
  2006-05-31 18:09           ` Mike Galbraith
@ 2006-06-01  7:41           ` Kirill Korotaev
  2006-06-01  8:34             ` Balbir Singh
  2006-06-01 23:43           ` Peter Williams
  2 siblings, 1 reply; 27+ messages in thread
From: Kirill Korotaev @ 2006-06-01  7:41 UTC (permalink / raw)
  To: balbir
  Cc: Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman,
	Sam Vilain, Andrew Morton, Eric W. Biederman

>>> The more runnable tasks there are the less likely it is that any of 
>>> them is exceeding its hard cap due to normal competition for the 
>>> CPUs.  So I think that it's unlikely that there will ever be a very 
>>> large number of tasks in the sinbin at the same time.
>>
>>
>> for containers this can be untrue... :( actually even for 1000 tasks 
>> (I suppose this is the maximum in your case) it can slowdown 
>> significantly as well.
> 
> 
> Do you have any documented requirements for container resource management?
> Is there a minimum list of features and nice to have features for 
> containers
> as far as resource management is concerned?
Sure! You can check OpenVZ project (http://openvz.org) for example of 
required resource management. BTW, I must agree with other people here 
who noticed that per-process resource management is really useless and 
hard to use :(

Briefly about required resource management:
1) CPU:
- fairness (i.e. prioritization of containers). For this we use SFQ like 
fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses 
tocken bucket algorithm. I can provide more details on this if you are 
interested.
- cpu limits (soft, hard). OpenVZ provides only hard cpu limits. For 
this we account the time in cycles. And after some credit is used do 
delay of container execution. We use cycles as our experiments show that 
statistical algorithms work poorly on some patterns :(
- cpu guarantees. I'm not sure any of solutions provide this yet.

2) disk:
- overall disk quota for container
- per-user/group quotas inside container

in OpenVZ we wrote a 2level disk quota which works on disk subtrees. 
vserver imho uses 1 partition per container approach.

- disk I/O bandwidth:
we started to use CFQv2, but it is quite poor in this regard. First, it 
doesn't prioritizes writes and async disk operations :( And even for 
sync reads we found some problems we work on now...

3) memory and other resources.
- memory
- files
- signals and so on and so on.
For example, in OpenVZ we have user resource beancounters (original 
author is Alan Cox), which account the following set of parameters:
kernel memory (vmas, page tables, different structures etc.), dcache 
pinned size, different user pages (locked, physical, private, shared), 
number of files, sockets, ptys, signals, network buffers, netfilter 
rules etc.

4. network bandwidth
traffic shaping is already ok here.

>>>> Is it possible to use the scheduler_tick() function take a look at all
>>>> deactivated tasks (as efficiently as possible) and activate them when
>>>> its time to activate them or just fold the functionality by defining a
>>>> time quantum after which everyone is worken up. This time quantum
>>>> could be the same as the time over which limits are honoured.
>>
>>
>> agree with it.
> 
> 
> Thinking a bit more along these lines, it would probably break O(1). But 
> I guess a good
> algorithm can amortize the cost.
this is the price to pay. but it happens quite rarelly as was noticed 
already...

Kirill


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01  7:41           ` Kirill Korotaev
@ 2006-06-01  8:34             ` Balbir Singh
  2006-06-01 23:47               ` Sam Vilain
  0 siblings, 1 reply; 27+ messages in thread
From: Balbir Singh @ 2006-06-01  8:34 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Peter Williams, Balbir Singh, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman,
	Sam Vilain, Andrew Morton, Eric W. Biederman, Srivatsa, ckrm-tech

Hi, Kirill,

Kirill Korotaev wrote:
>> Do you have any documented requirements for container resource 
>> management?
>> Is there a minimum list of features and nice to have features for 
>> containers
>> as far as resource management is concerned?
> 
> Sure! You can check OpenVZ project (http://openvz.org) for example of 
> required resource management. BTW, I must agree with other people here 
> who noticed that per-process resource management is really useless and 
> hard to use :(

I'll take a look at the references. I agree with you that it will be useful
to have resource management for a group of tasks.

> 
> Briefly about required resource management:
> 1) CPU:
> - fairness (i.e. prioritization of containers). For this we use SFQ like 
> fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses 
> tocken bucket algorithm. I can provide more details on this if you are 
> interested.

Yes, any information or pointers to them will be very useful.

> - cpu limits (soft, hard). OpenVZ provides only hard cpu limits. For 
> this we account the time in cycles. And after some credit is used do 
> delay of container execution. We use cycles as our experiments show that 
> statistical algorithms work poorly on some patterns :(
> - cpu guarantees. I'm not sure any of solutions provide this yet.

ckrm has a solution to provide cpu guarantees. 

I think as far as CPU resource management is concerned (limits or guarantees),
there are common problems to be solved, for example

1. Tracking when a limit or a gaurantee is not met
2. Taking a decision to cap the group
3. Selecting the next task to execute (keeping O(1) in mind)

For the existing resource controller in OpenVZ I would be
interested in the information on the kinds of patterns it does not
perform well on and the patterns it performs well on.

> 
> 2) disk:
> - overall disk quota for container
> - per-user/group quotas inside container
> 
> in OpenVZ we wrote a 2level disk quota which works on disk subtrees. 
> vserver imho uses 1 partition per container approach.
> 
> - disk I/O bandwidth:
> we started to use CFQv2, but it is quite poor in this regard. First, it 
> doesn't prioritizes writes and async disk operations :( And even for 
> sync reads we found some problems we work on now...
> 
> 3) memory and other resources.
> - memory
> - files
> - signals and so on and so on.
> For example, in OpenVZ we have user resource beancounters (original 
> author is Alan Cox), which account the following set of parameters:
> kernel memory (vmas, page tables, different structures etc.), dcache 
> pinned size, different user pages (locked, physical, private, shared), 
> number of files, sockets, ptys, signals, network buffers, netfilter 
> rules etc.
> 
> 4. network bandwidth
> traffic shaping is already ok here.

Traffic shaping is just for outgoing traffic right? How about incoming
traffic (through the accept call)

> 

These are a great set of requirements. Thanks for putting them together.


>> Thinking a bit more along these lines, it would probably break O(1). 
>> But I guess a good
>> algorithm can amortize the cost.
> 
> this is the price to pay. but it happens quite rarelly as was noticed 
> already...
> 

Yes, agreed.

> Kirill
> 


-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

PS: I am also cc'ing ckrm-tech and srivatsa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01  8:34             ` Balbir Singh
@ 2006-06-01 23:47               ` Sam Vilain
  0 siblings, 0 replies; 27+ messages in thread
From: Sam Vilain @ 2006-06-01 23:47 UTC (permalink / raw)
  To: balbir
  Cc: Kirill Korotaev, Peter Williams, Balbir Singh, Mike Galbraith,
	Con Kolivas, Linux Kernel, Kingsley Cheung, Ingo Molnar,
	Rene Herman, Andrew Morton, Eric W. Biederman, Srivatsa,
	ckrm-tech

Balbir Singh wrote:

>>1) CPU:
>>- fairness (i.e. prioritization of containers). For this we use SFQ like 
>>fair cpu scheduler with virtual cpus (runqueues). Linux-vserver uses 
>>tocken bucket algorithm. I can provide more details on this if you are 
>>interested.
>>    
>>
>Yes, any information or pointers to them will be very useful.
>  
>
A general description of the token bucket scheduler is on the Vserver
wiki at http://linux-vserver.org/Linux-VServer-Paper-06

I also just described it on a nearby thread -
http://lkml.org/lkml/2006/5/28/122

Sam.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 15:59         ` Balbir Singh
  2006-05-31 18:09           ` Mike Galbraith
  2006-06-01  7:41           ` Kirill Korotaev
@ 2006-06-01 23:43           ` Peter Williams
  2 siblings, 0 replies; 27+ messages in thread
From: Peter Williams @ 2006-06-01 23:43 UTC (permalink / raw)
  To: balbir
  Cc: Kirill Korotaev, Balbir Singh, Mike Galbraith, Con Kolivas,
	Linux Kernel, Kingsley Cheung, Ingo Molnar, Rene Herman

Balbir Singh wrote:
> Kirill Korotaev wrote:
>>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>>
>>>
>>>
>>> The more runnable tasks there are the less likely it is that any of 
>>> them is exceeding its hard cap due to normal competition for the 
>>> CPUs.  So I think that it's unlikely that there will ever be a very 
>>> large number of tasks in the sinbin at the same time.
>>
>> for containers this can be untrue... :( actually even for 1000 tasks 
>> (I suppose this is the maximum in your case) it can slowdown 
>> significantly as well.
> 
> Do you have any documented requirements for container resource management?
> Is there a minimum list of features and nice to have features for 
> containers
> as far as resource management is concerned?
> 
> 
>>
>>>> Is it possible to use the scheduler_tick() function take a look at all
>>>> deactivated tasks (as efficiently as possible) and activate them when
>>>> its time to activate them or just fold the functionality by defining a
>>>> time quantum after which everyone is worken up. This time quantum
>>>> could be the same as the time over which limits are honoured.
>>
>> agree with it.
> 
> Thinking a bit more along these lines, it would probably break O(1). But 
> I guess a good
> algorithm can amortize the cost.

It's also unlikely to be less overhead than using timers.  In fact, my 
gut feeling is that you'd actually be doing something very similar to 
how timers work only cruder.  I.e. reinventing the wheel.

-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 13:10       ` Kirill Korotaev
  2006-05-31 15:59         ` Balbir Singh
@ 2006-05-31 23:28         ` Peter Williams
  2006-06-01  7:44           ` Kirill Korotaev
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Williams @ 2006-05-31 23:28 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

Kirill Korotaev wrote:
>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>
>>
>> The more runnable tasks there are the less likely it is that any of 
>> them is exceeding its hard cap due to normal competition for the 
>> CPUs.  So I think that it's unlikely that there will ever be a very 
>> large number of tasks in the sinbin at the same time.
> for containers this can be untrue...

Why will this be untrue for containers?

> :( actually even for 1000 tasks (I 
> suppose this is the maximum in your case) it can slowdown significantly 
> as well.
> 
>>> Is it possible to use the scheduler_tick() function take a look at all
>>> deactivated tasks (as efficiently as possible) and activate them when
>>> its time to activate them or just fold the functionality by defining a
>>> time quantum after which everyone is worken up. This time quantum
>>> could be the same as the time over which limits are honoured.
> agree with it.

If there are a lot of RUNNABLE (i.e. on a run queue) tasks then normal 
competition will mean that their CPU usage rates are small and therefore 
unlikely to be greater than their cap.  The sinbin is only used for 
tasks that are EXCEEDING their cap.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-05-31 23:28         ` Peter Williams
@ 2006-06-01  7:44           ` Kirill Korotaev
  2006-06-01 23:21             ` Peter Williams
  0 siblings, 1 reply; 27+ messages in thread
From: Kirill Korotaev @ 2006-06-01  7:44 UTC (permalink / raw)
  To: Peter Williams
  Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

>>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>>
>>>
>>>
>>> The more runnable tasks there are the less likely it is that any of 
>>> them is exceeding its hard cap due to normal competition for the 
>>> CPUs.  So I think that it's unlikely that there will ever be a very 
>>> large number of tasks in the sinbin at the same time.
>>
>> for containers this can be untrue...
> 
> 
> Why will this be untrue for containers?
if one container having 100 running tasks inside exceeded it's credit, 
it should be delayed. i.e. 100 tasks should be placed in sinbin if I 
understand your algo correctly. the second container having 100 tasks as 
well will do the same.

Kirill

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 3/5] sched: Add CPU rate hard caps
  2006-06-01  7:44           ` Kirill Korotaev
@ 2006-06-01 23:21             ` Peter Williams
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Williams @ 2006-06-01 23:21 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Balbir Singh, Mike Galbraith, Con Kolivas, Linux Kernel,
	Kingsley Cheung, Ingo Molnar, Rene Herman

Kirill Korotaev wrote:
>>>>> Using a timer for releasing tasks from their sinbin sounds like a  bit
>>>>> of an overhead. Given that there could be 10s of thousands of tasks.
>>>>
>>>>
>>>>
>>>> The more runnable tasks there are the less likely it is that any of 
>>>> them is exceeding its hard cap due to normal competition for the 
>>>> CPUs.  So I think that it's unlikely that there will ever be a very 
>>>> large number of tasks in the sinbin at the same time.
>>>
>>> for containers this can be untrue...
>>
>>
>> Why will this be untrue for containers?
> if one container having 100 running tasks inside exceeded it's credit, 
> it should be delayed. i.e. 100 tasks should be placed in sinbin if I 
> understand your algo correctly. the second container having 100 tasks as 
> well will do the same.

1. Caps are set on a per task basis not on a group basis.
2. Sinbinning is the last resort and only used for hard caps.  The soft 
capping mechanism is also applied to hard capped tasks and natural 
competition also tends to reduce usage rates.

In general, sinbinning will only kick in on lightly loaded systems where 
there is no competition for CPU resources.

Further, there is a natural ceiling of 999 per CPU on the number tasks 
that will ever be in the sinbin at the same time.  To achieve this 
maximum some very unusual circumstances have to prevail:

1. these 999 tasks must be the only runnable tasks on the system,
2. they all must have a cap of 1/1000, and
3. the distribution of CPU among them must be perfectly fair so that 
they all have the expected average usage rate of 1/999.

If you add one more task to this mix the average usage would be 1/1000 
and if they all had that none would be exceeding their cap and there 
would be no sinbinning at all.  Of course, in reality, half would be 
slightly above the average and half slightly below and about 500 would 
be sinbinned.  But this reality check also applies to the 999 and 
somewhat less than 999 would actually be sinbinned.

As the number of runnable tasks increases beyond 1000 then the number 
that have a usage rate greater than their cap will decrease and quickly 
reach zero.

So the conclusion is that the maximum number of sinbinned tasks per CPU 
is given by:

min(1000 / min_cpu_rate_cap - 1, nr_running)

As you can see, if a minimum cap cpu of 1 causes problems we can just 
increase that minimum.

And once again I ask what's so special about containers that changes this?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2006-06-06 10:47 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-01 21:03 [RFC 3/5] sched: Add CPU rate hard caps Al Boldi
2006-06-02  1:33 ` Peter Williams
2006-06-02 11:23   ` Matt Helsley
2006-06-02 13:16     ` Peter Williams
2006-06-06 10:47     ` Srivatsa Vaddagiri
  -- strict thread matches above, loose matches on Subject: below --
2006-05-26  4:20 [RFC 0/5] sched: Add CPU rate caps Peter Williams
2006-05-26  4:20 ` [RFC 3/5] sched: Add CPU rate hard caps Peter Williams
2006-05-26  6:58   ` Kari Hurtta
2006-05-27  1:00     ` Peter Williams
2006-05-26 11:00   ` Con Kolivas
2006-05-26 13:59     ` Peter Williams
2006-05-26 14:12       ` Con Kolivas
2006-05-26 14:23       ` Mike Galbraith
2006-05-27  0:16         ` Peter Williams
2006-05-27  9:28           ` Mike Galbraith
2006-05-28  2:09             ` Peter Williams
2006-05-27  6:48   ` Balbir Singh
2006-05-27  8:44     ` Peter Williams
2006-05-31 13:10       ` Kirill Korotaev
2006-05-31 15:59         ` Balbir Singh
2006-05-31 18:09           ` Mike Galbraith
2006-06-01  7:41           ` Kirill Korotaev
2006-06-01  8:34             ` Balbir Singh
2006-06-01 23:47               ` Sam Vilain
2006-06-01 23:43           ` Peter Williams
2006-05-31 23:28         ` Peter Williams
2006-06-01  7:44           ` Kirill Korotaev
2006-06-01 23:21             ` Peter Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox