[RFC] CPU controllers?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] CPU controllers?
@ 2006-06-15 13:46 Srivatsa Vaddagiri
  2006-06-15 21:52 ` Sam Vilain
                   ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2006-06-15 13:46 UTC (permalink / raw)
  To: Sam Vilain, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Nick Piggin, Peter Williams, Andrew Morton, sekharan,
	Balbir Singh
  Cc: linux-kernel

Hello,
	There have been several proposals so far on this subject and no
consensus seems to have been reached on what an acceptable CPU controller
for Linux needs to provide. I am hoping this mail will trigger some
discussions in that regard. In particular I am keen to know what the
various maintainers think about this subject.

The various approaches proposed so far are:

	- CPU rate-cap (limit CPU execution rate per-task)
		http://lkml.org/lkml/2006/5/26/7	

	- f-series CKRM controller (CPU usage guarantee for a task-group)
		http://lkml.org/lkml/2006/4/27/399

	- e-series CKRM controller (CPU usage guarantee/limit for a task-group)
		http://prdownloads.sourceforge.net/ckrm/cpu.ckrm-e18.v10.patch.gz?download

	- OpenVZ controller (CPU usage guarantee/hard-limit for a task-group)
		http://openvz.org/

	- vserver controller (CPU usage guarantee(?)/limit for a task-group)
		http://linux-vserver.org/

(I apologize if I have missed any other significant proposal for Linux)

Their salient features and limitations/drawbacks, as I could gather, are 
summarized later below. To note is each controller varies in degree of 
complexity and addresses its own set of requirements. 

In going forward for an acceptable controller in mainline it would help, IMHO, 
if we put together the set of requirements which the Linux CPU controller 
should support. Some questions that arise in this regard are:

	- Do we need mechanisms to control CPU usage of tasks, further to what
	  already exists (like nice)?  IMO yes.

	- What are the requirements of such a CPU controller? Some of them to
	  consider are:

		- Should it operate on a per-task basis or on a per-task-group
	  	  basis?
		- Should it support more than one level of task-groups?
		- If we want to allow on a per-task-group basis, which mechanism
		  do we use for grouping tasks (Resource Groups, PAGG,
		  uid/session id ..)?
		- Should it support limit and guarantee both? In case of limit,
		  should it support both soft and hard limit?
		- What interface do we choose for user to specify
		  limit/guarantee? system call or filesystem based (ex: /proc or
		  Resource Group's rcfs)?
		- Over what interval should guarantee/limit be monitored and
		  controlled?
		- With what accuracy should we allow the limit/guarantee to be
		  expressed?
		- Co-existence with CPUset - should guarantee/limit be 
		  enforced only on the set of CPUs attached to the cpuset?
		- Should real-time tasks be outside the purview of this control?
		- Load balance to be made aware of the guarantee/limit of tasks
		  (or task-groups)? Ofcourse yes!

One possibility is to add a basic controller, that addresses some minimal
requirements, to begin with and progressively enhance it capabilities. From this
pov, both the f-series resource group controller and cpu rate-cap seem to be 
good candidates for a minimal controller to begin with.

Thoughts?

Salient features of various CPU controllers that have been proposed so far are
summarized below. I have not captured OpenVZ and Vserver controller aspects
well. Request the maintainers to fill-in!

1. CPU Rate Cap	(by Peter Williams)

Features:

	* Limit CPU execution rate on a per-task basis.
	* Limit specified in terms of parts-per-thousand. Limit set thr' /proc
	  interface.
	* Supports hard limit and soft limit
	* Introduces new task priorities where tasks that have exceeded their 
	  soft limit can be "parked" until the O(1) scheduler picks them for
 	  execution
	* Load balancing on SMP systems made aware of tasks whose execution
	  rate is limited by this feature
	* Patch is simple

Limitations:
	* Does not support guarantee

Drawbacks:
	* Limiting CPU execution rate of a group of tasks has to be tackled from
	  an external module (user or kernel space) which may make this approach
	  somewhat inconvenient to implement for task-groups.

2. Timeslice scaling (Maeda Naoaki and Kurosawa Takahiro)

Features:
	* Provide guaranteed CPU execution rate on a per-task-group basis
	  Guarantee provided over an interval of 5 seconds.
	* Hooked to Resource Group infrastructure currently and hence 
	  guarantee/limit set thr' Resource Group's RCFS interface.
	* Achieves guaranteed execution by scaling down timeslice of tasks
	  who are above their guaranteed execution rate. Timeslice can be 
	  scaled down only to a minimum of 1 slice.
	* Does not scale down timeslice of interactive tasks (even if their
	  CPU usage is beyond what is guaranteed) and does not avoid requeue
	  of interactive tasks.
	* Patch is quite simple

Limitations:
	* Does not support limiting task-group CPU execution rate

Drawbacks:
	(Some of the drawbacks listed are probably being addressed currently 
	 with a redesign - which we are yet to see)

	* Interactive tasks (and their requeuing) can come in the way of
	  providing guaranteed execution rate to other tasks
	* SMP load balancing does not take into account guarantee provided to 
	  task groups.
	* It may not be possible to restrict CPU usage of a task group to only 
	  its guaranteed usage if the task-group has large number of tasks 
	  (each task is run for a minimum of 1 timeslice)
	* May not handle bursty loads
	

3. Resource Group e-series CPU controller

Features:
	* Provides both guarantee and limit for CPU execution rate of task
	  groups (classes)
	* Two-level scheduling. Pick a task-group (class) to execute first and 
	  then a task within the task-group. Both are of O(1) complexity.
	* Classes are given priorities based on their guaranteed CPU usage,
	  accumulated CPU execution and the highest priority task present 
	  within the group. Class with the highest priority picked up
	  for execution next.
	* Guarantee/Limit specified in terms of shares

Drawbacks:
	* Complexity

3. OpenVZ CPU controller 
	
Features:
	- Provides both guarantee [1] and (hard) limit for CPU execution rate 
	  of task group (containers)
	- Multi-level scheduler (Pick a task-group to run first, then pick a 
	  virtual-cpu and then a task)
		- Virtual cpu concept makes group-aware SMP load balancing easy
	- Uses cycles (rather than ticks) consumed for accounting (?)

[1] - http://download.openvz.org/doc/OpenVZ-Users-Guide.pdf

Limitations:
	- ?

Drawbacks:
	- ?


4. VServer CPU controller

Features:
	- Token-bucket based

Drawbacks:
	- ?

Limitations:
	- ?



-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-15 13:46 [RFC] CPU controllers? Srivatsa Vaddagiri
@ 2006-06-15 21:52 ` Sam Vilain
  2006-06-15 23:30 ` Peter Williams
  2006-06-17  8:48 ` Nick Piggin
  2 siblings, 0 replies; 36+ messages in thread
From: Sam Vilain @ 2006-06-15 21:52 UTC (permalink / raw)
  To: vatsa
  Cc: Kirill Korotaev, Mike Galbraith, Ingo Molnar, Nick Piggin,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel

Srivatsa Vaddagiri wrote:
> One possibility is to add a basic controller, that addresses some minimal
> requirements, to begin with and progressively enhance it capabilities. From this
> pov, both the f-series resource group controller and cpu rate-cap seem to be 
> good candidates for a minimal controller to begin with.
>
> Thoughts?
>   

Sounds like you're on the right track, but I don't know whether we can
truly be happy making the performance/guarantee trade-off decision for
the user.

You could grossly put the solutions into several camps;

1. solutions which have very low impact and provide soft assurances only
2. solutions which provide hard limits
3. solutions which provide guarantees

I think it's almost invariant that the latter solutions have more of a
performance impact, and that it's quite important that normal system
throughput does not suffer from the "scheduling namespace" solution that
we come up with.

> Salient features of various CPU controllers that have been proposed so far are
> summarized below. I have not captured OpenVZ and Vserver controller aspects
> well. Request the maintainers to fill-in!
>   [...]
> 2. Timeslice scaling (Maeda Naoaki and Kurosawa Takahiro)
>
> Features:
> 	* Provide guaranteed CPU execution rate on a per-task-group basis
> 	  Guarantee provided over an interval of 5 seconds.
> 	* Hooked to Resource Group infrastructure currently and hence 
> 	  guarantee/limit set thr' Resource Group's RCFS interface.
> 	* Achieves guaranteed execution by scaling down timeslice of tasks
> 	  who are above their guaranteed execution rate. Timeslice can be 
> 	  scaled down only to a minimum of 1 slice.
> 	* Does not scale down timeslice of interactive tasks (even if their
> 	  CPU usage is beyond what is guaranteed) and does not avoid requeue
> 	  of interactive tasks.
> 	* Patch is quite simple
>
> Limitations:
> 	* Does not support limiting task-group CPU execution rate
>
> Drawbacks:
> 	(Some of the drawbacks listed are probably being addressed currently 
> 	 with a redesign - which we are yet to see)
>
> 	* Interactive tasks (and their requeuing) can come in the way of
> 	  providing guaranteed execution rate to other tasks
> 	* SMP load balancing does not take into account guarantee provided to 
> 	  task groups.
> 	* It may not be possible to restrict CPU usage of a task group to only 
> 	  its guaranteed usage if the task-group has large number of tasks 
> 	  (each task is run for a minimum of 1 timeslice)
> 	* May not handle bursty loads
> 	
>   [...]
> 4. VServer CPU controller
>
> Features:
> 	- Token-bucket based
>   

The VServer scheduler is also timeslice scaling - it just uses the token
bucket to know how much to scale the timeslices. It doesn't care about
interactive bonuses, although it does lessen the interactivity bonus a
notch or two (to -5..+5).

This means that it's performance neutral in the general case.

> Drawbacks:
> 	- ?
>   

It fits into category 1 (or, using Herbert Poetzl's enhancements, 2), so
does not provide guarantees.

> Limitations:
> 	- ?

Doesn't deal with huge numbers of processes; but with task group ulimits
that problem goes away in practice.

Sam.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-15 13:46 [RFC] CPU controllers? Srivatsa Vaddagiri
  2006-06-15 21:52 ` Sam Vilain
@ 2006-06-15 23:30 ` Peter Williams
  2006-06-16  0:42   ` Matt Helsley
  2006-06-17  8:48 ` Nick Piggin
  2 siblings, 1 reply; 36+ messages in thread
From: Peter Williams @ 2006-06-15 23:30 UTC (permalink / raw)
  To: vatsa
  Cc: Sam Vilain, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Nick Piggin, Andrew Morton, sekharan, Balbir Singh, linux-kernel

Srivatsa Vaddagiri wrote:
> Hello,
> 	There have been several proposals so far on this subject and no
> consensus seems to have been reached on what an acceptable CPU controller
> for Linux needs to provide. I am hoping this mail will trigger some
> discussions in that regard. In particular I am keen to know what the
> various maintainers think about this subject.
> 
> The various approaches proposed so far are:
> 
> 	- CPU rate-cap (limit CPU execution rate per-task)
> 		http://lkml.org/lkml/2006/5/26/7	
> 
> 	- f-series CKRM controller (CPU usage guarantee for a task-group)
> 		http://lkml.org/lkml/2006/4/27/399
> 
> 	- e-series CKRM controller (CPU usage guarantee/limit for a task-group)
> 		http://prdownloads.sourceforge.net/ckrm/cpu.ckrm-e18.v10.patch.gz?download
> 
> 	- OpenVZ controller (CPU usage guarantee/hard-limit for a task-group)
> 		http://openvz.org/
> 
> 	- vserver controller (CPU usage guarantee(?)/limit for a task-group)
> 		http://linux-vserver.org/
> 
> (I apologize if I have missed any other significant proposal for Linux)
> 
> Their salient features and limitations/drawbacks, as I could gather, are 
> summarized later below. To note is each controller varies in degree of 
> complexity and addresses its own set of requirements. 
> 
> In going forward for an acceptable controller in mainline it would help, IMHO, 
> if we put together the set of requirements which the Linux CPU controller 
> should support. Some questions that arise in this regard are:
> 
> 	- Do we need mechanisms to control CPU usage of tasks, further to what
> 	  already exists (like nice)?  IMO yes.
> 
> 	- What are the requirements of such a CPU controller? Some of them to
> 	  consider are:
> 
> 		- Should it operate on a per-task basis or on a per-task-group
> 	  	  basis?
> 		- Should it support more than one level of task-groups?
> 		- If we want to allow on a per-task-group basis, which mechanism
> 		  do we use for grouping tasks (Resource Groups, PAGG,
> 		  uid/session id ..)?
> 		- Should it support limit and guarantee both? In case of limit,
> 		  should it support both soft and hard limit?
> 		- What interface do we choose for user to specify
> 		  limit/guarantee? system call or filesystem based (ex: /proc or
> 		  Resource Group's rcfs)?
> 		- Over what interval should guarantee/limit be monitored and
> 		  controlled?
> 		- With what accuracy should we allow the limit/guarantee to be
> 		  expressed?
> 		- Co-existence with CPUset - should guarantee/limit be 
> 		  enforced only on the set of CPUs attached to the cpuset?
> 		- Should real-time tasks be outside the purview of this control?
> 		- Load balance to be made aware of the guarantee/limit of tasks
> 		  (or task-groups)? Ofcourse yes!
> 
> One possibility is to add a basic controller, that addresses some minimal
> requirements, to begin with and progressively enhance it capabilities.

I would amend this to say "provide the basic controllers and let more 
complex management mechanisms use them (from outside the scheduler) to 
provide higher level control.  An essential part of this will be the 
provision of statistics for these external controllers to use.

> From this
> pov, both the f-series resource group controller and cpu rate-cap seem to be 
> good candidates for a minimal controller to begin with.
> 
> Thoughts?
> 
> Salient features of various CPU controllers that have been proposed so far are
> summarized below. I have not captured OpenVZ and Vserver controller aspects
> well. Request the maintainers to fill-in!
> 
> 1. CPU Rate Cap	(by Peter Williams)
> 
> Features:
> 
> 	* Limit CPU execution rate on a per-task basis.
> 	* Limit specified in terms of parts-per-thousand. Limit set thr' /proc
> 	  interface.

The /proc interface is not an essential part of this patch and the 
reason that it was implemented is that it was simple, easy and useful 
for testing.  The patch "proper" provides four functions for 
setting/getting the soft/hard caps an exports these so that they can be 
used from modules.

I.e. it would be very easy to replace the /proc interface with another 
one (or more) or to keep it and make another interface as well.  All the 
essential testing/processing required for setting the caps properly is 
inside the functions NOT the /proc interface.

> 	* Supports hard limit and soft limit
> 	* Introduces new task priorities where tasks that have exceeded their 
> 	  soft limit can be "parked" until the O(1) scheduler picks them for
>  	  execution
> 	* Load balancing on SMP systems made aware of tasks whose execution
> 	  rate is limited by this feature
> 	* Patch is simple
> 
> Limitations:
> 	* Does not support guarantee

Why would a capping mechanism support guarantees?  The two mechanisms 
can be implemented separately.  The only interaction between them that 
is required is a statement about which has precedence.  I.e. if a cap is 
less than a guarantee is it enforced?  I would opine that it should be.

BTW if "nice" works properly, guarantees can be implemented by suitable 
fiddling of task "nice" values.

> 
> Drawbacks:
> 	* Limiting CPU execution rate of a group of tasks has to be tackled from
> 	  an external module (user or kernel space) which may make this approach
> 	  somewhat inconvenient to implement for task-groups.

Nevertheless it can be done and it has the advantage that the cost is 
only borne by those who wish to use such high level controls.

The caps provided by this (simple) patch provide functionality that 
ordinary can find useful.  In particular, the use of a soft cap of zero 
to effectively put a task (and all of its children) in the background is 
very useful for doing software builds on a work station.  Con Kolivas's 
SCHED_IDLE scheduling class in his staircase scheduler provides the same 
functionality and is (from all reports) very popular.

The key difference between soft caps and the SCHED_IDLE mechanism is 
that it is more general in that limits other than zero can be specified. 
  This provides more flexibility.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-15 23:30 ` Peter Williams
@ 2006-06-16  0:42   ` Matt Helsley
  0 siblings, 0 replies; 36+ messages in thread
From: Matt Helsley @ 2006-06-16  0:42 UTC (permalink / raw)
  To: Peter Williams
  Cc: vatsa, Sam Vilain, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Nick Piggin, Andrew Morton, Chandra S. Seetharaman, Balbir Singh,
	LKML

On Fri, 2006-06-16 at 09:30 +1000, Peter Williams wrote:
> Srivatsa Vaddagiri wrote:

<snip>

> > 	* Supports hard limit and soft limit
> > 	* Introduces new task priorities where tasks that have exceeded their 
> > 	  soft limit can be "parked" until the O(1) scheduler picks them for
> >  	  execution
> > 	* Load balancing on SMP systems made aware of tasks whose execution
> > 	  rate is limited by this feature
> > 	* Patch is simple
> > 
> > Limitations:
> > 	* Does not support guarantee
> 
> Why would a capping mechanism support guarantees?  The two mechanisms 
> can be implemented separately.  The only interaction between them that 
> is required is a statement about which has precedence.  I.e. if a cap is 
> less than a guarantee is it enforced?  I would opine that it should be.

When this combination occurs userspace is crazy/uncoordinated/dumb and
can't be "satisfied". Perhaps the better approach is to ignore both
guarantee and limit (cap) in this case -- treat it as if userspace
hasn't specified either.

Alternatively the kernel can refuse to allow configuring such a
combination in the first place. This is one reason tying guarantees and
limits (caps) into the same framework would be useful.

<snip>

Cheers,
	-Matt Helsley




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-15 13:46 [RFC] CPU controllers? Srivatsa Vaddagiri
  2006-06-15 21:52 ` Sam Vilain
  2006-06-15 23:30 ` Peter Williams
@ 2006-06-17  8:48 ` Nick Piggin
  2006-06-17 15:55   ` Balbir Singh
                     ` (2 more replies)
  2 siblings, 3 replies; 36+ messages in thread
From: Nick Piggin @ 2006-06-17  8:48 UTC (permalink / raw)
  To: vatsa
  Cc: Sam Vilain, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel

Srivatsa Vaddagiri wrote:
> Hello,
> 	There have been several proposals so far on this subject and no
> consensus seems to have been reached on what an acceptable CPU controller
> for Linux needs to provide. I am hoping this mail will trigger some
> discussions in that regard. In particular I am keen to know what the
> various maintainers think about this subject.
> 
> The various approaches proposed so far are:
> 
> 	- CPU rate-cap (limit CPU execution rate per-task)
> 		http://lkml.org/lkml/2006/5/26/7	
> 
> 	- f-series CKRM controller (CPU usage guarantee for a task-group)
> 		http://lkml.org/lkml/2006/4/27/399
> 
> 	- e-series CKRM controller (CPU usage guarantee/limit for a task-group)
> 		http://prdownloads.sourceforge.net/ckrm/cpu.ckrm-e18.v10.patch.gz?download
> 
> 	- OpenVZ controller (CPU usage guarantee/hard-limit for a task-group)
> 		http://openvz.org/
> 
> 	- vserver controller (CPU usage guarantee(?)/limit for a task-group)
> 		http://linux-vserver.org/
> 
> (I apologize if I have missed any other significant proposal for Linux)
> 
> Their salient features and limitations/drawbacks, as I could gather, are 
> summarized later below. To note is each controller varies in degree of 
> complexity and addresses its own set of requirements. 
> 
> In going forward for an acceptable controller in mainline it would help, IMHO, 
> if we put together the set of requirements which the Linux CPU controller 
> should support. Some questions that arise in this regard are:
> 
> 	- Do we need mechanisms to control CPU usage of tasks, further to what
> 	  already exists (like nice)?  IMO yes.

Can we get back to the question of need? And from there, work out what
features are wanted.

IMHO, having containers try to virtualise all resources (memory, pagecache,
slab cache, CPU, disk/network IO...) seems insane: we may just as well use
virtualisation.

So, from my POV, I would like to be convinced of the need for this first.
I would really love to be able to keep core kernel simple and fast even if
it means edge cases might need to use a slightly different solution.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-17  8:48 ` Nick Piggin
@ 2006-06-17 15:55   ` Balbir Singh
  2006-06-17 16:48   ` Srivatsa Vaddagiri
  2006-06-19 18:14   ` Chris Friesen
  2 siblings, 0 replies; 36+ messages in thread
From: Balbir Singh @ 2006-06-17 15:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: vatsa, Sam Vilain, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, linux-kernel

On 6/17/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Srivatsa Vaddagiri wrote:
> > Hello,
> >       There have been several proposals so far on this subject and no
> > consensus seems to have been reached on what an acceptable CPU controller
> > for Linux needs to provide. I am hoping this mail will trigger some
> > discussions in that regard. In particular I am keen to know what the
> > various maintainers think about this subject.
> >
> > The various approaches proposed so far are:
> >
> >       - CPU rate-cap (limit CPU execution rate per-task)
> >               http://lkml.org/lkml/2006/5/26/7
> >
> >       - f-series CKRM controller (CPU usage guarantee for a task-group)
> >               http://lkml.org/lkml/2006/4/27/399
> >
> >       - e-series CKRM controller (CPU usage guarantee/limit for a task-group)
> >               http://prdownloads.sourceforge.net/ckrm/cpu.ckrm-e18.v10.patch.gz?download
> >
> >       - OpenVZ controller (CPU usage guarantee/hard-limit for a task-group)
> >               http://openvz.org/
> >
> >       - vserver controller (CPU usage guarantee(?)/limit for a task-group)
> >               http://linux-vserver.org/
> >
> > (I apologize if I have missed any other significant proposal for Linux)
> >
> > Their salient features and limitations/drawbacks, as I could gather, are
> > summarized later below. To note is each controller varies in degree of
> > complexity and addresses its own set of requirements.
> >
> > In going forward for an acceptable controller in mainline it would help, IMHO,
> > if we put together the set of requirements which the Linux CPU controller
> > should support. Some questions that arise in this regard are:
> >
> >       - Do we need mechanisms to control CPU usage of tasks, further to what
> >         already exists (like nice)?  IMO yes.
>
> Can we get back to the question of need? And from there, work out what
> features are wanted.
>
> IMHO, having containers try to virtualise all resources (memory, pagecache,
> slab cache, CPU, disk/network IO...) seems insane: we may just as well use
> virtualisation.
>
> So, from my POV, I would like to be convinced of the need for this first.
> I would really love to be able to keep core kernel simple and fast even if
> it means edge cases might need to use a slightly different solution.
>
> --
> SUSE Labs, Novell Inc.

The simplest example that comes to my mind to explain the need is
through quality of service. Consider a single system running two
instances of an application (lets say a web portal or a database
sever). If one of the instances is production and the other is
development, and if the development instance is being stress tested -
how do I provide reliable quality of service to the users of the
production instance?

I am sure other people will probably have better examples.

Warm Regards,

Balbir
Linux Technology Center
IBM, ISL

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-17  8:48 ` Nick Piggin
  2006-06-17 15:55   ` Balbir Singh
@ 2006-06-17 16:48   ` Srivatsa Vaddagiri
  2006-06-18  5:06     ` Nick Piggin
  2006-06-19 18:14   ` Chris Friesen
  2 siblings, 1 reply; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2006-06-17 16:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Sam Vilain, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, maeda.naoaki, kurosawa

On Sat, Jun 17, 2006 at 06:48:17PM +1000, Nick Piggin wrote:
> Srivatsa Vaddagiri wrote:
> >	- Do we need mechanisms to control CPU usage of tasks, further to 
> >	what
> >	  already exists (like nice)?  IMO yes.
> 
> Can we get back to the question of need? And from there, work out what
> features are wanted.
> 
> IMHO, having containers try to virtualise all resources (memory, pagecache,
> slab cache, CPU, disk/network IO...) seems insane: we may just as well use
> virtualisation.
> 
> So, from my POV, I would like to be convinced of the need for this first.
> I would really love to be able to keep core kernel simple and fast even if
> it means edge cases might need to use a slightly different solution.

I think a proportional-share scheduler (which is what a CPU controller
may provide) has non-container uses also. Do you think nice (or sched policy) 
is enough to, say, provide guaranteed CPU usage for applications or limit 
their CPU usage? Moreover it is more flexible if guarantee/limit can be 
specified for a group of tasks, rather than individual tasks even in
non-container scenarios (like limiting CPU usage of all web-server 
tasks togther or for limiting CPU usage of make -j command).

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-17 16:48   ` Srivatsa Vaddagiri
@ 2006-06-18  5:06     ` Nick Piggin
  2006-06-18  5:53       ` Sam Vilain
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-06-18  5:06 UTC (permalink / raw)
  To: vatsa
  Cc: Sam Vilain, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, maeda.naoaki, kurosawa

Srivatsa Vaddagiri wrote:
> On Sat, Jun 17, 2006 at 06:48:17PM +1000, Nick Piggin wrote:
> 
>>Srivatsa Vaddagiri wrote:
>>
>>>	- Do we need mechanisms to control CPU usage of tasks, further to 
>>>	what
>>>	  already exists (like nice)?  IMO yes.
>>
>>Can we get back to the question of need? And from there, work out what
>>features are wanted.
>>
>>IMHO, having containers try to virtualise all resources (memory, pagecache,
>>slab cache, CPU, disk/network IO...) seems insane: we may just as well use
>>virtualisation.
>>
>>So, from my POV, I would like to be convinced of the need for this first.
>>I would really love to be able to keep core kernel simple and fast even if
>>it means edge cases might need to use a slightly different solution.
> 
> 
> I think a proportional-share scheduler (which is what a CPU controller
> may provide) has non-container uses also. Do you think nice (or sched policy) 
> is enough to, say, provide guaranteed CPU usage for applications or limit 
> their CPU usage? Moreover it is more flexible if guarantee/limit can be 
> specified for a group of tasks, rather than individual tasks even in
> non-container scenarios (like limiting CPU usage of all web-server 
> tasks togther or for limiting CPU usage of make -j command).
> 

Oh, I'm sure there are lots of things we *could* do that we currently can't.

What I want to establish first is: what exact functionality is required, why,
and by whom. Only then can we sanely discuss the fitness of solutions and
propose alternatives, and decide whether to merge.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  5:06     ` Nick Piggin
@ 2006-06-18  5:53       ` Sam Vilain
  2006-06-18  6:11         ` Nick Piggin
  2006-06-18  7:18         ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 36+ messages in thread
From: Sam Vilain @ 2006-06-18  5:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: vatsa, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, maeda.naoaki, kurosawa

Nick Piggin wrote:
>> I think a proportional-share scheduler (which is what a CPU controller
>> may provide) has non-container uses also. Do you think nice (or sched 
>> policy) is enough to, say, provide guaranteed CPU usage for 
>> applications or limit their CPU usage? Moreover it is more flexible 
>> if guarantee/limit can be specified for a group of tasks, rather than 
>> individual tasks even in
>> non-container scenarios (like limiting CPU usage of all web-server 
>> tasks togther or for limiting CPU usage of make -j command).
>>
>
> Oh, I'm sure there are lots of things we *could* do that we currently 
> can't.
>
> What I want to establish first is: what exact functionality is 
> required, why, and by whom.

You make it sound like users should feel sorry for wanting features 
already commonly available on other high performance unix kernels.

The answer is quite simple, people who are consolidating systems and 
working with fewer, larger systems, want to mark processes, groups of 
processes or entire containers into CPU scheduling classes, then either 
fair balance between them, limit them or reserve them a portion of the 
CPU - depending on the user and what their requirements are. What is 
unclear about that?

Yes, this does get somewhat simpler if you strap yourself into a 
complete virtualisation straightjacket, but the current thread is not 
about that approach - and the continual suggestions that we are all just 
being stupid and going about it the wrong way are locally off-topic.

Bear in mind that we have on the table at least one group of scheduling 
solutions (timeslice scaling based ones, such as the VServer one) which 
is virtually no overhead and could potentially provide the "jumpers" 
necessary for implementing more complex scheduling policies.

Sam.

> Only then can we sanely discuss the fitness of solutions and propose 
> alternatives, and decide whether to merge.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  5:53       ` Sam Vilain
@ 2006-06-18  6:11         ` Nick Piggin
  2006-06-18  6:40           ` Sam Vilain
  2006-06-18  6:42           ` Andrew Morton
  2006-06-18  7:18         ` Srivatsa Vaddagiri
  1 sibling, 2 replies; 36+ messages in thread
From: Nick Piggin @ 2006-06-18  6:11 UTC (permalink / raw)
  To: Sam Vilain
  Cc: vatsa, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, maeda.naoaki, kurosawa

Sam Vilain wrote:
> Nick Piggin wrote:
> 
>>> I think a proportional-share scheduler (which is what a CPU controller
>>> may provide) has non-container uses also. Do you think nice (or sched 
>>> policy) is enough to, say, provide guaranteed CPU usage for 
>>> applications or limit their CPU usage? Moreover it is more flexible 
>>> if guarantee/limit can be specified for a group of tasks, rather than 
>>> individual tasks even in
>>> non-container scenarios (like limiting CPU usage of all web-server 
>>> tasks togther or for limiting CPU usage of make -j command).
>>>
>>
>> Oh, I'm sure there are lots of things we *could* do that we currently 
>> can't.
>>
>> What I want to establish first is: what exact functionality is 
>> required, why, and by whom.
> 
> 
> You make it sound like users should feel sorry for wanting features 
> already commonly available on other high performance unix kernels.

If telling me what exact functionality they want is going to cause them
so much pain, I suppose they should feel sorry for themselves.

And I don't care about any other kernels, unix or not. I care about what
Linux users want.

> 
> The answer is quite simple, people who are consolidating systems and 
> working with fewer, larger systems, want to mark processes, groups of 
> processes or entire containers into CPU scheduling classes, then either 
> fair balance between them, limit them or reserve them a portion of the 
> CPU - depending on the user and what their requirements are. What is 
> unclear about that?
> 

It is unclear whether we should have hard limits, or just nice like
priority levels. Whether virtualisation (+/- containers) could be a
good solution, etc.

If you want to *completely* isolate N groups of users, surely you
have to use virtualisation, unless you are willing to isolate memory
management, pagecache, slab caches, network and disk IO, etc.

> Yes, this does get somewhat simpler if you strap yourself into a 
> complete virtualisation straightjacket, but the current thread is not 
> about that approach - and the continual suggestions that we are all just 
> being stupid and going about it the wrong way are locally off-topic.

I'm sorry you cannot come up with a statement of the functionality you
require without badmouthing "complete" virtualisation or implying that
I'm saying you're stupid.

I think the containers people might also recognise that it may not be
the best solution to make containers the be all and end all of
consolidating systems, and virtualisation is a very relevant topic when
discussing pros and cons and alternate solutions.

But at this point I'm yet to be shown what the *problem* is. I'm not
trying to deny that one might exist.

> 
> Bear in mind that we have on the table at least one group of scheduling 
> solutions (timeslice scaling based ones, such as the VServer one) which 
> is virtually no overhead and could potentially provide the "jumpers" 
> necessary for implementing more complex scheduling policies.

Again, I don't care about the solutions at this stage. I want to know
what the problem is. Please?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  6:11         ` Nick Piggin
@ 2006-06-18  6:40           ` Sam Vilain
  2006-06-18  7:17             ` Nick Piggin
  2006-06-18  6:42           ` Andrew Morton
  1 sibling, 1 reply; 36+ messages in thread
From: Sam Vilain @ 2006-06-18  6:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: vatsa, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, maeda.naoaki, kurosawa

Nick Piggin wrote:
>> The answer is quite simple, people who are consolidating systems and 
>> working with fewer, larger systems, want to mark processes, groups of 
>> processes or entire containers into CPU scheduling classes, then 
>> either fair balance between them, limit them or reserve them a 
>> portion of the CPU - depending on the user and what their 
>> requirements are. What is unclear about that?
>>
>
> It is unclear whether we should have hard limits, or just nice like
> priority levels. Whether virtualisation (+/- containers) could be a
> good solution, etc.

Look, that was actually answered in the paragraph you're responding to. 
Once again, give me a set of possible requirements and I'll find you a 
set of users that have them. I am finding this sub-thread quite redundant.

> If you want to *completely* isolate N groups of users, surely you
> have to use virtualisation, unless you are willing to isolate memory
> management, pagecache, slab caches, network and disk IO, etc.

No, you have to use separate hardware. Try to claim otherwise and you're 
glossing over the corner cases.

Sam.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  6:40           ` Sam Vilain
@ 2006-06-18  7:17             ` Nick Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2006-06-18  7:17 UTC (permalink / raw)
  To: Sam Vilain
  Cc: vatsa, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, maeda.naoaki, kurosawa

Sam Vilain wrote:
> Nick Piggin wrote:
> 
>>> The answer is quite simple, people who are consolidating systems and 
>>> working with fewer, larger systems, want to mark processes, groups of 
>>> processes or entire containers into CPU scheduling classes, then 
>>> either fair balance between them, limit them or reserve them a 
>>> portion of the CPU - depending on the user and what their 
>>> requirements are. What is unclear about that?
>>>
>>
>> It is unclear whether we should have hard limits, or just nice like
>> priority levels. Whether virtualisation (+/- containers) could be a
>> good solution, etc.
> 
> 
> Look, that was actually answered in the paragraph you're responding to. 
> Once again, give me a set of possible requirements and I'll find you a 
> set of users that have them. I am finding this sub-thread quite redundant.

Clearly we can't stuff everything into the kernel. What I'm asking is
what the important functionality is that people want to cover. I don't
know how you could possibly interpret it as anything else.

> 
>> If you want to *completely* isolate N groups of users, surely you
>> have to use virtualisation, unless you are willing to isolate memory
>> management, pagecache, slab caches, network and disk IO, etc.
> 
> 
> No, you have to use separate hardware. Try to claim otherwise and you're 
> glossing over the corner cases.

Well, virtualisation seems like it would get you a lot further than
containers for the same amount of work.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  6:11         ` Nick Piggin
  2006-06-18  6:40           ` Sam Vilain
@ 2006-06-18  6:42           ` Andrew Morton
  2006-06-18  7:28             ` Nick Piggin
  2006-06-18  7:36             ` [RFC] CPU controllers? Mike Galbraith
  1 sibling, 2 replies; 36+ messages in thread
From: Andrew Morton @ 2006-06-18  6:42 UTC (permalink / raw)
  To: Nick Piggin
  Cc: sam, vatsa, dev, efault, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa

On Sun, 18 Jun 2006 16:11:18 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> If you want to *completely* isolate N groups of users, surely you
> have to use virtualisation,

I'd view this as a kludge.  If one group of tasks is trashing the
performance of another group of tasks the user is forced to use hardware
virtualisation to work around it.

I mean, is this our answer to the updatedb problem?  Instantiate a separate
copy of the kernel just to run updatedb?

> unless you are willing to isolate memory
> management, pagecache, slab caches, network and disk IO, etc.

Well yes.  Ideally and ultimately.  People have done this, and it's in
production.  We need to see (and work upon) the patches before we can judge
whether we want to do this, and how far we want to go.

> Again, I don't care about the solutions at this stage. I want to know
> what the problem is. Please?

Isolation.  To prevent one group of processes from damaging the performance
of other groups, by providing manageability of the resource consumption of
each group.  There are plenty of applications of this, not just
server-consolidation-via-server-virtualisation.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  6:42           ` Andrew Morton
@ 2006-06-18  7:28             ` Nick Piggin
  2006-06-19 19:03               ` Resource Management Requirements (was "[RFC] CPU controllers?") Chandra Seetharaman
  2006-06-18  7:36             ` [RFC] CPU controllers? Mike Galbraith
  1 sibling, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2006-06-18  7:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: sam, vatsa, dev, efault, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa

Andrew Morton wrote:
> On Sun, 18 Jun 2006 16:11:18 +1000
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>If you want to *completely* isolate N groups of users, surely you
>>have to use virtualisation,
> 
> 
> I'd view this as a kludge.  If one group of tasks is trashing the
> performance of another group of tasks the user is forced to use hardware
> virtualisation to work around it.
> 
> I mean, is this our answer to the updatedb problem?  Instantiate a separate
> copy of the kernel just to run updatedb?

Well even before that, I'd view the fact that working around the VM's
poor behaviour by putting updatedb into a container or memory control
as a kludge anyway. CPU and IO control (ie. nice & ioprio) is reasonable.

updatedb is pretty simple and the VM should easily be able to recognise
its use-once nature.

However I don't doubt that people would like to be able to manage memory
better. Whether that is best served by having resource control heirarchies
or virtualisation or something else completely is still on the table IMO.

> 
> 
>>unless you are willing to isolate memory
>>management, pagecache, slab caches, network and disk IO, etc.
> 
> 
> Well yes.  Ideally and ultimately.  People have done this, and it's in
> production.  We need to see (and work upon) the patches before we can judge
> whether we want to do this, and how far we want to go.

Definitely.

> 
> 
>>Again, I don't care about the solutions at this stage. I want to know
>>what the problem is. Please?
> 
> 
> Isolation.  To prevent one group of processes from damaging the performance
> of other groups, by providing manageability of the resource consumption of
> each group.  There are plenty of applications of this, not just
> server-consolidation-via-server-virtualisation.

OK... let me put it more clearly. What are the requirements?
I don't like that apparently virtualisation can't be discussed in
a general thread about resource control. Nothing is going to be a
100% solution for everybody. If, for a *specific* application,
virtualisation can be discounted... then great, that is the kind
of discussion I would like to see.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Resource Management Requirements (was "[RFC] CPU controllers?")
  2006-06-18  7:28             ` Nick Piggin
@ 2006-06-19 19:03               ` Chandra Seetharaman
  2006-06-20  5:40                 ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 36+ messages in thread
From: Chandra Seetharaman @ 2006-06-19 19:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, sam, vatsa, dev, efault, mingo, pwil3058, balbir,
	linux-kernel, maeda.naoaki, kurosawa, ckrm-tech

On Sun, 2006-06-18 at 17:28 +1000, Nick Piggin wrote:

> OK... let me put it more clearly. What are the requirements?
Nick,

Here are some requirements we(Resource Groups aka CKRM) are working
towards (Note that this is not limited to CPU alone):

In a enterprise environment:
 - Ability to group applications into their importance levels and assign
   appropriate amount of resources to them.
 - In case of server consolidation, ability to allocate and control
   resources to a specific group of applications. Ability to 
   account/charge according to their usages.
 - manage multiple departments in a single OS instance with ability to
   allocate and control resources department wise (similar to above
   requirement :)
 - ability to guarantee "time to complete" for a specific user
   request (by controlling resource usage starting from the web server
   to the database server).
 - In case of ISPs and ASPs, ability to guarantee/limit usages to 
   independent clients (in a single OS instance). 
 - Ability to control runaway processes from bringing down the system 
   response (DoS attacks, fork bombs etc.,)
  
In a university environment (can be treated as a subset of enterprise
requirements above):
 - Ability to limit resource consumption at individual user level.
 - Ability to control runaway processes.
 - Ability for a user to manage resources allocated to them (as 
   explained in the desktop environment below). 

In a desktop environment:
 - Ability to control resource usage of a set of applications 
   (ex: infamous updatedb issue).
 - Ability to run different loads and get the expected result (like 
   checking emails or browsing Internet while compilation is in 
   progress) 

Generic:
Provide these resource management capabilities with less overhead on
overall system performance.

regards,

chandra
-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sekharan@us.ibm.com   |      .......you may get it.
----------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Resource Management Requirements (was "[RFC] CPU controllers?")
  2006-06-19 19:03               ` Resource Management Requirements (was "[RFC] CPU controllers?") Chandra Seetharaman
@ 2006-06-20  5:40                 ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2006-06-20  5:40 UTC (permalink / raw)
  To: Chandra Seetharaman
  Cc: Nick Piggin, Andrew Morton, sam, dev, efault, mingo, pwil3058,
	balbir, linux-kernel, maeda.naoaki, kurosawa, ckrm-tech

On Mon, Jun 19, 2006 at 12:03:23PM -0700, Chandra Seetharaman wrote:
> On Sun, 2006-06-18 at 17:28 +1000, Nick Piggin wrote:
> 
> > OK... let me put it more clearly. What are the requirements?

At a very broad-level, all the requirements pointed by Chandra below boil down 
to the requirement of providing guaranteed CPU usage for a group of 
tasks and the ability of limiting (hard or soft) CPU usage of other group of 
tasks.

At a finer-level, this broad requirement could be interpreted and implemented 
in a number of ways (ex: by having kernel support only task-level limit and
implementing group-level in user-space etc) and thats what this RFC was
about - to discuss what minimal kernel support would be needed to
support the above broad requirement!

> Nick,
> 
> Here are some requirements we(Resource Groups aka CKRM) are working
> towards (Note that this is not limited to CPU alone):
> 
> In a enterprise environment:
>  - Ability to group applications into their importance levels and assign
>    appropriate amount of resources to them.
>  - In case of server consolidation, ability to allocate and control
>    resources to a specific group of applications. Ability to 
>    account/charge according to their usages.
>  - manage multiple departments in a single OS instance with ability to
>    allocate and control resources department wise (similar to above
>    requirement :)
>  - ability to guarantee "time to complete" for a specific user
>    request (by controlling resource usage starting from the web server
>    to the database server).
>  - In case of ISPs and ASPs, ability to guarantee/limit usages to 
>    independent clients (in a single OS instance). 
>  - Ability to control runaway processes from bringing down the system 
>    response (DoS attacks, fork bombs etc.,)
>   
> In a university environment (can be treated as a subset of enterprise
> requirements above):
>  - Ability to limit resource consumption at individual user level.
>  - Ability to control runaway processes.
>  - Ability for a user to manage resources allocated to them (as 
>    explained in the desktop environment below). 
> 
> In a desktop environment:
>  - Ability to control resource usage of a set of applications 
>    (ex: infamous updatedb issue).
>  - Ability to run different loads and get the expected result (like 
>    checking emails or browsing Internet while compilation is in 
>    progress) 
> 
> Generic:
> Provide these resource management capabilities with less overhead on
> overall system performance.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  6:42           ` Andrew Morton
  2006-06-18  7:28             ` Nick Piggin
@ 2006-06-18  7:36             ` Mike Galbraith
  2006-06-18  7:49               ` Nick Piggin
                                 ` (3 more replies)
  1 sibling, 4 replies; 36+ messages in thread
From: Mike Galbraith @ 2006-06-18  7:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, sam, vatsa, dev, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa

On Sat, 2006-06-17 at 23:42 -0700, Andrew Morton wrote:
> On Sun, 18 Jun 2006 16:11:18 +1000
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > Again, I don't care about the solutions at this stage. I want to know
> > what the problem is. Please?
> 
> Isolation.  To prevent one group of processes from damaging the performance
> of other groups, by providing manageability of the resource consumption of
> each group.  There are plenty of applications of this, not just
> server-consolidation-via-server-virtualisation.

Scheduling contexts do sound useful.  They're easily defeated though, as
evolution mail demonstrates to me every time it's GUI hangs and I see
that a nice 19 find is running, eating very little CPU, but effectively
DoSing evolution nonetheless (journal).  I wonder how often people who
tried to distribute CPU would likewise be stymied by other resources.

	-Mike


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  7:36             ` [RFC] CPU controllers? Mike Galbraith
@ 2006-06-18  7:49               ` Nick Piggin
  2006-06-18  7:49               ` Nick Piggin
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2006-06-18  7:49 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, sam, vatsa, dev, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa

Mike Galbraith wrote:
> On Sat, 2006-06-17 at 23:42 -0700, Andrew Morton wrote:
> 
>>On Sun, 18 Jun 2006 16:11:18 +1000
>>Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>>Again, I don't care about the solutions at this stage. I want to know
>>>what the problem is. Please?
>>
>>Isolation.  To prevent one group of processes from damaging the performance
>>of other groups, by providing manageability of the resource consumption of
>>each group.  There are plenty of applications of this, not just
>>server-consolidation-via-server-virtualisation.
> 
> 
> Scheduling contexts do sound useful.  They're easily defeated though, as
> evolution mail demonstrates to me every time it's GUI hangs and I see
> that a nice 19 find is running, eating very little CPU, but effectively
> DoSing evolution nonetheless (journal).  I wonder how often people who
> tried to distribute CPU would likewise be stymied by other resources.

Not entirely infrequently. Which is why it really doesn't seem like
it could be useful from a security point of view without a *huge*
amount of work and complexity... and even from a guaranteed-service
point of view, it still seems (to me) like a pretty big and complex
problem.

As a check box for marketing it sounds pretty cool though, I admit ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  7:36             ` [RFC] CPU controllers? Mike Galbraith
  2006-06-18  7:49               ` Nick Piggin
@ 2006-06-18  7:49               ` Nick Piggin
  2006-06-18  9:09               ` Andrew Morton
  2006-06-19 18:21               ` Chris Friesen
  3 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2006-06-18  7:49 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, sam, vatsa, dev, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa

Mike Galbraith wrote:
> On Sat, 2006-06-17 at 23:42 -0700, Andrew Morton wrote:
> 
>>On Sun, 18 Jun 2006 16:11:18 +1000
>>Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>>Again, I don't care about the solutions at this stage. I want to know
>>>what the problem is. Please?
>>
>>Isolation.  To prevent one group of processes from damaging the performance
>>of other groups, by providing manageability of the resource consumption of
>>each group.  There are plenty of applications of this, not just
>>server-consolidation-via-server-virtualisation.
> 
> 
> Scheduling contexts do sound useful.  They're easily defeated though, as
> evolution mail demonstrates to me every time it's GUI hangs and I see
> that a nice 19 find is running, eating very little CPU, but effectively
> DoSing evolution nonetheless (journal).  I wonder how often people who
> tried to distribute CPU would likewise be stymied by other resources.

Not entirely infrequently. Which is why it really doesn't seem like
it could be useful from a security point of view without a *huge*
amount of work and complexity... and even from a guaranteed-service
point of view, it still seems (to me) like a pretty big and complex
problem.

As a check box for marketing it sounds pretty cool though, I admit ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  7:36             ` [RFC] CPU controllers? Mike Galbraith
  2006-06-18  7:49               ` Nick Piggin
  2006-06-18  7:49               ` Nick Piggin
@ 2006-06-18  9:09               ` Andrew Morton
  2006-06-18  9:49                 ` Mike Galbraith
  2006-06-19 18:21               ` Chris Friesen
  3 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-06-18  9:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: nickpiggin, sam, vatsa, dev, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa

On Sun, 18 Jun 2006 09:36:16 +0200
Mike Galbraith <efault@gmx.de> wrote:

> as
> evolution mail demonstrates to me every time it's GUI hangs and I see
> that a nice 19 find is running, eating very little CPU, but effectively
> DoSing evolution nonetheless (journal).

eh?  That would be an io scheduler bug, wouldn't it?

Tell us more.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  9:09               ` Andrew Morton
@ 2006-06-18  9:49                 ` Mike Galbraith
  2006-06-19  6:28                   ` Mike Galbraith
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Galbraith @ 2006-06-18  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, sam, vatsa, dev, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa

On Sun, 2006-06-18 at 02:09 -0700, Andrew Morton wrote:
> On Sun, 18 Jun 2006 09:36:16 +0200
> Mike Galbraith <efault@gmx.de> wrote:
> 
> > as
> > evolution mail demonstrates to me every time it's GUI hangs and I see
> > that a nice 19 find is running, eating very little CPU, but effectively
> > DoSing evolution nonetheless (journal).
> 
> eh?  That would be an io scheduler bug, wouldn't it?
> 
> Tell us more.

The trace below was done with a nice -n 19 bonnie -s 2047 running, but
the same happens with the find that SuSE starts at annoying times.
Scheduler is cfq, but changing schedulers doesn't help.

Place a shell window over the evolution window, start io, then click on
the evolution window, and see how long it takes to be able to read mail.
Here, it's a couple forevers.

evolution     D 00000001     0  9324   6938          9333  7851 (NOTLB)
       ef322dec 00000000 00000000 00000001 00000003 93a3f580 000f44c2 ef322000 
       ef322000 ef314030 93a3f580 000f44c2 ef322000 001fb058 ef24d980 ef322000 
       ef322e50 b10bcb57 00000000 b1399998 ef322e3c 00000001 ef24d9c0 ef24d9d0 
Call Trace:
 [<b10bcb57>] log_wait_commit+0x139/0x1f1
 [<b10b6000>] journal_stop+0x239/0x350
 [<b10b6dc8>] journal_force_commit+0x1d/0x1f
 [<b10ae32a>] ext3_force_commit+0x24/0x26
 [<b10a83a0>] ext3_write_inode+0x34/0x7b
 [<b107fa79>] __writeback_single_inode+0x2e8/0x3c9
 [<b10803f1>] sync_inode+0x15/0x2f
 [<b10a426b>] ext3_sync_file+0xc3/0xc8
 [<b10600fc>] do_fsync+0x68/0xb3
 [<b1060167>] __do_fsync+0x20/0x2f
 [<b1060195>] sys_fsync+0xd/0xf
 [<b1002e1b>] syscall_call+0x7/0xb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  9:49                 ` Mike Galbraith
@ 2006-06-19  6:28                   ` Mike Galbraith
  2006-06-19  6:35                     ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Mike Galbraith @ 2006-06-19  6:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, sam, vatsa, dev, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa


This is kinda OT for this thread, but here's another example of where
the IO can easily foil CPU distribution plans.  I wonder how many folks
get nailed by /proc being mounted without noatime,nodiratime like I just
apparently did.

top           D E29B4928     0 10174   8510                     (NOTLB)
       d2f63c4c 00100100 00200200 e29b4928 ea07f3c0 f1510c40 000f6e66 d2f63000 
       d2f63000 ed88c550 f1510c40 000f6e66 d2f63000 d2f63000 ed062220 ed88c550 
       d2f63c70 b139a97b ed062224 efef8df8 ed062224 ed88c550 d2f63000 0000385a 
Call Trace:
 [<b139a97b>] __mutex_lock_slowpath+0x59/0xb0
 [<b139a9d7>] .text.lock.mutex+0x5/0x14
 [<b10bb24f>] __log_wait_for_space+0x53/0xb4
 [<b10b67b4>] start_this_handle+0x100/0x617
 [<b10b6d86>] journal_start+0xbb/0xe0
 [<b10ae10e>] ext3_journal_start_sb+0x29/0x4a
 [<b10a8d9f>] ext3_dirty_inode+0x2a/0xaf
 [<b1080171>] __mark_inode_dirty+0x2a/0x19e
 [<b107784a>] touch_atime+0x79/0x9f
 [<b103fda5>] do_generic_mapping_read+0x370/0x480
 [<b1040747>] __generic_file_aio_read+0xf0/0x205
 [<b1040896>] generic_file_aio_read+0x3a/0x46
 [<b105d919>] do_sync_read+0xbb/0xf1
 [<b105e2c1>] vfs_read+0xa4/0x166
 [<b105e6c1>] sys_read+0x3d/0x64
 [<b1002e1b>] syscall_call+0x7/0xb

netdaemon     D EC84ED04     0  7696      1          7711  7695 (NOTLB)
       efef8dec 00000000 efef8000 ec84ed04 efef8e00 402cecc0 000f6e68 efef8000 
       efef8000 ed617030 402cecc0 000f6e68 efef8000 efef8000 ed062220 ed617030 
       efef8e10 b139a97b ed062224 ed062224 d2f63c58 ed617030 efef8000 0000385a 
Call Trace:
 [<b139a97b>] __mutex_lock_slowpath+0x59/0xb0
 [<b139a9d7>] .text.lock.mutex+0x5/0x14
 [<b10bb24f>] __log_wait_for_space+0x53/0xb4
 [<b10b67b4>] start_this_handle+0x100/0x617
 [<b10b6d86>] journal_start+0xbb/0xe0
 [<b10ae10e>] ext3_journal_start_sb+0x29/0x4a
 [<b10a8d9f>] ext3_dirty_inode+0x2a/0xaf
 [<b1080171>] __mark_inode_dirty+0x2a/0x19e
 [<b107784a>] touch_atime+0x79/0x9f
 [<b106fc08>] vfs_readdir+0x91/0x93
 [<b106fc6a>] sys_getdents64+0x60/0xa7
 [<b1002e1b>] syscall_call+0x7/0xb



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19  6:28                   ` Mike Galbraith
@ 2006-06-19  6:35                     ` Andrew Morton
  2006-06-19  6:46                       ` Mike Galbraith
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2006-06-19  6:35 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: nickpiggin, sam, vatsa, dev, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa

On Mon, 19 Jun 2006 08:28:45 +0200
Mike Galbraith <efault@gmx.de> wrote:

> This is kinda OT for this thread, but here's another example of where
> the IO can easily foil CPU distribution plans.  I wonder how many folks
> get nailed by /proc being mounted without noatime,nodiratime like I just
> apparently did.
> 
> top           D E29B4928     0 10174   8510                     (NOTLB)
>        d2f63c4c 00100100 00200200 e29b4928 ea07f3c0 f1510c40 000f6e66 d2f63000 
>        d2f63000 ed88c550 f1510c40 000f6e66 d2f63000 d2f63000 ed062220 ed88c550 
>        d2f63c70 b139a97b ed062224 efef8df8 ed062224 ed88c550 d2f63000 0000385a 
> Call Trace:
>  [<b139a97b>] __mutex_lock_slowpath+0x59/0xb0
>  [<b139a9d7>] .text.lock.mutex+0x5/0x14
>  [<b10bb24f>] __log_wait_for_space+0x53/0xb4
>  [<b10b67b4>] start_this_handle+0x100/0x617
>  [<b10b6d86>] journal_start+0xbb/0xe0
>  [<b10ae10e>] ext3_journal_start_sb+0x29/0x4a
>  [<b10a8d9f>] ext3_dirty_inode+0x2a/0xaf
>  [<b1080171>] __mark_inode_dirty+0x2a/0x19e
>  [<b107784a>] touch_atime+0x79/0x9f
>  [<b103fda5>] do_generic_mapping_read+0x370/0x480
>  [<b1040747>] __generic_file_aio_read+0xf0/0x205
>  [<b1040896>] generic_file_aio_read+0x3a/0x46
>  [<b105d919>] do_sync_read+0xbb/0xf1
>  [<b105e2c1>] vfs_read+0xa4/0x166
>  [<b105e6c1>] sys_read+0x3d/0x64
>  [<b1002e1b>] syscall_call+0x7/0xb

Confused.  What has this to do with /proc?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19  6:35                     ` Andrew Morton
@ 2006-06-19  6:46                       ` Mike Galbraith
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Galbraith @ 2006-06-19  6:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, sam, vatsa, dev, mingo, pwil3058, sekharan, balbir,
	linux-kernel, maeda.naoaki, kurosawa

On Sun, 2006-06-18 at 23:35 -0700, Andrew Morton wrote:
> On Mon, 19 Jun 2006 08:28:45 +0200
> Mike Galbraith <efault@gmx.de> wrote:
> 
> > This is kinda OT for this thread, but here's another example of where
> > the IO can easily foil CPU distribution plans.  I wonder how many folks
> > get nailed by /proc being mounted without noatime,nodiratime like I just
> > apparently did.
> > 
> > top           D E29B4928     0 10174   8510                     (NOTLB)
> >        d2f63c4c 00100100 00200200 e29b4928 ea07f3c0 f1510c40 000f6e66 d2f63000 
> >        d2f63000 ed88c550 f1510c40 000f6e66 d2f63000 d2f63000 ed062220 ed88c550 
> >        d2f63c70 b139a97b ed062224 efef8df8 ed062224 ed88c550 d2f63000 0000385a 
> > Call Trace:
> >  [<b139a97b>] __mutex_lock_slowpath+0x59/0xb0
> >  [<b139a9d7>] .text.lock.mutex+0x5/0x14
> >  [<b10bb24f>] __log_wait_for_space+0x53/0xb4
> >  [<b10b67b4>] start_this_handle+0x100/0x617
> >  [<b10b6d86>] journal_start+0xbb/0xe0
> >  [<b10ae10e>] ext3_journal_start_sb+0x29/0x4a
> >  [<b10a8d9f>] ext3_dirty_inode+0x2a/0xaf
> >  [<b1080171>] __mark_inode_dirty+0x2a/0x19e
> >  [<b107784a>] touch_atime+0x79/0x9f
> >  [<b103fda5>] do_generic_mapping_read+0x370/0x480
> >  [<b1040747>] __generic_file_aio_read+0xf0/0x205
> >  [<b1040896>] generic_file_aio_read+0x3a/0x46
> >  [<b105d919>] do_sync_read+0xbb/0xf1
> >  [<b105e2c1>] vfs_read+0xa4/0x166
> >  [<b105e6c1>] sys_read+0x3d/0x64
> >  [<b1002e1b>] syscall_call+0x7/0xb
> 
> Confused.  What has this to do with /proc?

/me assumed... with usual result.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  7:36             ` [RFC] CPU controllers? Mike Galbraith
                                 ` (2 preceding siblings ...)
  2006-06-18  9:09               ` Andrew Morton
@ 2006-06-19 18:21               ` Chris Friesen
  2006-06-20  6:20                 ` Mike Galbraith
  3 siblings, 1 reply; 36+ messages in thread
From: Chris Friesen @ 2006-06-19 18:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andrew Morton, Nick Piggin, sam, vatsa, dev, mingo, pwil3058,
	sekharan, balbir, linux-kernel, maeda.naoaki, kurosawa

Mike Galbraith wrote:

> Scheduling contexts do sound useful.  They're easily defeated though, as
> evolution mail demonstrates to me every time it's GUI hangs and I see
> that a nice 19 find is running, eating very little CPU, but effectively
> DoSing evolution nonetheless (journal).  I wonder how often people who
> tried to distribute CPU would likewise be stymied by other resources.

We do a lot with diskless blades.  Basically cpu(s), memory, and network 
ports.

For this case, cpu, memory, and network controllers are sufficient. 
Even just cpu gets you a long way, since mostly we're not IO-intensive 
and we generally have a pretty good idea of memory consumption.

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19 18:21               ` Chris Friesen
@ 2006-06-20  6:20                 ` Mike Galbraith
  0 siblings, 0 replies; 36+ messages in thread
From: Mike Galbraith @ 2006-06-20  6:20 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Andrew Morton, Nick Piggin, sam, vatsa, dev, mingo, pwil3058,
	sekharan, balbir, linux-kernel, maeda.naoaki, kurosawa

On Mon, 2006-06-19 at 12:21 -0600, Chris Friesen wrote:
> Mike Galbraith wrote:
> 
> > Scheduling contexts do sound useful.  They're easily defeated though, as
> > evolution mail demonstrates to me every time it's GUI hangs and I see
> > that a nice 19 find is running, eating very little CPU, but effectively
> > DoSing evolution nonetheless (journal).  I wonder how often people who
> > tried to distribute CPU would likewise be stymied by other resources.
> 
> We do a lot with diskless blades.  Basically cpu(s), memory, and network 
> ports.
> 
> For this case, cpu, memory, and network controllers are sufficient. 
> Even just cpu gets you a long way, since mostly we're not IO-intensive 
> and we generally have a pretty good idea of memory consumption.

Sure.  Some conflicts can be avoided with foreknowledge, and those
conflicts that do occur don't necessarily make limits worthless or
unmanageable.  Nonetheless, I can imagine them becoming problematic.

	-Mike


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  5:53       ` Sam Vilain
  2006-06-18  6:11         ` Nick Piggin
@ 2006-06-18  7:18         ` Srivatsa Vaddagiri
  2006-06-19  2:07           ` Sam Vilain
  1 sibling, 1 reply; 36+ messages in thread
From: Srivatsa Vaddagiri @ 2006-06-18  7:18 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Nick Piggin, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, maeda.naoaki, kurosawa

On Sun, Jun 18, 2006 at 05:53:42PM +1200, Sam Vilain wrote:
> Bear in mind that we have on the table at least one group of scheduling 
> solutions (timeslice scaling based ones, such as the VServer one) which 
> is virtually no overhead and could potentially provide the "jumpers" 
> necessary for implementing more complex scheduling policies.

Sam,
	Do you have any plans to post the vserver CPU control
implementation hooked against maybe Resource Groups (for grouping
tasks)? Seeing several different implementation against current
kernel may perhaps help maintainers decide what they like and what they
don't?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-18  7:18         ` Srivatsa Vaddagiri
@ 2006-06-19  2:07           ` Sam Vilain
  2006-06-19  7:04             ` MAEDA Naoaki
  0 siblings, 1 reply; 36+ messages in thread
From: Sam Vilain @ 2006-06-19  2:07 UTC (permalink / raw)
  To: vatsa
  Cc: Nick Piggin, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, maeda.naoaki, kurosawa, ckrm-tech

Srivatsa Vaddagiri wrote:
> On Sun, Jun 18, 2006 at 05:53:42PM +1200, Sam Vilain wrote:
>   
>> Bear in mind that we have on the table at least one group of scheduling 
>> solutions (timeslice scaling based ones, such as the VServer one) which 
>> is virtually no overhead and could potentially provide the "jumpers" 
>> necessary for implementing more complex scheduling policies.
>>     
> 	Do you have any plans to post the vserver CPU control
> implementation hooked against maybe Resource Groups (for grouping
> tasks)? Seeing several different implementation against current
> kernel may perhaps help maintainers decide what they like and what they
> don't?

That sounds like a good idea, I like the Resource Groups concept in
general and it would be good to be able to fit this into a more generic
and comprehensive framework.

I'll try it against Chandra and Maeda's Apr 27 submission (a shame I
missed it the first time around), and see how far I get.

[goes away a bit]

ok, so basically the bit in cpu_rc_load() where for_each_cpu_mask() is
called, in Maeda Naoaki's patch "CPU controller - Add class load
estimation support", is where O(N) creeps in that could be remedied with
a token bucket algorithm. You don't want this because if you have 10,000
processes on a system in two resource groups, the aggregate performance
will suffer due to the large number of cacheline misses during the 5,000
size loop that runs every resched.

To apply the token bucket here, you would first change the per-CPU
struct cpu_rc to have the TBF fields; minimally:

int tokens; /* current number of CPU tokens */

int fill_rate[2]; /* Fill rate: add X tokens... */
int interval[2]; /* Divisor: per Y jiffies */
int tokens_max; /* Limit: no more than N tokens */

unsigned long last_time; /* last time accounted */

(note: the VServer implementation has several other fields for various
reasons; the above are the important ones).

Then, in cpu_rc_record_allocation(), you'd take the length of the slice
out of the bucket (subtract from tokens). In cpu_rc_account(), you would
then "refund" unused CPU tokens back. The approach in Linux-VServer is
to remove tokens every scheduler_tick(), but perhaps there are
advantages to doing it the way you are in the CPU controller Resource
Groups patch.

That part should obviate the need for cpu_rc_load() altogether.

Then, in cpu_rc_scale_timeslice(), you would make it add a bonus
depending on (tokens / tokens_max); I found a quadratic back-off,
scaling 0% full to a +15 penalty, 100% full to a -5 bonus and 50% full
to no bonus, worked well - in my simple purely CPU bound process tests
using tight loop processes.

Note that when the bucket reaches 0, there is a choice to keep
allocating short timeslices anyway, under the presumption that the
system has CPU to burn (sched_soft), or to put all processes in that RC
on hold (sched_hard). This could potentially be controlled by flags on
the bucket - as well as the size of the boost.

Hence, the "jumpers" I refer to are the bucket parameters - for
instance, if you set the tokens_max to ~HZ, and have a suitably high
priority/RT task monitoring the buckets, then that process should be
able to;

- get a complete record of how many tokens were used by a RC since it
last checked,
- influence subsequent scheduling priority of the RC, by adjusting the
fill rate, current tokens value, the size of the boost, or the
"sched_hard" flag

...and it could probably do that with very occasional timeslices, such
as one slice per N*HZ (where N ~ the number of resource groups). So that
makes it a candidate for moving to userland.

The current VServer implementation fails to schedule fairly when the CPU
allocations do not add up correctly; if you only allocated 25% of CPU to
one vserver, then 40% to another, and they are both busy, they might end
up both with empty buckets and an equal +15 penalty - effectively using
50/50 CPU and allocating very short timeslices, yielding poor batch
performance.

So, with (possibly userland) policy monitoring for this sort of
condition and adjusting bucket sizes and levels appropriately, that old
"problem" that leads people to conclude that the VServer scheduler does
not work could be solved - all without incurring major overhead even on
very busy systems.

I think that the characteristics of these two approaches are subtly
different. Both scale timeslices, but in a different way - instead of
estimating the load and scaling back timeslices up front, busy Resource
Groups are relied on to deplete their tokens in a timely manner, and get
shorter slices allocated because of that. No doubt from 10,000 feet they
both look the same.

There is probably enough information here for an implementation, but
I'll wait for feedback on this post before going any further with it.

Sam.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19  2:07           ` Sam Vilain
@ 2006-06-19  7:04             ` MAEDA Naoaki
  2006-06-19  8:19               ` Sam Vilain
  0 siblings, 1 reply; 36+ messages in thread
From: MAEDA Naoaki @ 2006-06-19  7:04 UTC (permalink / raw)
  To: Sam Vilain
  Cc: vatsa, Nick Piggin, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, kurosawa, ckrm-tech, MAEDA Naoaki

Sam Vilain wrote:
> Srivatsa Vaddagiri wrote:
>> On Sun, Jun 18, 2006 at 05:53:42PM +1200, Sam Vilain wrote:
>>   
>>> Bear in mind that we have on the table at least one group of scheduling 
>>> solutions (timeslice scaling based ones, such as the VServer one) which 
>>> is virtually no overhead and could potentially provide the "jumpers" 
>>> necessary for implementing more complex scheduling policies.
>>>     
>> 	Do you have any plans to post the vserver CPU control
>> implementation hooked against maybe Resource Groups (for grouping
>> tasks)? Seeing several different implementation against current
>> kernel may perhaps help maintainers decide what they like and what they
>> don't?
> 
> That sounds like a good idea, I like the Resource Groups concept in
> general and it would be good to be able to fit this into a more generic
> and comprehensive framework.

That sounds nice.

> I'll try it against Chandra and Maeda's Apr 27 submission (a shame I
> missed it the first time around), and see how far I get.
> 
> [goes away a bit]
> 
> ok, so basically the bit in cpu_rc_load() where for_each_cpu_mask() is
> called, in Maeda Naoaki's patch "CPU controller - Add class load
> estimation support", is where O(N) creeps in that could be remedied with
> a token bucket algorithm. You don't want this because if you have 10,000
> processes on a system in two resource groups, the aggregate performance
> will suffer due to the large number of cacheline misses during the 5,000
> size loop that runs every resched.

Thank you for looking the code.

cpu_rc_load() is never called unless sysadm tries to access the load
information via configfs from userland. In addition, it sums up per-CPU
group stats, so the size of loop is the number of CPU, not process in
the group.

However, there is a similer loop in cpu_rc_recalc_tsfactor(), which runs
every CPU_RC_RECALC_INTERVAL that is defined as HZ. I don't think it
will cause a big performance penalty.

> To apply the token bucket here, you would first change the per-CPU
> struct cpu_rc to have the TBF fields; minimally:
> 
> int tokens; /* current number of CPU tokens */
> 
> int fill_rate[2]; /* Fill rate: add X tokens... */
> int interval[2]; /* Divisor: per Y jiffies */
> int tokens_max; /* Limit: no more than N tokens */
> 
> unsigned long last_time; /* last time accounted */
> 
> (note: the VServer implementation has several other fields for various
> reasons; the above are the important ones).
> 
> Then, in cpu_rc_record_allocation(), you'd take the length of the slice
> out of the bucket (subtract from tokens). In cpu_rc_account(), you would
> then "refund" unused CPU tokens back. The approach in Linux-VServer is
> to remove tokens every scheduler_tick(), but perhaps there are
> advantages to doing it the way you are in the CPU controller Resource
> Groups patch.
> 
> That part should obviate the need for cpu_rc_load() altogether.
> 
> Then, in cpu_rc_scale_timeslice(), you would make it add a bonus
> depending on (tokens / tokens_max); I found a quadratic back-off,
> scaling 0% full to a +15 penalty, 100% full to a -5 bonus and 50% full
> to no bonus, worked well - in my simple purely CPU bound process tests
> using tight loop processes.
> 
> Note that when the bucket reaches 0, there is a choice to keep
> allocating short timeslices anyway, under the presumption that the
> system has CPU to burn (sched_soft), or to put all processes in that RC
> on hold (sched_hard). This could potentially be controlled by flags on
> the bucket - as well as the size of the boost.
> 
> Hence, the "jumpers" I refer to are the bucket parameters - for
> instance, if you set the tokens_max to ~HZ, and have a suitably high
> priority/RT task monitoring the buckets, then that process should be
> able to;
> 
> - get a complete record of how many tokens were used by a RC since it
> last checked,
> - influence subsequent scheduling priority of the RC, by adjusting the
> fill rate, current tokens value, the size of the boost, or the
> "sched_hard" flag
> 
> ...and it could probably do that with very occasional timeslices, such
> as one slice per N*HZ (where N ~ the number of resource groups). So that
> makes it a candidate for moving to userland.
> 
> The current VServer implementation fails to schedule fairly when the CPU
> allocations do not add up correctly; if you only allocated 25% of CPU to
> one vserver, then 40% to another, and they are both busy, they might end
> up both with empty buckets and an equal +15 penalty - effectively using
> 50/50 CPU and allocating very short timeslices, yielding poor batch
> performance.
> 
> So, with (possibly userland) policy monitoring for this sort of
> condition and adjusting bucket sizes and levels appropriately, that old
> "problem" that leads people to conclude that the VServer scheduler does
> not work could be solved - all without incurring major overhead even on
> very busy systems.
> 
> I think that the characteristics of these two approaches are subtly
> different. Both scale timeslices, but in a different way - instead of
> estimating the load and scaling back timeslices up front, busy Resource
> Groups are relied on to deplete their tokens in a timely manner, and get
> shorter slices allocated because of that. No doubt from 10,000 feet they
> both look the same.

Current 0(1) scheduler gives extra bonus for interactive tasks by
requeuing them to active array for a while. It would break
the controller's efforts. So, I'm planning to stop the interactive
task requeuing if the target share doesn't meet.

Are there a similar issue on the vserver scheduler?

> There is probably enough information here for an implementation, but
> I'll wait for feedback on this post before going any further with it.
> 
> Sam.

Thanks,
MAEDA Naoaki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19  7:04             ` MAEDA Naoaki
@ 2006-06-19  8:19               ` Sam Vilain
  2006-06-19  8:41                 ` MAEDA Naoaki
  0 siblings, 1 reply; 36+ messages in thread
From: Sam Vilain @ 2006-06-19  8:19 UTC (permalink / raw)
  To: MAEDA Naoaki
  Cc: vatsa, Nick Piggin, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, kurosawa, ckrm-tech

MAEDA Naoaki wrote:
>> ok, so basically the bit in cpu_rc_load() where for_each_cpu_mask() is
>> called, in Maeda Naoaki's patch "CPU controller - Add class load
>> estimation support", is where O(N) creeps in that could be remedied with
>> a token bucket algorithm. You don't want this because if you have 10,000
>> processes on a system in two resource groups, the aggregate performance
>> will suffer due to the large number of cacheline misses during the 5,000
>> size loop that runs every resched.
>>     
>
> Thank you for looking the code.
>
> cpu_rc_load() is never called unless sysadm tries to access the load
> information via configfs from userland. In addition, it sums up per-CPU
> group stats, so the size of loop is the number of CPU, not process in
> the group.
>
> However, there is a similer loop in cpu_rc_recalc_tsfactor(), which runs
> every CPU_RC_RECALC_INTERVAL that is defined as HZ. I don't think it
> will cause a big performance penalty.
>   

Ok, so that's not as bad as it looked.  So, while it is still O(N), the
fact that it is O(N/HZ) makes this not a problem until you get to
possibly impractical levels of runqueue length.

I'm thinking it's probably worth doing anyway, just so that it can be
performance tested to see if this performance guestimate is accurate.

>> To apply the token bucket here, you would first change the per-CPU
>> struct cpu_rc to have the TBF fields; minimally:
>>
>> [...]
>> I think that the characteristics of these two approaches are subtly
>> different. Both scale timeslices, but in a different way - instead of
>> estimating the load and scaling back timeslices up front, busy Resource
>> Groups are relied on to deplete their tokens in a timely manner, and get
>> shorter slices allocated because of that. No doubt from 10,000 feet they
>> both look the same.
>>     
>
> Current 0(1) scheduler gives extra bonus for interactive tasks by
> requeuing them to active array for a while. It would break
> the controller's efforts. So, I'm planning to stop the interactive
> task requeuing if the target share doesn't meet.
>
> Are there a similar issue on the vserver scheduler?
>   

Not an issue - those extra requeued timeslices are accounted for normally.

Sam.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19  8:19               ` Sam Vilain
@ 2006-06-19  8:41                 ` MAEDA Naoaki
  2006-06-19  8:53                   ` Sam Vilain
  0 siblings, 1 reply; 36+ messages in thread
From: MAEDA Naoaki @ 2006-06-19  8:41 UTC (permalink / raw)
  To: Sam Vilain
  Cc: vatsa, Nick Piggin, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, kurosawa, ckrm-tech, MAEDA Naoaki

Sam Vilain wrote:
> MAEDA Naoaki wrote:
>>> ok, so basically the bit in cpu_rc_load() where for_each_cpu_mask() is
>>> called, in Maeda Naoaki's patch "CPU controller - Add class load
>>> estimation support", is where O(N) creeps in that could be remedied with
>>> a token bucket algorithm. You don't want this because if you have 10,000
>>> processes on a system in two resource groups, the aggregate performance
>>> will suffer due to the large number of cacheline misses during the 5,000
>>> size loop that runs every resched.
>>>     
>> Thank you for looking the code.
>>
>> cpu_rc_load() is never called unless sysadm tries to access the load
>> information via configfs from userland. In addition, it sums up per-CPU
>> group stats, so the size of loop is the number of CPU, not process in
>> the group.
>>
>> However, there is a similer loop in cpu_rc_recalc_tsfactor(), which runs
>> every CPU_RC_RECALC_INTERVAL that is defined as HZ. I don't think it
>> will cause a big performance penalty.
>>   
> 
> Ok, so that's not as bad as it looked.  So, while it is still O(N), the
> fact that it is O(N/HZ) makes this not a problem until you get to
> possibly impractical levels of runqueue length.

Do you mean N is the size of the loop? for_each_cpu_mask() loops
the number of CPUs times. It is not directly related to runqueue length.

> I'm thinking it's probably worth doing anyway, just so that it can be
> performance tested to see if this performance guestimate is accurate.
> 
>>> To apply the token bucket here, you would first change the per-CPU
>>> struct cpu_rc to have the TBF fields; minimally:
>>>
>>> [...]
>>> I think that the characteristics of these two approaches are subtly
>>> different. Both scale timeslices, but in a different way - instead of
>>> estimating the load and scaling back timeslices up front, busy Resource
>>> Groups are relied on to deplete their tokens in a timely manner, and get
>>> shorter slices allocated because of that. No doubt from 10,000 feet they
>>> both look the same.
>>>     
>> Current 0(1) scheduler gives extra bonus for interactive tasks by
>> requeuing them to active array for a while. It would break
>> the controller's efforts. So, I'm planning to stop the interactive
>> task requeuing if the target share doesn't meet.
>>
>> Are there a similar issue on the vserver scheduler?
>>   
> 
> Not an issue - those extra requeued timeslices are accounted for normally.

It's great.

Thanks,
MAEDA Naoaki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19  8:41                 ` MAEDA Naoaki
@ 2006-06-19  8:53                   ` Sam Vilain
  2006-06-19 21:44                     ` MAEDA Naoaki
  0 siblings, 1 reply; 36+ messages in thread
From: Sam Vilain @ 2006-06-19  8:53 UTC (permalink / raw)
  To: MAEDA Naoaki
  Cc: vatsa, Nick Piggin, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, kurosawa, ckrm-tech

MAEDA Naoaki wrote:
>> Ok, so that's not as bad as it looked.  So, while it is still O(N), the
>> fact that it is O(N/HZ) makes this not a problem until you get to
>> possibly impractical levels of runqueue length.
>>     
>
> Do you mean N is the size of the loop? for_each_cpu_mask() loops
> the number of CPUs times. It is not directly related to runqueue length.
>   

Ok, I mistook it for a per-task loop.

Well, let me know if you think it's worth trying it out anyway.

Sam.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19  8:53                   ` Sam Vilain
@ 2006-06-19 21:44                     ` MAEDA Naoaki
  0 siblings, 0 replies; 36+ messages in thread
From: MAEDA Naoaki @ 2006-06-19 21:44 UTC (permalink / raw)
  To: Sam Vilain
  Cc: vatsa, Nick Piggin, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel, kurosawa, ckrm-tech, MAEDA Naoaki

Sam Vilain wrote:
> MAEDA Naoaki wrote:
>>> Ok, so that's not as bad as it looked.  So, while it is still O(N), the
>>> fact that it is O(N/HZ) makes this not a problem until you get to
>>> possibly impractical levels of runqueue length.
>>>     
>> Do you mean N is the size of the loop? for_each_cpu_mask() loops
>> the number of CPUs times. It is not directly related to runqueue length.
>>   
> 
> Ok, I mistook it for a per-task loop.
> 
> Well, let me know if you think it's worth trying it out anyway.

I don't think this loop would be a bottle neck, but test by third person
is always valuable.

Thanks,
MAEDA Naoaki


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-17  8:48 ` Nick Piggin
  2006-06-17 15:55   ` Balbir Singh
  2006-06-17 16:48   ` Srivatsa Vaddagiri
@ 2006-06-19 18:14   ` Chris Friesen
  2006-06-19 19:11     ` Chandra Seetharaman
  2 siblings, 1 reply; 36+ messages in thread
From: Chris Friesen @ 2006-06-19 18:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: vatsa, Sam Vilain, Kirill Korotaev, Mike Galbraith, Ingo Molnar,
	Peter Williams, Andrew Morton, sekharan, Balbir Singh,
	linux-kernel

Nick Piggin wrote:

> So, from my POV, I would like to be convinced of the need for this first.
> I would really love to be able to keep core kernel simple and fast even if
> it means edge cases might need to use a slightly different solution.

We currently use a heavily modified CKRM version "e".

The "resource groups" (formerly known as CKRM) cpu controls express what 
we'd like to do, but they aren't nearly accurate enough.  We don't make 
use the limits, but we do use per-cpu guarantees, along with the 
hierarchy concept.

Our engineering guys need to be able to make cpu guarantees for the 
various type of processes.  "main server app gets 90%, these fault 
handling guys normally get 2% but should be able to burst to 100% for up 
to 100ms, that other group gets 5% in total, but a subset of them should 
get priority over the others, and this little guy here should only be 
guaranteed .5% but it should take priority over everything else on the 
system as long as it hasn't used all its allocation".

Ideally they'd really like sub percentage (.1% would be nice, but .5% is 
proably more realistic) accuracy over the divisions.  This should be 
expressed per-cpu, and tasks should be migrated as necessary to maintain 
fairness.  (Ie, a task belonging to a group with 50% on each cpu should 
be able to run essentially continuously, bouncing back and forth between 
cpus.)  In our case, predictability/fairness comes first, then performance.

If a method is accepted into mainline, it would be nice to have NPTL 
support it as a thread attribute so that different threads can be in 
different groups.

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19 18:14   ` Chris Friesen
@ 2006-06-19 19:11     ` Chandra Seetharaman
  2006-06-19 20:28       ` Chris Friesen
  0 siblings, 1 reply; 36+ messages in thread
From: Chandra Seetharaman @ 2006-06-19 19:11 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Nick Piggin, vatsa, Sam Vilain, Kirill Korotaev, Mike Galbraith,
	Ingo Molnar, Peter Williams, Andrew Morton, Balbir Singh,
	linux-kernel

On Mon, 2006-06-19 at 12:14 -0600, Chris Friesen wrote:
> Nick Piggin wrote:
> 
> > So, from my POV, I would like to be convinced of the need for this first.
> > I would really love to be able to keep core kernel simple and fast even if
> > it means edge cases might need to use a slightly different solution.
> 
> We currently use a heavily modified CKRM version "e".
> 
> The "resource groups" (formerly known as CKRM) cpu controls express what 
> we'd like to do, but they aren't nearly accurate enough.  We don't make 
> use the limits, but we do use per-cpu guarantees, along with the 
> hierarchy concept.
> 
> Our engineering guys need to be able to make cpu guarantees for the 
> various type of processes.  "main server app gets 90%, these fault 
> handling guys normally get 2% but should be able to burst to 100% for up 
> to 100ms, that other group gets 5% in total, but a subset of them should 
> get priority over the others, and this little guy here should only be 
> guaranteed .5% but it should take priority over everything else on the 
> system as long as it hasn't used all its allocation".
> 
> Ideally they'd really like sub percentage (.1% would be nice, but .5% is 
> proably more realistic) accuracy over the divisions.  This should be 
> expressed per-cpu, and tasks should be migrated as necessary to maintain 
> fairness.  (Ie, a task belonging to a group with 50% on each cpu should 
> be able to run essentially continuously, bouncing back and forth between 
> cpus.)  In our case, predictability/fairness comes first, then performance.
> 
> If a method is accepted into mainline, it would be nice to have NPTL 
> support it as a thread attribute so that different threads can be in 
> different groups.
> 

Chris,

Resource Groups(CKRM) does allow threads to be in different Resource
Groups ( and since Resource Group assignment is dynamic, a thread can
move to a high priority resource group for a specific operation and get
back to its original resource group after the operation is complete).

Just wondering if that is sufficient or you _would_ need support from
NPTL.

chandra
> Chris
-- 

----------------------------------------------------------------------
    Chandra Seetharaman               | Be careful what you choose....
              - sekharan@us.ibm.com   |      .......you may get it.
----------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC] CPU controllers?
  2006-06-19 19:11     ` Chandra Seetharaman
@ 2006-06-19 20:28       ` Chris Friesen
  0 siblings, 0 replies; 36+ messages in thread
From: Chris Friesen @ 2006-06-19 20:28 UTC (permalink / raw)
  To: sekharan
  Cc: Nick Piggin, vatsa, Sam Vilain, Kirill Korotaev, Mike Galbraith,
	Ingo Molnar, Peter Williams, Andrew Morton, Balbir Singh,
	linux-kernel

Chandra Seetharaman wrote:

> Resource Groups(CKRM) does allow threads to be in different Resource
> Groups ( and since Resource Group assignment is dynamic, a thread can
> move to a high priority resource group for a specific operation and get
> back to its original resource group after the operation is complete).
> 
> Just wondering if that is sufficient or you _would_ need support from
> NPTL.

The main issue is that the mapping between pthread_t and and PID is only 
known by NPTL.  A thread can find it's own PID (for purposes of resource 
groups) but it can't find the PID of other threads given their pthread_t.

Essentially I'm looking for cpu-group equivalents to:

pthread_setschedparam()
pthread_getschedparam()
pthread_attr_setschedpolicy()
pthread_attr_getschedpolicy()

It's not absolutely critical, but we did add it to our current NPTL.

Chris

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2006-06-20  6:17 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-15 13:46 [RFC] CPU controllers? Srivatsa Vaddagiri
2006-06-15 21:52 ` Sam Vilain
2006-06-15 23:30 ` Peter Williams
2006-06-16  0:42   ` Matt Helsley
2006-06-17  8:48 ` Nick Piggin
2006-06-17 15:55   ` Balbir Singh
2006-06-17 16:48   ` Srivatsa Vaddagiri
2006-06-18  5:06     ` Nick Piggin
2006-06-18  5:53       ` Sam Vilain
2006-06-18  6:11         ` Nick Piggin
2006-06-18  6:40           ` Sam Vilain
2006-06-18  7:17             ` Nick Piggin
2006-06-18  6:42           ` Andrew Morton
2006-06-18  7:28             ` Nick Piggin
2006-06-19 19:03               ` Resource Management Requirements (was "[RFC] CPU controllers?") Chandra Seetharaman
2006-06-20  5:40                 ` Srivatsa Vaddagiri
2006-06-18  7:36             ` [RFC] CPU controllers? Mike Galbraith
2006-06-18  7:49               ` Nick Piggin
2006-06-18  7:49               ` Nick Piggin
2006-06-18  9:09               ` Andrew Morton
2006-06-18  9:49                 ` Mike Galbraith
2006-06-19  6:28                   ` Mike Galbraith
2006-06-19  6:35                     ` Andrew Morton
2006-06-19  6:46                       ` Mike Galbraith
2006-06-19 18:21               ` Chris Friesen
2006-06-20  6:20                 ` Mike Galbraith
2006-06-18  7:18         ` Srivatsa Vaddagiri
2006-06-19  2:07           ` Sam Vilain
2006-06-19  7:04             ` MAEDA Naoaki
2006-06-19  8:19               ` Sam Vilain
2006-06-19  8:41                 ` MAEDA Naoaki
2006-06-19  8:53                   ` Sam Vilain
2006-06-19 21:44                     ` MAEDA Naoaki
2006-06-19 18:14   ` Chris Friesen
2006-06-19 19:11     ` Chandra Seetharaman
2006-06-19 20:28       ` Chris Friesen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox