Re: [RFC] CPU controllers?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>
To: Sam Vilain <sam@vilain.net>
Cc: vatsa@in.ibm.com, Nick Piggin <nickpiggin@yahoo.com.au>,
	Kirill Korotaev <dev@openvz.org>, Mike Galbraith <efault@gmx.de>,
	Ingo Molnar <mingo@elte.hu>,
	Peter Williams <pwil3058@bigpond.net.au>,
	Andrew Morton <akpm@osdl.org>,
	sekharan@us.ibm.com, Balbir Singh <balbir@in.ibm.com>,
	linux-kernel@vger.kernel.org, kurosawa@valinux.co.jp,
	ckrm-tech@lists.sourceforge.net,
	MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>
Subject: Re: [RFC] CPU controllers?
Date: Mon, 19 Jun 2006 16:04:41 +0900	[thread overview]
Message-ID: <44964C89.6060003@jp.fujitsu.com> (raw)
In-Reply-To: <449606F5.6050909@vilain.net>

Sam Vilain wrote:
> Srivatsa Vaddagiri wrote:
>> On Sun, Jun 18, 2006 at 05:53:42PM +1200, Sam Vilain wrote:
>>   
>>> Bear in mind that we have on the table at least one group of scheduling 
>>> solutions (timeslice scaling based ones, such as the VServer one) which 
>>> is virtually no overhead and could potentially provide the "jumpers" 
>>> necessary for implementing more complex scheduling policies.
>>>     
>> 	Do you have any plans to post the vserver CPU control
>> implementation hooked against maybe Resource Groups (for grouping
>> tasks)? Seeing several different implementation against current
>> kernel may perhaps help maintainers decide what they like and what they
>> don't?
> 
> That sounds like a good idea, I like the Resource Groups concept in
> general and it would be good to be able to fit this into a more generic
> and comprehensive framework.

That sounds nice.

> I'll try it against Chandra and Maeda's Apr 27 submission (a shame I
> missed it the first time around), and see how far I get.
> 
> [goes away a bit]
> 
> ok, so basically the bit in cpu_rc_load() where for_each_cpu_mask() is
> called, in Maeda Naoaki's patch "CPU controller - Add class load
> estimation support", is where O(N) creeps in that could be remedied with
> a token bucket algorithm. You don't want this because if you have 10,000
> processes on a system in two resource groups, the aggregate performance
> will suffer due to the large number of cacheline misses during the 5,000
> size loop that runs every resched.

Thank you for looking the code.

cpu_rc_load() is never called unless sysadm tries to access the load
information via configfs from userland. In addition, it sums up per-CPU
group stats, so the size of loop is the number of CPU, not process in
the group.

However, there is a similer loop in cpu_rc_recalc_tsfactor(), which runs
every CPU_RC_RECALC_INTERVAL that is defined as HZ. I don't think it
will cause a big performance penalty.

> To apply the token bucket here, you would first change the per-CPU
> struct cpu_rc to have the TBF fields; minimally:
> 
> int tokens; /* current number of CPU tokens */
> 
> int fill_rate[2]; /* Fill rate: add X tokens... */
> int interval[2]; /* Divisor: per Y jiffies */
> int tokens_max; /* Limit: no more than N tokens */
> 
> unsigned long last_time; /* last time accounted */
> 
> (note: the VServer implementation has several other fields for various
> reasons; the above are the important ones).
> 
> Then, in cpu_rc_record_allocation(), you'd take the length of the slice
> out of the bucket (subtract from tokens). In cpu_rc_account(), you would
> then "refund" unused CPU tokens back. The approach in Linux-VServer is
> to remove tokens every scheduler_tick(), but perhaps there are
> advantages to doing it the way you are in the CPU controller Resource
> Groups patch.
> 
> That part should obviate the need for cpu_rc_load() altogether.
> 
> Then, in cpu_rc_scale_timeslice(), you would make it add a bonus
> depending on (tokens / tokens_max); I found a quadratic back-off,
> scaling 0% full to a +15 penalty, 100% full to a -5 bonus and 50% full
> to no bonus, worked well - in my simple purely CPU bound process tests
> using tight loop processes.
> 
> Note that when the bucket reaches 0, there is a choice to keep
> allocating short timeslices anyway, under the presumption that the
> system has CPU to burn (sched_soft), or to put all processes in that RC
> on hold (sched_hard). This could potentially be controlled by flags on
> the bucket - as well as the size of the boost.
> 
> Hence, the "jumpers" I refer to are the bucket parameters - for
> instance, if you set the tokens_max to ~HZ, and have a suitably high
> priority/RT task monitoring the buckets, then that process should be
> able to;
> 
> - get a complete record of how many tokens were used by a RC since it
> last checked,
> - influence subsequent scheduling priority of the RC, by adjusting the
> fill rate, current tokens value, the size of the boost, or the
> "sched_hard" flag
> 
> ...and it could probably do that with very occasional timeslices, such
> as one slice per N*HZ (where N ~ the number of resource groups). So that
> makes it a candidate for moving to userland.
> 
> The current VServer implementation fails to schedule fairly when the CPU
> allocations do not add up correctly; if you only allocated 25% of CPU to
> one vserver, then 40% to another, and they are both busy, they might end
> up both with empty buckets and an equal +15 penalty - effectively using
> 50/50 CPU and allocating very short timeslices, yielding poor batch
> performance.
> 
> So, with (possibly userland) policy monitoring for this sort of
> condition and adjusting bucket sizes and levels appropriately, that old
> "problem" that leads people to conclude that the VServer scheduler does
> not work could be solved - all without incurring major overhead even on
> very busy systems.
> 
> I think that the characteristics of these two approaches are subtly
> different. Both scale timeslices, but in a different way - instead of
> estimating the load and scaling back timeslices up front, busy Resource
> Groups are relied on to deplete their tokens in a timely manner, and get
> shorter slices allocated because of that. No doubt from 10,000 feet they
> both look the same.

Current 0(1) scheduler gives extra bonus for interactive tasks by
requeuing them to active array for a while. It would break
the controller's efforts. So, I'm planning to stop the interactive
task requeuing if the target share doesn't meet.

Are there a similar issue on the vserver scheduler?

> There is probably enough information here for an implementation, but
> I'll wait for feedback on this post before going any further with it.
> 
> Sam.

Thanks,
MAEDA Naoaki

next prev parent reply	other threads:[~2006-06-19  7:05 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-06-15 13:46 [RFC] CPU controllers? Srivatsa Vaddagiri
2006-06-15 21:52 ` Sam Vilain
2006-06-15 23:30 ` Peter Williams
2006-06-16  0:42   ` Matt Helsley
2006-06-17  8:48 ` Nick Piggin
2006-06-17 15:55   ` Balbir Singh
2006-06-17 16:48   ` Srivatsa Vaddagiri
2006-06-18  5:06     ` Nick Piggin
2006-06-18  5:53       ` Sam Vilain
2006-06-18  6:11         ` Nick Piggin
2006-06-18  6:40           ` Sam Vilain
2006-06-18  7:17             ` Nick Piggin
2006-06-18  6:42           ` Andrew Morton
2006-06-18  7:28             ` Nick Piggin
2006-06-19 19:03               ` Resource Management Requirements (was "[RFC] CPU controllers?") Chandra Seetharaman
2006-06-20  5:40                 ` Srivatsa Vaddagiri
2006-06-18  7:36             ` [RFC] CPU controllers? Mike Galbraith
2006-06-18  7:49               ` Nick Piggin
2006-06-18  7:49               ` Nick Piggin
2006-06-18  9:09               ` Andrew Morton
2006-06-18  9:49                 ` Mike Galbraith
2006-06-19  6:28                   ` Mike Galbraith
2006-06-19  6:35                     ` Andrew Morton
2006-06-19  6:46                       ` Mike Galbraith
2006-06-19 18:21               ` Chris Friesen
2006-06-20  6:20                 ` Mike Galbraith
2006-06-18  7:18         ` Srivatsa Vaddagiri
2006-06-19  2:07           ` Sam Vilain
2006-06-19  7:04             ` MAEDA Naoaki [this message]
2006-06-19  8:19               ` Sam Vilain
2006-06-19  8:41                 ` MAEDA Naoaki
2006-06-19  8:53                   ` Sam Vilain
2006-06-19 21:44                     ` MAEDA Naoaki
2006-06-19 18:14   ` Chris Friesen
2006-06-19 19:11     ` Chandra Seetharaman
2006-06-19 20:28       ` Chris Friesen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=44964C89.6060003@jp.fujitsu.com \
    --to=maeda.naoaki@jp.fujitsu.com \
    --cc=akpm@osdl.org \
    --cc=balbir@in.ibm.com \
    --cc=ckrm-tech@lists.sourceforge.net \
    --cc=dev@openvz.org \
    --cc=efault@gmx.de \
    --cc=kurosawa@valinux.co.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=pwil3058@bigpond.net.au \
    --cc=sam@vilain.net \
    --cc=sekharan@us.ibm.com \
    --cc=vatsa@in.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox