Re: [RFC] CPU controllers?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>
To: Sam Vilain <sam@vilain.net>
Cc: vatsa@in.ibm.com, Nick Piggin <nickpiggin@yahoo.com.au>,
	Kirill Korotaev <dev@openvz.org>, Mike Galbraith <efault@gmx.de>,
	Ingo Molnar <mingo@elte.hu>,
	Peter Williams <pwil3058@bigpond.net.au>,
	Andrew Morton <akpm@osdl.org>,
	sekharan@us.ibm.com, Balbir Singh <balbir@in.ibm.com>,
	linux-kernel@vger.kernel.org, kurosawa@valinux.co.jp,
	ckrm-tech@lists.sourceforge.net,
	MAEDA Naoaki <maeda.naoaki@jp.fujitsu.com>
Subject: Re: [RFC] CPU controllers?
Date: Mon, 19 Jun 2006 16:04:41 +0900	[thread overview]
Message-ID: <44964C89.6060003@jp.fujitsu.com> (raw)
In-Reply-To: <449606F5.6050909@vilain.net>

Sam Vilain wrote:
> Srivatsa Vaddagiri wrote:
>> On Sun, Jun 18, 2006 at 05:53:42PM +1200, Sam Vilain wrote:
>>   
>>> Bear in mind that we have on the table at least one group of scheduling 
>>> solutions (timeslice scaling based ones, such as the VServer one) which 
>>> is virtually no overhead and could potentially provide the "jumpers" 
>>> necessary for implementing more complex scheduling policies.
>>>     
>> 	Do you have any plans to post the vserver CPU control
>> implementation hooked against maybe Resource Groups (for grouping
>> tasks)? Seeing several different implementation against current
>> kernel may perhaps help maintainers decide what they like and what they
>> don't?
> 
> That sounds like a good idea, I like the Resource Groups concept in
> general and it would be good to be able to fit this into a more generic
> and comprehensive framework.

That sounds nice.

> I'll try it against Chandra and Maeda's Apr 27 submission (a shame I
> missed it the first time around), and see how far I get.
> 
> [goes away a bit]
> 
> ok, so basically the bit in cpu_rc_load() where for_each_cpu_mask() is
> called, in Maeda Naoaki's patch "CPU controller - Add class load
> estimation support", is where O(N) creeps in that could be remedied with
> a token bucket algorithm. You don't want this because if you have 10,000
> processes on a system in two resource groups, the aggregate performance
> will suffer due to the large number of cacheline misses during the 5,000
> size loop that runs every resched.

Thank you for looking the code.

cpu_rc_load() is never called unless sysadm tries to access the load
information via configfs from userland. In addition, it sums up per-CPU
group stats, so the size of loop is the number of CPU, not process in
the group.

However, there is a similer loop in cpu_rc_recalc_tsfactor(), which runs
every CPU_RC_RECALC_INTERVAL that is defined as HZ. I don't think it
will cause a big performance penalty.

> To apply the token bucket here, you would first change the per-CPU
> struct cpu_rc to have the TBF fields; minimally:
> 
> int tokens; /* current number of CPU tokens */
> 
> int fill_rate[2]; /* Fill rate: add X tokens... */
> int interval[2]; /* Divisor: per Y jiffies */
> int tokens_max; /* Limit: no more than N tokens */
> 
> unsigned long last_time; /* last time accounted */
> 
> (note: the VServer implementation has several other fields for various
> reasons; the above are the important ones).
> 
> Then, in cpu_rc_record_allocation(), you'd take the length of the slice
> out of the bucket (subtract from tokens). In cpu_rc_account(), you would
> then "refund" unused CPU tokens back. The approach in Linux-VServer is
> to remove tokens every scheduler_tick(), but perhaps there are
> advantages to doing it the way you are in the CPU controller Resource
> Groups patch.
> 
> That part should obviate the need for cpu_rc_load() altogether.
> 
> Then, in cpu_rc_scale_timeslice(), you would make it add a bonus
> depending on (tokens / tokens_max); I found a quadratic back-off,
> scaling 0% full to a +15 penalty, 100% full to a -5 bonus and 50% full
> to no bonus, worked well - in my simple purely CPU bound process tests
> using tight loop processes.
> 
> Note that when the bucket reaches 0, there is a choice to keep
> allocating short timeslices anyway, under the presumption that the
> system has CPU to burn (sched_soft), or to put all processes in that RC
> on hold (sched_hard). This could potentially be controlled by flags on
> the bucket - as well as the size of the boost.
> 
> Hence, the "jumpers" I refer to are the bucket parameters - for
> instance, if you set the tokens_max to ~HZ, and have a suitably high
> priority/RT task monitoring the buckets, then that process should be
> able to;
> 
> - get a complete record of how many tokens were used by a RC since it
> last checked,
> - influence subsequent scheduling priority of the RC, by adjusting the
> fill rate, current tokens value, the size of the boost, or the
> "sched_hard" flag
> 
> ...and it could probably do that with very occasional timeslices, such
> as one slice per N*HZ (where N ~ the number of resource groups). So that
> makes it a candidate for moving to userland.
> 
> The current VServer implementation fails to schedule fairly when the CPU
> allocations do not add up correctly; if you only allocated 25% of CPU to
> one vserver, then 40% to another, and they are both busy, they might end
> up both with empty buckets and an equal +15 penalty - effectively using
> 50/50 CPU and allocating very short timeslices, yielding poor batch
> performance.
> 
> So, with (possibly userland) policy monitoring for this sort of
> condition and adjusting bucket sizes and levels appropriately, that old
> "problem" that leads people to conclude that the VServer scheduler does
> not work could be solved - all without incurring major overhead even on
> very busy systems.
> 
> I think that the characteristics of these two approaches are subtly
> different. Both scale timeslices, but in a different way - instead of
> estimating the load and scaling back timeslices up front, busy Resource
> Groups are relied on to deplete their tokens in a timely manner, and get
> shorter slices allocated because of that. No doubt from 10,000 feet they
> both look the same.

Current 0(1) scheduler gives extra bonus for interactive tasks by
requeuing them to active array for a while. It would break
the controller's efforts. So, I'm planning to stop the interactive
task requeuing if the target share doesn't meet.

Are there a similar issue on the vserver scheduler?

> There is probably enough information here for an implementation, but
> I'll wait for feedback on this post before going any further with it.
> 
> Sam.

Thanks,
MAEDA Naoaki

next prev parent reply	other threads:[~2006-06-19  7:05 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-06-15 13:46 [RFC] CPU controllers? Srivatsa Vaddagiri
2006-06-15 21:52 ` Sam Vilain
2006-06-15 23:30 ` Peter Williams
2006-06-16  0:42   ` Matt Helsley
2006-06-17  8:48 ` Nick Piggin
2006-06-17 15:55   ` Balbir Singh
2006-06-17 16:48   ` Srivatsa Vaddagiri
2006-06-18  5:06     ` Nick Piggin
2006-06-18  5:53       ` Sam Vilain
2006-06-18  6:11         ` Nick Piggin
2006-06-18  6:40           ` Sam Vilain
2006-06-18  7:17             ` Nick Piggin
2006-06-18  6:42           ` Andrew Morton
2006-06-18  7:28             ` Nick Piggin
2006-06-19 19:03               ` Resource Management Requirements (was "[RFC] CPU controllers?") Chandra Seetharaman
2006-06-20  5:40                 ` Srivatsa Vaddagiri
2006-06-18  7:36             ` [RFC] CPU controllers? Mike Galbraith
2006-06-18  7:49               ` Nick Piggin
2006-06-18  7:49               ` Nick Piggin
2006-06-18  9:09               ` Andrew Morton
2006-06-18  9:49                 ` Mike Galbraith
2006-06-19  6:28                   ` Mike Galbraith
2006-06-19  6:35                     ` Andrew Morton
2006-06-19  6:46                       ` Mike Galbraith
2006-06-19 18:21               ` Chris Friesen
2006-06-20  6:20                 ` Mike Galbraith
2006-06-18  7:18         ` Srivatsa Vaddagiri
2006-06-19  2:07           ` Sam Vilain
2006-06-19  7:04             ` MAEDA Naoaki [this message]
2006-06-19  8:19               ` Sam Vilain
2006-06-19  8:41                 ` MAEDA Naoaki
2006-06-19  8:53                   ` Sam Vilain
2006-06-19 21:44                     ` MAEDA Naoaki
2006-06-19 18:14   ` Chris Friesen
2006-06-19 19:11     ` Chandra Seetharaman
2006-06-19 20:28       ` Chris Friesen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=44964C89.6060003@jp.fujitsu.com \
    --to=maeda.naoaki@jp.fujitsu.com \
    --cc=akpm@osdl.org \
    --cc=balbir@in.ibm.com \
    --cc=ckrm-tech@lists.sourceforge.net \
    --cc=dev@openvz.org \
    --cc=efault@gmx.de \
    --cc=kurosawa@valinux.co.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=pwil3058@bigpond.net.au \
    --cc=sam@vilain.net \
    --cc=sekharan@us.ibm.com \
    --cc=vatsa@in.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.