public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>,
	pjt@google.com, linux-kernel@vger.kernel.org, devel@openvz.org
Subject: Re: [PATCH v2] sched: move h_load calculation to task_h_load
Date: Tue, 16 Jul 2013 17:50:40 +0200	[thread overview]
Message-ID: <20130716155040.GO23818@dyad.programming.kicks-ass.net> (raw)
In-Reply-To: <1373896159-1278-1-git-send-email-vdavydov@parallels.com>

On Mon, Jul 15, 2013 at 05:49:19PM +0400, Vladimir Davydov wrote:
> The bad thing about update_h_load(), which computes hierarchical load
> factor for task groups, is that it is called for each task group in the
> system before every load balancer run, and since rebalance can be
> triggered very often, this function can eat really a lot of cpu time if
> there are many cpu cgroups in the system.
> 
> Although the situation was improved significantly by commit a35b646
> ('sched, cgroup: Reduce rq->lock hold times for large cgroup
> hierarchies'), the problem still can arise under some kinds of loads,
> e.g. when cpus are switching from idle to busy and back very frequently.
> 
> For instance, when I start 1000 of processes that wake up every
> millisecond on my 8 cpus host, 'top' and 'perf top' show:
> 
> Cpu(s): 17.8%us, 24.3%sy,  0.0%ni, 57.9%id,  0.0%wa,  0.0%hi,  0.0%si
> Events: 243K cycles
>   7.57%  [kernel]               [k] __schedule
>   7.08%  [kernel]               [k] timerqueue_add
>   6.13%  libc-2.12.so           [.] usleep
> 
> Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu
> usage increases significantly although the 'wakers' are still executing
> in the root cpu cgroup:
> 
> Cpu(s): 19.1%us, 48.7%sy,  0.0%ni, 31.6%id,  0.0%wa,  0.0%hi,  0.7%si
> Events: 230K cycles
>  24.56%  [kernel]            [k] tg_load_down
>   5.76%  [kernel]            [k] __schedule
> 
> This happens because this particular kind of load triggers 'new idle'
> rebalance very frequently, which requires calling update_h_load(),
> which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
> though it is absolutely useless, because idle cpu cgroups have no tasks
> to pull.
> 
> This patch tries to improve the situation by making h_load calculation
> proceed only when h_load is really necessary. To achieve this, it
> substitutes update_h_load() with update_cfs_rq_h_load(), which computes
> h_load only for a given cfs_rq and all its ascendants, and makes the
> load balancer call this function whenever it considers if a task should
> be pulled, i.e. it moves h_load calculations directly to task_h_load().
> For h_load of the same cfs_rq not to be updated multiple times (in case
> several tasks in the same cgroup are considered during the same balance
> run), the patch keeps the time of the last h_load update for each cfs_rq
> and breaks calculation when it finds h_load to be uptodate.
> 
> The benefit of it is that h_load is computed only for those cfs_rq's,
> which really need it, in particular all idle task groups are skipped.
> Although this, in fact, moves h_load calculation under rq lock, it
> should not affect latency much, because the amount of work done under rq
> lock while trying to pull tasks is limited by sched_nr_migrate.
> 
> After the patch applied with the setup described above (1000 wakers in
> the root cgroup and 10000 idle cgroups), I get:
> 
> Cpu(s): 16.9%us, 24.8%sy,  0.0%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si
> Events: 242K cycles
>   7.57%  [kernel]                  [k] __schedule
>   6.70%  [kernel]                  [k] timerqueue_add
>   5.93%  libc-2.12.so              [.] usleep
> 
> Changes in v2:
>  * use jiffies instead of rq->clock for last_h_load_update.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

Thanks!

  reply	other threads:[~2013-07-16 15:51 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-13  8:47 [PATCH RFC] sched: move h_load calculation to task_h_load Vladimir Davydov
2013-07-15  8:28 ` Peter Zijlstra
2013-07-15 10:00   ` Vladimir Davydov
2013-07-15 10:59     ` Peter Zijlstra
2013-07-15 13:49       ` [PATCH v2] " Vladimir Davydov
2013-07-16 15:50         ` Peter Zijlstra [this message]
2013-07-24  3:56         ` [tip:perf/core] sched: Move h_load calculation to task_h_load() tip-bot for Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130716155040.GO23818@dyad.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=devel@openvz.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pjt@google.com \
    --cc=vdavydov@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox