From: Peter Zijlstra <peterz@infradead.org>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>,
pjt@google.com, linux-kernel@vger.kernel.org, devel@openvz.org
Subject: Re: [PATCH v2] sched: move h_load calculation to task_h_load
Date: Tue, 16 Jul 2013 17:50:40 +0200 [thread overview]
Message-ID: <20130716155040.GO23818@dyad.programming.kicks-ass.net> (raw)
In-Reply-To: <1373896159-1278-1-git-send-email-vdavydov@parallels.com>
On Mon, Jul 15, 2013 at 05:49:19PM +0400, Vladimir Davydov wrote:
> The bad thing about update_h_load(), which computes hierarchical load
> factor for task groups, is that it is called for each task group in the
> system before every load balancer run, and since rebalance can be
> triggered very often, this function can eat really a lot of cpu time if
> there are many cpu cgroups in the system.
>
> Although the situation was improved significantly by commit a35b646
> ('sched, cgroup: Reduce rq->lock hold times for large cgroup
> hierarchies'), the problem still can arise under some kinds of loads,
> e.g. when cpus are switching from idle to busy and back very frequently.
>
> For instance, when I start 1000 of processes that wake up every
> millisecond on my 8 cpus host, 'top' and 'perf top' show:
>
> Cpu(s): 17.8%us, 24.3%sy, 0.0%ni, 57.9%id, 0.0%wa, 0.0%hi, 0.0%si
> Events: 243K cycles
> 7.57% [kernel] [k] __schedule
> 7.08% [kernel] [k] timerqueue_add
> 6.13% libc-2.12.so [.] usleep
>
> Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu
> usage increases significantly although the 'wakers' are still executing
> in the root cpu cgroup:
>
> Cpu(s): 19.1%us, 48.7%sy, 0.0%ni, 31.6%id, 0.0%wa, 0.0%hi, 0.7%si
> Events: 230K cycles
> 24.56% [kernel] [k] tg_load_down
> 5.76% [kernel] [k] __schedule
>
> This happens because this particular kind of load triggers 'new idle'
> rebalance very frequently, which requires calling update_h_load(),
> which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
> though it is absolutely useless, because idle cpu cgroups have no tasks
> to pull.
>
> This patch tries to improve the situation by making h_load calculation
> proceed only when h_load is really necessary. To achieve this, it
> substitutes update_h_load() with update_cfs_rq_h_load(), which computes
> h_load only for a given cfs_rq and all its ascendants, and makes the
> load balancer call this function whenever it considers if a task should
> be pulled, i.e. it moves h_load calculations directly to task_h_load().
> For h_load of the same cfs_rq not to be updated multiple times (in case
> several tasks in the same cgroup are considered during the same balance
> run), the patch keeps the time of the last h_load update for each cfs_rq
> and breaks calculation when it finds h_load to be uptodate.
>
> The benefit of it is that h_load is computed only for those cfs_rq's,
> which really need it, in particular all idle task groups are skipped.
> Although this, in fact, moves h_load calculation under rq lock, it
> should not affect latency much, because the amount of work done under rq
> lock while trying to pull tasks is limited by sched_nr_migrate.
>
> After the patch applied with the setup described above (1000 wakers in
> the root cgroup and 10000 idle cgroups), I get:
>
> Cpu(s): 16.9%us, 24.8%sy, 0.0%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si
> Events: 242K cycles
> 7.57% [kernel] [k] __schedule
> 6.70% [kernel] [k] timerqueue_add
> 5.93% libc-2.12.so [.] usleep
>
> Changes in v2:
> * use jiffies instead of rq->clock for last_h_load_update.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Thanks!
next prev parent reply other threads:[~2013-07-16 15:51 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-13 8:47 [PATCH RFC] sched: move h_load calculation to task_h_load Vladimir Davydov
2013-07-15 8:28 ` Peter Zijlstra
2013-07-15 10:00 ` Vladimir Davydov
2013-07-15 10:59 ` Peter Zijlstra
2013-07-15 13:49 ` [PATCH v2] " Vladimir Davydov
2013-07-16 15:50 ` Peter Zijlstra [this message]
2013-07-24 3:56 ` [tip:perf/core] sched: Move h_load calculation to task_h_load() tip-bot for Vladimir Davydov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130716155040.GO23818@dyad.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=devel@openvz.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=pjt@google.com \
--cc=vdavydov@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox