From: Peter Zijlstra <peterz@infradead.org>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Ingo Molnar <mingo@redhat.com>,
pjt@google.com, linux-kernel@vger.kernel.org, devel@openvz.org
Subject: Re: [PATCH v2] sched: move h_load calculation to task_h_load
Date: Tue, 16 Jul 2013 17:50:40 +0200 [thread overview]
Message-ID: <20130716155040.GO23818@dyad.programming.kicks-ass.net> (raw)
In-Reply-To: <1373896159-1278-1-git-send-email-vdavydov@parallels.com>
On Mon, Jul 15, 2013 at 05:49:19PM +0400, Vladimir Davydov wrote:
> The bad thing about update_h_load(), which computes hierarchical load
> factor for task groups, is that it is called for each task group in the
> system before every load balancer run, and since rebalance can be
> triggered very often, this function can eat really a lot of cpu time if
> there are many cpu cgroups in the system.
>
> Although the situation was improved significantly by commit a35b646
> ('sched, cgroup: Reduce rq->lock hold times for large cgroup
> hierarchies'), the problem still can arise under some kinds of loads,
> e.g. when cpus are switching from idle to busy and back very frequently.
>
> For instance, when I start 1000 of processes that wake up every
> millisecond on my 8 cpus host, 'top' and 'perf top' show:
>
> Cpu(s): 17.8%us, 24.3%sy, 0.0%ni, 57.9%id, 0.0%wa, 0.0%hi, 0.0%si
> Events: 243K cycles
> 7.57% [kernel] [k] __schedule
> 7.08% [kernel] [k] timerqueue_add
> 6.13% libc-2.12.so [.] usleep
>
> Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu
> usage increases significantly although the 'wakers' are still executing
> in the root cpu cgroup:
>
> Cpu(s): 19.1%us, 48.7%sy, 0.0%ni, 31.6%id, 0.0%wa, 0.0%hi, 0.7%si
> Events: 230K cycles
> 24.56% [kernel] [k] tg_load_down
> 5.76% [kernel] [k] __schedule
>
> This happens because this particular kind of load triggers 'new idle'
> rebalance very frequently, which requires calling update_h_load(),
> which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
> though it is absolutely useless, because idle cpu cgroups have no tasks
> to pull.
>
> This patch tries to improve the situation by making h_load calculation
> proceed only when h_load is really necessary. To achieve this, it
> substitutes update_h_load() with update_cfs_rq_h_load(), which computes
> h_load only for a given cfs_rq and all its ascendants, and makes the
> load balancer call this function whenever it considers if a task should
> be pulled, i.e. it moves h_load calculations directly to task_h_load().
> For h_load of the same cfs_rq not to be updated multiple times (in case
> several tasks in the same cgroup are considered during the same balance
> run), the patch keeps the time of the last h_load update for each cfs_rq
> and breaks calculation when it finds h_load to be uptodate.
>
> The benefit of it is that h_load is computed only for those cfs_rq's,
> which really need it, in particular all idle task groups are skipped.
> Although this, in fact, moves h_load calculation under rq lock, it
> should not affect latency much, because the amount of work done under rq
> lock while trying to pull tasks is limited by sched_nr_migrate.
>
> After the patch applied with the setup described above (1000 wakers in
> the root cgroup and 10000 idle cgroups), I get:
>
> Cpu(s): 16.9%us, 24.8%sy, 0.0%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si
> Events: 242K cycles
> 7.57% [kernel] [k] __schedule
> 6.70% [kernel] [k] timerqueue_add
> 5.93% libc-2.12.so [.] usleep
>
> Changes in v2:
> * use jiffies instead of rq->clock for last_h_load_update.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Thanks!
next prev parent reply other threads:[~2013-07-16 15:51 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-13 8:47 [PATCH RFC] sched: move h_load calculation to task_h_load Vladimir Davydov
2013-07-15 8:28 ` Peter Zijlstra
2013-07-15 10:00 ` Vladimir Davydov
2013-07-15 10:59 ` Peter Zijlstra
2013-07-15 13:49 ` [PATCH v2] " Vladimir Davydov
2013-07-16 15:50 ` Peter Zijlstra [this message]
2013-07-24 3:56 ` [tip:perf/core] sched: Move h_load calculation to task_h_load() tip-bot for Vladimir Davydov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130716155040.GO23818@dyad.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=devel@openvz.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=pjt@google.com \
--cc=vdavydov@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.