From: tip-bot for Waiman Long <tipbot@zytor.com>
To: linux-tip-commits@vger.kernel.org
Cc: torvalds@linux-foundation.org, hpa@zytor.com, mingo@kernel.org,
pjt@google.com, peterz@infradead.org, Waiman.Long@hpe.com,
efault@gmx.de, bsegall@google.com, linux-kernel@vger.kernel.org,
doug.hatch@hpe.com, tglx@linutronix.de, yuyang.du@intel.com,
morten.rasmussen@arm.com, scott.norton@hpe.com
Subject: [tip:sched/core] sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline
Date: Fri, 4 Dec 2015 03:57:42 -0800 [thread overview]
Message-ID: <tip-b0367629acf62a78404c467cd09df447c2fea804@git.kernel.org> (raw)
In-Reply-To: <1449081710-20185-3-git-send-email-Waiman.Long@hpe.com>
Commit-ID: b0367629acf62a78404c467cd09df447c2fea804
Gitweb: http://git.kernel.org/tip/b0367629acf62a78404c467cd09df447c2fea804
Author: Waiman Long <Waiman.Long@hpe.com>
AuthorDate: Wed, 2 Dec 2015 13:41:49 -0500
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 4 Dec 2015 10:34:48 +0100
sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline
If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.
Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:
10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares
In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.
This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.
By doing so, the perf profile became:
9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:
Before patch - Max-jOPs: 907533 Critical-jOps: 134877
After patch - Max-jOPs: 916011 Critical-jOps: 142366
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/core.c | 10 +++++++---
kernel/sched/sched.h | 7 ++++++-
2 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d591db1..aa3f978 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7370,6 +7370,9 @@ int in_sched_functions(unsigned long addr)
*/
struct task_group root_task_group;
LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
#endif
DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7427,11 +7430,12 @@ void __init sched_init(void)
#endif /* CONFIG_RT_GROUP_SCHED */
#ifdef CONFIG_CGROUP_SCHED
+ task_group_cache = KMEM_CACHE(task_group, 0);
+
list_add(&root_task_group.list, &task_groups);
INIT_LIST_HEAD(&root_task_group.children);
INIT_LIST_HEAD(&root_task_group.siblings);
autogroup_init(&init_task);
-
#endif /* CONFIG_CGROUP_SCHED */
for_each_possible_cpu(i) {
@@ -7712,7 +7716,7 @@ static void free_sched_group(struct task_group *tg)
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
- kfree(tg);
+ kmem_cache_free(task_group_cache, tg);
}
/* allocate runqueue etc for a new task group */
@@ -7720,7 +7724,7 @@ struct task_group *sched_create_group(struct task_group *parent)
{
struct task_group *tg;
- tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+ tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
if (!tg)
return ERR_PTR(-ENOMEM);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 472cd14..a5a6b3e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
unsigned long shares;
#ifdef CONFIG_SMP
- atomic_long_t load_avg;
+ /*
+ * load_avg can be heavily contended at clock tick time, so put
+ * it in its own cacheline separated from the fields above which
+ * will also be accessed at each tick.
+ */
+ atomic_long_t load_avg ____cacheline_aligned;
#endif
#endif
next prev parent reply other threads:[~2015-12-04 11:58 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-02 18:41 [PATCH v2 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
2015-12-02 18:41 ` [PATCH v2 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
2015-12-02 20:02 ` bsegall
2015-12-03 19:26 ` Waiman Long
2015-12-03 19:41 ` bsegall
2015-12-03 4:32 ` Mike Galbraith
2015-12-03 19:34 ` Waiman Long
2015-12-04 2:07 ` Mike Galbraith
2015-12-04 20:19 ` Waiman Long
2015-12-03 10:56 ` Peter Zijlstra
2015-12-03 19:38 ` Waiman Long
2015-12-03 11:12 ` Peter Zijlstra
2015-12-03 17:56 ` bsegall
2015-12-03 18:17 ` Peter Zijlstra
2015-12-03 18:23 ` bsegall
2015-12-03 19:56 ` Waiman Long
2015-12-03 20:03 ` Peter Zijlstra
2015-12-04 11:57 ` tip-bot for Waiman Long [this message]
2015-12-02 18:41 ` [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group Waiman Long
2015-12-02 19:55 ` bsegall
2015-12-04 11:58 ` [tip:sched/core] sched/fair: Disable the task group load_avg update for the root_task_group tip-bot for Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=tip-b0367629acf62a78404c467cd09df447c2fea804@git.kernel.org \
--to=tipbot@zytor.com \
--cc=Waiman.Long@hpe.com \
--cc=bsegall@google.com \
--cc=doug.hatch@hpe.com \
--cc=efault@gmx.de \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-tip-commits@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=morten.rasmussen@arm.com \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=scott.norton@hpe.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=yuyang.du@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.