From: Waiman Long <waiman.long@hpe.com>
To: bsegall@google.com
Cc: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
linux-kernel@vger.kernel.org, Yuyang Du <yuyang.du@intel.com>,
Paul Turner <pjt@google.com>,
Morten Rasmussen <morten.rasmussen@arm.com>,
Scott J Norton <scott.norton@hpe.com>,
Douglas Hatch <doug.hatch@hpe.com>
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Date: Thu, 03 Dec 2015 14:26:05 -0500 [thread overview]
Message-ID: <5660974D.7050006@hpe.com> (raw)
In-Reply-To: <xm26zixs4o13.fsf@sword-of-the-dawn.mtv.corp.google.com>
On 12/02/2015 03:02 PM, bsegall@google.com wrote:
> Waiman Long<Waiman.Long@hpe.com> writes:
>
>> If a system with large number of sockets was driven to full
>> utilization, it was found that the clock tick handling occupied a
>> rather significant proportion of CPU time when fair group scheduling
>> and autogroup were enabled.
>>
>> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
>> profile looked like:
>>
>> 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
>> 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
>> 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
>> 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
>> 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
>> 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
>> 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares
>>
>> In particular, the high CPU time consumed by update_cfs_shares()
>> was mostly due to contention on the cacheline that contained the
>> task_group's load_avg statistical counter. This cacheline may also
>> contains variables like shares, cfs_rq& se which are accessed rather
>> frequently during clock tick processing.
>>
>> This patch moves the load_avg variable into another cacheline
>> separated from the other frequently accessed variables. It also
>> creates a cacheline aligned kmemcache for task_group to make sure
>> that all the allocated task_group's are cacheline aligned.
>>
>> By doing so, the perf profile became:
>>
>> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
>> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
>> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
>> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
>> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
>> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
>> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>>
>> The %cpu time is still pretty high, but it is better than before. The
>> benchmark results before and after the patch was as follows:
>>
>> Before patch - Max-jOPs: 907533 Critical-jOps: 134877
>> After patch - Max-jOPs: 916011 Critical-jOps: 142366
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>> kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++--
>> kernel/sched/sched.h | 7 ++++++-
>> 2 files changed, 40 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d568ac..e39204f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
>> */
>> struct task_group root_task_group;
>> LIST_HEAD(task_groups);
>> +
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +/* Cacheline aligned slab cache for task_group */
>> +static struct kmem_cache *task_group_cache __read_mostly;
>> +#endif
>> #endif
>>
>> DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
>> @@ -7356,6 +7361,7 @@ void __init sched_init(void)
>> root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>> ptr += nr_cpu_ids * sizeof(void **);
>>
>> + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);
> The KMEM_CACHE macro suggests instead adding
> ____cacheline_aligned_in_smp to the struct definition instead.
The main goal is to have the load_avg placed in a new cacheline
separated from the read-only fields above. That is why I placed
____cacheline_aligned after load_avg. I omitted the in_smp part because
it is in the SMP block already. Putting ____cacheline_aligned_in_smp
won't guarantee alignment of any field within the structure.
I have done some test and having ____cacheline_aligned inside the
structure has the same effect of forcing the whole structure in the
cacheline aligned boundary.
>> #endif /* CONFIG_FAIR_GROUP_SCHED */
>> #ifdef CONFIG_RT_GROUP_SCHED
>> root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
>> /* task_group_lock serializes the addition/removal of task groups */
>> static DEFINE_SPINLOCK(task_group_lock);
>>
>> +/*
>> + * Make sure that the task_group structure is cacheline aligned when
>> + * fair group scheduling is enabled.
>> + */
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> + kmem_cache_free(task_group_cache, tg);
>> +}
>> +#else /* CONFIG_FAIR_GROUP_SCHED */
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> + return kzalloc(sizeof(struct task_group), GFP_KERNEL);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> + kfree(tg);
>> +}
>> +#endif /* CONFIG_FAIR_GROUP_SCHED */
>> +
>> static void free_sched_group(struct task_group *tg)
>> {
>> free_fair_sched_group(tg);
>> free_rt_sched_group(tg);
>> autogroup_free(tg);
>> - kfree(tg);
>> + free_task_group(tg);
>> }
>>
>> /* allocate runqueue etc for a new task group */
>> @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group *parent)
>> {
>> struct task_group *tg;
>>
>> - tg = kzalloc(sizeof(*tg), GFP_KERNEL);
>> + tg = alloc_task_group();
>> if (!tg)
>> return ERR_PTR(-ENOMEM);
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index efd3bfc..e679895 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -248,7 +248,12 @@ struct task_group {
>> unsigned long shares;
>>
>> #ifdef CONFIG_SMP
>> - atomic_long_t load_avg;
>> + /*
>> + * load_avg can be heavily contended at clock tick time, so put
>> + * it in its own cacheline separated from the fields above which
>> + * will also be accessed at each tick.
>> + */
>> + atomic_long_t load_avg ____cacheline_aligned;
>> #endif
>> #endif
> I suppose the question is if it would be better to just move this to
> wind up on a separate cacheline without the extra empty space, though it
> would likely be more fragile and unclear.
I have been thinking about that too. The problem is anything that will
be in the same cacheline as load_avg and have to be accessed at clock
click time will cause the same contention problem. In the current
layout, the fields after load_avg are the rt stuff as well some list
head structure and pointers. The rt stuff should be kind of mutually
exclusive of the CFS load_avg in term of usage. The list head structure
and pointers don't seem to be that frequently accessed. So it is the
right place to start a new cacheline boundary.
Cheers,
Longman
next prev parent reply other threads:[~2015-12-03 19:26 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-12-02 18:41 [PATCH v2 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
2015-12-02 18:41 ` [PATCH v2 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
2015-12-02 20:02 ` bsegall
2015-12-03 19:26 ` Waiman Long [this message]
2015-12-03 19:41 ` bsegall
2015-12-03 4:32 ` Mike Galbraith
2015-12-03 19:34 ` Waiman Long
2015-12-04 2:07 ` Mike Galbraith
2015-12-04 20:19 ` Waiman Long
2015-12-03 10:56 ` Peter Zijlstra
2015-12-03 19:38 ` Waiman Long
2015-12-03 11:12 ` Peter Zijlstra
2015-12-03 17:56 ` bsegall
2015-12-03 18:17 ` Peter Zijlstra
2015-12-03 18:23 ` bsegall
2015-12-03 19:56 ` Waiman Long
2015-12-03 20:03 ` Peter Zijlstra
2015-12-04 11:57 ` [tip:sched/core] sched/fair: Move the cache-hot 'load_avg' variable " tip-bot for Waiman Long
2015-12-02 18:41 ` [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group Waiman Long
2015-12-02 19:55 ` bsegall
2015-12-04 11:58 ` [tip:sched/core] sched/fair: Disable the task group load_avg update for the root_task_group tip-bot for Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5660974D.7050006@hpe.com \
--to=waiman.long@hpe.com \
--cc=bsegall@google.com \
--cc=doug.hatch@hpe.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=morten.rasmussen@arm.com \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=scott.norton@hpe.com \
--cc=yuyang.du@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.