Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Waiman Long <waiman.long@hpe.com>
To: bsegall@google.com
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, Yuyang Du <yuyang.du@intel.com>,
	Paul Turner <pjt@google.com>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Scott J Norton <scott.norton@hpe.com>,
	Douglas Hatch <doug.hatch@hpe.com>
Subject: Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Date: Thu, 03 Dec 2015 14:26:05 -0500	[thread overview]
Message-ID: <5660974D.7050006@hpe.com> (raw)
In-Reply-To: <xm26zixs4o13.fsf@sword-of-the-dawn.mtv.corp.google.com>

On 12/02/2015 03:02 PM, bsegall@google.com wrote:
> Waiman Long<Waiman.Long@hpe.com>  writes:
>
>> If a system with large number of sockets was driven to full
>> utilization, it was found that the clock tick handling occupied a
>> rather significant proportion of CPU time when fair group scheduling
>> and autogroup were enabled.
>>
>> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
>> profile looked like:
>>
>>    10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>>     8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>>     8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>>     8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>>     6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
>>     5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
>>
>> In particular, the high CPU time consumed by update_cfs_shares()
>> was mostly due to contention on the cacheline that contained the
>> task_group's load_avg statistical counter. This cacheline may also
>> contains variables like shares, cfs_rq&  se which are accessed rather
>> frequently during clock tick processing.
>>
>> This patch moves the load_avg variable into another cacheline
>> separated from the other frequently accessed variables. It also
>> creates a cacheline aligned kmemcache for task_group to make sure
>> that all the allocated task_group's are cacheline aligned.
>>
>> By doing so, the perf profile became:
>>
>>     9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>>     7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>>     7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>>     7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>>     5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>>     4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
>>
>> The %cpu time is still pretty high, but it is better than before. The
>> benchmark results before and after the patch was as follows:
>>
>>    Before patch - Max-jOPs: 907533    Critical-jOps: 134877
>>    After patch  - Max-jOPs: 916011    Critical-jOps: 142366
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   kernel/sched/core.c  |   36 ++++++++++++++++++++++++++++++++++--
>>   kernel/sched/sched.h |    7 ++++++-
>>   2 files changed, 40 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d568ac..e39204f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
>>    */
>>   struct task_group root_task_group;
>>   LIST_HEAD(task_groups);
>> +
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +/* Cacheline aligned slab cache for task_group */
>> +static struct kmem_cache *task_group_cache __read_mostly;
>> +#endif
>>   #endif
>>
>>   DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
>> @@ -7356,6 +7361,7 @@ void __init sched_init(void)
>>   		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>>   		ptr += nr_cpu_ids * sizeof(void **);
>>
>> +		task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);
> The KMEM_CACHE macro suggests instead adding
> ____cacheline_aligned_in_smp to the struct definition instead.

The main goal is to have the load_avg placed in a new cacheline 
separated from the read-only fields above. That is why I placed 
____cacheline_aligned after load_avg. I omitted the in_smp part because 
it is in the SMP block already. Putting ____cacheline_aligned_in_smp 
won't guarantee alignment of any field within the structure.

I have done some test and having ____cacheline_aligned inside the 
structure has the same effect of forcing the whole structure in the 
cacheline aligned boundary.

>>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>>   #ifdef CONFIG_RT_GROUP_SCHED
>>   		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
>>   /* task_group_lock serializes the addition/removal of task groups */
>>   static DEFINE_SPINLOCK(task_group_lock);
>>
>> +/*
>> + * Make sure that the task_group structure is cacheline aligned when
>> + * fair group scheduling is enabled.
>> + */
>> +#ifdef CONFIG_FAIR_GROUP_SCHED
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> +	return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> +	kmem_cache_free(task_group_cache, tg);
>> +}
>> +#else /* CONFIG_FAIR_GROUP_SCHED */
>> +static inline struct task_group *alloc_task_group(void)
>> +{
>> +	return kzalloc(sizeof(struct task_group), GFP_KERNEL);
>> +}
>> +
>> +static inline void free_task_group(struct task_group *tg)
>> +{
>> +	kfree(tg);
>> +}
>> +#endif /* CONFIG_FAIR_GROUP_SCHED */
>> +
>>   static void free_sched_group(struct task_group *tg)
>>   {
>>   	free_fair_sched_group(tg);
>>   	free_rt_sched_group(tg);
>>   	autogroup_free(tg);
>> -	kfree(tg);
>> +	free_task_group(tg);
>>   }
>>
>>   /* allocate runqueue etc for a new task group */
>> @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group *parent)
>>   {
>>   	struct task_group *tg;
>>
>> -	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
>> +	tg = alloc_task_group();
>>   	if (!tg)
>>   		return ERR_PTR(-ENOMEM);
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index efd3bfc..e679895 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -248,7 +248,12 @@ struct task_group {
>>   	unsigned long shares;
>>
>>   #ifdef	CONFIG_SMP
>> -	atomic_long_t load_avg;
>> +	/*
>> +	 * load_avg can be heavily contended at clock tick time, so put
>> +	 * it in its own cacheline separated from the fields above which
>> +	 * will also be accessed at each tick.
>> +	 */
>> +	atomic_long_t load_avg ____cacheline_aligned;
>>   #endif
>>   #endif
> I suppose the question is if it would be better to just move this to
> wind up on a separate cacheline without the extra empty space, though it
> would likely be more fragile and unclear.

I have been thinking about that too. The problem is anything that will 
be in the same cacheline as load_avg and have to be accessed at clock 
click time will cause the same contention problem. In the current 
layout, the fields after load_avg are the rt stuff as well some list 
head structure and pointers. The rt stuff should be kind of mutually 
exclusive of the CFS load_avg in term of usage. The list head structure 
and pointers don't seem to be that frequently accessed. So it is the 
right place to start a new cacheline boundary.

Cheers,
Longman

next prev parent reply	other threads:[~2015-12-03 19:26 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-02 18:41 [PATCH v2 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
2015-12-02 18:41 ` [PATCH v2 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
2015-12-02 18:41 ` [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
2015-12-02 20:02   ` bsegall
2015-12-03 19:26     ` Waiman Long [this message]
2015-12-03 19:41       ` bsegall
2015-12-03  4:32   ` Mike Galbraith
2015-12-03 19:34     ` Waiman Long
2015-12-04  2:07       ` Mike Galbraith
2015-12-04 20:19         ` Waiman Long
2015-12-03 10:56   ` Peter Zijlstra
2015-12-03 19:38     ` Waiman Long
2015-12-03 11:12   ` Peter Zijlstra
2015-12-03 17:56     ` bsegall
2015-12-03 18:17       ` Peter Zijlstra
2015-12-03 18:23         ` bsegall
2015-12-03 19:56     ` Waiman Long
2015-12-03 20:03       ` Peter Zijlstra
2015-12-04 11:57   ` [tip:sched/core] sched/fair: Move the cache-hot 'load_avg' variable " tip-bot for Waiman Long
2015-12-02 18:41 ` [PATCH v2 3/3] sched/fair: Disable tg load_avg update for root_task_group Waiman Long
2015-12-02 19:55   ` bsegall
2015-12-04 11:58   ` [tip:sched/core] sched/fair: Disable the task group load_avg update for the root_task_group tip-bot for Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5660974D.7050006@hpe.com \
    --to=waiman.long@hpe.com \
    --cc=bsegall@google.com \
    --cc=doug.hatch@hpe.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=scott.norton@hpe.com \
    --cc=yuyang.du@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox