Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Waiman Long <waiman.long@hpe.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	linux-kernel@vger.kernel.org,
	Scott J Norton <scott.norton@hpe.com>,
	Douglas Hatch <doug.hatch@hpe.com>, Paul Turner <pjt@google.com>,
	Ben Segall <bsegall@google.com>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Yuyang Du <yuyang.du@intel.com>
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
Date: Mon, 30 Nov 2015 14:13:32 -0500	[thread overview]
Message-ID: <565C9FDC.9020603@hpe.com> (raw)
In-Reply-To: <20151130102240.GH17308@twins.programming.kicks-ass.net>

On 11/30/2015 05:22 AM, Peter Zijlstra wrote:
> Please always Cc the people who wrote the code.
>
> +CC pjt, ben, morten, yuyang

Sorry for that. Their names didn't show up when I did get_maintainer.pl.

> On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
>> The load_avg statistical counter is only changed if the load on a CPU
>> deviates significantly from the previous tick. So there are usually
>> more readers than writers of load_avg. Still, on a large system,
>> the cacheline contention can cause significant slowdown and impact
>> performance.
>>
>> This patch attempts to separate those load_avg readers
>> (update_cfs_shares) and writers (task_tick_fair) to use different
>> cachelines instead. Writers of load_avg will now accumulates the
>> load delta into load_avg_delta which sits in a different cacheline.
>> If load_avg_delta is sufficiently large (>  load_avg/64), it will then
>> be added back to load_avg.
>>
>> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
>> 480 threads), the perf profile before the patch was:
>>
>>     9.44%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     8.74%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>>     7.83%   0.03%  java  [kernel.vmlinux]  [k] tick_sched_timer
>>     7.74%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>>     7.27%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>>     5.94%   1.74%  java  [kernel.vmlinux]  [k] task_tick_fair
>>     4.15%   3.92%  java  [kernel.vmlinux]  [k] update_cfs_shares
>>
>> After the patch, it became:
>>
>>     2.94%   0.00%  java  [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>>     2.52%   0.01%  java  [kernel.vmlinux]  [k] hrtimer_interrupt
>>     2.25%   0.02%  java  [kernel.vmlinux]  [k] tick_sched_timer
>>     2.21%   0.00%  java  [kernel.vmlinux]  [k] update_process_times
>>     1.70%   0.03%  java  [kernel.vmlinux]  [k] scheduler_tick
>>     0.96%   0.34%  java  [kernel.vmlinux]  [k] task_tick_fair
>>     0.61%   0.48%  java  [kernel.vmlinux]  [k] update_cfs_shares
> This begs the question tough; why are you running a global load in a
> cgroup; and do we really need to update this for the root cgroup? It
> seems to me we don't need calc_tg_weight() for the root cgroup, it
> doesn't need to normalize its weight numbers.
>
> That is; isn't this simply a problem we should avoid?

I didn't use any cgroup in my test setup. Autogroup was enabled, though. 
Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 
times.

>> The benchmark results before and after the patch were as follows:
>>
>>    Before patch - Max-jOPs: 916011    Critical-jOps: 142366
>>    AFter patch  - Max-jOPs: 939130    Critical-jOps: 211937
>>
>> There was significant improvement in Critical-jOps which was latency
>> sensitive.
>>
>> This patch does introduce additional delay in getting the real load
>> average reflected in load_avg. It may also incur additional overhead
>> if the number of CPUs in a task group is small. As a result, this
>> change is only activated when running on a 4-socket or larger systems
>> which can get the most benefit from it.
> So I'm not particularly charmed by this; it rather makes a mess of
> things. Also this really wants a run of the cgroup fairness test thingy
> pjt/ben have somewhere.

I will be glad to run any additional tests, if necessary. I do need 
pointers to those test, though.

>> Signed-off-by: Waiman Long<Waiman.Long@hpe.com>
>> ---
>>   kernel/sched/core.c  |    9 +++++++++
>>   kernel/sched/fair.c  |   30 ++++++++++++++++++++++++++++--
>>   kernel/sched/sched.h |    8 ++++++++
>>   3 files changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 4d568ac..f3075da 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
>>   		root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>>   		ptr += nr_cpu_ids * sizeof(void **);
>>
>> +#ifdef CONFIG_SMP
>> +		/*
>> +		 * Use load_avg_delta if not 2P or less
>> +		 */
>> +		root_task_group.use_la_delta = (num_possible_nodes()>  2);
>> +#endif /* CONFIG_SMP */
>>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>>   #ifdef CONFIG_RT_GROUP_SCHED
>>   		root_task_group.rt_se = (struct sched_rt_entity **)ptr;
>> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
>>   	if (!alloc_rt_sched_group(tg, parent))
>>   		goto err;
>>
>> +#if defined(CONFIG_FAIR_GROUP_SCHED)&&  defined(CONFIG_SMP)
>> +	tg->use_la_delta = root_task_group.use_la_delta;
>> +#endif
>>   	return tg;
>>
>>   err:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8f1eccc..44732cc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>>   /*
>> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
>> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
>>    * and effective_load (which is not done because it is too costly).
>> + *
>> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
>> + * accumulated into the load_avg_delta variable instead to reduce cacheline
>> + * contention on load_avg at the expense of more delay in reflecting the real
>> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
>> + * cachelines. With that flag set, load_avg will be read mostly whereas
>> + * load_avg_delta will be write mostly.
>>    */
>>   static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
>>   {
>>   	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>>
>>   	if (force || abs(delta)>  cfs_rq->tg_load_avg_contrib / 64) {
>> -		atomic_long_add(delta,&cfs_rq->tg->load_avg);
>> +		struct task_group *tg = cfs_rq->tg;
>> +		long load_avg, tot_delta;
>> +
>> +		if (!tg->use_la_delta) {
>> +			/*
>> +			 * If the use_la_delta isn't set, just add the
>> +			 * delta directly into load_avg.
>> +			 */
>> +			atomic_long_add(delta,&tg->load_avg);
>> +			goto set_contrib;
>> +		}
>> +
>> +		tot_delta = atomic_long_add_return(delta,&tg->load_avg_delta);
>> +		load_avg = atomic_long_read(&tg->load_avg);
>> +		if (abs(tot_delta)>  load_avg / 64) {
>> +			tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
>> +			if (tot_delta)
>> +				atomic_long_add(tot_delta,&tg->load_avg);
>> +		}
>> +set_contrib:
>>   		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>>   	}
>>   }
> I'm thinking that its now far too big to retain the inline qualifier.

I can take the inline keyword out.

>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e679895..aef4e4e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -252,8 +252,16 @@ struct task_group {
>>   	 * load_avg can be heavily contended at clock tick time, so put
>>   	 * it in its own cacheline separated from the fields above which
>>   	 * will also be accessed at each tick.
>> +	 *
>> +	 * The use_la_delta flag, if set, will enable the use of load_avg_delta
>> +	 * to accumulate the delta and only change load_avg when the delta
>> +	 * is big enough. This reduces the cacheline contention on load_avg.
>> +	 * This flag will be set at allocation time depending on the system
>> +	 * configuration.
>>   	 */
>> +	int use_la_delta;
>>   	atomic_long_t load_avg ____cacheline_aligned;
>> +	atomic_long_t load_avg_delta ____cacheline_aligned;
> This would only work if the structure itself is allocated with cacheline
> alignment, and looking at sched_create_group(), we use a plain kzalloc()
> for this, which doesn't guarantee any sort of alignment beyond machine
> word size IIRC.

With a RHEL 6 derived .config file, the size of the task_group structure 
was 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag 
increase the size to 512 bytes. So it did make the structure a multiple 
of the cacheline size. With both slub and slab, the allocated task group 
pointers from kzalloc() in sched_create_group() were all multiples of 
0x200. So they were properly aligned for the ____cacheline_aligned tag 
to work.

> Also, you unconditionally grow the structure by a whole cacheline.

I know it is a drawback of using ____cacheline_aligned tag. However, we 
probably won't create too many task groups in normal use. So the 
increase in memory consumption shouldn't be noticeable.

Cheers,
Longman

next prev parent reply	other threads:[~2015-11-30 19:13 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
2015-12-04 11:57   ` [tip:sched/core] " tip-bot for Waiman Long
2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
2015-11-30 10:23   ` Peter Zijlstra
2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long
2015-11-30 10:22   ` Peter Zijlstra
2015-11-30 19:13     ` Waiman Long [this message]
2015-11-30 22:09       ` Peter Zijlstra
2015-12-01  3:55         ` Waiman Long
2015-12-01  8:49           ` Peter Zijlstra
2015-12-01 10:44             ` Mike Galbraith
2015-12-02 18:48             ` Waiman Long
2015-11-30 22:29       ` Peter Zijlstra
2015-12-01  4:00         ` Waiman Long
2015-12-01  8:47           ` Peter Zijlstra
2015-12-02 18:44             ` Waiman Long
2015-11-30 22:32       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=565C9FDC.9020603@hpe.com \
    --to=waiman.long@hpe.com \
    --cc=bsegall@google.com \
    --cc=doug.hatch@hpe.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=scott.norton@hpe.com \
    --cc=yuyang.du@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox