From: Peter Zijlstra <peterz@infradead.org>
To: Waiman Long <Waiman.Long@hpe.com>
Cc: Ingo Molnar <mingo@redhat.com>,
linux-kernel@vger.kernel.org,
Scott J Norton <scott.norton@hpe.com>,
Douglas Hatch <doug.hatch@hpe.com>, Paul Turner <pjt@google.com>,
Ben Segall <bsegall@google.com>,
Morten Rasmussen <morten.rasmussen@arm.com>,
Yuyang Du <yuyang.du@intel.com>
Subject: Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg
Date: Mon, 30 Nov 2015 11:22:40 +0100 [thread overview]
Message-ID: <20151130102240.GH17308@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <1448478580-26467-4-git-send-email-Waiman.Long@hpe.com>
Please always Cc the people who wrote the code.
+CC pjt, ben, morten, yuyang
On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
> The load_avg statistical counter is only changed if the load on a CPU
> deviates significantly from the previous tick. So there are usually
> more readers than writers of load_avg. Still, on a large system,
> the cacheline contention can cause significant slowdown and impact
> performance.
>
> This patch attempts to separate those load_avg readers
> (update_cfs_shares) and writers (task_tick_fair) to use different
> cachelines instead. Writers of load_avg will now accumulates the
> load delta into load_avg_delta which sits in a different cacheline.
> If load_avg_delta is sufficiently large (> load_avg/64), it will then
> be added back to load_avg.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
> 480 threads), the perf profile before the patch was:
>
> 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
> 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
> 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
> 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
> 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
>
> After the patch, it became:
>
> 2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
> 2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
> 2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer
> 2.21% 0.00% java [kernel.vmlinux] [k] update_process_times
> 1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick
> 0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair
> 0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares
This begs the question tough; why are you running a global load in a
cgroup; and do we really need to update this for the root cgroup? It
seems to me we don't need calc_tg_weight() for the root cgroup, it
doesn't need to normalize its weight numbers.
That is; isn't this simply a problem we should avoid?
> The benchmark results before and after the patch were as follows:
>
> Before patch - Max-jOPs: 916011 Critical-jOps: 142366
> AFter patch - Max-jOPs: 939130 Critical-jOps: 211937
>
> There was significant improvement in Critical-jOps which was latency
> sensitive.
>
> This patch does introduce additional delay in getting the real load
> average reflected in load_avg. It may also incur additional overhead
> if the number of CPUs in a task group is small. As a result, this
> change is only activated when running on a 4-socket or larger systems
> which can get the most benefit from it.
So I'm not particularly charmed by this; it rather makes a mess of
things. Also this really wants a run of the cgroup fairness test thingy
pjt/ben have somewhere.
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
> ---
> kernel/sched/core.c | 9 +++++++++
> kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 8 ++++++++
> 3 files changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d568ac..f3075da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7356,6 +7356,12 @@ void __init sched_init(void)
> root_task_group.cfs_rq = (struct cfs_rq **)ptr;
> ptr += nr_cpu_ids * sizeof(void **);
>
> +#ifdef CONFIG_SMP
> + /*
> + * Use load_avg_delta if not 2P or less
> + */
> + root_task_group.use_la_delta = (num_possible_nodes() > 2);
> +#endif /* CONFIG_SMP */
> #endif /* CONFIG_FAIR_GROUP_SCHED */
> #ifdef CONFIG_RT_GROUP_SCHED
> root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
> if (!alloc_rt_sched_group(tg, parent))
> goto err;
>
> +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
> + tg->use_la_delta = root_task_group.use_la_delta;
> +#endif
> return tg;
>
> err:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8f1eccc..44732cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> /*
> - * Updating tg's load_avg is necessary before update_cfs_share (which is done)
> + * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
> * and effective_load (which is not done because it is too costly).
> + *
> + * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
> + * accumulated into the load_avg_delta variable instead to reduce cacheline
> + * contention on load_avg at the expense of more delay in reflecting the real
> + * load_avg. The tg's load_avg and load_avg_delta variables are in separate
> + * cachelines. With that flag set, load_avg will be read mostly whereas
> + * load_avg_delta will be write mostly.
> */
> static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> {
> long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>
> if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> - atomic_long_add(delta, &cfs_rq->tg->load_avg);
> + struct task_group *tg = cfs_rq->tg;
> + long load_avg, tot_delta;
> +
> + if (!tg->use_la_delta) {
> + /*
> + * If the use_la_delta isn't set, just add the
> + * delta directly into load_avg.
> + */
> + atomic_long_add(delta, &tg->load_avg);
> + goto set_contrib;
> + }
> +
> + tot_delta = atomic_long_add_return(delta, &tg->load_avg_delta);
> + load_avg = atomic_long_read(&tg->load_avg);
> + if (abs(tot_delta) > load_avg / 64) {
> + tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
> + if (tot_delta)
> + atomic_long_add(tot_delta, &tg->load_avg);
> + }
> +set_contrib:
> cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> }
> }
I'm thinking that its now far too big to retain the inline qualifier.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e679895..aef4e4e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -252,8 +252,16 @@ struct task_group {
> * load_avg can be heavily contended at clock tick time, so put
> * it in its own cacheline separated from the fields above which
> * will also be accessed at each tick.
> + *
> + * The use_la_delta flag, if set, will enable the use of load_avg_delta
> + * to accumulate the delta and only change load_avg when the delta
> + * is big enough. This reduces the cacheline contention on load_avg.
> + * This flag will be set at allocation time depending on the system
> + * configuration.
> */
> + int use_la_delta;
> atomic_long_t load_avg ____cacheline_aligned;
> + atomic_long_t load_avg_delta ____cacheline_aligned;
This would only work if the structure itself is allocated with cacheline
alignment, and looking at sched_create_group(), we use a plain kzalloc()
for this, which doesn't guarantee any sort of alignment beyond machine
word size IIRC.
Also, you unconditionally grow the structure by a whole cacheline.
> #endif
> #endif
>
> --
> 1.7.1
>
next prev parent reply other threads:[~2015-11-30 10:22 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-25 19:09 [PATCH 0/3] sched/fair: Reduce contention on tg's load_avg Waiman Long
2015-11-25 19:09 ` [PATCH 1/3] sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats() Waiman Long
2015-12-04 11:57 ` [tip:sched/core] " tip-bot for Waiman Long
2015-11-25 19:09 ` [PATCH 2/3] sched/fair: Move hot load_avg into its own cacheline Waiman Long
2015-11-30 10:23 ` Peter Zijlstra
2015-11-25 19:09 ` [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg Waiman Long
2015-11-30 10:22 ` Peter Zijlstra [this message]
2015-11-30 19:13 ` Waiman Long
2015-11-30 22:09 ` Peter Zijlstra
2015-12-01 3:55 ` Waiman Long
2015-12-01 8:49 ` Peter Zijlstra
2015-12-01 10:44 ` Mike Galbraith
2015-12-02 18:48 ` Waiman Long
2015-11-30 22:29 ` Peter Zijlstra
2015-12-01 4:00 ` Waiman Long
2015-12-01 8:47 ` Peter Zijlstra
2015-12-02 18:44 ` Waiman Long
2015-11-30 22:32 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151130102240.GH17308@twins.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=Waiman.Long@hpe.com \
--cc=bsegall@google.com \
--cc=doug.hatch@hpe.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=morten.rasmussen@arm.com \
--cc=pjt@google.com \
--cc=scott.norton@hpe.com \
--cc=yuyang.du@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox